Overview of Clustering
In machine learning, unsupervised learning help us to solve the problems by learning the data and classifying it without labels. And the clustering algorithm is a powerful unsupervised tool to discover hidden data structure and unknown features. If you’re working with large volumes of unstructured data, it will be better to partition the data into some sort of logical groupings before any analysis. At present, we are using clustering techniques everywhere, such as in identifying fraudulent or criminal activity, spam email filter, optimizing the network traffic flow and so on. They can use a large amount of unlabeled data to construct powerful clustering groups.

A cluster refers to a collection of data points aggregated together because of certain similarities. There are several basic clustering algorithms, like Hierarchical clustering, K-means clustering or density-based clustering, to be applied for different types of problems:
- Hierarchical clustering is an algorithm that attempts to establish a hierarchy of cluster. It has two strategies, one is “bottom-up” approach, Agglomerative, which assumes that each observation is an independent cluster, and then merges two adjacent clusters at each level to reach the highest level. And the other is “top-down” approach, Divisive, which assumes that all observations belong to the same cluster, and then separates the relatively dissimilar observations in each cluster to the next level.
- K-means clustering is an algorithm trying to find k different clusters, where k is specified in advanced, and the center of each cluster is calculated by using the mean value of the observations in the same cluster. And this post will be focus on this clustering algorithm.
- Density-based clustering algorithm assumes that the cluster structure can be determined by the tightness of the sample distribution. Normally, the algorithm examines the connection between samples from the perspective of sample density, and continuously expands clusters based on connectable samples to obtain the final clustering results.
K-means
K-means clustering algorithm is originated from a vector quantization method in signal processing, and is now popular as a clustering method in the field of data mining. The purpose of K-means clustering is to divide n observations into k clusters, such that each observation belongs to the nearest mean, which is also call the centroid. One thing to be noted is that K-means is completely different from K-nearest neighbors (KNN), one classification method of supervised learning, although they both have the letter K in their name, which is a coincidence.

Let’s take a look about how naïve K-means clustering works. K-means is an iterative process, the algorithm is composed of four steps:
Select K random data points in the data space as the initial centroids, and each point represents a cluster center;
Assign each observation to the closest cluster by calculating its Euclidean distance with respect to each centroid;
Update the new cluster centroid by computing the average of the assigned observations within one cluster;
Repeat step 2 and 3 until no centroids position change.
Implementation of naïve K-means clustering
First of all, define a generator function to generate some artificial cluster testing data:
# Generate clustering data by uniformly distributed data points around certain origin
def fake_data(n = 100, R = 1, min_ = 0, max_ = 1, center_x = 0 , center_y = 0):
pi = math.pi
r = R * np.sqrt(np.random.uniform(min_, max_, size = n))
theta = np.random.uniform(min_, max_, size = n) * 2 * pi
x = center_x + np.cos(theta) * r
y = center_y + np.sin(theta) * r
df = np.column_stack([x,y])
df = pd.DataFrame(df)
df.columns = ['x','y']
return(df)
Let’s assume that we have three different groups of data points, which are gathering around theoretical centroids $(x, y) \in [(5, 30), (30, 10), (50, 50)]$:
# 'Random' initilization
np.random.seed(10)
# Create data
group_1 = fake_data(n = 30,R = 10, center_x = 5, center_y = 30)
group_2 = fake_data(n = 30,R = 10, center_x = 30, center_y = 10)
group_3 = fake_data(n = 30,R = 10, center_x = 50, center_y = 50)
# Merge into one testing data pool
data = group_1.append(group_2).append(group_3)
data.head()
x | y | |
---|---|---|
0 | 7.986546 | 21.740908 |
1 | 3.572751 | 29.804669 |
2 | 11.684570 | 25.677808 |
3 | 1.353208 | 37.847375 |
4 | 10.950436 | 33.800396 |
Visualize the scatter data plots in data space:
# Plot pre-cluster data
plt.scatter(data['x'], data['y'], c = 'black', alpha=0.3, edgecolors='none')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Pre-cluster data')
plt.show()
Although K-means clustering is computationally difficult, which belongs to NP-hard, however, efficient heuristic algorithms can help to converge quickly to a local optimum.