Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose. Clustering is often called an unsupervised learning, as you don’t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given. In this post, let’s discuss about the famous centroid based clustering algorithm — K-means — in a simplest way.
Check out the following figures to get started:
To run a k-means algorithm, you have to randomly initialize three points (See the figures 1 and 2) called the cluster centroids. I have three cluster centroids, because I want to group my data into three clusters. K-means is an iterative algorithm and it does two steps: 1. Cluster assignment step 2. Move centroid step.
In Cluster assignment step, the algorithm goes through each of the data points and depending on which cluster is closer, whether the red cluster centroid or the blue cluster centroid or the green; It assigns the data points to one of the three cluster centroids.
In move centroid step, K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.
This process is repeated until there is no change in the clusters (or possibly until some other stopping condition is met). K is chosen randomly or by giving specific initial starting points by the user.
Now, check out the figures 3 and 4 below. They are the examples of K-means being run on 90 data points (with k =3). The data does not have well defined clusters as in the previous examples. Figure 3 shows the initial data points before clustering and figure 4 shows the result after 16 iterations. The three lines in figure 4 shows the path from each centroid’s initial location to its final location.
K-means is usually run many times, starting with different random centroids each time. The results can be compared by examining the clusters or by a numeric measure such as the clusters’ distortion, which is the sum of the squared differences between each data point and its corresponding centroid. In cluster distortion case, the clustering with lowest distortion value can be chosen as the best clustering.
For choosing an appropriate value for K, just run the experiment using different values of K and see which ones generate good results. Since, K-means is used for exploratory data mining, you must examine the clustering results anyways to determine which clusters make sense. The value for k can be decreased if some clusters are too small, and increased if the clusters are too broad.
For a more objective measure, you can experiment with increasing values of k and graph various metrics (indices) of the quality of the resulting clustering’s. There are various methods on this Wikipedia page to determine the number of clusters in a data set.