The best-known method in the class of nonhierarchical clustering algorithms is the k-means approach. In the k-means method, unlike with the hierarchical ones, observations can be moved from one cluster to another in order to minimize a joint measure of the quality of clustering; hence, the method is iterative in nature. The starting point is the selection of k seeds, one per cluster, which are used to “attract” items into the cluster. The seeds should be natural attractors; i.e., they should be rather different from one another. There is no point in selecting two seeds that are close to each other, as they would be natural candidates for being placed in the same cluster. We also notice that in this approach, we have to specify in advance the number of clusters that we wish to form.
There are several variations on the theme, but a possible k-means algorithm is as follows:
- Step 1: Select k initial seeds.
- Step 2: Find an initial partitioning by assigning each observation to the closest cluster; in doing so, we may use the distance between the observation to be assigned and the centroid of each cluster; the centroid of a cluster is updated whenever a new element is added.
- Step 3: Reassign selected observations to another cluster if this improves an overall measure of the quality of clustering; update centroids accordingly, and stop when no further improvement is possible.
One possible measure of overall clustering quality in step 3 is the error sum of squares of the partition (ESS)
where C(i) is the index of the cluster to which observation Xi is currently assigned, and is its centroid. Clearly, the procedure is heuristic, because we may get stuck in a local minimum, which may depend on the initial seeds.
Leave a Reply