A first class of approaches to cluster formation is based on the sequential construction of a tree (dendrogram), that leads us to form clusters; Fig. 17.5 shows a simple dendrogram. The leaves of the tree correspond to objects; branches of the tree correspond to sequential groupings of observations and clusters. Since a tree suggests a natural hierarchy, methods in this class are called hierarchical. Methods differ in the way clusters are built:
- In divisive methods we start from one big cluster and proceed to build disjoint smaller clusters.
- In agglomerative methods, increasingly large clusters are built by merging smaller ones.
The second approach is more common and is best illustrated by a simple example.
Example 17.4 Consider four observations collected in the following matrix:
Remember that rows correspond to observations and columns to variables; so, the first observation is X1 = [3, 6, 9, −1]T. Using Euclidean distance, we find the following distance matrix:
Now, let us build clusters using an agglomerative (bottom up) strategy based on the centroid distance between clusters. The first iteration involves the distances between single observations, and we immediately see in matrix D that the two closest observations are X1 and X3, whose distance is 5.0990. The centroid of these two observations is
Now we compute the following distances:
By comparing these distances with d(X2, X4) = 15.6525, we conclude that we should merge cluster A with X2, obtaining a new cluster B consisting of three observations; this is finally merged with the remaining observation X4 and the process stops, resulting in the dendrogram of Fig. 17.5.10 We observe that the centroid of cluster B is
and its distance from X4 is 17.5278, which is the height of the dendrogram. In fact, the height of the vertical bars in the dendrogram is related to distances.
In the example, we always merge a cluster with a single observation, but this need not be the case in general, as we may well merge clusters. We should note that in this class of methods we do not necessarily specify the number of clusters a priori. Given the tree, we may decide where to draw a horizontal line separating clusters, based on the distances. Also note that the approach is not iterative, but one-shot: In agglomerative methods we build the tree all the way up, without revising previous decisions.
Leave a Reply