Given two observations , we may define various distance measures between them, such as
- The Euclidean distance
- The Mahalanobis distance9where S is the (sample) covariance matrix.
Other distances may be defined to account for categorical variables; we may also assign weights to variables to express their relative importance. These distances measure the dissimilarity between two single observations, but when we aggregate several observations into clusters, we need some way to measure distance between clusters, i.e., between sets of observations. Four ways of defining distance between clusters are illustrated in Fig. 17.4:
- Figure 17.4(a) illustrates the single linkage (nearest-neighbor) distanceFig. 17.4 Four different distances between clusters.This distance between clusters A and B is given by the smallest distance between an element of A and an element of B.
- Fig. 17.4(b) illustrates the complete linkage (farthest-neighbor) distanceThis distance between clusters A and B is given by the largest distance between an element of A and an element of B.
- Figure 17.4(c) illustrates the average linkage distanceThis distance between clusters A and B is given by the average distance between any pair consisting of an element of A and an element of B; NA and NB are the numbers of elements in A and B, respectively.
- Figure 17.4(d) illustrates the centroid distanceIn this case we take the sample mean along each dimension for observations in cluster A, and define the centroid ; the centroid for cluster B is defined in the same way, and the distance between clusters is given by the distance between these two “representative” observations, which are essentially the barycenter of each set [the black squares in Fig. 17.4(d)].Fig. 17.5 A dendrogram from hierarchical clustering.
Leave a Reply