Given two observations images, we may define various distance measures between them, such as

  • The Euclidean distanceimages
  • The Mahalanobis distance9imageswhere S is the (sample) covariance matrix.

Other distances may be defined to account for categorical variables; we may also assign weights to variables to express their relative importance. These distances measure the dissimilarity between two single observations, but when we aggregate several observations into clusters, we need some way to measure distance between clusters, i.e., between sets of observations. Four ways of defining distance between clusters are illustrated in Fig. 17.4:

  • Figure 17.4(a) illustrates the single linkage (nearest-neighbor) distanceimagesimagesFig. 17.4 Four different distances between clusters.This distance between clusters A and B is given by the smallest distance between an element of A and an element of B.
  • Fig. 17.4(b) illustrates the complete linkage (farthest-neighbor) distanceimagesThis distance between clusters A and B is given by the largest distance between an element of A and an element of B.
  • Figure 17.4(c) illustrates the average linkage distanceimagesThis distance between clusters A and B is given by the average distance between any pair consisting of an element of A and an element of BNA and NB are the numbers of elements in A and B, respectively.
  • Figure 17.4(d) illustrates the centroid distanceimagesIn this case we take the sample mean along each dimension for observations in cluster A, and define the centroid images; the centroid images for cluster B is defined in the same way, and the distance between clusters is given by the distance between these two “representative” observations, which are essentially the barycenter of each set [the black squares in Fig. 17.4(d)].imagesFig. 17.5 A dendrogram from hierarchical clustering.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *