Category: Dealing with Complexity Data Reduction and Clustering

  • Nonhierarchical clustering: k-means

    The best-known method in the class of nonhierarchical clustering algorithms is the k-means approach. In the k-means method, unlike with the hierarchical ones, observations can be moved from one cluster to another in order to minimize a joint measure of the quality of clustering; hence, the method is iterative in nature. The starting point is the selection of k seeds,…

  • Hierarchical methods

    A first class of approaches to cluster formation is based on the sequential construction of a tree (dendrogram), that leads us to form clusters; Fig. 17.5 shows a simple dendrogram. The leaves of the tree correspond to objects; branches of the tree correspond to sequential groupings of observations and clusters. Since a tree suggests a natural hierarchy,…

  • Measuring distance

    Given two observations , we may define various distance measures between them, such as Other distances may be defined to account for categorical variables; we may also assign weights to variables to express their relative importance. These distances measure the dissimilarity between two single observations, but when we aggregate several observations into clusters, we need some…

  • CLUSTER ANALYSIS

    The aim of cluster analysis is to search for patterns in a dataset by grouping similar items. The number of groups (clusters) need not be fixed in advance, and, in fact, there is an array of different methods, which share a common need: the definition of a distance between observations, which is used to measure…

  • FACTOR ANALYSIS

    The rationale behind factor analysis may be best understood by a small numerical example. Example 17.2 Consider observations in  and the correlation matrix of their component variables X1, X2, …, X5: Does this suggest some structure? We see that X1 and X2 seem to be strongly correlated with each other, whereas they seem weekly correlated with X3, X4, and X5. The latter variables, on the contrary, seem…

  • Applications of PCA

    Principal component analysis can be applied in a marketing setting when questionnaires are administered to potential customers asking for a quantitative evaluation along many dimensions. Many such questions are, or are perceived as, redundant. Spotting the few principal components may help in assessing which product features, or combination thereof, are most important. They can also tell…

  • A small numerical example

    Principal component analysis in practice is carried out on sampled data, but it may be instructive to consider an example where both the probabilistic and the statistical sides are dealt with.5 Consider first a random variable with bivariate normal distribution, X ∼ N(0, Σ), where and ρ > 0. Essentially X1 and X2 are standard normal variables with positive correlation ρ. To find the eigenvalues of Σ,…

  • Another view of PCA

    Another view is obtained by interpreting the first principal component in terms of orthogonal projection. Consider a unit vector , and imagine projecting the observed vector X on u. This yields a vector parallel to u, of length uTX. Since u has unit length, the projection of observation X(k) on u is We are projecting p-dimensional observations on just one axis, and of course we would like to…

  • A geometric view of PCA

    The linear data transformation, including centering, can be written as where . We assume that data have already been centered, in order to ease notation. Hence The Zi variables are called principal components: Z1 is the first principal component. We recall that the matrix A rotating axes is orthogonal: Now, let us consider the sample covariance matrix of X, i.e., SX. Since we assume centered…

  • THE NEED FOR DATA REDUCTION

    Consider a sample of observations , k = 1,…, n. Each observation X(k) consists of a vector of p elements . If p = 2, visualizing observations is easy, but this is certainly no piece of cake for large values of p. Hence, we need some way to reduce data dimensionality, by mapping observations in  to observations in a lower-dimensional space , where q is possibly much smaller than p. Reducing data…