Consider a sample of observations , k = 1,…, n. Each observation X(k) consists of a vector of p elements . If p = 2, visualizing observations is easy, but this is certainly no piece of cake for large values of p. Hence, we need some way to reduce data dimensionality, by mapping observations in to observations in a lower-dimensional space , where q is possibly much smaller than p. Reducing data should be helpful in
- Simplifying our analysis
- Improving regression (when too many variables are considered, numerical results may not be stable because of collinearity)
- Speeding up Monte Carlo simulation by concentrating sampling on the most relevant dimensions
- Classifying patterns
- Spotting outliers
One possible strategy for reducing dimensionality of data is to discard some components of each observed vector. For instance, we could consider only the first component X1 of each observation. However, by adopting such a crude strategy, we will miss much information.
Example 17.1 As a concrete example, imagine analyzing grades obtained by a sample of students on different subjects. Comparing students on the basis of their performance obtained in only one subject, whatever it is, does not seem quite appropriate. Taking the average of the grades is a better choice, but it does not make much sense, either, even though this is what is typically done, since it aggregates different kinds of ability. We might have different subjects that are in fact strongly related to one another; the consequence is that some grades might be strongly correlated, whereas other subjects shed some light on unrelated forms of intelligence. Generalizing a bit, we might take linear combinations of grades. In other words, we could consider a small set of linear combinations of variables for weight vectors uj:
For each weighting scheme j, we find a new variable Zj. But how can we find a sensible way to assign weights? We should find a limited set of variables Zj, which really contribute the most information. By “most information,” we mean combinations that maximize variance and are uncorrelated. The rationale is that we should keep just a few variables Zj that account for most observed variability among students and are not redundant in terms of information they provide. This is what principal component analysis does.
Fig. 17.1 Introducing PCA.
Alternatively, we could try to find underlying factors, which are latent as they are not directly observed, which should explain students’ performance. This is the approach taken by factor analysis. Each observation X(k) should be rewritten as a linear function of just a very few factors fj. Such latent factors might correspond to basic abilities that may have different degrees of influence on results obtained in different subjects.
Alternatively, we might just try to find meaningful groups of similar students, which may be ideal candidates for different types of jobs or further study, or maybe for supplemental lecture hours in order to help them. This is what cluster analysis is all about.
17.2 PRINCIPAL COMPONENT ANALYSIS (PCA)
To introduce principal component analysis (PCA) in the most intuitive way, let us have a look at Fig. 17.1, which shows a scatterplot of observations in two dimensions, X1 and X2. The observations are clearly correlated, and correlation is positive. Now, consider the two axes referring to variables Z1 and Z2. Of course, we may represent the very same set of observations in terms of Z1 and Z2. This is just a change of coordinates, accomplishing two objectives:
- The data have been centered, which is obtained by subtracting the sample mean ; as we observed, this shift in the origin of data is often advisable to improve numerical stability in data analysis algorithms.1
- The variables Z1 and Z2 are uncorrelated; this is accomplished by a rotation of the axes.
We recall from Section 3.4.3 that to rotate a vector we multiply it by an orthogonal matrix. Since orthogonal matrices enjoy plenty of nice properties, we might be able to find a linear transformation of variables with some interesting structure. We notice that variables Z1 and Z2, regarded as coordinates, are parallel to the axes of an ellipsoid associated with the scatterplot, and the first variable is associated with the largest variance. In fact, there are two different ways to introduce PCA:
- PCA is a means of taking linear combinations of variables, in such a way that they are uncorrelated.
- PCA is a way to combine variables so that the first component has maximal variance, the second one has the next maximal variance, and so forth.
Let us pursue both points of view.
Leave a Reply