Another view of PCA

Another view is obtained by interpreting the first principal component in terms of orthogonal projection. Consider a unit vector , and imagine projecting the observed vector X on u. This yields a vector parallel to u, of length u^TX. Since u has unit length, the projection of observation X^(k) on u is

We are projecting p-dimensional observations on just one axis, and of course we would like to have an approximation that is as good as possible. More precisely, we should find u in such a way that the distance between the originally observed vector X^(k) and its projection is as small as possible. If we have a sample of n observations, we should minimize the average distance

which looks much like a least-squares problem. This amounts to an orthogonal projection of the original vectors on u, where we know that the original and the projected vectors are orthogonal.³ Hence, we can apply the Pythagorean theorem to rewrite the problem:

Therefore, we essentially want to maximize

subject to the condition . The problem can be restated as

But we know that, assuming data are centered, the sample covariance matrix is S_χ = χ^Tχ/(n − 1); hence, the problem is equivalent to

In plain English, what we want is finding one dimension on which multidimensional data should be projected, in such a way that the variance of the projected data is maximized. This makes sense from a least-squares perspective, but it also have an intuitive appeal: The dimension along which we maximize variance is the one providing the most information.

To solve the problem above, we may associate the constraint (17.2) with a Lagrange multiplier λ and augment the objective function (17.1) to obtain the Lagrangian function:⁴

The gradient of the Lagrangian function with respect to u is

and setting it to zero yields the first-order optimality condition

This amounts to saying that λ must be an eigenvalue of the sample covariance matrix, but which one? We can rewrite the objective function (17.1) as follows:

Hence, we see that λ should be the largest eigenvalue of S_X, u is the corresponding normalized eigenvector, and we obtain the same result as in the previous section. Furthermore, we should continue on the same route, by asking for another direction in which variance is maximized, subject to the constraint that it is orthogonal to the first direction we found. Since eigenvectors of a symmetric matrix are orthogonal, we see that indeed we will find all of them, in decreasing order of the corresponding eigenvalues.

Another view of PCA

Comments

Leave a Reply Cancel reply