Principal component analysis in practice is carried out on sampled data, but it may be instructive to consider an example where both the probabilistic and the statistical sides are dealt with.5 Consider first a random variable with bivariate normal distribution, X ∼ N(0, Σ), where

images

and ρ > 0. Essentially X1 and X2 are standard normal variables with positive correlation ρ. To find the eigenvalues of Σ, we must find its characteristic polynomial and solve the corresponding equation

images

This yields the two eigenvalues λ1 = 1 + ρ and λ2 = 1 − ρ. Note that the two eigenvalues are positive, since ρ is a correlation coefficient. To find the first eigenvalue, we consider the system of linear equations:

images

Clearly, the two equations are linearly dependent and any vector such that u1 = u2 is an eigenvector. By a similar token, any vector such that u1 = −u2 is an eigenvector corresponding to λ2. Two normalized eigenvectors are

images

These are the rows of the transformation matrix

images

Since we are dealing with standard normals, μ = 0 and the first principal component is

images

The second principal component is

images

As a further check, let us compute the variance of the first principal component:

images

Figure 17.2 shows the level curves of the joint density of X when ρ= 0.85:

images

Since correlation is positive, the main axis of the ellipses has positive slope. It is easy to see that along that direction we have the largest variability.

As we noted, practical PCA is carried out on sampled data. Figure 17.3 shows a sample of size 200 from the above bivariate normal distribution, with ρ = 0.85. The sample statistics are

images

Fig. 17.2 Level curves of a multivariate normal with ρ = 0.85.

images

Since the sample size is not very large, we see that estimated parameters are not too close to what is expected. Nevertheless, the observation cloud (scatterplot) displayed in Fig. 17.3(a) clearly shows positive correlation. The matrix of normalized eigenvectors is

images

Apart from a sign, the values in this matrix, if the estimates were perfect, should be images. The eigenvalues of the sample covariance matrix are

images

and indeed:

images

Note that data need be centered, since the sample means are not zero. The two small plots in Figs. 17.3(b) and (c) show the two principal components, i.e., the projections of the original data. We clearly see that the first principal component accounts for most variability, precisely

images

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *