Given a random vector with expected value μ, the covariance matrix can be expressed as
Note that inside the expectation we are multiplying a column vector p × 1 and a row vector 1 × p, which does result in a square matrix p × p. It may also be worth noting that there is a slight inconsistency of notation, since we denote variance in the scalar case by σ2, but we do not use Σ2 here, as this would be somewhat confusing. The element in row i and column j of matrix Σ, [Σ]ij, is the covariance σij between Xi and Xj. Consistently, we should regard the variance of Xi as Cov(Xi, Xi) = σii. We may also express the covariance matrix as
This is just a vector generalization of Eq. (8.5). If we consider a linear combination Z of variables X, i.e.,
then the variance of Z is
where Σ is the covariance matrix of X. By a similar token, let us consider a linear transformation from random vector X to random vector Z, represented by the matrix A, i.e.
It turns out6 out that the covariance matrix of Z is
By recalling that a linear combination of jointly normal variables is normal, the following theorem can be immediately understood.
THEOREM 15.1 Let X be a vector of n jointly normal variables with expected value μ and covariance matrix Σ. Given a matrix , the transformed vector AX, taking values in , has a jointly normal distribution with expected value Aμ and covariance matrix AΣAT.
The above properties refer to covariance matrices, i.e., to probabilistic concepts. The same results carry over to the sample covariance matrix, which we denote by S. Again, there is a bit of notational inconsistency with respect to sample variance S2 in the scalar case, but we will think of sample variance in terms of sample covariance, , and adopt this notation, which is consistent with the use of Σ for a covariance matrix. The sample covariance matrix may be expressed in terms of random observation vectors X(k):
The expression in Eq. (15.4) is a multivariable generalization of the familiar way of rewriting sample variance; see Eq. (9.7). It is also fairly easy to show that we may write the sample covariance matrix in a very compact form using the data matrix χ. The sum in Eq. (15.4) can be expressed as χTχ, and by rewriting the vector of sample means as in Eq. (15.1) we obtain
From a computational point of view, Eq. (15.5) may not be quite convenient; however, these ways of rewriting the sample covariance matrix may come in handy when proving theorems and analyzing data manipulations. If the data are already centered, then expressing the sample covariance matrix is immediate:
We should also note the following properties, that generalize what we are familiar with in the scalar case:
where b is an arbitrary vector of real numbers.
If we need the sample correlation matrix R, consisting of sample correlation coefficients Rij between Xi and Xj, we may introduce the diagonal matrix of sample standard deviations
and then let
15.3.2 Measuring distance and the Mahalanobis transformation
In Section 3.3.1 we defined the concept of vector norm, which can be used to measure the distance between points in . We might also define the distance between observed vectors in the same way, but in statistics we typically want to account for the covariance structure as well. As an introduction, consider the distance between the realization of a random variable X and its expected value μ, or between two realizations X1 and X2. Does a distance measure based on a plain difference, such as |X − μ| or |X1 − X2|, make sense? Such a measure is highly debatable, from a statistical perspective, as it disregards dispersion altogether. A suitable measure should be expressed in terms of number of standard deviations, which leads to the standardized distances
Alternatively, we may consider the squared distances
To generalize the idea to the distance between observation vectors X(1) and X(2), we may rely on the covariance matrix and define the squared distance
Fig. 15.6 Illustration of Mahalanobis distance.
where Σ−1 is the inverse of the covariance matrix. More often than not, we do not know the underlying covariance matrix, and we have to replace it with the sample covariance matrix S. We may also express the distance with respect to the expected value in the same way:
The last expression should be familiar, since it is related to the argument of the exponential function that defines the joint PDF of a multivariable normal distribution.7 We also recall that the level curves of this PDF are ellipses, whose shape depends on the correlation between variables. This is very helpful in understanding the rationale behind the definition of the distances described above, which are known as Mahalanobis distances. Consider the two points A and B in Fig. 15.6. The figure is best interpreted in terms of a bivariate normal distribution with expected value μ; the ellipse is a level curve of its PDF. Geometrically, if we consider standard Euclidean distance, the points A and B do not have the same distance from μ. However, if we account for covariance by Mahalanobis distance, we see that the two points have the same distance from μ. Strictly speaking, we cannot compare the probabilities of outcomes A and B, as they are both zero; nevertheless, the two points are on the same “isodensity” curve and are, in a loose sense, equally likely.
Mahalanobis distance can also be interpreted as a Euclidean distance modified by a suitably chosen weight matrix, which changes the relative importance of dimensions. Measuring distances is essential in clustering algorithms, as we will see in Section 17.4.1. Finally, Mahalanobis distance may also be interpreted in terms of a transformation, called Mahalanobis transformation. Consider the square root of the covariance matrix, i.e., a symmetric matrix Σ1/2 such that
where X is a random variable with expected value μ and covariance matrix Σ. Clearly, this transformation is just an extension of familiar standardization of a scalar random variable. The distance between X and μ can be expressed in terms of standardized variables as follows:
Now, using Eqs. (15.3) and (15.7), we find that the covariance matrix of the standardized variables is
Thus, we see that Mahalanobis transformation yields a set of uncorrelated and standardized variables.
Leave a Reply