Running a linear regression with multiple explanatory variables is a rather straightforward extension of what especially if we assume fixed, deterministic values of the regressors. The underlying statistical model is
We avoid using α to denote a constant term, so that we may group parameters into a vector . The model is estimated on the basis of a sample of n observations,
which are collected in vector Y and matrix χ:
The data matrix collects observed values of the regressor variables, and it includes a leading column of ones. This makes notation more uniform; we may think that coefficient β0 is associated with a stream of constant observations xi0 = 1. To estimate β, we apply ordinary least squares, on the basis of the following regression equations:
where ei is the residual for observation i, and bj is the estimate of parameter βj, j = 0,…,q. If we collect residuals in vector and coefficients in vector b, the regression equations may be rewritten in the following convenient matrix form:
Using least squares, we aim at minimizing the sum of squared residuals, which is just the squared norm of vector e:
Now we only have to follow the familiar least-squares drift, but in matrix terms. The concepts of Section 3.9.1, concerning derivatives of quadratic forms, come in quite handy here. To see why, let us rewrite Eq. (16.2):
This is a function of the vector of coefficients b and includes a constant term, a linear term, and a quadratic form. Furthermore the matrix χT χ is square, symmetric, and positive semidefinite, implying that the associated quadratic form is convex. Hence, we are minimizing a convex function, and stationarity conditions are sufficient for optimality; we must just take the gradient, i.e., the vector of partial derivatives of the quadratic form with respect to each coefficient bj, and set it to zero. From Section 3.9.1, we recall the following rules to obtain the gradient of linear and quadratic functions of multiple variables:
for a column vector a and a square matrix A. By applying these rules to (16.3), we immediately get the optimality conditions:
This is just a system of linear equations; the reader is urged to check the size of each matrix involved and to verify that all of the sizes match; in particular, the square matrix χTχ belongs to the space . To solve this system, formally, we have just to invert a matrix:
Can something go wrong with this matrix inversion? The answer is “definitely yes,” and it is fairly easy to see why, by a proper interpretation of the regression equation (16.1). What we are doing is trying to express a vector as a linear combination of q + 1 vectors:
where xj, j = 1,…,q, is a column vector collecting the n observations xij of variable j, and is a vector consisting of elements equal to 1.
Since n > q + 1, there is little hope of succeeding, and we must settle for an optimal approximation, whereby we project the vector onto a subspace of q + 1 vectors, in such a way as to minimize the norm of the residual vector e = Y − χb. In general, we cannot take for granted that these q + 1 vectors are linearly independent; if they are not, even the coefficients in this approximation will not be well defined, since one of the basis columns can be expressed as a linear combination of some other columns. So, in order to avoid trouble, the vectors 1n and xj should be linearly independent, which amounts to saying that the data matrix χ is full-rank.1 If so, it turns out that the matrix χTχ is nonsingular and Eq. (16.4) makes sense. Actually, there is an even subtler issue: Even if the columns of χ are linearly independent, some regressor variables could be strongly correlated. Even in such a case, it is unlikely that random sampling will result in truly linearly dependent columns; however, the χTχ could be close to singular, resulting in unstable estimates of the regression parameters. This issue is called multicollinearity and is outlined in Section 16.2.1.
Leave a Reply