To investigate the statistical validity of a multiple regression model, the first step is to check the variance of the estimators. In this case, we have multiple estimators, so we should check their covariance matrix:
Using Eq. (16.4), we see that
The familiar assumptions about errors, in the multivariate case, can be expressed as
i.e., the expected value is zero and the covariance matrix is the identity matrix times a common variance σ2, since errors are mutually independent and identically distributed. Incidentally, in what follows we also assume normality of errors. The first implication of Eq. (16.6) is that ordinary least-squares estimators are, under the standard assumptions, unbiased:
Note that this is easy to obtain if we consider the data matrix χ as a given matrix of fixed numbers; if we consider stochastic regressors, some more work is needed. To find the covariance matrix of estimators, we substitute (16.6) into (16.5):
In (16.7), we have used the fact that, since χTχ is a symmetric matrix, the transposition of its inverse just yields its inverse
What we are still missing is a way to estimate error variance σ2. Not surprisingly, the problem is solved by an extension of what we know from the theory of simple regression. Let us define the sum of squared residuals
where q is the number of regressors. Then, the following unbiased estimator of σ2 is obtained:
The first thing to notice is that if we plug q = 1, we get the familiar result from simple regression.3 We wee that the more regressors we use, the more degrees of freedom we lose. Indeed, confidence intervals and hypothesis testing for single parameters, under a normality assumption, is not different from the case of a single regressor; the only caution is that we must account for the degrees of freedom we lose for each additional regressor.
If we test the significance of a single parameter, essentially we run a familiar t test. In the case of multiple regressors, however, it may be more informative to test the whole model, or maybe a subset of parameters. This leads to an F test, based on the analysis of variance concepts that we outlined in Section 10.3.4. The F statistic is
This is an F variable with q and n − q − 1 degrees of freedom, whose quantiles are tabulated and can be used to check overall significance. To understand this ratio, we should regard it as the ratio of two terms:
- Explained variability, in the form of a sum of squares divided by q degrees of freedom (note that q is the total number of regression coefficients minus 1)
- Unexplained variability, the sum of squared residuals divided by n − q − 1 degrees of freedom
Finally, the R2 coefficient can also be calculated along familiar lines. However, when using the coefficient of determination with multiple variables, there is a subtle issue we should consider. What happens if we add another regressor? Clearly, R2 cannot decrease if we add one more opportunity to fit empirical data, but this does not necessarily mean that the model has improved. The more variables we consider, the more uncertainty we have in estimating their parameters. As Example 16.2 shows, we may have trouble with uncertain estimates because of collinearity. Hence, adding a possibly correlated regressor may be far from beneficial, and we need a measure to capture the tradeoff between adding one variable and the trouble this may cause. The adjusted R2coefficient has been proposed, which accounts for the degrees of freedom that are lost because of additional regressors:
To see the rationale behind the adjustment, recall that R2 is the ratio of explained variability and total variability:4
In the adjusted R2, we divide the two terms of the ratio by their degrees of freedom:
which leads to (16.9). The sum of squared residuals SSR cannot increase, if we add regressors. However, the term (n − q − 1) may decrease faster than SSR, if the added regressors do not contribute much explanatory power. The net result is that an additional regressor may actually result in a decrease of .
Leave a Reply