The R2 coefficient and ANOVA

Testing the significance of a slope coefficient is a first diagnostic check of the suitability of a regression model. However, per se, this test does not tell us much about the predictive power of the model as a whole. The reader may better appreciate the point by thinking about a multiple regression model involving several explanatory variable: A t test checks the impact of each regressor individually, but it does not tell us anything about the overall model. Indeed, the standard error of regression is a measure of the uncertainty of predictions and can help in this respect, but it suffers from the usual problem with standard deviation: Its magnitude depends on how we measure things. The typical performance measure that is used to check the usefulness of a regression model is the R2 coefficient, also known as coefficient of determination. The R2 coefficient can be understood in two related ways:

  • It can be seen as the fraction of total variability that the model is able to explain.
  • It can be seen as the squared coefficient of correlation between the explanatory variable x and the response Y, or between the prediction images and the observed response Y.

Whatever the interpretation, the value of the R2 coefficent are bounded between 0 and 1; the closer to 1, the higher the explanatory power of the model.

images

Fig. 10.5 Geometric interpretation of the R2 coefficient.

Let us pursue the former interpretation first, with the graphical aid of Fig. 10.5. When we look at raw data, we see that the response variable Y is subject to variability with respect to its average images. This variability can be measured by the sum of squared deviations with respect to images, which is called the total sum of squares (TSS):

images

Not all of this variability is noise, since part of it can be attributed to variability in the explanatory variable x. This “predictable” variability is explained by the regression model in terms of the predicted response images. So, we may measure the explained variability by the explained sum of squares:

images

From Fig. 10.5, we see that there is some residual part of variability that cannot be accounted for by model predictions. This unexplained variability is related to residuals, more precisely to the sum of squared residuals, that we already defined as

images

This view is reinforced by the following theorem.

THEOREM 10.3 For a linear regression model, the total variability can be partitioned into the sum of explained and unexplained variability as follows:

images

PROOF The total variability can be rewritten by adding and subtracting images within squared terms and by expanding them:

images

Now we prove that the last term v = 0. The first step is to rewrite model predictions in terms of deviations with respect to averages, using the least-squares estimator of intercept:

images

Plugging this expression into the expression of v, and using the least-squares estimator of slope, we get

images

On the basis of this result, we may define the R2 coefficient as the ratio of explained variability and total variability:

images

We see that R2 is also 1 minus the fraction of unexplained variability. When the coefficient of determination is close to 1, little variability is left unexplained by the model.

Recalling the meaning of slope b, we would expect the regression model to have good explanatory power when there is a strong correlation between the explanatory and the output variables. Strictly speaking, since we are dealing with a nonstochastic regressor, we are a bit sloppy when talking about this correlation, and it would be better to consider the correlation between images and Yi. Hence, let us consider the squared sample correlation:

images

It is not difficult to prove that this expression boils down to R2. To begin with, we see that the average prediction and the average observation of the response variable are the same:

images

Hence, we may rewrite the numerator of images as

images

We see that we get the same null term v that we encountered in proving Theorem 10.3. Hence, we can rewrite the squared correlation as

images

Thus, we have another aspect to consider in interpreting the coefficient of determination.

The R2 coefficient gives us an evaluation of the overall fit of the model. One word of caution is in order, however, when we think of using multiple explanatory variables. It is easy to understand that the more variable we add, the better the fit: R2 cannot decrease when we add more and more variables. However, there is a tradeoff, since adding degrees of freedom means that we are increasing the uncertainty of our estimates. Furthermore, the more variables, the more parameters we use to improve the fit within the sampled data, but this need not translate out of sample.

The interpretation of R2 in terms of explained variability suggests another way of checking the significance of the regression model as a whole, rather than in terms of a t test on the slope coefficient. As we mentioned, this is important when dealing with multiple regression, which involves multiple coefficients; these may be tested individually, but we would also like to have an evaluation of the overall model. This test applies concepts that we introduced when dealing with analysis of variance (ANOVA) in Section 9.6. Again, there are a couple of ways to interpret the idea.

  1. If the model has explanatory power, then the explained variability ESS should be large with respect to the unexplained variability SSR. We could check their ratio, and if this is large enough, then the model is significant. In fact, there are many settings in which one has to compare variances. To carry out the test, we need to build a test statistic with a well-defined probability distribution, and this calls for a little adjustment.
  2. Remembering what we did in Section 9.6, we may consider a joint test on the parameters of a multiple regression model:imagesThe model is significant if at least one coefficient is nonzero. Hence, we may run the following test:images

To devise a test, let us use some intuition first. If we assume that errors are normally distributed, then the sums of squares are related to the sum of squared normals, and we know that this leads to a chi-square distribution. If we want to check the ratio of sums of squares, we have to deal with a ratio of chi-square random variables, and we know that this involves an F distribution. The reasoning here is quite informal, as we should check independence of the two sums of squares, the F distribution involves two independent chi-square variables. Another issue is related to the degrees of freedom, which are typically associated with sums of squares and define which kind of chi-square variable is involved. To get a clue, let us consider the relationship between TSS, ESS, and SSR:

images

We know that the left-hand term has n − 1 degrees of freedom, since only n − 1 terms can be assigned freely. We suggested that the last term, which is related to the standard error of regression, has n − 2 degrees of freedom. Then, the remaining term ESS must have just 1 degree of freedom. To further justify this intuition, please note that we may rewrite all of the involved terms in the ESS as

Table 10.3 Typical ANOVA table summarizing source of variation, sum of squares, degrees of freedom, mean squares, F statistic, and the corresponding p-value.

images
images

This shows that any term can be written as a single function of the Yi, so this sum of squares must have 1 degree of freedom. Correcting the sums of squares for their degrees of freedom, we define the explained mean square and the mean square due to errors:

images

and consider the following test statistic:

images

If the true slope is zero, i.e., under the null hypothesis, this variable is a ratio of chi-square variables and has F distribution, more precisely an F(l, n − 2) distribution. We are encouraged to reject the null hypothesis if the test statistic is large, as this suggests that the ESS is large with respect to SSR. Using quantiles from the F distribution, we may test the regression model. Again, we should emphasize that this depends on a normality assumption and that we have just provided an intuitive justification, not a serious proof. It is customary to express the ANOVA test and the related F test in the tabular form shown in Table 10.3.

Example 10.10 To illustrate the procedure, let us consider the ANOVA for the two datasets of Figs. 10.1 and 10.2. The two respective ANOVA tables are illustrated in Table 10.4. In both cases, we have 5 degrees of freedom in TSS, as there are n = 6 observations; 1 degree of freedom is associated with ESS, and n − 2 = 4 with SER. Apart from little numerical glitches, we see that the ESS is the same for both regression models, since the regression coefficients and the value of the explanatory variables are the same. What makes the difference is the SER, i.e., the sum of squared residuals, which is much larger in the second case because of the much larger variability in the second dataset. This variability is left unexplained by the regression model. In this simple regression model, MSexp = ESS, since 1 degree of freedom is associated with explained variability, whereas MSerr = SSR/4. The test statistic is 50.0864 in the first dataset. This is a rather large value, for a F(l, 4) distribution, and it is quite unlikely that it is just the result of sampling variability. Indeed, the p-value is 0.0021, which means that the model is significant at less than 1%. On the contrary, for the second dataset the test statistic is small, 2.6621, and the corresponding p-value is 17.81%. This means that, to reject the null hypothesis that the regression model is not significant, we must accept a very large probability of a type I error.

Table 10.4 ANOVA tables for the datasets illustrated in Figs. 10.1 and 10.2.

images

One should wonder whether this amounts to a test genuinely different from the t test on the slope. In the simple regression case, the two tests are actually equivalent. Indeed, in Section 7.7.2 we pointed out that an F1,n variable is just a tn variable squared. So, the t test on the only slope in a simple regression model or the F test are equivalent. Indeed, the p-values above are exactly the same that we got in Example 10.8 when running the t test. Furthermore, for the first dataset, we have

images

This does not generalize to multiple regression models, where there are multiple coefficients and the degrees of freedom will play an important role.

images

Fig. 10.6 Plot of residuals coherent with the regression model assumptions; e (ordinate) refers to residual and i (abscissa) is an observation index.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *