Analysis of residuals

All of the theory we have built so far relies on specific assumptions about the errors , which we recall here once again for convenience:

They have expected value zero.
They are mutually independent and identically distributed; in particular they have the same standard deviation, which does not depend on x_i.

Since errors are not observable directly, we must settle for a check based on residuals. The check can exploit sound statistical procedures, which are beyond the scope for our purposes, it is enough if we reinforce the important concepts by illustrating a few visual checks that we can carry out just by plotting residuals.

The assumption about the expected value is automatically met, since the way we build the least-squares estimators implies that the average residual is zero, as shown in Eq. (10.2). What we need to check is that there is no evident autocorrelation and lack of stationarity. The plot of the residuals e_i should look like Fig. 10.6, where we see that they do reasonably resemble pure noise. On the contrary, in Fig. 10.7 we see a pattern that is typically associated with positively correlated errors. If we observe a positive residual for observation number i, the next observation i + 1 is likely to display a positive residual as well. The same holds for negative errors, and we see “waves” of positive vs. negative residuals. Such a pattern may be observed for at least a couple of reasons. The first one is that there is indeed some correlation between consecutive observations in time. In such a case subscript i really refers to time and to the order in which observations were taken; the obvious case is when time is the explanatory variable. Another possible reason has really little to do with statistics: We may also observe a pattern like this when there are nonlinearities in the observed phenomenon. Consider, for instance, the nonlinear function in Fig. 10.8, and imagine using linear regression to approximate it, using a few sample points (possibly affected by noise). The nonlinear curve is somehow cut by the regression line, and this results in a nonrandom pattern in the residuals: They have one sign in the middle range of the interval of x, and the opposite sign near the extreme points. Of course, this second case has more to do with the appropriateness of the selected functional form, and we are somewhat improperly using statistical concepts as a diagnostic tool. In passing, we should also clarify the meaning of the subscript i associated with an observation. If the explanatory variable is time, subscript i refers to the position in the chronological sequence of observations. But if we are regressing sales against price, we might wish to sort observations according to the value of the explanatory variable, so that subscript i does not refer to the order in which we took our samples. Doing so, we miss the possibility of seeing potential patterns due to the impact of time, however. Hence, it is advisable to plot residuals in any sensible order to ensure a multiple check.

Fig. 10.7 Plot of residuals suggesting autocorrelation in the errors.

Fig. 10.8 Using linear regression with a nonlinear underlying function results in “autocorrelated” residuals.

Fig. 10.9 Plot of residuals suggesting that the mean of the error process is not stationary.

Another check concerns the stationarity of the error process in terms of mean and standard deviation. By construction, the average residual is zero over the whole dataset, but a plot of residuals like the one in Fig. 10.9 suggests lack of stationarity. If i is actually related to the temporal sequence of observations, we could consider running a multiple regression in which time is an explanatory variable. Last but not least, we assumed homoskedasticity, but a plot of residuals such as Fig. 10.10 suggests that variance is not constant over our observations. In Section 10.5 we outline two possible ways to cope with heteroskedasticity. We close this section by just giving a hint of more quantitative tests.

To check autocorrelation, we could measure the correlation between e_i and e_i+1. Again, we should be careful about the meaning of the subscript i: If it is time, checking correlation between consecutive samples makes sense; otherwise, we should sort observations according to values of the explanatory variable.Fig. 10.10 Plot of residuals suggesting that the variance of the error process is not stationary.
To check whether time should be included as an explanatory variable, we may measure the “correlation” between e_i and i, where i refers to the chronological order of observations.
Finally, to check homoskedasticity, we may measure the “correlation” between and i.

Analysis of residuals

Comments

Leave a Reply Cancel reply