Using regression for forecasting and explanation purposes

We have seen that omitting variables may result in biased estimates, or even in debatable models where significant coefficients are associated with regressor variables that may even have no real impact on the response variable. However, a rather cynical point of view could be that, as long as the model does a good job at forecasting, no one should care. This is an opinion that should not be dismissed harshly. We should wonder what our real aim is when building a regression model. If we are interested only in a forecasting model, then maybe proper variable selection is not an issue, as long as bad choices do not result in a model with very little predictive power. This could be the point of view of an engineer or someone who is just interested in the decisions that are based on model output; if the decisions are satisfactory, so be it. Arguably, the viewpoint of a sociologist would be rather different, as she would probably really like to understand what drives a phenomenon. There is nothing wrong in either reasoning; they are just two different ways of using modeling tools. Indeed, many forecasting modeling frameworks that we do not consider here, like neural networks, have been criticized because they do not offer any explanation for their output, yet they may be practically useful.

Leaving such philosophical considerations aside, when we have estimated a multiple linear regression model, we might be interested in forecasting. Given the model5

images

we should just plug x0 to obtain a point forecast images. However, we have repeatedly observed that a single, point forecast may be of little use, and a suitable prediction interval should be devised. Given an estimate images and a confidence level α2, if we assume normality of errors, it is tempting to build a prediction interval like

images

where in selecting the t distribution we account for estimated coefficients in setting the degrees of freedom. However, we know from Section 10.4 that doing so is not quite correct, as we would consider the uncertainty only in the realization of the error term, and not the uncertainty in the estimate of coefficients. To properly cast the problem, we should evaluate the variance of the forecast error:

images

where the random response is given by

images

Since the estimate of b depends only on previous realizations of the errors that affect YY0 is independent from them, and we may write

images

Now, to evaluate the second term, we may take advantage of Eq. (16.8):

images

Then, if we knew the “true” standard deviation of errors images, we could conclude

images

Plugging the estimate images, we obtain a t distribution with n − q − 1 degrees of freedom, and the following prediction interval with confidence level 1 − α:

images

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *