We have seen that omitting variables may result in biased estimates, or even in debatable models where significant coefficients are associated with regressor variables that may even have no real impact on the response variable. However, a rather cynical point of view could be that, as long as the model does a good job at forecasting, no one should care. This is an opinion that should not be dismissed harshly. We should wonder what our real aim is when building a regression model. If we are interested only in a forecasting model, then maybe proper variable selection is not an issue, as long as bad choices do not result in a model with very little predictive power. This could be the point of view of an engineer or someone who is just interested in the decisions that are based on model output; if the decisions are satisfactory, so be it. Arguably, the viewpoint of a sociologist would be rather different, as she would probably really like to understand what drives a phenomenon. There is nothing wrong in either reasoning; they are just two different ways of using modeling tools. Indeed, many forecasting modeling frameworks that we do not consider here, like neural networks, have been criticized because they do not offer any explanation for their output, yet they may be practically useful.
Leaving such philosophical considerations aside, when we have estimated a multiple linear regression model, we might be interested in forecasting. Given the model5
we should just plug x0 to obtain a point forecast . However, we have repeatedly observed that a single, point forecast may be of little use, and a suitable prediction interval should be devised. Given an estimate and a confidence level α2, if we assume normality of errors, it is tempting to build a prediction interval like
where in selecting the t distribution we account for estimated coefficients in setting the degrees of freedom. However, we know from Section 10.4 that doing so is not quite correct, as we would consider the uncertainty only in the realization of the error term, and not the uncertainty in the estimate of coefficients. To properly cast the problem, we should evaluate the variance of the forecast error:
where the random response is given by
Since the estimate of b depends only on previous realizations of the errors that affect Y, Y0 is independent from them, and we may write
Now, to evaluate the second term, we may take advantage of Eq. (16.8):
Then, if we knew the “true” standard deviation of errors , we could conclude
Plugging the estimate , we obtain a t distribution with n − q − 1 degrees of freedom, and the following prediction interval with confidence level 1 − α:
Leave a Reply