USING REGRESSION MODELS

Regression models can be used in a variety of ways, but the essential possibilities are

  1. Using a regression model to assess the impact of explanatory variables
  2. Using a regression model to forecast the value of the response variable for given values of the explanatory variables

In the first case, we are actually concerned with the estimate of slope; the idea is that understanding the phenomenon can lead to knowledge and to better policies. Apparently, there is little difference from the second case since, after all, a good estimate of parameters should lead to good forecasts. In fact, there are at least a couple of subtle differences between the two mindsets. If the only practical relevance of a regression model is forecasting, we may be quite happy with a model that does not yield any plausible explanation of a phenomenon, but it is empirically good at forecasting future values of the response variables. We will see that, in the limit, one can resort to time series models that do not try at all to explain anything; their only role is forecasting, without any ambition to find explanations. A completely different approach is taken, for instance, by sociologists, since in that case a true understanding of the driving forces behind an observed result is the true added value of a regression model. Furthermore, forecasting when regressors are themselves stochastic may seem to be a quite tricky business, as regressors themselves should be forecasted. For instance, in finance one can build a regression model to link the return of many stock shares to a few macroeconomic factors, which might be easier to forecast. In this section we just deal with forecasting when either regressors are nonstochastic and under our control, or when the explanatory variable is time.

Apparently, forecasting is quite easy. Given the regression model, just plug the value of x0, and the predicted response is

images

Note that this is just a point forecast, i.e., a single value. If we knew the underlying statistical model and its parameters, the response would be a random variable

images

Assuming that errors are normally distributed, the response would be normal as well, with

images

This uncertainty is just linked to the variance of the future realization of the error images. However, we should take the uncertainty of our estimators into account, too. We should consider images as a random variable, since it is based on estimators a and b. Since we know how to characterize the uncertainty of our estimates, we might build a confidence interval for images. However, when we consider the realization of the error term in the future, we are dealing with a prediction interval, which means dealing with the distribution of a random variable, rather than the estimation of an unknown parameter, which is a number. Indeed, we may even go as far as to build a probability distribution for Y0. Assuming that regressors are nonstochastic and that errors are normally distributed, this distribution will be t with n − 2 degrees of freedom, or a normal distribution for a sufficiently large sample.

Since errors are independent, we may just add the variances of images and images. The former depends on the variances of a and b, but we should not forget their covariance. So, let us write everything explicitly:

images

To streamline notation, let

images

which is a number under our assumptions. Then

images

Taking its variance, we obtain

images

Taking into account the variance of the error affecting Y0, we conclude that

images

Here we are using the standard error of Y0, which we should interpret as a standard error of prediction, rather than a standard error of estimate. As usual, it is important to interpret the result we have obtained. We see that variance depends on three contributions:

  1. The first term is just images and is due to the realization of the future error term.
  2. The second term, images, depends on the underlying uncertainty and is mitigated by increasing the sample.
  3. The last term is a bit more complicated. The denominator of the ratio tells us that the more observation are spread out, the better; indeed we know that this affects our ability to estimate slope. The numerator depends on the distance between x0, the value of the explanatory variable for which we predict the response, and the average x in the available dataset. We see that the farther we go, the less certain we are. Indeed, we cannot extrapolate to much outside the available data. Actually, the situation can be even worse if the underlying phenomenon is nonlinear, and the linear regression model is just a suitable approximation for a limited range of the explanatory variable.

In practice, we must estimate images by SSR, and when computing prediction intervals we should use quantiles from tn−2 distribution. With large samples, quantiles from the standard normals can be used.

Example 10.11 A firm observes the following data correlating sales price and sales volume (measured in thousands of items):

images

The firm wants to predict sales if the sales price is raised to €4. Since sales are not completely predictable, the firm would also like a prediction interval with a 95% confidence level to come up with a robust decision.

Least squares yield

images

Not surprisingly, slope is negative, since we expect sales to drop after a price increase. Note, however, that the dataset shows that sales were higher when price was €2.8 than when it was €2.6; other factors, or just pure randomness, may be at play. Then, a point forecast is easily obtained:

images

Thus, we expect to sell something like 263 items. To build a prediction interval, we need to estimate SER, based on the residuals images

images

Then

images

We may also check that R2 = 0.857, which looks satisfactory. Applying (10.23), where images = 3.2, we get

images
images

Fig. 10.11 Bad regression-based sales forecast in Example 10.11.

Since t0.975,3 = 3.1824, the 95% prediction interval is as follows, expressed in items:

images

This is a pretty large amount of uncertainty, but there is something far worse: The prediction interval does not make any sense, as it includes negative values. On second thought, this is not quite surprising, since SER = 487.1 is large with respect to images = 262.6. Note, however, that the large value of R2 may induce a false sense of security. To visualize what is really going wrong, we may have a look at Fig. 10.11. We see that SER is not that large in absolute terms, but it is relatively large when the price goes up, because it affects much smaller sales volumes. Furthermore, since sales cannot be negative,13 a negatively sloped regression line is a sensible model within a limited range of prices. When prices are very high, a nonlinear model like the one hinted at by the dashed line in the figure would make more sense.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *