USING REGRESSION MODELS

Regression models can be used in a variety of ways, but the essential possibilities are

Using a regression model to assess the impact of explanatory variables
Using a regression model to forecast the value of the response variable for given values of the explanatory variables

In the first case, we are actually concerned with the estimate of slope; the idea is that understanding the phenomenon can lead to knowledge and to better policies. Apparently, there is little difference from the second case since, after all, a good estimate of parameters should lead to good forecasts. In fact, there are at least a couple of subtle differences between the two mindsets. If the only practical relevance of a regression model is forecasting, we may be quite happy with a model that does not yield any plausible explanation of a phenomenon, but it is empirically good at forecasting future values of the response variables. We will see that, in the limit, one can resort to time series models that do not try at all to explain anything; their only role is forecasting, without any ambition to find explanations. A completely different approach is taken, for instance, by sociologists, since in that case a true understanding of the driving forces behind an observed result is the true added value of a regression model. Furthermore, forecasting when regressors are themselves stochastic may seem to be a quite tricky business, as regressors themselves should be forecasted. For instance, in finance one can build a regression model to link the return of many stock shares to a few macroeconomic factors, which might be easier to forecast. In this section we just deal with forecasting when either regressors are nonstochastic and under our control, or when the explanatory variable is time.

Apparently, forecasting is quite easy. Given the regression model, just plug the value of x₀, and the predicted response is

Note that this is just a point forecast, i.e., a single value. If we knew the underlying statistical model and its parameters, the response would be a random variable

Assuming that errors are normally distributed, the response would be normal as well, with

This uncertainty is just linked to the variance of the future realization of the error . However, we should take the uncertainty of our estimators into account, too. We should consider as a random variable, since it is based on estimators a and b. Since we know how to characterize the uncertainty of our estimates, we might build a confidence interval for . However, when we consider the realization of the error term in the future, we are dealing with a prediction interval, which means dealing with the distribution of a random variable, rather than the estimation of an unknown parameter, which is a number. Indeed, we may even go as far as to build a probability distribution for Y₀. Assuming that regressors are nonstochastic and that errors are normally distributed, this distribution will be t with n − 2 degrees of freedom, or a normal distribution for a sufficiently large sample.

Since errors are independent, we may just add the variances of and . The former depends on the variances of a and b, but we should not forget their covariance. So, let us write everything explicitly:

To streamline notation, let

which is a number under our assumptions. Then

Taking its variance, we obtain

Taking into account the variance of the error affecting Y₀, we conclude that

Here we are using the standard error of Y₀, which we should interpret as a standard error of prediction, rather than a standard error of estimate. As usual, it is important to interpret the result we have obtained. We see that variance depends on three contributions:

The first term is just and is due to the realization of the future error term.
The second term, , depends on the underlying uncertainty and is mitigated by increasing the sample.
The last term is a bit more complicated. The denominator of the ratio tells us that the more observation are spread out, the better; indeed we know that this affects our ability to estimate slope. The numerator depends on the distance between x₀, the value of the explanatory variable for which we predict the response, and the average x in the available dataset. We see that the farther we go, the less certain we are. Indeed, we cannot extrapolate to much outside the available data. Actually, the situation can be even worse if the underlying phenomenon is nonlinear, and the linear regression model is just a suitable approximation for a limited range of the explanatory variable.

In practice, we must estimate by SSR, and when computing prediction intervals we should use quantiles from t_n−2 distribution. With large samples, quantiles from the standard normals can be used.

Example 10.11 A firm observes the following data correlating sales price and sales volume (measured in thousands of items):

The firm wants to predict sales if the sales price is raised to €4. Since sales are not completely predictable, the firm would also like a prediction interval with a 95% confidence level to come up with a robust decision.

Least squares yield

Not surprisingly, slope is negative, since we expect sales to drop after a price increase. Note, however, that the dataset shows that sales were higher when price was €2.8 than when it was €2.6; other factors, or just pure randomness, may be at play. Then, a point forecast is easily obtained:

Thus, we expect to sell something like 263 items. To build a prediction interval, we need to estimate SER, based on the residuals

Then

We may also check that R² = 0.857, which looks satisfactory. Applying (10.23), where = 3.2, we get

Fig. 10.11 Bad regression-based sales forecast in Example 10.11.

Since t_0.975,3 = 3.1824, the 95% prediction interval is as follows, expressed in items:

This is a pretty large amount of uncertainty, but there is something far worse: The prediction interval does not make any sense, as it includes negative values. On second thought, this is not quite surprising, since SER = 487.1 is large with respect to = 262.6. Note, however, that the large value of R² may induce a false sense of security. To visualize what is really going wrong, we may have a look at Fig. 10.11. We see that SER is not that large in absolute terms, but it is relatively large when the price goes up, because it affects much smaller sales volumes. Furthermore, since sales cannot be negative,¹³ a negatively sloped regression line is a sensible model within a limited range of prices. When prices are very high, a nonlinear model like the one hinted at by the dashed line in the figure would make more sense.

USING REGRESSION MODELS

Comments

Leave a Reply Cancel reply