When selecting variables, there are a few issues and tradeoffs involved.
- We might wish to include variables that are not directly observable, and we have to settle for an observable proxy. If so, we must ensure that the variable we include is an adequate substitute.
- Our choices are also affected by the availability of data. If data collection has been carried out and there is no possibility for adding variables and observations, we have to live with it. If data have still to be gathered, cost issues may arise.Table 16.1 Sample data illustrating issues in regressor selection.
- If observations of many variables are available, it may not be clear which subset of regressors offers the best explanatory or predictive power. Model building procedures have been proposed to come up with the best subset of variables, such as forward selection (a stepwise procedure in which one regressor is added at a time) and backward elimination (a stepwise procedure in which one variable is omitted at a time).
Apparently, we should aim at finding the model with the largest R2 coefficient, and, arguably, the more variables we include, the better model we obtain. However, the following examples show that subtle difficulties may be encountered.
Example 16.1 (Omitted variables and bias) Let us consider the sample data in Table 16.1. If we regress a model specified as
we obtain the following estimates, including a 95% confidence interval for each parameter:
The R2 coefficient is 0.8424. Now, let us repeat the application of least squares, omitting the variable x2:
The R2 coefficient now is 0.6975. We see that by omitting a variable, we reduce R2, which may not be so surprising. We also see that the estimate of β1 changes considerably and is increased. Actually, data have been generated by Monte Carlo sampling on the basis of the model
The standard deviation of error was , and the values of the regressors x1 and x2 have been obtained by sampling a multivariate normal distribution with:
When the two regressors are used, R2 is less than ideal because of the limited sample size and the large variability of errors. When we omit the second explanatory variable, we loose explanatory power, but there is a subtler effect: The estimate of β1 is biased. In fact, since the two regressors are positively correlated, the coefficient of x1 in the second regression increases, because part of the effect of x2 is attributed to x1. This is an example of distortion by omitted variables.
If correlation is negative, we get a lower estimate. For instance, repeating the experiment with ρ = − 0.6, we obtain
with R2 = 0.5002, when regressing against both variables. If we omit x2, we find
with R2 = 0.0406. In the last case we see a dramatic reduction of R2, and a negative bias in the estimate of β1.
The numbers involved in the example above should be considered with due care, as we are reporting results obtained with one Monte Carlo sample, and no general conclusion can be drawn. What is really important is that when a variable is omitted, this may affect the estimate of parameters associated with other variables. Then, one might argue that it is better to stay on the safe side, and include as many variables as we can. The following example shows that this is not the case.
Table 16.2 Sample data illustrating the effect of collinearity.
Example 16.2 (Multicollinearity and instability) Let us repeat the experiment of Example 16.1, but this time let ρ = 0.98. This means that we are regressing against two strongly correlated regressors. Rather than running one regression, we use Monte Carlo sampling to generate n = 30 observations, on which we apply OLS, and the procedure is repeated 5 times. In Table 16.2 we show the estimates of the three parameters, along with their 95% confidence intervals, for each sample. The results are quite striking. If we compare the estimates for all samples, we notice a lot of variability: for instance, coefficient b1 is negative in sample 4 and positive in sample 5. Furthermore, even looking at one sample at a time, we are often unsure of the sign of an effect. It is tempting to attribute all of this to the limited number of observations (30) and to large variability in the underlying errors; however, we see that R2 is not too bad in the five samples. In fact, the problem is the high correlation between the regressors, which leads to difficulties because of multicollinearity.
To better understand the difficulty revealed by Example 16.2, we should reflect on the meaning of parameter β1. This is supposed to be the increment in the expected value of Y, when x1 is incremented by 1, and x2 is kept fixed. However, this makes no real sense here, because x1 and x2 are strongly correlated. We cannot explore the effect of a variation in one variable, while keeping the other one fixed. Furthermore, because of this strong link, the data matrix χ has two columns that are linearly related to each other, although not really linearly dependent; as a result, the matrix (χTχ) is not really singular, but solving the system of linear equations is critical since a little change in the inputs results in a large change in the solution. This is a problem called numerical instability in numerical analysis.
So, we see that we should include neither too few, nor too many explanatory variables. Finding the right model requires for a bit of experimentation, as well as some understanding of the underlying phenomenon. As the following example illustrates, we cannot trust a linear regression model as a black box.
Example 16.3 (Lurking variables) Consider a regression model used to study the impact of car features on mileage, i.e., miles per gallon or kilometers per liter of petrol. There are many characteristics of a car that can explain mileage; let us choose the turn circle, measured in feet or meters. If you regress mileage on turn circle, you will find a positive coefficient, and a significant model. But is this a sensible model? No one would really think that turn circle determines mileage, but then why is the model statistically significant? Actually, the turn circle is positively correlated with vehicle length, which in turn is positively correlated with vehicle weight. Arguably, it is weight the main determinant of mileage, which is somehow surrogated by turn circle in the model. If you include weight and the other characteristics, the coefficient of turn circle will be not statistically significant.
Weight, in this case, is the lurking variable.2 As another example, imagine regressing sales against expenditure in advertisement. Any marketeer will be delighted by a regression model showing the positive impact of such expenditure on sales. But imagine that advertisements are part of a more general campaign in which prices are reduced. How can you be sure that the real driver of sales is not just price reduction?
These examples show that the interpretation of a regression model cannot rely on pure statistical concepts. Regression models may be able to detect association, not causation. A statistically significant model need not be significant from a business perspective, and domain-specific knowledge is needed to correctly analyze model results.
Leave a Reply