THE NEED FOR A STATISTICAL FRAMEWORK

So far, we have regarded linear regression as a numerical problem, which is fairly easy to solve by least squares. However, this is a rather limited view. To begin with, it would be useful to gain a deeper insight into the meaning of the regression coefficients, in particular the slope b. The careful reader might have noticed that the numerator in formula (10.4) looks suspiciously like a covariance between variables x and y, whereas the denominator suggests the idea of variance of x. The interpretation must be handled with care since, so far, we considered variables x and y as numbers and not random variables. Still, let us pursue this line of intuition by manipulating the expression of slope a bit. In the following, we will use the rather obvious identities

Equation (10.4) can be rewritten as

In this ratio, using the identities above, we may subtract zero from both numerator and denominator and rearrange, which yields

This is yet another way of writing the slope,⁵ and by dividing numerator and denominator by n − 1, we get an expression in terms of sample statistics:

Formally, the quantities S_x, S_y, S_xy, and r_xy are similar to sample standard deviations, covariances, and correlation coefficients; even if the interpretation is debatable, as we are dealing with numbers and not with random samples, the notation is quite handy and tells us a lot. The slope coefficient is closely related to the sample correlation between x and y. Indeed, we know that correlation picks up the linear association between random variables. Positive correlation results in an upward-sloping regression line, whereas negative correlation results in a downward-sloping line. The exact slope depends on the standard deviation of x and y, but this is somewhat arbitrary as it is influenced by how we measure x and y.

The interpretation of slope in terms of correlation should be properly cast within a statistical framework. Indeed, a regression model relies on observed data that we can associate with a sampling mechanism, provided we postulate a data-generating process, i.e., an underlying stochastic model that we observe through a sample of noisy observations from outside. Within this framework, regression coefficients are estimates of the unknown parameters of the data-generating process. If we rely on concepts from inferential statistics and analyze linear regression as an estimation problem, we are able to

Quantify uncertainty about estimates, i.e., calculate confidence intervals for the underlying parameters
Test hypotheses about the effect of explanatory variables

Fig. 10.2 A different dataset yielding the same regression line as in Fig. 10.1.

Quantify uncertainty about predictions, i.e., calculate prediction intervals in order to come up with robust decisions

An example will illustrate why all of the above is so important.

Example 10.2 Consider the dataset tabulated in Fig. 10.2, and plotted there as filled diamonds; the figure also includes the dataset of Fig. 10.1, represented by the empty circles. The two datasets look quite different, yet we invite the reader to check that the two regression lines are the same (disregarding slight numerical discrepancies). Now imagine that we want to predict the response with the explanatory variable set to x₀₁ = 13 or x₀₂ = 17. The prediction, based on the two identical linear regression models, would just be the same. We have just to plug values of x into the regression line (10.5) to obtain

This is straightforward, but in which case should we trust the linear regression model more? Intuition suggests that the second dataset is affected by a larger level of uncertainty, and that the resulting predictions should be taken with due care. Indeed, for the first dataset the sum of squared residuals for the best-fit line is SSR₁ = 53.1495, whereas in the second case it is SSR₂ = 1000. Furthermore, the value x = 13 lies within the range of the observed sample on which the fitting is based, whereas x = 17 does not. Intuition again suggests that if we are extrapolating outside the range of observed values we should be more uncertain about our predictions, but the least-squares method per se does not provide us with tools to address these issues.

The first step in building a statistical framework is a clear statement of the assumptions about the underlying data-generating process, i.e., the process by which observations of the response variable are generated for given values of the explanatory variable. We have two possibilities:

If the explanatory variable is a number x, we speak of a nonstochastic regressor and may write a linear data-generating model as
If the explanatory variable is a random variable X, we speak of a stochastic regressor and may write

Even if the regressor is nonstochastic, the response Y_i is a random variable because of the terms , which are random variables and are called errors. Even though errors are modeled as random variables, they need not be random in principle. Errors may represent truly random factors, but they could also represent other variables that would explain the response, but cannot be observed directly. They can also represent modeling errors, i.e., deviations from a simple linear functional form. If so, they have more to do with our ignorance and modeling mistakes than with true randomness.⁶ We will just put everything under one roof and random errors will play the role of a catchall. However, this does not mean that everything goes: We do model errors as random variables, but they are subject to very precise conditions for our modeling mechanism to work. It is essential to validate a regression model a posteriori by checking whether there is a blatant violation of these assumptions.

The difference between nonstochastic and stochastic regressors has implications on how the data-generating model should be characterized. Before dwelling on the required formal assumptions, we show with a few examples that the difference is substantial and not academic.

Example 10.3 Imagine that we are a monopolist producer of a good and we want to set its selling price. The price will arguably affect demand, and we could conceive a linear demand model such as

where D is demand and p is price. In this market setting, price is a managerial lever under our complete control and, of course, it can be perfectly observed, without any measurement uncertainty. Hence, price in this case is a nonstochastic regressor and there is a clear causal relationship between price and demand. We could try to estimate the demand function in order to find the optimal price to maximize profit.⁷

Example 10.4 Now consider a model relating the random return R_k, that we may earn from investing in stock share k, to the random return R_m from the market portfolio as a whole. The return from the market portfolio can be typically proxied by a broad market index such as S&P500. If we regress R_k on R_m, namely

we build a model with a stochastic regressor, as we cannot indisputably claim control over the market. Furthermore, in this case we also see an important point again: Linear regression models, like correlation analysis, are about association, not causation. Unlike the monopolistic producer setting a price, in this case it is hard to say which variable is the cause and which one is the effect. On one hand, we could say that the general market mood does affect return on asset k; on the other hand, asset k is itself part of the market portfolio and contributes to return R_m. A regression model like this has many uses⁸ and can be exploited to assess the systematic risk of an asset, i.e., the risk component that cannot be diversified away by holding a widely diversified portfolio.

Example 10.5 Stochastic and nonstochastic regressors may coexist in the same model, when there are multiple explanatory variables. Consider a model predicting the rating or the share obtained by broadcasting a movie on a TV network. The rating depends on many factors, including characteristics of the movie itself, such as its genre and the number of star actors and actresses featured in it, as well as the broadcast slot, i.e., the time, day, and month of the broadcast. These are variables that a TV network may control, because it selects which movie to show and when. However, the rating also depends on what competitor networks show at the same time, i.e., their counterprogramming decisions. Clearly, the level of competition is not under our control,⁹ and should be considered a stochastic regressor.

Whether regressors are stochastic or not, Y_i is a random variable since errors make our observations noisy, and α and β in the underlying data-generating process are unknown parameters that we try to estimate. What we actually observe should be written, in the case of nonstochastic regressors, as

where the regression coefficients a and b are found by least squares and the residuals e_i are evaluated by comparing predicted and observed values of the response variable. Note the difference between this expression and (10.8). We use Roman letters (a,b,e_i) when referring to coefficients and residuals, which are estimates and observed variables, respectively; we use Greek letters when referring to underlying things that we cannot observe directly. Even if the regressors are nonstochastic, the regression coefficients a and b are random variables,¹⁰ since they depend on the random variables Y_i through Eqs. (10.1) and (10.3). They are estimators of the unknown parameters α and β. It is very useful to see how this conceptual framework is exactly the same we have used when estimating the expected value μ, an unknown number, by the sample mean , an observed random variable. In this case, we should also notice that the errors are not directly observable, since we do not know the true parameters. We must settle for a proxy of errors, the residuals e_i.

Least squares provide us with estimators of the parameters in the data-generating processes (10.8) or (10.9). Whenever we estimate parameters, we would like to have unbiased estimators with as little variance as possible. In order to investigate the properties of the least-squares estimators, we need a precise statement of conditions on regressor variables and errors. In the next sections we treat in fair detail the case of nonstochastic regressors, which is easier to deal with. We will outline the case of stochastic regressors later, in Section 10.5. It turns out that the fundamental results for the case of stochastic regressors are not that different, but the underlying assumptions must be expressed in a more complicated way.

The validity of the regression model also depends on our assumptions about the errors . Very roughly speaking, it is generally assumed that they are independent of each other and that their expected value is zero. Independence among errors simplifies results considerably and ensures that we are not really missing something systematic in our model; if their expected value were not zero, we could just include it in the intercept α. A fundamental issue is whether errors are somehow related to the value of x_i or X_i:

We speak of a homoskedastic case if the variance of errors is the same for each observation, and in particular it does not depend on the value of the regressor variable: , for all i = 1,…,n.
Otherwise, we have a heteroskedastic case.

Strictly speaking, the estimation approach we are describing, where errors are i.i.d. variables, should be referred to as ordinary least squares (OLS); the slope and intercept given by (10.1) and (10.3) are OLS estimators. Variations on the theme are used to deal with violations of the assumptions above.

Finally, in order to derive more specific results on confidence intervals and in testing hypotheses about the underlying parameters, we might use further assumptions concerning the exact distribution of the errors. If we assume normality, rather simple results may be found; alternatively, the central limit theorem can be invoked to prove asymptotic normality of estimators, under technical conditions. All of these assumptions should not be taken for granted, and they should be carefully checked in order to assess the validity of the regression model. Whether least-squares estimators are biased or not, as well as their efficiency, depends on them.

THE NEED FOR A STATISTICAL FRAMEWORK

Comments

Leave a Reply Cancel reply