Introduction

We take advantage of all the probabilistic and statistical knowledge we have built in the to get into the realm of empirical model building. Models come in many forms, but what we want to do here is finding a relationship between two variables, say, x and y, based on a set of n joint observations (x_i, y_i), i = 1,…,n. We got acquainted with correlation and if two variables are correlated we can try to put such knowledge to good use for decision making and forecasting. The first step in building a model is choosing a functional form representing the link between the variables of interest. The simplest relationship that comes to mind is linear:

This is called a simple linear regression model. It is obviously linear, but one could and should wonder whether a more complicated, nonlinear functional form is better suited to our task. It is simple since there is only one variable x that we use to “explain” the variable y; multiple linear regression models rely on possibly several explanatory variables. We cover these more advanced models in since they rely on a definitely more challenging technical machinery. Yet, even the innocent-looking simple linear regression model hides a lot of issues, which are best understood in a simple setting. A deep understanding of these issues is needed to tackle nonlinear and multiple regression models.

It is tempting to interpret x as an input or a cause, and y as an output or an effect. Granted, there are many practical cases in which this interpretation does make sense, but we should never forget that a regression model relies on association and not causation. The same caveats that we have pointed out when dealing with correlation apply to regression models. By a similar token, when referring to functions it is customary to call x the independent variable and y the dependent variable; obviously, these terms can be a bit misleading in a statistical framework. In the following we will refer to x by the terms explanatory variable or regressor; y will be called response or regressed variable.

To build and use a linear regression model, we must accomplish the following steps:

We must devise a suitable way to choose the coefficients a and b.
We should check if the model makes sense and is reliable enough.
We should use the model by
- Building knowledge to understand a phenomenon
- Generating forecasts and scenarios for decision making under uncertainty

We accomplish the first step in Section 10.1, where we lay down the foundations of the least-squares method. Section 10.2 deals with the second step, which requires building a statistical framework for linear regression. This is needed to state precise assumptions behind our modeling endeavor, which should be thoroughly checked before using the model; we also need to draw statistical inferences and the test hypotheses about the estimated coefficients in the model. We do so in Section 10.3, for the simpler case of a nonstochastic regressor, i.e., when the explanatory variable x is treated as a number rather than a random variable. Then, in Section 10.4, we tackle the third step. There are different uses of a linear regression model, and statistics in general. We might be interested in understanding a physical or social phenomenon; in such a case a model is used for knowledge discovery purposes and to ascertain the impact of explanatory variables. In a business management setting, we are more likely to be interested in using the model to generate forecasts and scenarios as an input to a decision-making procedure. However, we should not undervalue model building per se: Gathering data and identifying a model is often a good way to reach a common understanding of a multifaceted problem, possibly involving different members of an organization with a limited view of the overall process. Finally, in Section 10.5 we relax a few limiting assumptions made in Section 10.3, and outline extensions such as the weighted least-squares method, and in Section 10.6 we take a look at the links between linear algebra and linear regression. The last two sections can be safely omitted by readers who just need a basic understanding of linear regression.

Fig. 10.1 Data for linear regression.

Comments

Leave a Reply Cancel reply