LEAST-SQUARES METHOD

Consider the data tabulated and depicted in Fig. 10.1. These joint observations are displayed as circles, and a look at the plot suggests the possibility of finding a linear relationship between x and y. A linear law relating the two variables, such as¹

can be exploited to understand what drives a social or physical phenomenon, and is one way of exploiting correlation between data. The sample correlation coefficient for these data is 0.9623, and even if the sample is a toy one, it is significant at 1% confidence level using the test of Section 9.4.5 (the p-value is 0.0021). Intuitively, the sign of the slope b should be the same as the sign of the correlation. If we relate expenditure in advertisement and revenue, we would expect a positive correlation. If we relate price and demand, we would expect negative correlation. Before we go on, it is important to insist on two points:

A linear relationship need not be the best model. We know that correlation picks up only linear associations, but there could be a more complicated relationship. Nevertheless, linear models are the natural starting point in our investigation.
It is tempting to interpret x as a cause and y as an effect; however, correlation only measures association. For instance, when relating price and demand we cannot say in general which one is the independent variable, as this depends on the specific market and its level of competition.

Clearly, there is no way to find a “perfect fit” line passing through all of the observed points, unless a linear model is an exact representation of reality. This will never be the case in real life, because of uncertainty and/or unmodeled factors.

Now what makes a good model, and how can we measure its fit against empirical data? Let us compare the two alternative lines depicted in Fig. 10.1. It is a fairly safe bet most readers would choose the continuous line rather than the dashed one. Our eye is measuring the distance between the observed points (x_i,y_i) and the corresponding points that the linear model would predict. To find a prediction , associated with a setting of the regressor x_i, simply requires plugging the latter into the regression model:

= a + bx_i

To formalize this distance for each individual observation, we can define a residual:²

In order to evaluate the overall fit of a model we must aggregate the single residuals into one distance measure. It should be clear that the average residual is not quite useful, as positive and negative deviations cancel each other. Borrowing the variance idea from statistics, we may consider squared deviations to get rid of the sign and define the sum of squared residuals (SSR) as follows:

According to the least-squares approach, the best-fit line is the one that minimizes SSR with respect to the regression coefficients a and b. Of course, there is no difference between minimizing SSR or the average squared residual, as multiplying the objective function by 1/n does not change the optimal solution.

Finding the best coefficients a and b calls for the solution of a least-squares problem and, in the case of simple linear regression, is a straightforward exercise in calculus: We just need to enforce the first-order optimality conditions, one per regression coefficient.³ A first condition is obtained by requiring stationarity of the partial derivative of SSR with respect to the intercept a:

Rearranging the condition yields

where and are the average values of x and y. It is useful to interpret this condition: It tells us that the barycenter (, ) of the experimental data lies on the regression line, which does make good sense. This condition also implies that the average residual for the best-fit model is zero:

The second optimality condition is obtained by enforcing stationarity of the partial derivative with respect to the slope b:

We can get rid of the leading factor −2, which is irrelevant, and plug the optimal value a* into this condition:

Solving for b yields

This way of expressing the optimal slope is arguably the most immediate for carrying out the calculations. However, we may also rewrite slope in a way that is easier to remember, dividing numerator and denominator by n and making averages of x and y explicit:

With least squares, we find explicit expressions for a* and b*. In the following we will see how these expressions may be interpreted intuitively to improve our understanding; however, it is better to illustrate the approach first by a small numerical example. To streamline notation, we will drop the asterisk (*) and denote the optimal value of coefficients by a and b.

Example 10.1 Let us apply the least-squares method to the data of Fig. 10.1. The necessary calculations are reported in Table 10.1. Applying (10.3), we have