The aim of this section is to broaden our view about linear regression by analyzing it in the light of some concepts from linear algebra. In fact, linear regression can be regarded as a sort of orthogonal projection within suitably chosen vector spaces. To see this, let us group observations and residuals into vectors as follows:
Let us also denote by u = [1, 1, …, 1]T, i.e., a column vector whose entries all set to 1. Ideally, we would like to find coefficients a and b such that vector Y can be expressed as a linear combination of x and u:
However, there is little chance to express a vector in with a basis consisting of two vectors only, and we settle for a vector that is as close as possible, by minimizing the norm of the vector of residuals:
We are looking for a vector in the linear subspace spanned by x and u, which is as close as possible to Y. In other words, we are projecting Y on this subspace. From linear algebra, we know that this projection has some orthogonality properties, in the sense that the difference between Y and its projection and the projection itself should be orthogonal vectors. Intuitively, if we consider a plane, and a point outside the plane, the path of minimal length between the point and the plane lies along a line that is orthogonal to the plane itself.
Therefore, we should expect that15
where the product dot · denotes the usual inner product between vectors in . Since inner product is a linear operator, we should just check orthogonality between e and the basis vectors u and x. However, from least-squares theory, we know that the average of residuals is indeed zero, so
Checking orthogonality between e and x proceeds along familiar lines, based on the form of least-squares estimators:
Hence, we see that ordinary least squares may be interpreted in terms of orthogonal projection on linear subspaces. This view of linear regression in linear algebraic terms will prove most useful in understanding multiple linear regression.
So far, in this section, we did not refer to any probabilistic or statistical concept. Thus, it may be useful to interpret linear regression in probabilistic terms. Note that, in practice, regression is carried out on sampled data; however, we may also consider best approximation problems between random variables. Let us consider a random variable Y, which we want to approximate by a linear affine transformation of another variable X, i.e.
for coefficients a and b that we must determine in a suitable way. What are the desirable properties of such an approximation? To begin with, the two expected values should be the same:
which is just the probabilistic counterpart of (10.1). To find another condition, we should require that the approximation is good in a probabilistic sense. If we introduce the error16 , we may require that its variance is small; therefore, we solve the problem
subject to condition (10.27). Actually, from this requirement we find a condition on b only. To see this, let us express variance of error:
Minimization with respect to b yields
This is just the probabilistic counterpart of (10.7).
Since we find a best approximation of Y in terms of X by linear regression, the error should not carry any information related to X. In other words and X should be uncorrelated. Indeed
This also implies that is uncorrelated with the overall approximation a + bX. Now, can we bridge the gap between this probabilistic view of regression, and the view above, based on inner products and orthogonal projection? To this aim, we should define a suitable inner product between random variables, as well as an orthogonality concept. A comparison between the two views suggests that we may link orthogonality and lack of correlation. In other words, random variables X1 and X2 are “orthogonal” if Cov(X1, X2) = 0. This in turn suggests the definition of an inner product between random variables:
Let us see if this makes sense. From linear algebra, we recall the properties that any legitimate inner product should enjoy:
The first three properties correspond to properties of covariance that we listed in Section 8.3.1. The last property says first that variance cannot be negative, which is fine. The problem comes from the second point: Variance is zero for any constant random variable, not only for the constant zero. This suggests that the might be some more work to do, which is beyond the scope of an introductory book, but it turns out that there is way to solve this issue and properly define an inner product of random variables.17
Leaving the last technicality aside, it does seem that we can define a linear space of random variables, on which we may take linear combinations, and the inner product is just covariance. Within this framework, the two views described above are indeed strictly related. They both amount to orthogonal projection within linear spaces with a properly defined inner product. The inner product also defines a norm
where x may be a vector or a random variable. By least squares, we do minimize the squared norm of an error/residual, which is orthogonal to the projected element, either vector or random variable. We close this section by a short example reinforcing this general framework.
Example 10.12 (Pythagorean theorem for random variables) We know that if vectors x and y are orthogonal, then
If we apply this to the legs and the hypothenuse of a right triangle, we get the familiar form of Pythagorean theorem. If we apply the idea to a linear space of random variables, whereby the squared norm is variance, we get something quite familiar. If two random variables are uncorrected, we obtain
Problems
10.1 Consider the following sales data:
Build a linear regression model to predict sales and calculate R2.
build a 95% confidence interval for the slope.
10.3 A firm sells a perishable product, with a time window for sales limited to 1 month. The product is ordered once per month, and the delivery lead time is very small, so that the useful shelf life is really 1 month. Each piece is bought at €10 and its sold for €14; if the product expires, it can be scrapped for €2 per unit. Over the last 4 months a positive trend in sales has been observed:
Hence, the firm resorts to linear regression to forecast sales over the next period. How many items should the firm buy, in order to maximize expected profit in May?
Leave a Reply