A VECTOR SPACE LOOK AT LINEAR REGRESSION

The aim of this section is to broaden our view about linear regression by analyzing it in the light of some concepts from linear algebra. In fact, linear regression can be regarded as a sort of orthogonal projection within suitably chosen vector spaces. To see this, let us group observations and residuals into vectors as follows:

images

Let us also denote by u = [1, 1, …, 1]T, i.e., a column vector whose entries all set to 1. Ideally, we would like to find coefficients a and b such that vector Y can be expressed as a linear combination of x and u:

images

However, there is little chance to express a vector in images with a basis consisting of two vectors only, and we settle for a vector that is as close as possible, by minimizing the norm of the vector of residuals:

images

We are looking for a vector in the linear subspace spanned by x and u, which is as close as possible to Y. In other words, we are projecting Y on this subspace. From linear algebra, we know that this projection has some orthogonality properties, in the sense that the difference between Y and its projection and the projection itself should be orthogonal vectors. Intuitively, if we consider a plane, and a point outside the plane, the path of minimal length between the point and the plane lies along a line that is orthogonal to the plane itself.

Therefore, we should expect that15

images

where the product dot · denotes the usual inner product between vectors in images. Since inner product is a linear operator, we should just check orthogonality between e and the basis vectors u and x. However, from least-squares theory, we know that the average of residuals is indeed zero, so

images

Checking orthogonality between e and x proceeds along familiar lines, based on the form of least-squares estimators:

images

Hence, we see that ordinary least squares may be interpreted in terms of orthogonal projection on linear subspaces. This view of linear regression in linear algebraic terms will prove most useful in understanding multiple linear regression.

So far, in this section, we did not refer to any probabilistic or statistical concept. Thus, it may be useful to interpret linear regression in probabilistic terms. Note that, in practice, regression is carried out on sampled data; however, we may also consider best approximation problems between random variables. Let us consider a random variable Y, which we want to approximate by a linear affine transformation of another variable X, i.e.

images

for coefficients a and b that we must determine in a suitable way. What are the desirable properties of such an approximation? To begin with, the two expected values should be the same:

images

which is just the probabilistic counterpart of (10.1). To find another condition, we should require that the approximation is good in a probabilistic sense. If we introduce the error16 images, we may require that its variance is small; therefore, we solve the problem

images

subject to condition (10.27). Actually, from this requirement we find a condition on b only. To see this, let us express variance of error:

images

Minimization with respect to b yields

images

This is just the probabilistic counterpart of (10.7).

Since we find a best approximation of Y in terms of X by linear regression, the error should not carry any information related to X. In other words images and X should be uncorrelated. Indeed

images

This also implies that images is uncorrelated with the overall approximation a + bX. Now, can we bridge the gap between this probabilistic view of regression, and the view above, based on inner products and orthogonal projection? To this aim, we should define a suitable inner product between random variables, as well as an orthogonality concept. A comparison between the two views suggests that we may link orthogonality and lack of correlation. In other words, random variables X1 and X2 are “orthogonal” if Cov(X1X2) = 0. This in turn suggests the definition of an inner product between random variables:

images

Let us see if this makes sense. From linear algebra, we recall the properties that any legitimate inner product should enjoy:

  1. images
  2. images
  3. images, for any scalar a
  4. images, and images only if X = 0

The first three properties correspond to properties of covariance that we listed in Section 8.3.1. The last property says first that variance cannot be negative, which is fine. The problem comes from the second point: Variance is zero for any constant random variable, not only for the constant zero. This suggests that the might be some more work to do, which is beyond the scope of an introductory book, but it turns out that there is way to solve this issue and properly define an inner product of random variables.17

Leaving the last technicality aside, it does seem that we can define a linear space of random variables, on which we may take linear combinations, and the inner product is just covariance. Within this framework, the two views described above are indeed strictly related. They both amount to orthogonal projection within linear spaces with a properly defined inner product. The inner product also defines a norm

images

where x may be a vector or a random variable. By least squares, we do minimize the squared norm of an error/residual, which is orthogonal to the projected element, either vector or random variable. We close this section by a short example reinforcing this general framework.

Example 10.12 (Pythagorean theorem for random variables) We know that if vectors x and y are orthogonal, then

images

If we apply this to the legs and the hypothenuse of a right triangle, we get the familiar form of Pythagorean theorem. If we apply the idea to a linear space of random variables, whereby the squared norm is variance, we get something quite familiar. If two random variables are uncorrected, we obtain

images

Problems

10.1 Consider the following sales data:

images

Build a linear regression model to predict sales and calculate R2.

10.2 Given the observed data

images

build a 95% confidence interval for the slope.

10.3 A firm sells a perishable product, with a time window for sales limited to 1 month. The product is ordered once per month, and the delivery lead time is very small, so that the useful shelf life is really 1 month. Each piece is bought at €10 and its sold for €14; if the product expires, it can be scrapped for €2 per unit. Over the last 4 months a positive trend in sales has been observed:

images

Hence, the firm resorts to linear regression to forecast sales over the next period. How many items should the firm buy, in order to maximize expected profit in May?


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *