Location measures do not tell us anything about dispersion of data. We may have two distributions sharing the same mean, median, and mode, yet they are quite different. Figure 4.8, repeated illustrates the importance of dispersion in discerning the difference between distributions sharing location measures. One possible way to characterize dispersion is by measuring the range X(n) − X(1) i.e., the difference between the largest and the smallest observations. However, the range has a couple of related shortcomings:
- It uses only two observations of a possibly large dataset, with a corresponding potential loss of valuable information.
- It is rather sensitive to extreme observations.
An alternative and arguably better idea is based on measuring deviations from the mean. We could consider the average deviation from the mean, i.e., something like
However, it is easy to see that the above definition is useless, as the average deviation is identically zero by its very definition:
Fig. 4.8 Bar charts illustrating the role of dispersion: mean, median, and mode are the same, but the two distributions are quite different.
The problem is that we have positive and negative deviations canceling each other. To get rid of the sign of deviations, we might consider taking absolute values, which yields the mean absolute deviation (MAD):
As an alternative, we may average the squared deviations, which leads to the most common measure of dispersion.
DEFINITION 4.6 (Variance) In the case of a population of size n, variance is defined as
In the case of a sample of size n, variance is defined as
These definitions mirror the definition of mean for populations and samples. However, a rather puzzling feature of the definition of sample variance S2 is the division by n − 1, instead of n. A convincing justification will be given in Section 9.1.2 within the framework of inferential statistics. For now, let us observe that the n deviations are not independent, since identity of Eq. (4.3) shows that when we know the sample mean and the first n − 1 deviations, we can easily figure out the last deviation.13 In fact, there are only n − 1 independent pieces of information or, in other words, n − 1 degrees of freedom. Another informal argument is that since we do not know the true population mean μ, we have to settle for its estimate , and in estimating one parameter we lose one degree of freedom (1 df). This is actually useful as a mnemonic to help us deal with more complicated cases, where estimating multiple parameters results in the loss of more degrees of freedom.
Variance is more commonly used than MAD. With respect to MAD, variance enhances large deviations, since these are squared. Another reason, that will become apparent in the following, is that variance involves squaring deviations, and the function g(z) = z2 is a nice differentiable one. MAD involves an absolute value h(z) = |z|, which is not that nice. However, taking a square does have a drawback: It changes the unit of measurement. For instance, variance of weekly demand should be measured in squares of items, and it is difficult to assign a meaning to that. This is why a strictly related measure of dispersion has been introduced.
DEFINITION 4.7 (Standard deviation) Standard deviation is defined as the square root of variance. The usual notation, mirroring Definition 4.6, is σ for a population and S for a sample.
The calculation of variance and standard deviation is simplified by the following shortcuts:
Example 4.12 Consider the sample:
We have
Table 4.7 Computing variance with raw and centered data.
Hence, sample variance is
and sample standard deviation is
It is quite instructive to prove the above formulas. We consider here shortcut of Eq. (4.5), leaving the second one as an exercise:
These rearrangements do streamline calculations by hand or by a pocket calculator, but they can be computationally unfortunate when dealing with somewhat pathological cases.
Example 4.13 Consider the dataset in Table 4.7. The first column shows the raw data; the second column shows the corresponding centered data, which are obtained by subtracting the mean from the raw data. Of course, variance is the same in both cases, as shifting data by any amount does not affect dispersion. If we use the definition of variance, we get the correct result in both cases, S2 = 229.17. However, if we apply the streamlined formula of Eq. (4.6), on a finite precision computer we get 0 for the raw data. This is a consequence of numerical errors, and we may see it clearly by considering just two observations with the same structure as the data in Table 4.7:
where and are much smaller than α. For instance, in the table we have α = 10,000,000,000, , and . Then
and
We see that the two expressions are different, but since and are relatively small and get squared, the terms in the brackets are much smaller than the other ones. With a finite-precision computer arithmetic, they will be canceled in the calculations, so that the difference between the two expressions turns out to be zero. This is a numerical error due to truncation. In general, when taking the difference of similar quantities, a loss of precision may result. If we subtract the mean , we compute variance with centered data:
which yields the correct result
with no risk of numerical cancelation. Although the effects need not be this striking in real-life datasets, it is generally advisable to work on centered data.
We close this section by pointing out a fundamental property of variance and standard deviation, due to the fact that they involve the sum of squares.
PROPERTY 4.8 Variance and standard deviation can never be negative; they are zero in the “degenerate” case when there is no variability at all in the data.
Leave a Reply