We have introduced the basic concepts of frequencies and histograms in Section 1.2.1. Here we treat the same concepts in a slightly more systematic way, illustrating a few potential difficulties that may occur even with these very simple ideas.
Imagine a car insurance agent who has collected the weekly number of accidents occurred during the last 41 weeks, as shown in Table 4.2. This raw representation of data is somewhat confusing even for such a small dataset. Hence, we need a more systematic way to organize and present data. A starting point would be sorting the data in order to see the frequency with which each of the values above occurs. For instance, we see that in 5 weeks we have observed one accident, whereas two accidents have been observed in 12 cases. Doing so for all of the observed values, from 1 to 6, we get the second column of Table 4.3, which shows frequencies for the raw data above.
An even clearer picture may be obtained by considering relative frequencies, which are obtained by taking the ratio of observed frequencies and the total number of observations:
For instance, two accidents have been observed in 10 cases out of 41; hence, the relative frequency of the value 2 is . Relative frequencies are also displayed in Table 4.3. Sometimes, they are reported in percentage terms. What is really essential to notice is that relative frequencies should add up to 1, or 100%; when numbers are rounded, a small discrepancy can occur.
Table 4.3 Organizing raw data using frequencies and relative frequencies.
Fig. 4.1 A bar chart of frequencies for the data in Table 4.3.
Frequencies and relative frequencies may be represented graphically in a few ways. The most common graphical display is a bar chart, like the one illustrated in Fig. 4.1. The same bar chart, with a different vertical axis would represent relative frequencies.8 A bar chart can also be used to illustrate frequencies of categorical data. In such a case, the ordering of bars would have no meaning, whereas for quantitative variables we have a natural ordering of bars. A bar chart for quantitative variables is usually called a histogram. For qualitative variables, we may also use an alternative representation like a pie chart. Figure 4.2 shows a pie chart for answers to a hypothetical question (answers can be “yes,” “no,” and “don’t know”). Clearly, a pie chart does not show any ordering between categories.
Fig. 4.2 A pie chart for categorical data.
Table 4.4 Frequency and relative frequencies for grouped data.
A histogram is naturally suited to display discrete variables, but what about continuous variables, or discrete ones when there are many observed values? In such a case, it is customary to group data into intervals corresponding to classes. As a concrete example, consider the time it takes to get to workplace using a car. Time is continuous in this case, but there is little point in discriminating too much using fractions of seconds. We may consider “bins” characterized by a width of three minutes, as illustrated in Table 4.4. To formalize the concept, each “bin” corresponds to an interval. The common convention is to use closed-open intervals.9 This means, for instance, that the class 18–21 in Table 4.4 includes all observations ≥ 18 and < 21 or, in other words, it corresponds to interval [18, 21). To generalize, we use bins of the following form:
Fig. 4.3 Bar charts illustrating the effect of changing bin width in histograms.
where x0 is the origin of this set of bins and should not be confused with an observation, and h is the bin width. Actually, h need not be the same for all of the bins, but this may be a natural choice. The first bin, B1(x0,h), corresponds to interval [x0, x0 + h); the second bin, B2(x0, h), corresponds to interval [x0 + h, x0 + 2h), and so on. For widely dispersed data it might be convenient to introduce two side bins, i.e., an unbounded interval (−∞, xl) collecting the observations below a lower bound xl, and an unbounded interval [xu, +∞) for the observations above an upper bound xu.
Histograms are a seemingly trivial concept, that can be used to figure out basic properties of a dataset in terms of symmetry vs. skewness (see Fig. 1.5), as well as in terms of dispersion (see Fig. 1.6). In practice, they may not be as easy to use as one could imagine. A first choice we have to make concerns the bin width h and, correspondingly, the number of bins. The same dataset may look differently if we change the number of bins, as illustrated in Fig. 4.3. Using too few bins does not discriminate data enough and any underlying structure is lost in the blur [see histogram (a) in Fig. 4.3]; using too many may result in a confusing jaggedness, that should be smoothed in order to see the underlying pattern [see histogram (b) in Fig. 4.3]. A common rule of thumb is that one should not use less than 5 bins, and no more than 20. A probably less obvious issue is related to the choice of origin x0, as illustrated by the next example.
Example 4.5 Consider the dataset in Table 4.5, reporting observed values along with their frequency. Now let us choose h = 0.2, just to group data a little bit. Figure 4.4 shows three histograms obtained by setting x0 = 54.9, x0 = 55.0, and x0 = 55.1, respectively. At first sight, the change in histogram shape due to an innocent shift in the origin of bins is quite surprising. If we look more carefully into the data, the source of the trouble is evident. Let us check which interval corresponds to the first bin in the three cases. When x0 = 54.9, the first bin is [54.9, 55.1) and is empty. When x0 = 55.0, the first bin is changed to [55.0, 55.2); now two observations fall into this bin. Finally, when x0 = 55.1, the first bin is changed to [55.1, 55.3), and 2 + 4 = 6 observations fall into this bin. The point is that, by shifting bins, we change abruptly the number of observations falling into each bin, and a wild variation in the overall shape of the histogram is the result.10
Table 4.5 Data for Example 4.5.
Fig. 4.4 Effect of shifting the origin of histograms for the data in Table 4.5.
The example above, although somewhat pathological, shows that histograms are an innocent-looking graphical tool, but they may actually be dangerous if used without care, especially with small datasets. Still, they are useful to get an intuitive picture of the distribution of data. In the overall economy. We should just see that the histogram of relative frequencies is a first intuitive clue leading to the idea of a probability distribution.
We close this section by defining two concepts that will be useful in the following.
DEFINITION 4.3 (Order statistics) Let X1, X2, …, Xn be a sample of observed values. If we sort these data in increasing order, we obtain order statistics, which are denoted by X(1), X(2), …, X(n). The smallest observed value if X(1) and the largest observed value is X(n).
DEFINITION 4.4 (Outliers) An outlier is an observation that looks quite apart from the other ones. This may be a very unlikely value, or an observation that actually comes from a different population.
Example 4.6 Consider observations
Ordering these values yields the following order statistics:
The last value is quite apart from the remaining ones and is a candidate outlier.
Spotting an outlier is a difficult task, and the very concept looks quite arbitrary. There are statistical procedures to classify an outlier in a sensible and objective manner, but in practice we need to dig a bit deeper into data to figure out why a value looks so different. It may be the result of a data entry error or a wrong observation, in which case the observation should be eliminated. It could be just an unlikely observation, in which case eliminating the observation may result in a dangerous underestimation of the actual uncertainty. In other cases, we may be mixing observations from what are actually different populations. If we take observations of a variable in small towns and then we throw New York into the pool, an outlier is likely to result.
Leave a Reply