CUMULATIVE FREQUENCIES AND PERCENTILES

The median m is a value such that 50% of the observed values are smaller than or equal to it. In this section we generalize the idea to an arbitrary percentage. We could ask which value is such that 80% of the observations are smaller than or equal to it. Or, seeing things the other way around, we could ask what is the relative standing of an observed value. In Section 1.2.1 we anticipated quite practical motivations for asking such questions, which are of interest in measuring the service level in a supply chain or the financial risk of a portfolio of assets. The key concept is the set of cumulative (relative) frequencies.

DEFINITION 4.9 (Cumulative relative frequencies) Consider a sample of n observations X_i, i = 1, …, and group them in m classes, corresponding to each distinct observed value yk, k = 1, …, m. Classes are sorted in increasing order with respect to values: y_k < y_k+1. If f_k is the frequency of class k, the cumulative frequency of the corresponding value is the sum of all the frequencies up to and including that value:

By the same token, given relative frequencies p_k = f_k/n, we define cumulative relative frequencies:

For the sake of simplicity, when no ambiguity arises, we often speak of cumulative frequencies, even though we refer to the relative ones.

Example 4.14 Consider the data in Table 4.8, which displays frequencies, relative frequencies, and cumulative frequencies for a dataset of 47 observations taking values in the set {1, 2, 3, 4, 5}. If we cumulate frequencies, we obtain cumulative frequencies:

Table 4.8 Illustrating cumulative frequencies.

Fig. 4.9 Bar charts of relative frequencies and cumulative relative frequencies for the data in Table 4.8.

Cumulative relative frequencies are computed by adding relative frequencies:

Since relative frequencies add up to 1, the last cumulative frequency must be 1 (or 100%). Since relative frequencies cannot be negative, cumulative frequencies form an increasing sequence; this is also illustrated in Fig. 4.9. Incidentally, the percentages in the first two rows of Table 4.8 may look wrong. If we add up the first two relative frequencies, 23.40% and 31.91%, we obtain 55.31%, whereas the second cumulative relative frequency in the last column of the table is 55.32%. This is just the effect of rounding; indeed, such apparent inconsistencies are common when displaying cumulative frequencies.

Cumulative frequencies are related with a measure of relative standing of a value y_k, the percentile rank. A little example illustrates why one might be interested in percentile ranks.

Example 4.15 In French universities, grades are assigned on a numerical scale whose upper bound is 20 and the minimum for sufficiency is 10. A student has just passed a tough exam and is preparing to ask her parents for a well-deserved bonus, like a brand-new motorcycle, or a more powerful flamethrower, or maybe a trip to visit museums abroad. Unfortunately, her parents do not share her enthusiasm, as her grade is just 16, whereas the maximum is 20. Since this is just a bit above the midpoint of the range of sufficient grades (15), they argue that this is just a bit above average. She should do much better to earn a bonus! How can she defend her position?

As the saying goes, everything is relative. If she got the highest grade in the class, her claim is reasonable. Or maybe only 2 colleagues out of 70 earned a larger grade. What she needs to show is that she is near the top of the distribution of grades, and that a large percentage of students earned a worse grade.

The percentile rank of an observation X_i could be defined as the fraction of observations which are less than or equal to X_i:

where b is the number of observations below and e the number of observations equal to X_i, respectively. This definition is just based on the cumulative relative frequency corresponding to value X_i. However, there might be a little ambiguity. Imagine that all of the students in the example above have received the same grade, 16 out of 20. Using this definition, the percentile rank would be 100%. This is the same rank that our friend would get if she were the only student with a 16 out of 20, with everyone lagging far behind. A definition which does not discriminate between these two quite different cases is debatable indeed. We could argue that if everyone has received 16 out of 20, then the percentile rank for everyone should be 50%. Hence, we could consider the alternative definition of the percentile rank of X_i as

which accounts for observations equal to X_i in a slightly different way. Some well-known spreadsheets use still another definition, which eliminates the number of observations equal to X_i:

where b is the number of observations strictly below X_i, as before, and a is the number of observations strictly above X_i. We see that, sometimes, descriptive statistics is based on concepts that are a bit shaky; however, for a large dataset, the above ambiguity is often irrelevant in practice.

Now let us go the other way around. Given a value, we may be interested in its percentile rank, which is related to a cumulative frequency. Given a relative frequency, which is a percentage, we may ask what is the corresponding value. Essentially, we are inverting the mapping between values and cumulative frequencies. Values corresponding to a percentage are called percentiles. For instance, the median is just the 50% percentile, and we want to generalize the concept. Unfortunately, there is no standard definition of a percentile and software packages might yield slightly different values, especially for small datasets. In the following, we illustrate three possible approaches. None is definitely better than the other ones, and the choice may depend on the application.

Approach 1. Let us start with an intuitive definition. Say that we want to find the kth percentile. What we should do, in principle, is sort the n observations to get the order statistics X_(j), j = 1, …, n. Then, we should find the value corresponding to position kn/100. Since this ratio is not an integer in general, we might round it to the nearest integer. To illustrate, consider again the dataset (4.2), which we repeat here for convenience:

What is the 42nd percentile? Since we have only 12 observations, one possible approach relies on the following calculation:

so that we should take the 5th element (80.2). Sometimes, it is suggested to add to the ratio above before rounding. Doing so, we are sure that at least 42% of the data are less than or equal to the corresponding percentile. Note that in the sample above we have distinct observed values. The example below illustrates the case of repeated values.

Example 4.16 Let us consider again the inventory management problem we considered in Section 1.2.1. For convenience, let us repeat here the cumulative frequencies of each value of observed demand:

Imagine that we want to order a number of items so that we satisfy the whole demand in at least 85% of the cases. If we trust the observed data, we should find a 85% percentile. Since we have 20 observations, we could look at the value X_(j) where j = 85 × 20/100 = 17. Looking at the disaggregated data, we see that X₍₁₇₎ = 4, and there is no need for rounding. However, with large datasets it might be easier to work with cumulative frequencies. However, there is no value corresponding to a 85% cumulative frequency; what we may do, however, is take a value such that its cumulative frequency is at least 85%, which leads us to order four items. This example has two features:

We want to be “on the safe side.” We have a minimal service level, the probability of satisfying all customers, that we want to ensure. Hence, it makes sense to round up values. If the minimal service level were 79%, then j = 79 × 20/100 = 15.8; looking at the order statistics, we see that X₍₁₅₎ = 3 and X₍₁₆₎ = 4, but we should order four items to be on the safe side.
Percentiles in many cases, including this one, should correspond to decisions; since we can only order an integer number of items, we need a percentile that is an integer number.

Approach 2. Rounding positions may be a sensible procedure, but it is not consistent with the definition of median that we have considered before. For an even number of observations, we defined the median as the average of two consecutive values. If we want to be consistent with this approach, we may define the kth percentile as a value such that:

At least kn/100 observations are less than or equal to it.
At least (100 − k)n/100 observations are greater than or equal to it.

For instance, if n = 22 and we are looking for the 80% percentile, we want a value such that at least 80 × 22/100 = 17.6 observations are less than or equal to it, which means that we should take X₍₁₈₎); on the other hand, at least (100 − 80) × 22/100 = 4.4 values should be larger than or equal to it. Also this requirement leads us to consider the 18th observation, in ascending order. Hence, we see that in this case we just compute a position and then we round up. However, when kn/100 is an integer, two observations satisfy the above requirement, Indeed, this happens if we look for the 75% percentile and n = 32. Both X₍₂₄₎ and X₍₂₅₎ meet the two requirements stated above. So, we may take their average, which is exactly what happens when calculating the median of an even number of observations.

Approach 3. Considering the two methods above, which is the better one? Actually, it depends on our aims. Approach 2 does not make sense if the percentile we are looking must be a decision restricted to an integer value, as in Example 4.16. Furthermore, with approach 1 we are sure that the percentile will be an observed value, whereas with approach 2 we get a value that has not been observed. This is critical with integer variables, but if we are dealing with a continuous variable, it makes perfect sense. Indeed, there is still a third approach that can be used with continuous variables and is based on interpolating values, rather than rounding positions. The idea can be summarized as follows:

The sorted data values are taken to be the 100(0.5/n), 100(1.5/n), …, 100([n − 0.5]/n) percentiles.
Linear interpolation is used to compute percentiles for percent values between 100(0.5/n) and 100([n − 0.5]/n).
The minimum or maximum values in the dataset are assigned to percentiles for percent values outside that range.

Let us illustrate linear interpolation with a toy example.

Example 4.17 We are given the dataset

According to the procedure above, 3.4 is taken as the 10% percentile, 7.2 is taken as the 30% percentile, and so on until 12.5, which is taken as the 90% percentile. If we ask for the 5% percentile, the procedure yields 3.4, the smallest observation. If we ask for the 95% percentile, the procedure yields 12.5, the largest observation.

Things are more interesting for the 42% percentile. This should be somewhere between 7.2, which corresponds to 30%, and 8.3, which corresponds to 50%. To see exactly where, in Fig. 4.10 we plot the cumulative frequency as a function of observed values, and we draw a line joining the two observations. How much should we move along the line from value 7.2 toward value 8.3? The length of the line segment is

and we should move by a fraction of that interval, given by

Hence, the percentile we are looking for is

7.2 + 0.6 × 1.1 = 7.86

The above choice of low and high percentiles might look debatable, but it reflects the lack of knowledge about what may happen below X₍₁₎ and above X_(n). Note that if we have a large dataset, X₍₁₎ will be taken as a lower percentile than in the example, reflecting the fact that with more observations we are more confident about the lowest value that observations may take; however, we seldom can claim that a value below X₍₁₎ cannot be observed. Similar considerations apply to X_(n). In fact, there is little we can do about extreme values unless we superimpose a theoretical structure based on probability theory.

Fig. 4.10 Finding percentiles by interpolation.

In practice, whatever approach we use, provided that it makes sense for the type of variable we are dealing with and the purpose of the analysis, it will not influence significantly the result for a large dataset. When dealing with probability theory and random variables, we will introduce a strictly related concept, the quantile, which does have a standard definition.

CUMULATIVE FREQUENCIES AND PERCENTILES

Comments

Leave a Reply Cancel reply