RANDOM SAMPLES AND SAMPLE STATISTICS

Inferential statistics relies on random samples. There are many ways to take a sample:

Given a large population, we may administer a questionnaire to a relatively small sample of randomly selected individuals; alternatively, we may structure the sample in such a way that it is representative of the overall population.
Given a stream of manufactured items, we may take a random sample of them to check for defects; alternatively, we may carefully plan experiments in order to assess the impact of factors related to product design or manufacturing technology.
Sometimes, rather than working with a real system, we have to rely on a computer-based model reflecting randomness in the real world, and carry out Monte Carlo simulation experiments to assess system performance.
Finally, sometimes we have a set of data and we must do the best with it, as it would be too costly or impossible to collect others. Destructive tests, like experiments with car crashes, are rather costly, but at least they can be carefully planned in order to squeeze out as much information as possible. When dealing with a time series of stock prices, we cannot set experimental conditions at will to check the impact of controlled factors. We have to rely on the observed history, pretending that it came from random sampling, and that’s it.

The last point should be stressed: Quite often we assume that observed data have been generated by a process that we represent as a statistical model. If we want to use tools from probability and statistics, this is a necessary step. However, we should always keep in mind that whatever conclusion we come up with, it is only as good as the underlying model, which can be a rather drastic simplification of reality. Hence, it is important to formalize what we mean by random sample, in order to have a clear picture of the assumptions that statistical tools rely on; the validity of these assumptions should be carefully checked on a case wise basis.

DEFINITION 9.1 (Random sample) A random sample is a sequence X₁, X₂,…, X_n of independent and identically distributed (i.i.d.) random variables. Each element X_i in the sample is referred to as an observation, and n is the sample size.

It is very important to stress the role of independence in this definition. All of the concepts introduced of depend critically on this assumption. It may well be the case that there is correlation within a real-life sample, and a blindfolded application of naive statistical procedures may lead to erroneous conclusions and a possible business disaster. Furthermore, we also assume that the data are somewhat homogeneous, since they are identically distributed. Clearly, if the data have been observed under completely different settings, the conclusions we draw from their analysis may be severely flawed.

Example 9.1 Consider performing a quality check on manufactured items. This typically consists of the measurement of a few quantities for each item, that should conform to some specifications. There could be a small variability in such measures that results from pure randomness and does not significantly affect quality perceived by customers. Let us denote by X_k the measured value for item k. If we consider this as the realization of a random variable, can we say that these variables are independent? Well, it may be the case that, due to tool wear, the machine starts producing a sequence of items that do not meet the specifications; then, the values we observe are not really independent. The conditional probability that item k + 1 is defective, given that item k is defective, can be larger than the unconditional probability of producing a defective item.

Example 9.2 Sometimes, the assumptions that variables are identically distributed may be wrong. Imagine taking measures of whatever you like within a population consisting of men and women. If gender has a significant impact on the measured variable, it may be a gross mistake to attribute variability to randomness.

Example 9.3 Consider daily demand d_t, t = 1,…, T, for an item. Considering this as a sequence of i.i.d. variables may be a gross mistake as well. It may be the case that demands on two consecutive days are negatively correlated; if customers buy many items on day t, they might not buy any more until their inventory has depleted. Furthermore, at retail stores it may well be the case that sales on Saturdays are much larger than sales on Tuesdays. Such seasonal patterns are commonly observed, in which case data should be deseasonalized before they are analyzing.¹

If we take for granted that we are dealing with i.i.d. variables, we may under- or overestimate true variability. Assuming that we have a legitimate random sample, we use available data to estimate some quantity of interest, possibly a summary measure. By far, the most common such measure is the sample mean, but there are other measures that we may be interested in, like variance and correlation. Formally, given a random sample, we compute one or more sample statistics (not to be confused with Statistics itself).

DEFINITION 9.2 (Statistic) A statistic is a random variable whose value is determined by a random sample.

In other words, a statistic is a function of a random sample and, as such, it is a random variable. This is quite important to realize; we use the most familiar statistic, sample mean, to illustrate basic concepts and possible pitfalls.

RANDOM SAMPLES AND SAMPLE STATISTICS

Comments

Leave a Reply Cancel reply