The need for testing a hypothesis about an unknown parameter arises from many problems related to inferential statistics. There are general and powerful ways to build appropriate procedures for testing hypotheses, which we outline in Section 9.10. Since they do require some level of mathematical sophistication, we offer here an elementary treatment that is strongly linked to how we have built confidence intervals for the expected value of a normal distribution. To get a grasp of the underlying issues, we begin with hypotheses about the expected value of a normal distribution. We will generalize a bit in Section 9.4, where we consider testing the difference in the means of two populations, variance, proportions, and correlation. In Section 9.6 we show how Analysis of Variance allows to deal with more than two populations. A solid understanding of these topics is more than adequate for most business practitioners. In order to get a feeling for the involved issues, let us introduce a little numerical example.
Example 9.10 Consider the data listed in Table 9.2 and the following claims:
- The sample comes from a normal distribution, i.e., from a normal population.
- The expected value of this distribution, or the population mean, if you prefer, is μ0 = 5.
How can we check the truth of these claims? At present, we have absolutely no tool with which to verify that a sample comes from a normal distribution. We will consider the issue later, in Section 9.5, but we should note that this does not involve just a few parameters, but the whole distribution. Since a nonparametric test looks a bit hard, let us take normality for granted, at least for now. Since we already know a bit about estimating an expected value, we can try checking the second claim. Probably, the first step should be the calculation of the standard sample statistics:
Table 9.2 Sample data for hypothesis testing.
We see that the sample mean is quite different from μ0 = 5. This could lead us to reject the claim. However, we also see that variability is quite large and that the sample size is rather small. The discrepancy between and μ0 could be explained in either of two ways:
- The claim is false, i.e., μ0 ≠ 5.
- The claim is true, but the sample mean differs from the expected value because of randomness in sampling.
In this case we are somewhat embarrassed. If the sample statistics were, say, and S = 0.5, we would feel rather confident that the claim is false. If, on the contrary, we had , we would probably say that the small discrepancy is not statistically significant. But what does statistically significant exactly mean, anyway?
Such questions like this one are quite relevant in practice, and to check similar claims properly, we need to rely on sensible statistical procedures. In hypothesis testing we state some hypothesis concerning an unknown parameter, in this case the expected value. This hypothesis should be checked against an alternative. For instance, in our first example, we should test H0 : μ = 5, against the alternative Ha : μ ≠ 5. More generally, we may test the hypothesis
for a given μ0, against the alternative hypothesis
The first hypothesis H0 is called the null hypothesis. The somewhat odd name is justified by the fact that in many practical settings, if the null hypothesis is true, then we should just sit and do nothing. It is when the hypothesis is rejected that we should act before is too late, or we have discovered an interesting fact or some unexpected pattern in data. Clearly, this is just an interpretation that need not always be true. Note that the null and the alternative hypotheses are complementary and jointly cover all of the possible values of μ0. In the following, we will also consider other kinds of hypotheses, such as values of μ0. In the following, we will also consider other kinds of hypotheses, such as
or
Note again that the two hypotheses are mutually exclusive and cover all of the possible values of μ. The following example illustrates why the above forms may arise in practical contexts.
Example 9.11 Consider the manufacturing process for producing shafts that must fit into a hole. Ideally, the diameter of all of the shafts should exactly be as specified by product design; in practice, a slight discrepancy is unavoidable. However, we do not want a diameter larger than specified, as the shaft will not fit the hole; on the other hand, we do not want a diameter smaller than specified, either, as this may result in dangerous vibrations. Taking a random sample of items, we should check that the average diameter is sufficiently close to the nominal value. If the sample mean is too large or too small, then we should probably act quickly in order to bring back the manufacturing process under control.
Example 9.12 Unlike the previous example, there are cases in which the situation is not symmetric. Consider, for instance, the concentration of a water pollutant; if the measured concentration is below the maximum level allowed, no one will complain. We should act only if we find that the concentration exceeds a certain danger threshold; if there is less pollutant than acceptable, no one would add any just to stick to that threshold (hopefully). By the same token, there are cases in which we should act when the average is below some level. No consumer association would complain, if they find more candies than expected in a box. (The viewpoint of the producer is different!)
Whatever hypothesis we test, it might be true or false. If we accept a true hypothesis or we reject a false one, then we did well. Unfortunately, there are two types of errors that we could make:
- We commit a type I error if we reject a true hypothesis.
- We commit a type II error if we accept a false hypothesis.
The probability of an error is something that we have to accept, as in real life we will (almost) never know the truth; the best we can do is keeping the probability of errors as small as possible. Which kind of error is the most relevant one depends on the practical situation at hand, and the cost of making one of the two mistakes. Indeed, in practice, this also influences the way in which the null hypothesis is stated. Such considerations notwithstanding, we will see shortly that it is definitely easier to control the probability of a type I error. Keeping this probability low means that we want some quite strong evidence against the null hypothesis to reject it. An unfortunate consequence is that if we reject the null hypothesis only when we are pretty sure, the probability of a type II error tends to increase.11 The best way to get the point is keeping in mind the following sound principle:12
You are innocent until proven guilty.
Clearly, if we take such an attitude, we realize that sometimes we might lack enough evidence to reject the null hypothesis, even if it is very suspicious. This means that the probability of a type II error will increase. Indeed, it is sometimes better to say that we “fail to reject” the null hypothesis, as we are not really accepting it.13 Incidentally, this raises an immediate issue: How can we prove something using hypothesis testing? Of course, we cannot really prove anything using sampled data, but we can collect strong statistical evidence supporting a claim. However, this claim should not be taken as the null hypothesis. We should take the negation of our claim as the null hypothesis H0; then, if we are able to reject H0, this will provide good support for our idea.14
To make all of the above ideas operational, we have to be more specific. As we said, we consider first a test concerning the expected value of a normal distribution, where we want to keep the probability of a type I error reasonably low. Such an error occurs if the null hypothesis is true and we reject it by mistake. Let us start with the null hypothesis H0 : μ = μ0, tested against the alternative Ha: μ ≠ μ0. If the null hypothesis is true, and the population is normal, then the test statistic
has t distribution with n − 1 degrees of freedom. We already used this result when deriving a confidence interval, but please note a fundamental difference:
Fig. 9.3 A rejection region for hypothesis testing.
In this case we know the value μ0, which is provided by the null hypothesis. Now let us follow an intuitive route:
- Reasonably, we should reject H0 if the sample mean is “far” from μ0 (both larger or smaller), which implies that the statistic TS will be “far” from 0, i.e., too large (positive) or too small (negative).
- Then we could find a positive critical value t* such that we reject the null hypothesis if TS < −t* or TS > t*, and we fail to reject H0 if |TS| ≤ t*. Given the symmetry of the t distribution, there is no point in defining two different thresholds for negative or positive values.
- The critical value t* defines two regions:
- A rejection region C, such that if the statistic is in that region, TS ∈ C, then we reject the null hypothesis.
- An “acceptance” region, such that we fail to reject the null hypothesis if TS ∉ C.
- Actually, because of sampling errors, it might well be the case that the test statistic falls in the rejection region by pure chance; this, however, should be relatively unlikely. We want a suitably small probability of type I error α, say, α = 0.05. This value is often called the significance level. Then it is easy to see that by setting t* = t1−α/2, n−1, we obtain a rejection region associated with a probability of type I error given by α. The idea is illustrated in Fig. 9.3. If the null hypothesis is true, the test statistic will fall in the acceptance region with probability 1 − α:where we use the notation Pμ0 to emphasize that we compute this probability under the probability measure assumed in H0. The rejection region consists of two tails. Each tail is associated with a probability α/2; even if the null hypothesis is true, because of sampling variability we have a probability α/2 that the standardized sample mean falls on the right tail ( is much larger than μ0), and a probability α/2 that it falls on the left tail ( is much smaller than μ0). If, in such a circumstance, we reject H0, the probability of a type I error is α, putting the two tails together.
Wrapping everything up, the procedure prescribes the following conditions, for a given significance level α:
A test like this is called a two-sided or two-tail test, as the rejection region consists of two tails. Let us illustrate with an example.
Example 9.13 Consider again the data in Table 9.2 and the null hypothesis H0 : μ = 5, against the alternative Ha : μ ≠ 5. As we observed, the statistics
seem to contradict the claim, but we should test this carefully. Assume that we choose a significance level α = 0.1. The test statistic is
Indeed, TS ≠ 0, but we do not know yet if this is really significant. To find the critical value drawing the line between acceptance and rejection, we may consult statistical tables to find t1−α/2, n−1 = t0.95,9 = 1.8331. Since the sample size is n = 10, we have 9 degrees of freedom. Once again, note that if the rejection region consists of two tails, we should split its total area α on the two tails; this is quite similar to what we do when calculating confidence intervals. Since | TS | = 0.8308 < 1.8331, we cannot reject the hypothesis with that significance level; see Fig. 9.4. What would happen with a smaller probability of type I error? Well, we would fail to reject the null hypothesis again, since decreasing the significance level means that we are even more conservative. For instance, if we set α = 0.05, we should compare the test statistic against the quantile t0.975,9 = 2.2622. Then, the rejection region consists of two smaller tails, and we would fail to reject again, since TS is still in the acceptance region. In order to reject the null hypothesis, we should increase the probability of a type I error. If we set α = 0.5, the quantile marking the rejection region is t0.75,9 = 0.7027. In this case, | TS | = 0.8308 > 0.7027, i.e., the test statistic is in the rejection region. However, we have a very large probability of type I error; if α = 50%, we are basically flipping a coin.
Fig. 9.4 Checking rejection, given a test statistic.
One could wonder which value of α draws the line between acceptance and rejection. Using suitable software for evaluating the CDF of t distribution, we find that
Hence
If we choose α > 2 × 0.2138 = 42.76%, then we reject the null hypothesis. Note that we should double the probability above because this is a two-tail test. This leads to the p-value concept, which is discussed later.
To summarize, we failed to reject H0. Can we say that we accept it? Well, this is a bit of a philosophical question, but probably one would not say that a sample mean really supports H0 : μ = 5. All we can say is that the evidence is not hard enough to reject it. By the way, I can tell you that in this hypothetical case the null hypothesis was indeed true; the sample was obtained by running a generator of pseudorandom variates, sampling a normal distribution with expected value 5 and standard deviation 10; in real life, you will never know.
The careful reader will probably find some close connection between what we have learned about confidence intervals and hypothesis testing. Indeed, it turns out that the testing procedure above could also be carried out in terms of confidence intervals. What we could do is:
- Compute a confidence interval with confidence level 1 − α
- Reject the null hypothesis if μ0 does not fall inside the confidence interval
We prefer to avoid this way of reasoning in order to enforce the basic concepts about hypothesis testing. Furthermore, as we show in the next section, this way of thinking is not that helpful if the form of the null hypothesis is different, leading to a one-tail test.
Leave a Reply