Correlation analysis testing significance and potential dangers

To estimate the correlation coefficient ρ_XY between X and Y, we may just plug sample covariance S_XY and sample standard deviations S_X, S_Y into its definition, resulting in the sample coefficient of correlation, or sample correlation for short:

The factors n − 1 in S_XY, S_X, and S_Y cancel each other, and it can be proved that −1 ≤ r_XY ≤ +1, just like its probabilistic counterpart ρ_XY.

Once again, we stress that the estimator that we have just defined is a random variable depending on the random sample we take. Whenever we estimate a nonzero correlation coefficient, we should wonder whether it is statistically significant. Even if the “true” correlation is ρ_XY = 0, because of sampling variability it is quite unlikely that we get a sample correlation R_XY = 0. How can we decide if a nonzero sample correlation is really meaningful? A simple strategy is to test the null hypothesis

against the alternative hypothesis

However, we need a statistic whose distribution under the null hypothesis is fairly manageable. One useful result is that, if the sample is normal, the statistic

is approximately distributed as a t variable with n − 2 degrees of freedom, for a suitably large sample. In the following example we show how to take advantage of this distributional result.

Example 9.22 Say that you are convinced that there is a positive correlation between normal random variables X and Y. You take a sample of 12 joint observations. Which is the minimum value of correlation that you would find statistically significant? In this case, we should consider a one-sided test, with null hypothesis

tested against the alternative one

Assume that we test with the standard significance level α = 5%. The degrees of freedom we should consider are n − 2 = 10, so the rejection region is T > t_0.05,10 = 1.812. Now we have to transform this rejection region for the T statistic into a rejection region for the sample correlation. Setting

we solve for R_XY:

where, of course, we should take the positive root.

Correlation analysis is very useful, but as with any other tool, we must be well aware of its pitfalls and limitations to use it properly. They are a direct consequence of a few observations where we analyzed the link between independence and lack of correlation. We repeat them here for convenience, along with some additional warning.

Correlation measures association, not causation: Lurking variables A common misunderstanding is the confusion of correlation with causation. When X and Y are correlated, it is tempting to conclude that X “causes” Y. This may be true, but it is knowledge of the phenomenon that allows us to draw such a conclusion. Correlation per se does not measure anything except a symmetric association. In fact, the definitions of covariance and correlation are symmetric, and this also applies to their sample counterparts: R_XY = R_XY. Therefore, it is not at all clear which variable is the cause and which one is the effect.

Sometimes, we might detect a spurious correlation between X and Y, which is actually the effect of another variable Z. To see how this may happen, imagine that there is indeed a causal relationship between Z and Y, where Z is the cause and Y the effect, which is reflected by a positive correlation between Z and Y. If Z is positively correlated with X, because of a pure noncausal association, we will detect a positive correlation between X and Y as well; however, this positive correlation does not reflect any cause-effect relationship. In such a case, we call Z a lurking variable.

Example 9.23 Consider the relationship between how much we spend in advertisements, measured by X, and demand, measured by Y. If we detect a positive correlation between X and Y, we might be satisfied by our last marketing campaign. But now suppose that Z measures the discount we offer to support promotional sales. Quite often, a marketing campaign involves increased advertising and a reduction in price to expand the customer base. So, it might be the case that the real cause of the increase in demand is the reduction in price.

Fig. 9.6 Scatterplot of a nonlinear relationship between X and Y.

Lurking variables are a quite common issue and we may be easily lead to wrong conclusions. We will see other examples in Section 16.2.1, when dealing with multiple regression models.

Correlation measures only linear associations We have already pointed out that, in general, lack of correlation does not imply independence. When the relationship between X and Y is nonlinear, the coefficient of correlation could not reflect this link at all. An example is shown in Fig. 9.6, where there is indeed a link between the two variables, but sample correlation is practically zero. This happens because when Y is larger than its mean, X can be larger or smaller than its mean (see also Example 8.4). To overcome this difficulty, there are a few tricks that can be used:

One possible strategy is a nonlinear transformation of variables. Some-times, rather than considering X, we may take or log X. These non-linear transformations are commonly used to develop statistical models.
Another possibility is to account for nonlinearity explicitly, relating X and Y by a nonlinear model; nonlinear regression is outlined in Section 16.4.

The King-Kong effect Another danger of correlation analysis is the impact of a single “odd” observation. The issue is illustrated in Fig. 9.7. In Fig. 9.7(a) we see a random sample from a bivariate normal distribution with zero correlation (ρ_XY = 0). The sample correlation is very small, R_XY = 0.0275, and this is not significant and clearly compatible with sampling variability. Now assume that we add a single observation, (X_k, Y_k) = (40, 40), which is quite far apart, as shown in Fig. 9.7(b). Now sample correlation jumps to R_XY = 0.7657, because of the impact of the two deviations and on sample correlation. This is often called the “King Kong” or “Big Apple” effect.¹⁶ In practice, the King-Kong effect might be the effect of an outlier, i.e., a measurement error or, possibly, an observation that actually belongs to a different population.

Fig. 9.7 An illustration of the King Kong effect.

Correlation analysis testing significance and potential dangers

Comments

Leave a Reply Cancel reply