So far, we have considered the organization and representation of data in one dimension, but in applications we often observe multidimensional data. Of course, we may list summary measures for each single variable, but this would miss an important point: the relationship between different variables.
In issues concerning independence, correlation, etc. Here we want to get acquainted with those concepts in the simplest way. To begin with, let us consider bidimensional categorical data. To represent data of this kind, we may use a contingency table.
Example 4.18 Consider a sample of 500 married couples, where both husband and wife are employed. We collect information about yearly salary. For each person in the sample, we collect categorical information about the gender. The quantitative information about salary is transformed into categorical information by asking: Is salary less or more than $30,000? We could express this as “high” and “low” income. We should not just disaggregate couples into two separate samples of 500 males and 500 females, as we could miss some information about the interactions between the two categorical variables. The contingency table in Table 4.9 is able to capture information about interactions. Armed with the contingency table, we may ask a few questions:
Table 4.9 Contingency table for qualitative data
- What is the probability that a randomly selected female is high-income? We have 500 wives in the sample, and 36 + 54 = 90 are high-income. Then, the desired probability14 is
- What is the probability that a randomly selected person, of whatever gender, is low-income? Note that we have 500+500 persons, since there are 500 pairs in the sample. We have 212 pairs in which both members are low-income, and 198 + 36 pairs in which one of them is low-income. Hence, we should take the following ratio:
- If we pick a couple at random, what is the probability that the wife is low-income, assuming that the husband is low-income? Apparently, this is a tough question, but we may find the answer using a little intuition. There are 212 + 36 = 248 pairs in which the husband is low-income. We should restrict the sample to this subset and take the ratiosince the wife is low-income in 212 out of these 248 pairs.Fig. 4.12 Two scatterplots illustrating data dependence.
- If we pick a couple at random, what is the probability that the wife is low-income, assuming that the husband is high-income? Using the same idea as the previous question, we get
In Section 5.3 we will see how the last two questions relate to the fundamental concept of conditional probability.
If we need to represent pairs of quantitative variables, we might aggregate them in classes and prepare corresponding contingency tables. A possibly more useful representation is the scatterplot, which is better suited to investigate the relationships between pairs of variables. In a bidimensional scatterplot, points are drawn corresponding to observations, which are pairs of values; coordinates are given by the values taken by the two variables in each observation. In Fig. 4.12 two radically different cases are illustrated. In scatterplot (a), we can hardly claim that the two variables have a definite relationship, since no pattern is evident; points look completely random. Scatterplot (b) is quite another matter, as it seems that there is indeed some association between the two variables; we could even imagine drawing a line passing through the data. Where we take advantage of this kind of association by building linear regression models.
Contingency tables and scatterplots work well in two dimensions, but what if we have 10 or even more dimensions? How can we visualize data in order to discern potentially interesting associations and patterns? These are challenging issues dealt with within multivariate statistics. One possibility is trying to generalize the analysis for two dimensions. For instance, we may arrange several scatterplots according to a matrix, one plot for each possible pair of variables. As you may imagine, these graphical approaches can be useful in low-dimensional cases but they are not fully satisfactory. More refined alternatives are based, e.g., on the following approaches:
- We can reduce data dimensionality by considering combination of variables or underlying factors.
- We can try to spot and classify patterns by cluster analysis.
Data reduction methods, including principal component analysis, factor analysis, and cluster analysis.
Problems
4.1 You are carrying out a research about how many pizzas are consumed by teenagers, in the age range from 13 to 17. A sample of 20 boys/girls in that age range is taken, and the number of pizzas eaten per month is given in the following table:
- Compute mean, median, and standard deviation.
- Is there any odd observation in the dataset? If so, get rid of it and repeat the calculation of mean and median. Which one is more affected by an extreme value?
4.2 The following table shows a set of observed values and their frequencies:
- Compute mean, variance, and standard deviation.
- Find the cumulated relative frequencies.
4.3 You observe the following data, reporting the number of daily emergency calls received by a firm providing immediate repair services for critical equipment:
4.4 Management wants to investigate the time it takes to complete a manual assembly task. A sample of 12 workers is timed, yielding the following data (in seconds):
- Find mean and median; do you think that the data are skewed?
- Find the standard deviation.
- What is the percentile rank of the person who took 20.1 seconds to complete the task?
- Find the quartiles, using the second approach that we have described for the calculation of percentiles.
- Suppose that management want to define an acceptable threshold, based on the 90% percentile; all workers taking more than this time are invited to a training session to improve their performance. Find this percentile using the interpolation method.
4.5 Professors at a rather unknown but large college have developed a habit of heavy drinking to forget about their students. The following data show the number of hangovers since the beginning of semester, disaggregated for male and female professors:
- Given that the professor is a female, what is the probability that she had a hangover twice or more during the semester?
- What is the probability that a professor is a male, given that he had a hangover once or less during the semester?
Leave a Reply