A rather general answer to this question is that statistics is a group of methods to collect, analyze, present, and interpret data (and possibly to make decisions). We often consider statistics as a branch of mathematics, but this is the result of a more recent tendency. From a historical perspective, the term “statistics” stems from the word “state.” Originally, the driving force behind the discipline was the need to collect data about population and economy, something that was felt necessary in the city states of Venice and Florence during Renaissance. Many governments did the same in the following centuries. Then, statistics got a more quantitative twist, mainly under the impulse of French mathematicians. As a consequence, statistics got more intertwined with the theory of probability, a tendency that was not free from controversy.
Over time, many statistical tools have been introduced and they are often looked at as a bunch of cookbook recipes, which may result in quite some confusion. In order to bring some order, a good starting point is drawing the line between two related subbranches:
- Descriptive Statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and summary measures.
- Inferential Statistics consists of methods that use sampling to help make decisions or predictions about a population.
To better understand the role of sampling, we should introduce the following concepts.
DEFINITION 4.1 (Population vs. sample) A population consists of all elements (individuals, items, etc.) whose characteristics are being studied. A sample is a portion of the population, which is selected for study.
To get the point, it suffices to reflect a bit on the cost and the time required to carry out a census of the whole population of a state, e.g., to figure out average household income. A much more common occurrence is a sample survey. For the study to be effective, the sample must be representative of the whole population. If you sample people in front of a big investment bank, you are likely to get a misleading picture, as the sample is probably biased toward a very specific type of individual.
Example 4.1 One of the best-known examples of bad sample selection is the 1936 presidential election poll by the Literary Digest. According to this poll, the Republican governor of Kansas, Alf Landon, would beat former president Franklin Delano Roosevelt by 57–43%. The sample size was not tiny at all, as the Digest mailed over 10 million questionnaires and over 2.3 million people responded. The real outcome was quite different, as Roosevelt won with 62%. One of the reasons commonly put forward to explain such a blunder is that many respondents were selected from lists of automobile and telephone owners. Arguably, a selection process like that would be correct nowadays, but in the past the sample was biased towards relatively wealthy people, which in turn resulted in a bias towards republican voters.
A sample drawn in such a way that each element in the target population has a chance, of being selected is called a random sample. If the chance of being selected is the same for each element, we speak of a simple random sample.1
Household income is an example of a variable. A variable is a characteristic of each member of the population, and below we discuss different types of variable we might be interested in. Income is a quantitative variable, and we may want some information about average income of the population. The average income of the population is an example of a parameter. Typically, we do not know the parameters characterizing a whole population, and we have to resort to some form of estimate. If we use sampling, we have to settle for the average income of the sample, which is a statistic. The statistic can be used to estimate the unknown parameter.
If sampling is random, whenever we repeat the experiment, we get different results, i.e., different values of the resulting statistic. If the results show wide swings, any conclusion that we get from the study cannot be trusted. Intuition suggests that the larger the sample, the more reliable the conclusions. Furthermore, if the individuals in the population are not too different from one another, the sample can be small. In the limit, if all of the individuals were identical, any one of them would make a perfect sample. But if there is much variability within the population, a large sample must be taken. In practice, we need some theoretical background to properly address issues related to the size of the sample and the reliability of the conclusions we get from sampling, especially if such conclusions are the basis of decision making. Inferential statistics, we will consider such issues in detail. On the contrary, basic descriptive statistics does not strictly rely on quite sophisticated concepts. However, probability theory is best understood by using descriptive statistics as a motivation. Descriptive statistics is quite useful when conducting an exploratory study, i.e., if we want to analyze data to see if an interesting pattern emerges, suggesting some hypothesis or line of action. However, when a confirmatory analysis is carried out, to check a hypothesis, inferential statistics comes into play.
Table 4.1 Illustrating types of variable.
4.1.1 Types of variable
If we are sampling a population to figure out average household income, we are considering income as the variable of interest.
DEFINITION 4.2 (Variables and observations) A variable is a characteristic under study, which assumes different values for different elements of a population or a sample. The value of a variable for an element is called an observation or measurement.
This definition is illustrated in Table 4.1, where hypothetical data are shown. An anonymous person is characterized by weight, height, marital status, and number of children. Variables are arranged on columns, and each observation corresponds to a row. We immediately see differences between those variables. A variable can be
- Quantitative, if it can be measured numerically
- Qualitative or categorical, otherwise
Clearly, weight and number of children are quantitative variables, whereas marital status is not. Other examples of categorical variables are gender, hair color, or make of a computer.
If we look more carefully at quantitative variables in the table, we see another difference. You cannot have 2.1567 children; this variable is restricted to a set of discrete values, in this case integer numbers. On the contrary, weight and height can take, in principle, any value. In practice, we truncate those numbers to a suitable number of significant digits, but they can be considered as real numbers. Hence, quantitative variables should be further classified as
- Discrete, if the values it can take are countable (number of cars, number of accidents occurred, etc.)
- Continuous, if the variable can assume any value within an interval (length, weight, time, etc.)
We will generally associate discrete variables with integer numbers, and continuous variables with real numbers, as this is by far the most common occurrence. However, this is not actually a rule. For instance, we could consider a discrete variable that can take two real values such as ln(18) or 2π. We should also avoid the strict identification of “a variable that can take an infinite number of values” with a continuous variable. It is true that a continuous variable restricted to a bounded interval, e.g., [2, 10], can assume an infinite number of values, but a discrete variable can take an infinite number of integer values as well.2 For instance, if we consider the number of accidents that occurred on a highway in one month, there is no natural upper bound on them, and this should be regarded as a variable taking integer values i = 1, 2, 3, …, even though very large values are (hopefully) quite unlikely.
The classification looks pretty natural, but the following examples show that sometimes a little care is needed.
Example 4.2 (Dummy and nominal variables) Marital status is clearly a qualitative variable. However, in linear regression models it is quite common to associate them with binary values 1 and 0, which typically correspond to yes/no or true/false. In statistics, such a variable is often called “dummy.” The interpretation of these numerical values is actually arbitrary and depends on modeler’s choice. It is often the case that numerical values are attached to categorical variables for convenience, but we should consider these as nominal variables. A common example are the Standard Industrial Classification (SIC) codes.3 You might be excited to discover that SIC code 1090 corresponds to “Miscellaneous Metal Ores” and 1220 to “Bituminous Coal & Lignite Mining.” No disrespect intended to industries in this sector and, most importantly, no one should think that the second SIC code is really larger than the first one, thereby implying some ranking among them.
The example above points out a fundamental feature of truly numerical variables: They are ordered, whereas categorical variables cannot be really ordered, even though they may be associated with numerical (nominal but not ordinal) values. We cannot double a qualitative or a nominal variable, can we? But even doubling a quantitative variable is trickier than we may think.
Example 4.3 Imagine that, between 6 and 11 a.m., temperature on a day rises from 10°C to 20°C. Can we say that temperature has doubled? It is tempting to say yes, but imagine that you measure temperature by Fahrenheit, rather than Celsius degrees. In this case, the two temperatures are 50°F and 68°F, respectively,4 and their ratio is certainly not 2.
What is wrong with the last example is that the origin of the temperature scale is actually arbitrary. On the contrary, the origin of a scale measuring the number of children in a family is not arbitrary. We conclude this section with an example showing again that the same variable may be used in different ways, associated with different types of variable.
Example 4.4 (What is time?) Time is a variable that plays a fundamental role in many models. Which type of variable should we use to represent time?
Time as a continuous variable. From a “philosophical” point of view, time is continuous. If you consider two time instants, you can always find a time instant between them. Indeed, time in physics is usually represented by a real number. Many useful models in finance are also based on a continuous representation of time, as this may result in handy formulas.5
Time as a discrete variable. Say that we are in charge of managing the inventory of some item, that is ordered at the end of each week. Granted, time is continuous, but from our perspective what really matters is demand during each week. We could model demand by a variable like dt, where subscript t refers to weeks 1, 2, 3,…. In this model, time is discretized because of the structure of the decision making process. We are not interested in demand second by second. In the EOQ model (see Section 2.1) we treated time as a continuous variable, because demand rate was constant. In real life, demand is unlikely to be constant, and time must be discretized in order to build a manageable model. Indeed, quite often time is discretized to come up with a suitable computational procedure to support decisions.6
Time as a categorical variable. Consider daily sales at a retail store. Typically, demand on Mondays is lower than the average, maybe because the store is closed in the morning. Demand on Fridays is greater, and it explodes on Saturdays, probably because most people are free from their work on weekends. We observe similar seasonal patterns in ratings of TV programs and in consumption of electrical energy.7 We could try to analyze the statistical properties of dmon, dfri, etc. In this case, we see that “Monday” and “Friday” subscripts do not correspond to ordered time instants, as there are different weeks, each one with a Monday and a Friday. A time subscript in this case is actually related to categorical variables.
Table 4.2 Raw data may be hard to interpret.
Time can be modeled in different ways, and the choice among them may depend on the purpose of the model or on computational conveniency.
Leave a Reply