Consider the following questions:
- Given a set of characteristics, such as age, sex, cholesterol level, and smoking habits, what is the probability that a person will die because of a heart disease in the next 10 years?
- Given household income, number of children, education level, etc., what is the probability that a household will subscribe to a package of telephone/Web services?
- Given occupation, age, education, income, loan amount, etc., what is the probability that a homeowner will default on his mortgage payments?
All of these questions could be addressed by building a statistical model, possibly a regression model, but they have a troubling feature in common: The response variable is either 0 or 1, where we interpret 0 as “it did not happen” and 1 as “it happened.” So far, we have considered the possibility of regression models with qualitative variables as regressors, but here it is the regressed variable that is qualitative, and represented by a Bernoulli random variable taking values 1 or 0 with probabilities π and 1 − π, respectively. We could generalize the problem to a multinomial variable, taking values within a discrete and finite set, but for the sake of simplicity we will stick to the simple binary case.
One possibility for building a statistical model relying on linear regression would be a relationship such as
where p is the probability of the event of interest, and is an error whose probability distribution must be chosen. Note that we are not using a binary variable as the response variable, since we are not predicting the occurrence of the event, but its probability. This is possible, e.g., using linear discriminant analysis, whereby we identify a linear combination βTx, a threshold level γ, and we predict
However, for the purpose of forecasting, estimating a probability may have some advantages, so we will pursue this alternative approach. To fit a model such as (16.10), we could use a dataset in which Yi ∈ {0, 1} and apply familiar OLS. However, this idea suffers from a significant drawback: When the event is very likely or very unlikely, the response may well be a probability larger than 1 or smaller than 0, respectively. So, we cannot take such a simplistic linear regression approach. To overcome the difficulty, we may adopt a nonlinear transformation of the output, which can be sensibly interpreted as a probability. This may be accomplished by using the logistic function
Fig. 16.1 The logistic function.
which is plotted in Fig. 16.1. This is where the term logistic regression comes from. The nonlinear transformation allows to express the probability p as
Now the probability p is correctly bounded within the interval [0, 1]. To gain a better insight into the meaning of parameters βj, we need a more convenient form that is obtained by solving Eq. (16.12) for the exponential:
Taking logarithms of both sides and adding an error term, we obtain the statistical model
The ratio p/(l − p) provides us with equivalent information in terms of odds. Odds are well known to people engaged in betting. When the odds are “1 to 1,” this means that the ratio is 1/1, which in turn implies
When we say “it’s 10 to 1 that….,” we mean that the event is quite likely; indeed:
The logarithm of the odds ratio is known as a logit function. From Eq. (16.13) we can interpret the parameter βj as the increment in the logit, i.e., the logarithm of the odds ratio, for a unit increment in xj.
Now we need to find a way to fit the regression coefficients against a set of observations xi and Yi ∈ {0,1}, i = 1,…,n. The nonlinear transformation operated by the logistic function precludes application of straightforward least squares; the common approach used to estimate a logistic regression model exploits parameter estimation by the maximum-likelihood approach.6 To build the maximum-likelihood function, we observe again that Yi is the realization of a Bernoulli variable, which may be regarded as a binomial variable when only one experiment is carried out.7 Hence
where we write yi as a number, since we are referring to the observed value of the random variable Yi. The probability pi depends on observation xi and parameters β. Of course, we assume independence of the errors, so observations are independent and the likelihood function is just the product of individual probabilities:
The task of maximizing L can be somewhat simplified by taking its logarithm:
Efficient nonlinear programming algorithms are available to maximize log-likelihood numerically and estimate a logistic regression model. Commercial statistical packages are widely available to carry out the task.
Leave a Reply