The method of maximum likelihood is an alternative approach to find estimators in a systematic way. Imagine that a random variable X has a PDF characterized by a single parameter θ; we indicate this as fX(x; θ). If we draw a sample of n i.i.d. variables from this distribution, the joint density is just the product of individual PDFs:
This is a function depending on the parameter θ. If we are interested in estimating θ, given a sample Xi = xi, i = 1,…, n, we may swap the role of variables and parameters and build the likelihood function
The shorthand notation L(θ) is used to emphasize that this is a function of the unknown parameter θ, for a given sample of observations. On the basis this framework, intuition suggests that we should select the parameter yielding the largest value of the likelihood function. For a discrete random variable, the interpretation is more natural: We select the parameter maximizing the probability of what we have indeed observed. For a continuous random variable, we cannot really speak of probabilities, but the rationale behind the method of maximum likelihood should be apparent: We try to find the best explanation of what we observe.
Example 9.37 Let us consider the PDF of an exponential random variable with parameter λ:
Given observed values x1,…, xn, the likelihood function is
Quite often, rather than attempting direct maximization of the likelihood function, it is convenient to maximize its logarithm, i.e., the loglikelihood function
The rationale behind this function is easy to grasp, since by taking the logarithm of the likelihood function, we transform a product of functions into a sum of functions
The first-order optimality condition yields
The result is rather intuitive, since E[X] = 1/λ. In problem 9.17, the reader is invited to check that the method of moments yields the same estimator.
If we must estimate multiple parameters, the idea does not change much. We just solve a maximization problem involving multiple variables.
Example 9.38 Let us consider maximum-likelihood estimation of the parameters θ1 = μ and θ2 = σ2 of a normal distribution. Straightforward manipulation of the PDF yields the loglikelihood function
The first-order optimality condition with respect to μ is
Hence, the maximum-likelihood estimator of μ is just the sample mean. If we apply the first-order condition with respect to σ2, plugging the estimator , we obtain
In the two examples above, maximum likelihood yields the same estimator as the method of moments. Then, one could well wonder whether there is really any difference between the two approaches. The next example provides us with a partial answer.
Example 9.39 Let us consider a uniform distribution on interval [0, θ]. On the basis of a random sample of n observations, we build the likelihood function
This may look a bit weird at first sight, but it makes perfect sense, as the PDF is zero for x > θ. Indeed, θ cannot be smaller than the largest observation:
In this case, since there is a constraint on θ, we cannot just take the derivative of L(θ) and set it to 0 (first-order optimality condition). Nevertheless, it is easy to see that likelihood is maximized by choosing the smallest θ, subject to constraint above. Hence, using the notation of order statistics, we find
While the application of maximum likelihood to exponential and normal variables looks a bit dull, in the last case we start seeing something worth noting. The estimator is quite different from what we have obtained by using the method of moments. The most striking feature is that this estimator does not use the whole sample, but just a single observation. We leave it as an exercise for the reader that, in the case of a uniform distribution on the interval [a, b], maximum-likelihood estimation yields
Indeed, the smallest and largest observations are sufficient statistics for the parameters a and b of a uniform distribution.25 We refrain from stating a formal definition of sufficient statistics, but the idea is rather intuitive. Given a random sample X, a sufficient statistic is a function T(X) that captures all of the information we need from the sample in order to estimate a parameter. As a further example, sample mean is a sufficient statistic for the parameter μ of a normal distribution. This concept has far-reaching implications in the theory of inferential statistics.
A last point concerns unbiasedness. In the first two examples, we have obtained biased estimators for variance, even though they are consistent. It can be shown26 that an unbiased estimator of θ for the uniform distribution on [0, θ] is
Again, we see that maximum likelihood yields a less than ideal estimator, even though for n → ∞ there is no real issue. It turns out that maximum-likelihood estimators (MLEs) do have limitations, but a few significant advantages as well. Subject to some technical conditions, the following properties can be shown for MLEs:
- They are consistent.
- They are asymptotically normal.
- They are asymptotically efficient.
- They are invariant, in the sense that, given a function g(·), the MLE of γ = g(θ) is , where is the MLE of θ.
As a general rule, finding MLEs requires the solution of an optimization problem by numerical methods, but there is an opportunity here. Whatever constraint we want to enforce on the parameters, depending on domain-specific knowledge, can be easily added. We obtain a constrained optimization problem that can be tackled using the theory
Leave a Reply