Consider the problem of estimating a parameter θ, characterizing the probability distribution of a random available X. We have some prior information about θ, that we would like to express in a sensible way. We might assume that the unknown parameter lies anywhere in the unit interval [0, 1], or we might assume that it is close to some number μ, but we are somewhat uncertain about it. Such a knowledge or subjective view may be expressed by a probability density p(θ), which is called the prior distribution of θ. In the first case, we might associate a uniform distribution with θ; in the second case the prior could be a normal distribution with expected value μ and variance σ2. Note that this is the variance that we associate with the parameter, which is a random variable rather than a number, and not the variance of the random variable X itself.
In Bayesian estimation, the prior is merged with experimental evidence by using Bayes’ theorem. Experimental evidence consists of independent observations X1, …, Xn from the unknown distribution. Here and in the following we mostly assume that random variable X is continuous, and we speak of densities; the case of discrete random variables and probability mass functions is similar. We also assume that the values of the parameter θ are not restricted to a discrete set, so that the prior is a density as well. Hence, let us denote the density of X by f(x | θ), to emphasize its dependence on parameter θ. Since a random sample consists of independent random variables, their joint distribution, conditional on θ, is
The conditional density fn(x1, …, xn | θ) is also called the likelihood function, as it is related to the likelihood of observing the data values x1, …, xn, given the value of the parameter θ; also notice the similarity with the likelihood function in maximum likelihood estimation.25 Note that what really matters here is that the observed random variables X1, …, Xn are independent conditionally on θ. Since we are speaking about n+1 random variables, we could also consider the joint density
but this will not be really necessary for what follows. Given the joint conditional distribution fn(x1, …, xn | θ) and the prior p(θ), we can find the marginal density of X1, …, Xn by applying the total probability theorem:26
where we integrate over the domain Ω, on which θ is defined, i.e., the support of the prior distribution. Now what we need is to invert conditioning, i.e., we would like the distribution of θ conditional on the observed values Xi = xi, i = l, …, n. i.e.
This posterior density should merge the prior and the density of observed data conditional on the parameter. This is obtained by applying Bayes’ theorem to densities, which yields
Note that the posterior density involves a term gn (x1, …, xn) which does not really depend on θ. Its role is to normalize the posterior distribution, so that its integral is 1. Sometimes, it might be convenient to rewrite Eq. (14.17) as
where the symbol ∝ means “proportional to.” In plain English, Eq. (14.17) states that the posterior is proportional to the product of the likelihood function fn (x1, …, xn |θ) and the prior distribution p(θ):
posterior ∝ prior × likelihood
What we are saying is that, given some prior knowledge about the parameter and the distribution of observations conditional on the parameter, we obtain an updated distribution of the parameter, conditional on the actually observed data.
Example 14.14 (Bayesian learning and coin flipping) We tend to take for granted that coins are fair, and that the probability of getting head is 1/2. Let us consider flipping a possibly unfair coin, with an unknown probability θ of getting head. In order to learn this unknown value, we flip the coin repeatedly, i.e., we run a sequence of independent Bernoulli trials with unknown parameter θ.27 If we do not know anything about the coin, we might just assume a uniform prior
If we flip the coin N times, we know that the probability of getting H heads is related to the binomial probability distribution
This is our likelihood function. If we regard this expression as the probability of observing H heads, given θ, this should actually be the probability mass function of a binomial variable with parameters θ and N, but we are disregarding the binomial coefficient [see Eq. (6.16)], which does not depend on θ and just normalizes the distribution. If we multiply this likelihood function by the prior, which is just 1, we obtain the posterior density for θ, given the number of observed heads:
Equations (14.19) and (14.20) look like the same thing, because we use a uniform prior, but they very different in nature. Equation (14.20) gives the posterior density of θ, conditional on the fact that we observed H heads and N − H tails. If we look at it this way, we recognize the shape of a beta distribution, which is a density, rather than a mass function. To normalize the posterior, we should multiply it by the appropriate value of the beta function.28 Again, this normalization factor does not depend on θ and can be disregarded.
In Fig. 14.5 we display posterior densities, normalized in such a way that their maximum is 1, after flipping the coin N times and having observed H heads. The plot in Fig. 14.5(a) is just the uniform prior. Now imagine that the first flip lands head. After observing the first head, we know for sure that θ ≠ 0; indeed, if θ were zero, we could not observe any head. The posterior now proportional to a triangle:
Fig. 14.5 Updating posterior density in coin flipping.
This triangle is shown in Fig. 14.5(b). If we observe another head in the second flip, the updated posterior density is a portion of a parabola, as shown in Fig. 14.5(c):
If we get tail at the third flip, we rule out θ = 1 as well. Proceeding this way, we get beta distributions concentrating around the true (unknown) value of θ. Incidentally, the figure has been obtained by Monte Carlo simulation of coin flipping with θ = 0.2.
Armed with the posterior density, there are different ways of obtaining a Bayes’ estimator. Figure 14.5 would suggest taking the mode of the posterior, which would spare us the work of normalizing it. However, this need not be the most sensible choice. If we consider the expected value for the posterior distribution, we obtain
There are different ways of framing the problem, that are a bit beyond the scope but one thing that we can immediately appreciate is the challenge we face. The estimator above involves what looks like an intimidating integral, but our task is even more difficult in practice, because finding the posterior density may be a challenging computational exercise as well. In fact, given a prior, there is no general way of finding a closed-form posterior; things can really get awkward when multiple parameters are involved. Moreover, there is no guarantee that the posterior distribution pn(θ | x1, …, xn) will belong to the same family as does the prior p(θ). However, there are some exceptions. A family of distributions is called a conjugate family of priors if, whenever the prior is in the family, the posterior is too. The following example illustrates the idea.
Example 14.15 (A normal prior) Consider a sample (X1, …, Xn) from a normal distribution with unknown expected value θ and known variance . Then, given our knowledge about the multivariate normal, and taking advantage of independence among observations, we have the following likelihood function:
Let us assume that the prior distribution of θ is normal, too, with expected value μ and σ:
To get the posterior, we may simplify our work by considering in each function only the part that involves θ, wrapping the rest within a proportionality constant. In more detail, the likelihood function can be written as
We may simplify the expression further, by observing that
where is the average of xi, i = 1, …, n. Then, we may include terms not depending on θ into the proportionality constant and rewrite (14.22) as
By a similar token, we may rewrite the prior as
Multiplying Eqs. (14.23) and (14.24), we obtain the posterior
Again, we should try to include θ within one term; to this aim, we use a bit of tedious algebra30 and rewrite the argument of the exponential as follows:
where
Finally, this leads to
Disregarding the normalization constant, we immediately recognize the familiar shape of a normal density, with expected value ν. Then, given an observed sample mean and a prior μ, Eq. (14.27) tells us that the Bayes’ estimator of θ can be written as
Eq. (14.30) has a particularly nice and intuitive interpretation: The posterior estimator is a weighted average of the sample mean (the new evidence) and the prior μ, with weights that are inversely proportional to , the variance of sample mean, and σ2, the variance of the prior. The more reliable a term, the larger its weight in the average.
In the example, it may sound a little weird to assume that the variance of X is known, but not its expected value. Of course, one can extend the framework to cope with estimation of multiple parameters, but there are cases in which we are more uncertain about expected value than variance, as we see in the following section.
Leave a Reply