1 Appendix: Common distributions

1 Appendix: Common distributions This Appendix provides details for common univariate and multivariate distributions, including definitions, moments...
Author: Oswin Patterson
3 downloads 1 Views 246KB Size
1

Appendix: Common distributions

This Appendix provides details for common univariate and multivariate distributions, including definitions, moments, and simulation. Deyroye (1986) provides a complete treatment of random number generation, although care must be taken as many distributions can be parameterized in different ways. Uniform • A random variable X has a uniform distribution on the interval [α, β], denoted U (α, β) , if the probability density function (pdf) is p (x|α, β) =

1 β−α

for x ∈ [α, β] and 0 otherwise. The mean and variance of a uniform random variable 2 are E (X) = α+β and var (X) = (β−α) , respectively. 2 12 • The uniform distribution plays a foundational role in random number generation. In particular, uniform random numbers are required for the inverse transform simulation method, accept-reject algorithms, and the Metropolis algorithm. Fast and accurate pre-programmed algorithms are available in most statistical software packages and programming languages. Bernoulli • A random variable X has a Bernoulli distribution with parameter θ, denoted X ∼ Ber (θ) if the probability mass function (pmf) is Prob (X = x|θ) = θx (1 − θ)1−x . for x ∈ {0, 1}. The mean and variance of a Bernoulli random variable are E(X) = θ and var(X) = θ(1 − θ), respectively.

1

• To simulate X ∼ Ber (θ), 1. Draw U ∼ U (0, 1) 2. Set X = 1 if U < θ. Binomial • A random variable X ∈ {1, ..., n} has a Binomial distribution with parameters n and θ, denoted, X ∼ Bin (n, θ), if the pmf is Prob (X = x|n, θ) =

n! θx (1 − θ)n−x , x! (n − x)!

where n! = n (n − 1)! = n (n − 1) · · · 2 · 1. The mean and variance of a Binomial random variable are E(X) = nθ and var(X) = nθ (1 − θ), respectively. The Binomial distribution arises as the distribution of a sum of n independent Bernoulli trials. The Binomial is closely related to a number of other distributions. If W1 , ..., Wn are i.i.d. Pn Ber (p), then i=1 Wi ∼ Bin (n, p). As n → ∞ with np = λ, X ∼ Bin (n, p) converges in distribution to a Poisson distribution with parameter λ. • To simulate X ∼ Bin (n, θ), 1. Draw X1 , ..., Xn independently Xi ∼ Ber (θ) 2. Set X = count (Xi = 1) . Multinomial • A vector of random variables X = (X1 , ..., Xk ) has a Multinomial distribution, denoted X ∼ M ult (n, p1 , ..., pk ), if k

p (X = x|p1 , ..., pk ) =

Y n! pxi i x1 ! · · · xk ! i=1

Pk where i=1 xi = n. The Multinomial distribution is a natural extension of the Bernoulli and Binomial distributions. The Bernoulli distribution gives a single trial resulting in success or failure. The Binomial distribution is an extension that involves 2

n independently repeated Bernoulli trials. The Multinomial allows for multiple outcomes, instead of the two outcomes in the Binomial distribution. There are still n total trials, but now the outcome of each trial is assigned into one of k categories, and xi counts the number of outcomes in category i. The probability of category i is pi . The mean, variance, and covariances of the Multinomial distribution are given by E (Xi ) = npi , var (Xi ) = npi (1 − pi ) , and cov (Xi , Xj ) = −npi pj Multinomial distributions are often used in modeling finite mixture distributions where the Multinomial random variables represent the various mixture components. Dirichlet • A vector of random variables X = (X1 , .., Xk ) has a Dirichlet distribution, denoted P X ∼ D (α1 , ..., αk ), if ki=1 Xi = 1 Γ p (x|α1 , ..., αk ) =

P

k i=1

αi



Γ (α1 ) · · · Γ (αk )

k Y

xαi i −1

i=1

The Dirichlet distribution is used as a prior for mixture probabilities in mixture models. The mean, variance, and covariances of the Multinomial distribution are αi E (Xi ) = Pk

i=1

αi

P αi j6=i αj var (Xi ) = P  2 P k k i=1 αi + 1 i=1 αi cov (Xi , Xj ) = P k

i=1 αi

−αi αj 2 P k

 α + 1 i=1 i

• To simulate a Dirichlet X = (X1 , . . . , Xk ) ∼ D (α1 , ..., αk ) use the two step procedure Step 1: Draw k independent Gammas Yi ∼ Γ (αi , 1) Step 2: Set Xi = Yi /

k X i=1

3

Yi

Poisson • A random variable X ∈ N+ (the non-negative integers) has a Poisson distribution with parameter λ, denoted X ∼ Poi (λ) , if the pmf is −λ λ

Prob (X = x|λ) = e

x

x!

.

The mean and variance of a Poisson random variable are E(X) = λ and var(X) = λ, respectively. • To simulate X ∼ Poi (λ), 1. Draw Z1 , . . . , Zn independently, Zi ∼ exp (1) ) ( n X 2. Set X = inf n > 0 : Zi > λ , i=1

Exponential • A random variable X ∈ R+ has an exponential distribution with parameter µ, denoted, X ∼ exp (µ), if the pdf is   1 x p (x|µ) = exp − . µ µ The mean and variance of an exponential random variable are E (X) = µ and var (X) = µ2 , respectively. • The inverse transform method is the easiest way to simulate exponential random x variables, since the cumulative distribution function (cdf) is F (x) = 1 − e− µ . • To simulate X ∼ exp (µ), 1. Draw U ∼ U [0, 1] 2. Set X = −µ ln (1 − U ) . Gamma 4

• A random variable X ∈ R+ has a Gamma distribution with parameters α and β, denoted X ∼ G (α, β), if the pdf is p (x|α, β) =

β α α−1 x exp (−βx) , Γ (α)

The mean and variance of a Gamma random variable are E(X) = αβ −1 and var(X) = αβ −2 , respectively. It is important to realise that there are different parameterizations of the Gamma distribution; e.g. MATLAB parameterizes the Gamma density as p (x|α, β) =

1 xα−1 exp (−x/β) . Γ (α) β α

If X = Y /β and Y ∼ G (α, 1) then X ∼ G (α, β). To see this, use the inverse transform Y = βX with dY /dX = β, which implies that p (x|α, β) =

β α α−1 1 (xβ)α−1 exp (−βx) β = x exp (−βx) , Γ (α) Γ (α)

the density of a G (α, β) random variable. The exponential distribution is a special case of the Gamma distribution when α = 1: X ∼ G (1, µ−1 ) implies that X ∼ exp (µ). • Gamma random variable simulation is standard, with built-in generators in most software packages. These algorithms typically use accept/reject algorithms that are customized to the specific values of α and β. To simulate X ∼ G (α, β), when α is integer-valued, 1. Draw X1 , ..., Xα independently Xi ∼ exp(1) α X 2. Set X = β Xi . i=1

For non-integer α, accept-reject methods provide fast and accurate algorithms for Gamma simulation. Beta • A random variable X ∈ [0, 1] has a Beta distribution with parameters α and β, 5

denoted X ∼ B (α, β), if the pdf is p (x|α, β) =

xα−1 (1 − x)β−1 , B (α, β)

where B (α, β) = Γ (α) Γ (β) /Γ (α + β) is the Beta function. As R1 we have B (α, β) = 0 xα−1 (1 − x)β−1 dx.

R

p (x|α, β) dx = 1

The mean and variance of a Beta random variable are E (X) =

αβ α and var (X) = , 2 α+β (α + β) (α + β + 1)

respectively. If α = β = 1, then X ∼ U (0, 1). • If α and β are integers, to simulate X ∼ B (α, β), 1. Draw X1 ∼ G (α, 1) and X2 ∼ G (β, 1) X1 . 2. Set X = X1 + X2 When α is integer valued and β is non-integer valued (a common case in Bayesian inference), exponential random variables can be used to simulate beta random variables via the transformation method. Chi-squared • A random variable X ∈ R+ has a Chi-squared distribution with parameter ν, denoted X ∼ Xν2 if the pdf is p (x|ν) =

 x ν 1 −1 2  x exp − . 2 2 Γ ν2 ν 2

The mean and variance of X are E (X) = ν and var (X) = 2ν, respectively. The  Xν2 -distribution is a special case of the Gamma distribution: Xν2 = G ν2 , 21 . • Simulating chi-squared random variables typically uses the transformation method. For integer values of ν, the following two-step procedure simulates a Xν2 random

6

variable: Step 1: Draw Z1 , ..., Zν independently Zi ∼ N (0, 1) ν X Step 2: Set X = Zi2 . i=1

When ν is large, simulating using normal random variables is computationally costly and alternative more computationally efficient algorithms use Gamma random variable generation. Inverse Gamma • A random variable X ∈ R+ has a inverse Gamma distribution, denoted by X ∼ IG (α, β), if the pdf is   β β α −(α+1) x exp − p (x|α, β) = . Γ (α) x The mean and variance of the inverse Gamma distribution for α > 2 are E (X) =

β β2 and var (X) = α−1 (α − 1)2 (α − 2)

If Y ∼ G (α, β) , then X = Y −1 ∼ IG (α, β). To see this, write Z



1= Z0 ∞ = 0

β α α−1 y exp (−βy) dy = Γ (α)   β 1 βα exp − dx. Γ (α) xα+1 x

Z

0



βα Γ (α)

  α−1   β 1 −1 exp − dx x x x2

• The following two-steps simulate an IG (α, β) Step 1: Draw Y ∼ G (α, 1) β Step 2: Set X = . Y Again, as in the case of the Gamma distribution, some authors use a different parameterization for this distribution, so it is important to be careful to make sure you are 7

drawing using the correct parameters. In the case of prior distributions over the scale, σ 2 , it is additional complicated because some authors (Zellner, 1971) parameterize σ instead of σ 2 . Pareto • A random variable X ∈ R+ has a Pareto distribution, denoted by P ar(α, β), if the pdf for x > β and α > 0 is αβ α p(x|α, β) = α+1 x The mean and variance are E (X) =

αβ α−1

if α > 1 and var (X) =

αβ 2 (α−1)2 (α−2)

if α > 2.

• The following two-steps simulate a P ar(α, β) distribution Step 1: Draw Y ∼ exp (α) Step 2: Set X = βeY . Normal • A random variable X ∈ R has a normal distribution with parameters µ and σ 2 , denoted X ∼ N (µ, σ 2 ), if the pdf is p x|µ, σ

 2

(x − µ)2 exp − =√ 2σ 2 2πσ 2 1

!

we will also use φ (x|µ, σ 2 ) to denote the pdf and Φ (x|µ, σ 2 ) the cdf. The mean and variance are E (X) = µ and var (X) = σ 2 . • Given the importance of normal random variables, all software packages have functions to draw normal random variables. The algorithms typically use transformation methods drawing uniform and exponential random variables or look-up tables. The classic algorithm is the Box-Muller approach based on simulating uniforms. Log-Normal

8

• A random variable X ∈ R+ has a lognormal distribution with parameters µ and σ 2 , denoted by X ∼ LN (µ, σ 2 ) if the pdf is p x|µ, σ

2



  1 1 2 =√ exp − 2 (ln x − µ) . 2σ 2πσ 2 x 1

1

2

The mean and variance of the log-normal distribution are E (X) = eµ+ 2 σ and similarly var (X) = exp (2µ + σ 2 ) (exp (σ 2 ) − 1). It is related to a normal distribution via the transformation X = eµ+σZ . Although finite moments of the lognormal exist, the distribution does not admit a moment-generating function. • Simulating lognormal random variables via the transformation method is straightforward since X = eµ+σZ where Z ∼ N (0, 1) is LN (µ, σ 2 ). Truncated Normal • A random variable X has a truncated normal distribution, denoted by T N (µ, σ 2 ), with parameters µ, σ 2 and truncation region (a, b) if the pdf is p (x|a < x < b) = where it is clear that normal distribution is

Rb −∞

φ (x|µ, σ 2 ) , Φ (b|µ, σ 2 ) − Φ (a|µ, σ 2 )

φ (x|µ, σ 2 ) dx = Φ (b|µ, σ 2 ). The mean of a truncated

E(X|a < X < b) = µ − σ

φa − φb , Φb − Φa

where φx is the standard normal density evaluated at (x − µ) /σ and Φx is the standard normal cdf evaluated at (x − µ) /σ. • The inversion method can be used to simulate this distribution. A two-step algorithm that provides a draw from a truncated standard normal is Step 1: U ∼ U [0, 1] Step 2: X = Φ−1 [Φ (a) + U (Φ (b) − Φ (a))] ,

9

where Φ (a) =

Ra −∞

(2π)−1/2 exp (−x2 /2) dx. For the general truncated normal,

Step 1: Draw U ∼ U [0, 1]         a−µ b−µ a−µ −1 Φ Step 2: X = µ + σΦ +U Φ −Φ , σ σ σ where Φ−1 is the inverse of the cdf. Double exponential • A random variable X ∈ R has a double exponential (or Laplace) distribution with parameters µ and σ 2 , denoted X ∼ DE (µ, σ), if the pdf is   1 1 p (x|µ, σ) = exp − |x − µ| . 2σ σ The mean and variance are E (X) = µ and var (X) = 2σ 2 . • The composition method can be used to simulate a DE (µ, σ) random variable: Step 1: Draw λ ∼ exp (2) and Z ∼ N (0, 1) √ Step 2: Set X = µ + σ λZ. Check exponential • A random variable X ∈ R has a check (or asymmetric) exponential distribution with parameters µ and σ 2 , denoted X ∼ CE (τ, µ, σ), if the pdf is   1 1 p (x|µ, σ) = exp − ρτ (x − µ) . σµτ σ where ρτ (x) = |x| − (2τ − 1)x and µ−1 = 2τ (1 − τ ). The double exponential is a τ 1 special case when τ = 2 . • The composition method simulates a CE (µ, σ) random variable: Step 1: Draw λ ∼ exp (µτ ) and Z ∼ N (0, 1) √ Step 2: Set X = µ + (2τ − 1)σλ + σ λZ. 10

T • A random variable X ∈ R+ has a t-distribution with parameters ν,µ, and σ 2 , denoted X ∼ tν (µ, σ 2 ), if the pdf is

p x|ν, µ, σ

 2

=

ν+1 2 √  2 νπσ Γ ν2

2



Γ

1+

(x − µ) νσ 2

!− ν+1 2

When µ = 0 and σ = 1, the distribution is denoted merely as tν . The mean and ν for ν > 2. The variance of the t-distribution are E (X) = µ and var (X) = σ 2 ν−2 Cauchy distribution is the special case where ν = 1. • The composition method simulates a tν (µ, σ 2 ) random variable, ν ν  Step 1. Draw λ ∼ IG , and Z ∼ N (0, 1) √2 2 Step 2. Set X = µ + σ λZ Z • The class of Z-distributions for a, b > 0 have density 1 ea(z−µ)/σ fZ (z|a, b, µ, σ) = ≡ Z(z; a, b, σ, µ). σB(a, b) (1 + e(z−µ)/σ )a+b A typical parameterisation is a = δ + θ, b = δ − θ or δ = 12 (a + b), θ = 21 (a − b). This is a variance-mean mixture of normals where (  2 ) Z ∞ 1 1 1 √ exp − z − µ − (a − b)λσ Z(z; a, b, σ, µ) = pa,b (λ) dλ 2λσ 2 2 2πλσ 2 0 where pa,b (λ) is a Polya distribution which is an infinite mixture of exponentials that can be easily sampled as D

λ=

∞ X

2ψk−1 Zk ,

where

ψk = (a + k)(b + k) and Zk ∼ exp(1).

k=0

11

• Z-Distribution simulation is then given by √ 1 X = µ + (a − b)λσ + σ λZ, where Z ∼ N (0, 1) 2 Exponential power • A random variable has an exponential power distribution. denoted X ∼ EP(µ, σ, γ) if the pdf is   x − µ γ 1 . p (x|µ, σ, γ) = exp − 2σΓ(1 + γ −1 ) σ The mean and variance are E (X) = µ and var (X) = exponential are then special cases.

Γ(3/γ) 2 σ . Γ(1/γ)

Normal and double

• Exponential power simulation relies on the composition method: if X ∼ EP (µ, σ), √ then X = σ λZ, where the scale parameter is related to a positive stable random 3 −1 variable λ ∼ λ− 2 St+ α/2 (λ ) and ε ∼ N (0, 1). Inverse Gaussian • A random variable X ∈ R+ has an inverse Gaussian distribution with parameters µ and α, denoted X ∼ IN (µ, α), if the pdf is r p (x|µ, α) =

α (x − µ)2 α exp − 2πx3 2µ2 x

! .

The mean and variance of an inverse Gaussian random variable are E (X) = µ and var (X) = µ3 /α, respectively. • To simulate an inverse Gaussian IN (µ, α), Step 0: Draw U ∼ U (0, 1), V ∼ χ21 ξ2 ξ p V − √ 4ξµV + ξ 2 V 2 2µ 2 µ ξ2 Step 2: Set X = W 1(U ≥ ξ ) + 1(U ≤ ξ ) ξ+W ξ+W W Step 1: Draw W = ξ +

where ξ =

p

µ/α. 12

Generalized inverse Gaussian • A random variable X ∈ R+ has an generalized inverse Gaussian distribution with parameters a, b, and p, X ∼ GIG (a, b, p), if the pdf is p (x|a, b, p) =

 a  p2 b

   xp−1 1 b √  exp − ax + , 2 x 2Kp ab

where Kp is the modified Bessel function of the third kind. The mean and variance are known, but are complicated expressions of the Bessel functions. The Gamma distribution is the special case with β = b/2 and a = 0, the inverse Gamma is the special case with a = 0. • Simulating GIG random variables is typically done using resampling methods. Multivariate normal • A k×1 random vector X ∈ Rk+ has a multivariate normal distribution with parameters µ and Σ, denoted X ∼ Nk (µ, Σ), if the pdf is 1

p (x|µ, Σ) =

−1/2

k

|Σ|

(2π) 2



1 exp − (x − µ) Σ−1 (x − µ)0 2



where |Σ| is the determinant of the positive definite symmetric matrix Σ. The mean and covariance matrix of a multivariate normal are E (X) = µ and cov (X) = Σ, respectively. Multivariate T • A 1×k random vector X ∈ Rk+ has a multivariate t-distribution with parameters ν, µ, and Σ, denoted X ∼ tν (µ, Σ), if the pdf is given by p (x|ν, µ, Σ) =

Γ

ν+k 2 1/2

(νπ)



Γ

ν 2 1/2

(x − µ) Σ−1 (x − µ)0 1+ ν



|Σ|

− ν+k 2 .

The mean and covariance matrix of a multivariate t random variable are E (X) = µ and cov (X) = Σ, respectively. 13

• The following two steps provide a draw from a multivariate t-distribution: Step 1. Simulate Y ∼ Nk (µ, Σ) and Z ∼ Xν2  − 12 Z Step 2. Set X = µ + Y ν Wishart • A random m × m matrix Σ has a Wishart distribution, Σ ∼ Wm (v, V ), if the density function is given by p (Σ|v, V ) =

|Σ| 2

vm 2

(v−m−1) 2 v

|V | 2 Γm

  1 −1  exp − tr V Σ , v 2 

2

for v > m, where Γm

v  2

=

m Y

π

m(m−1) 4

 Γm

k=1

v−k+1 2



is the multivariate Gamma function. If v < m, then S does not have a density although its distribution is well defined. The Wishart distribution arises naturally in multivariate settings with normally distributed random variables as the distribution of quadratic forms of multivariate normal random variables. • The Wishart distribution can be viewed as a multivariate generalization of the Xν2 distribution. From this, it is clear how to sample from a Wishart distribution: Step 1: Draw Xj ∼ N (0, V ) for j = 1, ..., v v X Step 2: Set S = Xj Xj0 . j=1

Inverted Wishart • A random m × m matrix Σ has an inverted Wishart distribution, denoted Σ ∼ IW m (v, V ) if the density function is p (Σ|v, V ) =

|Σ|− 2

vm 2

(v+m+1) 2

|V |−m/2 Γm 14



  1 −1 .  exp − tr V Σ v 2 2

This also implies that Σ−1 has a Wishart distribution, Σ−1 ∼ Wm (v, V −1 ). The Jacobian of the transformation −1 ∂Σ −(m+1) . ∂Σ = |Σ| • To generate Σ ∼ IW m (w, W ), follow the two step procedure: Step 1: Draw Xi ∼ N 0, W −1 P Step 2: Set Σ = vi=1 Xi Xi0 .



In cases where m is extremely large, there are more efficient algorithms for drawing inverted Wishart random variables that involves factoring W and sampling from univariate X 2 distributions.

15

2

Likelihoods, Priors, and Posteriors

This appendix provides combinations of likelihoods and priors for the following types of observed data: Bernoulli, Poisson, exponential, normal, normal regression, and multivariate normal. For each specification, proper conjugate and Jeffreys’ priors are given. The overriding Bayesian paradigm takes the form of Bayes rule p(parameters|data) =

p(data|parameters)p(parameters) p(data)

where the types of data and parameters are problem specific. Bernoulli observations • If the data (yt |θ) ∼ Ber (θ) with θ ∈ [0, 1], then the likelihood is p (y|θ) =

T Y t=1

p (yt |θ) =

T Y

θyt (1 − θ)1−yt = θ

PT

t=1

yt

(1 − θ)T −

PT

t=1

yt

,

t=1

• A conjugate prior distribution is the beta family, θ ∼ B (a, A), where p (θ) =

Γ (a + A) a−1 θ (1 − θ)A−1 . Γ (a) Γ (A)

By Bayes rule, the posterior distribution is also Beta p (θ|y) ∝ p (y|θ) p (θ) = λa+ where aT = a +

PT

t=1

PT

t=1

yt −1

(1 − θ)A+T −

yt and AT = A + T −

PT

t=1

PT

t=1

yt −1

∼ B (aT , AT ) ,

yt .

• Fisher’s information for Bernoulli observations is   2 ∂ ln p (yt |θ) 1 I (θ) = −Eθ = , ∂θ2 θ (1 − θ) where Eθ denotes the expectation under a Ber(θ). Jeffreys’ prior is 1 2

p (θ) = I (θ) = θ

− 12

16

− 12

(1 − θ)

 ∼B

1 1 , 2 2

 .

Multinomial observations • If (y|θ) is Multinomial data from k categories where (y|θ) ∼ M ulti(θ1 , . . . , θk ), then the likelihood for T trials is given by k X T! yk y1 yi = T θ . . . θk where p(y1 , . . . , yk |θ1 , . . . , θk ) = y1 ! . . . yk ! 1 i=1

• A conjugate prior is a Dirichlet distribution, θ ∼ Dir(α), with density P Γ( αi ) α1 −1 p(θ1 , . . . , θk |α) = Q . . . θkαk −1 θ1 i Γ(αi ) The posterior is then p(θ1 , . . . , θk |α, y1 , . . . , yk ) ∝ p(y1 , . . . , yk |θ1 , . . . , θk )p(θ1 , . . . , θk |α) ∝ θ1y1 . . . θkyk θ1α1 −1 . . . θkαk −1 = θ1α1 +y1 −1 . . . θkαk +yk −1 ∼ Dir(α + y) which is again a Dirchlet with parameter α + y. Poisson observations • If the data (yt |λ) ∼ Poi (λ), then the likelihood is p (y|λ) =

T Y e−λ λyt t=1

yt !

∝ e−λT λ

PT

t=1

yt

,

• A conjugate prior for λ is a Gamma distribution, λ ∼ G (a, A), with density p (λ) =

Aa a−1 λ exp (−λA) . Γ (a)

The posterior distribution is Gamma p (λ|y) ∝ p (y|λ) p (λ) = e−λ(A+T ) λa+ 17

PT

t=1

yt −1

∼ G (aT , AT )

where aT = a +

PT

t=1

yt and AT = A + T .

• Fisher’s information for Poisson observations is  2  ∂ ln p (yt |λ) 1 I (λ) = −Eλ = . 2 ∂λ λ 1

1

Jeffreys’ prior is then p (λ) ≡ I (λ) 2 = λ− 2 . This can be viewed as a special case of the Gamma prior with a = 12 and A = 0. Exponential observations • If the data (yt |µ) ∼ exp (µ), then the likelihood is p (y|µ) =

T Y

µ exp (−µyt ) ∝ µT exp −µ

T X

t=1

! yt ,

t=1

• A conjugate prior for µ is a Gamma distribution, µ ∼ G (a, A). The posterior p (µ|y) ∝ p (y|µ) p (µ) ∝ µ where aT = a + T and AT = A +

„ PT « −µ A+ t=1 yt a+T −1

e

∼ G (aT , AT )

PT

t=1 yt .

• Fishers’ information for exponential observations is  I (µ) = −Eµ

  2 ∂ 2 ln p (yt |µ) 1 = . 2 ∂µ µ

Jeffreys prior for exponential observations is p (µ) ≡ µ−1 . This is a special case of the Gamma prior with a = 1 and A = 0. Normal observations with known variance • If the data (yt |µ, σ 2 ) ∼ N (µ, σ 2 ), the likelihood is p y|µ, σ

 2

 =

1 2πσ 2

 T2

18

T 1 X exp − 2 (yt − µ)2 2σ t=1

! .

We can factorize

T X

T X   (yt − µ) = (yt − y)2 + (y − µ)2 . 2

t=1

t=1

Thus, the likelihood, as a function of µ, is proportional to T (µ − y)2 exp − 2σ 2

! ,

with the other terms in p(y|µ, σ 2 ) being absorbed into the constant of integration. • A conjugate prior distribution for µ that is independent of σ 2 is given by µ ∼ N (a, A). The posterior distribution is p µ|y, σ

 2

∝ p y|µ, σ

 2

" #! 1 (µ − y)2 (µ − a)2 − . p (µ) ∝ exp − 2 σ 2 /T A

Completing the square yields (µ − y)2 (µ − a)2 (µ − aT )2 (y − a)2 − = + 2 , σ 2 /T A AT σ /T + A with parameters aT a 1 1 1 y = + 2 and = 2 + . AT A σ /T AT σ /T A The posterior distribution is p µ|y, σ

 2

" #! 1 (µ − aT )2 (y − a)2 ∝ exp − + 2 ∼ N (aT , AT ) . 2 AT σ /T + A

• A conjugate prior distribution for µ conditional on σ 2 is µ ∼ N (a, σ 2 A). The posterior distribution is p (µ|y, σ 2 ) ∝ p (y|µ, σ 2 ) p (µ) which gives (

p µ|y, σ

 2

1 ∝ exp − 2 2σ

19

(µ − y)2 (µ − a)2 + T −1 A

!)

Completing the square for the quadratic term in the exponential, (µ − y)2 (µ − a)2 (µ − aT )2 (y − a)2 = + + T −1 A AT T −1 + A where

aT a y 1 1 1 = + −1 and = −1 + . AT A T AT T A

The posterior distribution is p µ|y, σ

 2

(µ − aT )2 ∝ exp − 2σ 2 AT

!  ∼ N aT , AT σ 2 .

Notice the slight differences between this example and the previous one, in terms of the hyper-parameters and the form of the posterior distribution. • Fisher’s information for normal observations with σ 2 known is  2  ∂ ln p (yt |µ, σ 2 ) I (µ) = −Eλ ≡ 1. ∂µ2 Jeffreys prior for normal observations (with known variance) is a constant, p (µ) ≡ 1 which is improper. However, the posterior is proper and can be viewed as a limiting of the normal conjugate prior with a = 0 and A → ∞. Normal Variance with known Mean • Given µ, if (yt |µ, σ 2 ) ∼ N (µ, σ 2 ) the likelihood for σ 2 , is 

1 σ2

 T2

PT exp −

• A conjugate inverse Gamma prior σ 2 ∼ IG

p σ

 2

2 t=1 (yt − µ) 2σ 2

b B , 2 2



! .

has pdf

 2b    b B 2 − 2 −1  σ = exp − 2 2σ Γ 2b B 2

20

By Bayes rule,    p σ 2 |µ, y ∝ p y|µ, σ 2 p σ 2 ! PT   b+T +1 2 2 B + t=1 (yt − µ) 1 exp − ∝ 2 σ 2σ 2   bT BT , , ∼ IG 2 2 where bT = b + T and BT = B +

PT

t=1

(yt − µ)2 .

 The parameterization of the inverse Gamma, σ 2 ∼ IG 2b , B2 , is used as opposed to σ 2 ∼ IG (b, B) because the hyperparameters do not have any 1/2 terms. This is chosen for notational simplicity. It is also common in the literature to assume  p (σ) ∼ IG 2b , B2 , which only changes the first term in the expression for bT . • Fisher’s information for normal observations (with known mean) is I σ

2



 = −Eσ2

 ∂ 2 ln p (yt |µ, σ 2 ) 1 ≡ 2. 2 σ ∂ (σ 2 )

−1

Jeffreys’ prior is p (σ 2 ) ∝ (σ 2 ) , which is improper distribution. However, the resulting posterior is proper and can be viewed as a limiting of the inverse Gamma conjugate prior with B = 0 and b = 0. A flat or constant prior for σ 2 also leads to a proper posterior. Assuming p (σ 2 ) ≡ 1 yields a conditional posterior p σ 2 |µ, y ∼ IG 

T , 2

PT

2 t=1 (yt − µ) 2

! .

Unknown mean and variance: dependent priors • If the data (yt |µ, σ 2 ) ∼ N (µ, σ 2 ) and assuming that both µ and σ 2 are unknown, then the likelihood as a function of µ and σ 2 is 

1 σ2

 T2

PT exp −

21

2 t=1 (yt − µ) 2σ 2

! .

• A conjugate prior for (µ, σ 2 ) is p (µ, σ 2 ) = p (µ|σ 2 ) p (σ 2 ) where p µ|σ

2



∼ N a, Aσ

2



and p σ

2



 ∼ IG

b B , 2 2

 .

 This distribution is often expressed as N (a, Aσ 2 ) IG 2b , B2 . This prior assumes that µ and σ 2 are dependent. Bayes rule and a few lines of algebra yields a posterior     p µ, σ 2 |y ∝ p y|µ, σ 2 p µ|σ 2 p σ 2 " #!   T 2+b + 12 +1 T 1 1 (µ − y)2 (µ − a)2 X ∝ exp − 2 + + (yt − y)2 + B . σ2 2σ 1/T A t=1 Combining the quadratic terms that depend on µ by completing the square (µ − y)2 (µ − a)2 (µ − aT )2 (y − a)2 + = + , 1/T A AT 1/T + A where the hyper-parameters are 1 1 aT a y 1 = + −1 and = + −1 . AT A T AT A T Inserting this into the likelihood, gives a posterior p µ, σ 2 |y ∝ 



1 σ2

 T 2+b + 12 +1

#! " 1 (µ − aT )2 +B+S exp − 2 2σ AT

where we have

T

(y − a)2 X S= + (yt − y)2 . 1/T + A t=1 Given the conjugate prior structure, the posterior p (µ, σ 2 |y) ∝ p (µ|σ 2 , y) p (σ 2 |y) A few lines of algebra shows that 2



p µ, σ |y ∝



1 σ2

 21

1 (µ − aT )2 exp − 2 σ 2 AT

!

 ×

given p (µ|σ 2 , y) ∼ N (aT , σ 2 AT ) and p(σ 2 |y) ∼ IG 22

1 σ2 bT 2

 T 2+b +1

, B2T





B+S exp − 2σ 2



with bT = b + T, BT =

B + S. Marginal parameter distributions. In this specification, the marginal parameter distributions, p (σ 2 |y) and p (µ|y), are both known analytically. p (σ 2 |y) is inverse Gamma. The marginal p (µ|y) is Z

∞ 2

Z

2



p µ, σ |y dσ =

p (µ|y) = 0



  p µ|σ 2 , y p σ 2 |y dσ 2 .

0

Both p (µ|σ 2 , y) and p (σ 2 |y) are known, and the integral can be computed analytically. Ignoring integration constants, Z





p (µ|y) ∝ 0

"

1 σ2

 bT2+1 +1

2

∝ 1+

(µ − aT ) AT BT



  1 (µ − aT ) BT exp − 2 + dσ 2 σ 2AT 2

#− bT2+1

using our integration results. This is the kernel of a t-distribution, thus the marginal posterior is p(µ|y) ∼ tbT (aT , AT BT ). R The marginal likelihood, p (y) = p (y|µ, σ 2 ) p (µ, σ 2 ) dµdσ 2 can be computed

p y|µ, σ

 2

 =Ky

 T2

1 σ2 

2

exp −  12

T (µ − y) + S 2σ 2

!

1 (µ − a)2 p µ|σ = Kµ exp − 2 2σ A b   2 +1    1 B 2 p σ = Kσ exp − 2 , σ2 2σ 2



1 σ2

´

!

1

with constants Ky = (2π)−T /2 , Kσ = (B/2)b/2 /Γ (b/2) and Kµ = (2πA)− 2 . Substi-

23

tuting these expressions, the marginal likelihood is  b+T2 +1 +1   1 S+B dσ 2 p (y) = KKµ Kσ exp − σ2 2σ 2 0 " #! Z ∞ 1 (µ − y)2 (µ − a)2 exp − 2 · dµ + 2σ T −1 A −∞ ∞

Z



Completing the square inside the integrand gives (µ − aµT )2 (µ − y)2 (µ − a)2 (y − a)2 = , + + −1 T −1 A Aµ T +A with hyper-parameters aµT y a 1 1 1 + and µ = −1 + . µ = −1 AT T A AT T A The integrals can be expressed as Z 0





1 σ2

 b+T2 +1 +1

2

S + B + T(y−a) −1 +A exp − 2 2σ

The second integral is

R∞ −∞

 exp

2

(µ−aµ ) − 21 σ2 ATµ

! dσ 2

1 (µ − aµT )2 exp − 2 2σ Aµ −∞

Z

 dµ =





! dµ

1

2πAµ (σ 2 ) 2 . Using this, σ 2 can

be integrated out yielding the expression for the marginal likelihood: 2 ! S + B + T(y−a) −1 +A dσ 2 p (y) = 2πAµ KKµ Kσ exp − 2 2σ 0 # b+T "   T2  µ  21   2´b 2 − 2 b+T Γ 1 A B (y − a) 2 = S + B + −1 . b 2π A 2 T +A Γ 2



Z





1 σ2

 b+T +1 2

Finally, the predictive distribution can also be computed analytically. Since Z    T p yT +1 |y = p yT +1 |µ, σ 2 p µ, σ 2 |y T dµdσ 2 . To simplify, first compute the integral against µ by substituting from the posterior.

24

√ Since p (µ|σ 2 , y) ∼ N (aT , σ 2 AT ), we have that µ = aT + σ AT Z where Z is an independent normal. Substituting in yT +1 = µ + σεT +1 , gives yT +1 = aT + σ

p AT Z + σεT +1 = aT + σηT +1

where ηT +1 ∼ N (0, AT + 1). The predictive is therefore p yT +1 |y

 T

!   bT   +1 1 (yT +1 − aT )2 1 2 BT exp − 2 exp − 2 dσ 2 ∝ 2 2σ A + 1 σ 2σ T 0 " #! Z ∞   BT2+1 +1 1 1 (yT +1 − aT )2 exp − 2 BT + dσ 2 ∝ 2 σ 2σ A + 1 T 0 " # BT2+1 (yT +1 − aT )2 ∝ 1+ BT (AT + 1) Z





1 σ2

 12

∼ tbT (aT , BT (AT + 1)) . • Jeffreys’ prior for normal observations with unknown mean and variance is a bivariate distribution. Fishers’ information is  2  " # ∂ 2 ln p(yt |µ,σ 2 ) ∂ ln p(yt |µ,σ 2 ) 1  0 2 ∂µ∂σ 2 σ2  , I µ, σ 2 = −Eµ,σ2  ∂ 2 ln p∂µ (yt |µ,σ2 ) ∂ 2 ln p(yt |µ,σ2 ) = 0 σ22 2 2 2 ∂µ∂σ

∂(σ )

1

generating a prior distribution p (µ, σ 2 ) ≡ det (I (µ, σ 2 )) 2 = σ −2 . This prior is improper, but leads to a proper posterior distribution that can be viewed as a limiting case of the usual conjugate posterior 2



p µ, σ |y ∼ N aT , AT σ

2



 IG

bT BT , 2 2

 ,

where aT = 0, AT → ∞, b = 0 and B = 0. Regression • Consider a regression model specification, yt |xt , β, σ 2 ∼ N (xt β, σ 2 ), where xt is a vector of observed covariates, β is a k × 1 vector of regression coefficients and εt ∼ 25

N (0, σ 2 ). To express the likelihood, it is useful to stacking the data into matrices: y = Xβ + ε, where Y is a T × 1 vector of dependent variables, X is a T × k matrix of regressor variables and ε, where Y is a T × 1 vector of errors. The usual OLS regression estimator is βb = (X 0 X)−1 X 0 y and the residual sum of ˆ 0 (y − X β). ˆ Completing the square, squares is S = (y − X β) ˆ 0 (X 0 X)(β − β) ˆ + (y − X β) ˆ 0 (y − X β) ˆ (y − Xβ)0 (y − Xβ) = (β − β) ˆ 0 (X 0 X)(β − β) ˆ + S, = (β − β) which implies that T /2

 1 0 exp − 2 (y − Xβ) (y − Xβ) p(y|β, σ ) = 2σ  T /2   1 S 1 0 0 ˆ ˆ = exp − 2 (β − β) (X X)(β − β) − 2 σ2 2σ 2σ 

2

1 σ2



  ˆ S are sufficient statistics for (β, σ 2 ). Hence β, • A proper conjugate prior is p (β|σ 2 ) ∼ Nk (a, σ 2 A) and p (σ 2 ) ∼ IG p β|σ

2



= (2π)

k 2

σ

 k 2 −2

− 21

|A|

b B , 2 2



with



 1 0 −1 exp − 2 (β − a) (A) (β − a) 2σ

k

since |σ 2 A| = (σ 2 ) |A|. The posterior distribution is 2



p β, σ |y ∝



1 σ2

 b+T +1  2

1 σ2

 k2



  1 2 exp − 2 Q β, σ , 2σ

where the quadratic form is defined by  Q β, σ 2 = (y − Xβ)0 (y − Xβ) + (β − a)0 A−1 (β − a) + B. we can complete the square as follows: combine (y − Xβ)0 (y − Xβ)+(β − a)0 A−1 (β − a)

26

by stacking vectors y˜ =

!

y A

− 12

a

and W =

X 1 A− 2

! ,

where y˜ is a (T + k) × 1 vector and W is a (T + k) × k matrix. Then, (y − Xβ)0 (y − Xβ) + (β − a)0 A−1 (β − a) = (˜ y − W β)0 (˜ y − W β) . By analogy to the previous case, define β = (W 0 W )−1 W 0 y˜, where W 0 W = X 0 X + A−1 W 0 y˜ = X 0 y + A−1 a. Adding and subtracting W β to y˜ − W β and simplifying gives  0    y˜ − W β + W β − W β y˜ − W β + W β − W β 0  0  = β − β W 0 W β − β + y˜ − W β y˜ − W β

(˜ y − W β)0 (˜ y − W β) =



= (β − aT )0 A−1 y − W aT )0 (˜ y − W aT ) , T (β − aT ) + (˜ where A−1 = W 0 W = X 0 X + A−1 and aT = AT [X 0 y + A−1 a]. Straightforward T 0  algebra shows that y˜ − W β y˜ − W β = y 0 y + a0 A−1 a − a0T A−1 T aT . Putting the pieces together, the quadrtic form is  −1 0 0 −1 0 Q β, σ 2 = (β − aT )0 A−1 T (β − aT ) + y y + a A a − aT AT aT + B = (β − aT )0 A−1 T (β − aT ) + BT , where BT = y 0 y + a0 A−1 a − a0T A−1 T aT + B. This leads to the same posterior.

3 3.1

Direct sampling Generating i.i.d. random variables from distributions

The Gibbs sampler and MH algorithms require simulating i.i.d. random variables from “recognizable” distributions. Appendix 1 provides a list of common “recognizable” dis27

tributions, along with methods for generating random variables from these distributions. This section briefly reviews that standard methods and approaches for simulating random variables from recognizable distributions. Most of these algorithms first generate random variables from a relatively simple “building block” distribution, such as a uniform or normal distribution, and then transform these draws to obtain a sample from another distribution. This section describes a number of these approaches that are commonly encountered in practice. Most “random-number” generators actual use deterministic methods, along with transformations. In this regard, it is important to remember Von Neumann’s famous quotation: “Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin.” Inverse CDF method The inverse distribution method uses samples of uniform random variables to generate draws from random variables with a continuous distribution function, F . Since F (x) is uniformly distributed on [0, 1], draw a uniform random variable and invert the CDF to get a draw from F . Thus, to sample from F , Step 1: Draw U ∼ U [0, 1] Step 2: Set X = F −1 (U ) , where F −1 (U ) = inf {x : F (x) = U }. This inversion method provides i.i.d. draws from F provided that F −1 (U ) can be exactly calculated. For example, the CDF of an exponential random variable with parameter µ is F (x) = 1 − exp (−µx), which can easily be inverted. When F −1 cannot be analytically calculated, approximate inversions can be used. For example, suppose that the density is a known analytical function. Then, F (x) can be computed to an arbitrary degree of accuracy on a grid and inversions can be approximately calculated, generating an approximate draw from F . With all approximations, there is a natural trade-off between computational speed and accuracy. One example where efficient approximations are possible are inversions involving normal distributions, which is useful for generating truncated normal random variables. Outside of these limited cases, the inverse transform method does not provide a computationally attractive approach for drawing random variables from a given distribution function. In particular, it does not work well in multiple dimensions.

28

Functional Transformations The second main method uses functional transformations to express the distribution of a random variable that is a known function of another random variable. Suppose that X ∼ F , admitting a density f , and that y = h (x) is an increasing continuous function. Thus, we can define x = h−1 (y) as the inverse of the function h. The distribution of y is given by Z

h−1 (y)

 f (x) dx = F X ≤ h−1 (y) .

F (a) = Prob (Y ≤ y) = −∞

Differentiating with respect to y gives the density via Leibnitz’s rule:  dh−1 (y) −1 , fY (y) = f h (y) dy where we make explicit that the density is over the random variable Y . This result is used widely. For example, if X ∼ N (0, 1), then Y = µ + σX. Since x = h−1 (y) = y−µ , the σ  x−µ distribution function is F σ and density   1 (y − µ)2 fY (y) = √ exp − 2σ 2 2πσ Transformations are widely used to simulate both univariate and multivariate random variP ables. As examples, if Y ∼ Xν2 and ν is an integer, then Y = νi=1 Xi2 where each Xi is independent standard normal. Exponential random variables can be used to simulate X 2 , Gamma, beta, and Poisson random variables. The famous Box-Muller algorithm simulates normals from uniform and exponential random variables. In the multivariate setting, Wishart (and inverse Wishart) random variables can be via sums of squared vectors of standard normal random variables. Mixture distributions In the multidimensional case, a special case of the transformation generates continuous mixture distributions. The density of a continuous mixture distribution is given by Z p (x) =

p (x|λ) p (λ) dλ,

29

where p (x|λ) is viewed as density conditional on the parameter λ. One example of this is the class of scale mixtures of normal distributions, where  2 1 x p (x|λ) = √ exp − 2λ 2πλ and, λ is the conditional variance of X. It is often simpler just to write √ X=

λε,

√ where ε ∼ N (0, 1). The distribution of λ determines the marginal distribution of X. Here are a number of examples of scale mixture distributions. • T-distribution. The t-distribution arises in many Bayesian inference problems involving inverse Gamma priors and conditionally normally distributed likelihoods. If  p (yt |λ) ∼ N (0, λ) and p (λ) ∼ IG 2b , B2 , then the marginal distribution of yt is  tb 0, Bb . The proof is direct by analytically computing the marginal distribution, ∞

Z

p (yt |λ) p (λ) dλ.

p (yt ) = 0

Using our integration results: Z p (y) = 0



1 √ 2π |

  21     b +1   1 1 y 2 (B/2)b/2 1 2 B  exp − dλ exp − λ 2λ λ 2λ Γ 2b {z }| {z } likelihood

prior

b+1   ∞  ( 2 )+1 1 y2 + B 1 ∝ exp − dλ λ λ 2 0

Z

which is in the class of scale mixture integrals. We obtain  − b+1 y2 ( 2 ) p (y) ∝ 1 + B Thus yt ∼ tb (0, B). Given µ, yt |µ, λ, σ 2 ∼ N (µ, λσ 2 ) and λ ∼ IG implies p (yt |µ, σ 2 ) ∼ tb (µ, B).

b B , 2 2



which

• Double-exponential distribution. The double exponential arises as a scale mix30

ture distribution: If p (yt |λ) ∼ N (0, λ) and λ ∼ exp (2), then the marginal distribution of yt ∼ DE (0, 1). The proof is again by direct integration using the results in our integration appendix: Z 0



   1 y2 1 1 √ exp − −λ dλ = exp (− |y|) , 2 λ 2 2πλ

More generally, if µ and σ 2 are known, then yt |µ, λ, σ 2 ∼ N (µ, λσ 2 ) and λ ∼ exp (2), then p (yt |µ, σ 2 ) ∼ DE (µ, σ 2 ) substituting b = (y − µ) /σ and multiplying both sides by σ −1 . • Asymmetric Laplacean. The asymmetric Laplacean distribution is a scale mixture of normal distributions: if  p (yt |λ) ∼ N ((2τ − 1) λ, λ) and λ ∼ exp µ−1 , τ then yt ∼ CE (τ, 0, 1). The proof uses our integration appendix and Z 0



  1 1 1 2 √ exp − (y + (2τ − 1) λ) − 2τ (1 − τ ) λ dλ = exp (− |y| − (2τ − 1) y) 2λ 2 2πλ

In general, if µ and σ 2 are known, then   p(yt |µ, λ, σ 2 ) ∼ N µ + (1 − 2τ ) λ, λσ 2 and λ ∼ exp µ−1 τ which leads to p (yt |µ, σ 2 ) ∼ CE (τ, µ, σ 2 ). • Exponential power family. This family of distributions (Box and Tiao, 1973) p(x|τ, γ) is given by  y γ  p(y|σ, γ) = (2σ)−1 c(γ) exp − σ where c(γ) = Γ(1 + γ −1 )−1 . Following West (1987) we have the following scale mixtures of normals representation γ

Z

p(y|σ = 1, γ) = c(γ) exp (−|β| ) = 0

31



 2 y 1 √ exp − p(λ|γ)dλ 2λ 2πλ

where p(λ|γ) ∝ λ−3/2 St+γ (λ−1 ) and St+ a is the (analytically intractable) density of a 2 positive stable distribution. Factorization Method Another method that is useful in some multivariate settings is known as factorization. The rules of probability imply that a joint density, p (x1 , ..., xn ), can always be factored as p (x1 , ..., xn ) = p (xn |x1 , ..., xn−1 ) p (x1 , ..., xn−1 ) = p (x1 ) p (x2 |x1 ) · · · p (xn |x1 , ..., xn−1 ) . In this case, simulating X1 ∼ p (x1 ), X2 ∼ p (x2 |X1 ), ... Xn ∼ p (xn |X1 , ..., Xn−1 ) generates a draw from the joint distribution. This procedure is common in Bayesian statistics, and the distribution of X1 is a marginal distribution and the other distributions are conditional distributions. This is used repeatedly to express a joint posterior in terms of lower dimensional conditional posteriors. Rejection sampling The final general method discussed here is the accept-reject method, developed by von Neumann. Suppose that the goal is to generate a sample from f (x), where it is assumed that f is bounded, that is, f (x) ≤ cg (x) for some c. The accept-reject algorithm is a two-step procedure Step 1: Draw U ∼ U [0, 1] and X ∼ g Step 2: Accept Y = X if U ≤

f (X) , otherwise return to Step 1. cg (X)

Rejection sampling simulates repeatedly until a draw that satisfies U ≤ direct calculation, it is clear that Y has density f :  Prob(Y ≤ y) = Prob X ≤ y|U ≤

f (X) cg (X)

R y R f (x)/cg(X)



 du g (x) dx −∞ −∞ = R −∞ R f (x)/cg(X)  = du g (x) dx −∞ −∞

32

f (X) cg(X)

is found. By

  f (X) Prob X ≤ y, U ≤ cg(X)   = f (X) Prob U ≤ cg(X)

R 1 y f (x) dx c −∞ R 1 −∞ f (x) dx c −∞

Z

y

=

f (x) dx. −∞

Rejection sampling requires (a) a bounding or dominating density g, (b) an ability to evaluate the ratio f /g; (c) an ability to simulate i.i.d. draws from g, and (d) the bounding R constant c. Rejection sampling does not require that the normalization constant f (x) dx be known, since the algorithm only requires knowledge of f /c. For continuous densities on a bounded support, it is to satisfy (a) and (c) (a uniform density works), but for continuous densities on unbounded support it can be more difficult since we need to find a (x) maximizes the acceptance density with heavier tails and higher peaks. Setting c = sup fg(x) x

probability. In practice, finding the constant is difficult because f generally depends on a multi-dimensional parameter vector, f (x|θf ), and thus the bounding is over x and θf . Rejection sampling is often used to generate random variables from various recognizable distributions, such as the Gamma or beta distributions. In these cases, the structure of the densities is well known and the bounding density can be tailored to generate fast and efficient rejection sampling algorithms. In many of these cases, the densities are logconcave (e.g., Normal, Double exponential, Gamma, and Beta). A density f is log-concave (x) if ln (f (x)). For differentiable densities, this is equivalent to assuming that dlnf = ff0(x) dx (x) 2

(x) is non-increasing in x and d lnf < 0. Under these conditions, it is possible to develop dx2 “black-box” generation methods that perform well in many settings. Another modification that “adapts” the dominating densities works well for log-concave densities. The basis of the rejection sampling is the simple fact that since

Z f (x) =

f (x)

Z du =

0

f (x)

U (du) , 0

where U is the uniform distribution function, the density f is a marginal distribution from the joint distribution (X, U ) ∼ U ((x, u) : 0 ≤ u ≤ f (x)) . More generally, if X ∈ Rd is a random vector with density f and U is an independent random variable distributed U [0, 1], then (X, cU f (X)) is uniformly distributed on the set  A = (x, u) : x ∈ Rd , 0 ≤ u ≤ cf (x) . Conversely, if (X, U ) is uniformly distributed on A, then the density of X is f (X). Multinomial Resampling Sampling N values from a discrete distribution (xi , pi ) can be done by simulating standard uniforms Ui and then using binary search to find the

33

value of j, and hence xj , corresponding to qj−1 < Ui ≤ qj P where qj = jl=0 pj and q0 = 0. This algorithm is inefficient, although commonly used due to its simplicity, as requires O(N ln N ) operations. The ln N operations comes from the binary search. A more efficient method (particularly suited to the particle filtering methods) is to simulate N +1 exponentially distributed variables z0 , . . . , zN via zi = − ln Ui and calculating P the totals Zj = jl=0 zl and then merging Zj and qj in the sense that if qj ZN > Zi then output xj . This is an O(N ) algorithm. Slice Sampling Slice sampling can be used to sample the conditional posterior when the scale mixture is a stable distribution. Here we have  p λt |σ 2 , yt ∼ p(yt |λt , σ 2 )p (λt ) , where p (yt |λt , σ 2 ) = N (0, σ 2 λt ). To do this, introduce a uniform random variable ut and consider the joint distribution    p λt , ut |σ 2 ∝ p (λt ) Uut 0, φ yt0 , λt σ 2 . Then the algorithm is:   p λt |ut , σ 2 ∝ p (λt ) on N yt ; 0, σ 2 λt > ut    p ut |λt , σ 2 ∼ U 0, φ yt ; λt σ 2 . The accept/reject sampling alternative is as follows. Since we have the upper bound −1/2 φ (yt ; 0, λt σ 2 ) ≤ (2πyt2 ) exp(−1/2), the sample λt via λt ∼ p (λt ) h i  2 −1/2 −1/2 u ∼ U 0, 2πyt e , and if U > φ (yt , σ 2 λt ), repeat.

34

3.2

Integration results

There are a number of useful integration results that are repeatedly used for normalising probability distributions. The first identity is (x − µ)2 exp − 2σ 2 −∞

Z



!

√ dx =

2πσ 2 ,

for all µ which defines the univariate normal distribution. The second identity is for the Gamma function, defined as Z ∞ Γ (α) = y α−1 e−y dy, 0

which is important for a number of distributions, most notably the Gamma and inverse Gamma. This implies that y α−1 e−y /Γ (α) is a proper density. Changing variables to x = βy, gives the standard form of the Gamma distribution. Integration by parts implies Γ (α + 1) = αΓ (α), so Γ (α) = (α − 1)! with Γ (1) = 1 and √ Γ (1/2) = π. For other fractional values 

1 Γ (α) Γ α + 2



√ = 21−2α πΓ (2α) ,

A number of integrals are useful for analytic characterization of distributions, either priors or posteriors, that are scale mixtures of normals. As discussed in the previous appendix, scale mixtures involve integrals that are a product of two distributions. The following integral identities are useful for analyzing these specifications. For any given p, a, b > 0, Z



1 −p p x exp −ax dx = a b Γ b b 0 Z ∞  p+1   1 1 p p exp −ax−b dx = a− b Γ x b b 0 p−1

b



These integrals are useful when combining Gamma or inverse Gamma distributions prior distributions with normal likelihood functions. Second, for any a and b, Z 0



   a 1 b2 2 √ exp − a x+ dx = exp (−|ab|) , 2 x 2πx

which are useful for double exponential. A related integral, which is useful for the check 35

exponential distribution is Z 0

3.3



   a 1 b2 2 √ exp − + 2(2τ − 1)b + a x dx = exp (− |ab| − (2τ − 1)b) . 2 x 2πx

Useful algebraic formulae

One of the most useful algebraic tricks for calculating posterior distribution is completing the square (have fun showing the algebra for the second part!). In the scalar case, we have the identity (x − µ3 )2 (µ1 − µ2 )2 (x − µ1 )2 (x − µ2 )2 + = + Σ1 Σ2 Σ3 Σ 1 + Σ2   −1 −1 −1 −1 where µ3 = Σ3 Σ−1 µ + Σ µ and Σ = Σ + Σ . 1 2 3 1 2 1 2 A shrinkage interpretation is the following. When analyzing shrinkage and it is common to define the weight w = (Σ1 + Σ2 )−1 Σ1 . We can now re-write completing the square as (x − µ1 )2 (x − µ2 )2 (x − (1 − w)µ1 − wµ2 )2 (µ1 − µ2 )2 + . + = Σ1 Σ2 Σ1 (1 − w)−1 Σ2 (1 − w) In vector-matrix case, completing the square becomes 0 −1 0 −1 0 −1 (x − µ1 )0 Σ−1 1 (x − µ1 )+(x − µ2 ) Σ2 (x − µ2 ) = (x − µ3 ) Σ3 (x − µ3 )+(µ1 − µ2 ) Σ4 (µ1 − µ2 )

 −1 where µ3 = Σ3 Σ−1 1 µ1 + Σ2 µ2 and −1 Σ3 = Σ−1 1 + Σ2

−1

−1 Σ4 = Σ−1 Σ−1 1 1 + Σ2

−1

Σ−1 2 .

Completing the square for the convolution of two normal densities is also useful when interpreting the fundamental identity of likelihood times prior is equal to posterior times marginal. The identity yields φ (µ1 , Σ1 ) φ (µ2 , Σ2 ) = φ (µ, Σ) c(µ1 , Σ1 , µ2 , Σ2 )

36

with φ is the standard normal density and − k2

c(µ1 , Σ1 , µ2 , Σ2 ) = (2π)

− k2

|Σ1 + Σ2 |

 1 0 −1 exp − (µ2 − µ1 ) (Σ1 + Σ2 ) (µ2 − µ1 ) 2 

  −1 −1 −1 −1 with parameters µ = Σ Σ−1 1 µ1 + Σ2 µ2 and Σ = Σ1 + Σ2

3.4

The EM, ECM, and ECME algorithms

MCMC methods have been used extensively to perform numerical integration. There is also interest in using simulation-based methods to optimise functions. The EM algorithm is a algorithms in a general class of Q-maximisation algorithms that finds a (deterministic) sequence {θ(g) } converging to arg maxθ∈θ Q(θ). First, define a function Q(θ, φ) such that Q(θ) = Q(θ, θ) and it satisfies a convexity constraint Q(θ, φ) ≥ Q(θ, θ). Then define θ(g+1) = arg max Q(θ, θ(g) ) θ∈θ

This satisfies the convexity constraint Q(θ, θ) ≥ Q(θ, ϕ) for any ϕ. In order to prove convergence. you get a sequence of inequalities Q(θ(0) , θ(0) ) ≤ Q(θ(1) , θ(0) ) ≤ Q(θ(1) , θ(1) ) ≤ . . . ≤ Q In many models we have to deal with a latent variable and require estimation where integration is also involved. For example, suppose that we have a triple (y, z, θ) with joint probability specification p(y, z, θ) = p(y|z, θ)p(z, θ). This can occur in missing data problems and estimation problems in mixture models. A standard application of the EM algorithm is to find Z arg max p(y|z, θ)p(z|θ)dz θ∈θ

z

As we are just finding an optimum, you do not need the prior specification p(θ). The EM algorithm finds a sequence of parameter values θ(g) by alternating between an expectation and a maximsation step. This still requires the numerical (or analytical) computation of the criteria function Q(θ, θ(g) ) described below. 37

EM algorithms have been used in extensively in mixture models and missing data problems. The EM algorithm uses the particular choice where Z Q(θ) = log p(y|θ) = log p(y, z|θ)dz Here the likelihood has a mixture representation where z is the latent variable (missing data, state variable etc). This is termed a Q-maximization algorithm with: (g)

Z

Q(θ, θ ) =

log p(y|z, θ)p(z|θ(g) , y)dz = Ez|θ(g) ,y [log p(y|z, θ)]

To implement EM you need to be able to calculate Q(θ, θ(g) ) and optimize at each iteration. The EM algorithm and its extensions ECM and ECME are methods of computing maximum likelihood estimates or posterior modes in the presence of missing data. Let the objective function be l(θ) = log p(θ|y) + c(y), where c(y) is a possibly unknown normalizing constant that does not depend on β and y denotes observed data. We have a mixture representation, Z Z p(θ|y) =

p(θ, z|y)dz =

p(θ|z, y)p(z|y)dz

where distribution of the latent variables is p(z|θ, y) = p(y|θ, z)p(z|θ)/p(y|θ). In some cases the complete data log-posterior is simple enough for arg maxθ log p(θ|z, y) to be computed in closed form. The EM algorithm alternates between the Expectation and Maximization steps for which it is named. The E-step and M-step computes Z (g) Q(β|β ) = Ez|β (g) ,y [log p(y, z|β)] = log p(y, z|β)p(z|β (g) , y)dz β (g+1) = arg max Q(β|β (g) ) β

This has an important monotonicity property that ensures `(β (g) ) ≤ `(β (g+1) ) for all g. In fact, the monotonicity proof given by Dempster et al (1977) shows that any β with Q(β, β (g) ) ≥ Q(β (g) , β (g) ) also satisfies the log-likelihood inequality `(β) ≥ `(β (g) ). In problems with many parameters the M-step of EM may be difficult. In this case θ may be partitioned into components (θ1 , . . . , θk ) in such a way that maximizing log p(θj |θ−j , z, y) is easy. The ECM algorithm pairs the EM algorithm’s E-step with k conditional maximization (CM) steps, each maximizing Q over one component θj with each component of θ−j 38

fixed at the most recent value. Due to the fact that each CM step increases Q the ECM algorithm retains the monotonicity property. The ECME algorithm replaces some of ECM’s CM steps with maximizations over l instead of Q. Liu and Rubin (1994) show that doing so can greatly increase the rate of convergence. In many cases we will have a parameter vector θ = (β, ν) partitioned into its components and a missing data vector z = (λ, ω). Then we compute the Q(β, ν|β (g) , ν (g) ) objective function and then compute E- and M steps from this to provide an iterative algorithm for updating parameters. To update the hyperparameter ν we can maximize the fully data posterior p(β, ν|y) with β fixed at β (g+1) . The algorithm can be summarized as follows β (g+1) = argmaxβ Q(β|β (g) , ν (g) ) where Q(β|β (g) , ν (g) ) = Ez

β (g) ,ν (g) ,y

  log p(y, z|β, ν (g) )

ν (g+1) = argmaxν log p(β (g+1) , ν|y) Simulated Annealing (SA) A simulation-based approach to finding θˆ = arg maxH(θ) θ∈Θ

is to sample a sequence densities πJ (θ) = R

e−JH(θ) eJH(θ) dµ(θ)

where J is a temperature parameter. Instead of looking at derivatives and performing gradient-based optimization you can simulate from the sequence of densities. This forms a time-homogeneous Markov chain and under suitable regularly conditions on the relaxation ˆ The main caveat is that we need to know schedule for the temperature we have θ(g) → θ. the criterion function H(θ) to evaluate the Metropolis probability for sampling from the sequence of densities. This is not always available. An interesting generalisation which is appropriate in latent variable mixture models is the following. Suppose that H(θ) = Ez|θ {H(z, θ)} is unavailable in closed-form where without loss of generality we assume that H(z, θ) ≥ 0. In this case we can use latent variable simulated annealing (LVSA) methods. Define a joint probability distribution for z J = (z1 , . . . , zJ ) as J Y πJ (z J , θ) ∝ H(zj , θ)p(zj |θ)µ(θ) j=1

for some measure µ which ensures integrability of the joint. This distribution has the

39

property that its marginal distribution on θ is given by πJ (θ) ∝ Ez|θ {H(z, θ)}J µ(θ) = eJ ln H(θ) µ(θ) By the simulated annealing argument we see that this marginal collapses on the maximum of ln H(θ). The advantage of this approach is that it is typically straightforward to sample with MCMC from the conditionals πJ (zi |θ) ∼ H(zj , θ)p(zj |θ) and πJ (θ|z) ∼

J Y

H(zj , θ)p(zj |θ)µ(θ)

j=1

Jacquier, Johannes and Polson (2007) apply this to finding MLE estimates for commonly encountered latent variable models.

3.5

Notes and Discussion

Geman (1987) provides a monograph discussion of Markov chain optimisation methods. Pincus (1968) provides an early application of the Metropolis algorithm to simulated annealing optimisation. Standard references on simulated annealing include Kirkpatrick (1984), Kirkpatrick et al, (1983), Aarts and Korst (1989) Van Laarhoven and Aarts, (1987). Muller (2000) and Mueller et al (2004) propose this version of latent variable simulated annealing for dealing with the joint problem of integration and optimisation. Slice sampling applications in Statistics are dicussed in Besag and Green (1993), Polson, (1996), Damien et al, (1999) and Neal (2003) provides a general algorithm. General strategies based on the ratio-of-uniforms method are in Wakefield et al (1991). Devroye (1986) and Ripley (1987) discuss many tailored made algorithms for simulation. The EM algorithm for missing data problems was originally developed by Dempster, Laird and Rubin (1977) but has its roots in the hidden Markov model literature and the Baum-Welch algorithm. Liu and Rubin (1994) consider faster ECME algorothms. Liang and Wong (2001) and Liu, Liang and Wong (2000) consider extensions including evolutionary MCMC algorithms.

40