Introduction to Bayesian inference. March 2, 2005

Introduction to Bayesian inference March 2, 2005 1 Quick review: Joint, Marginals, Conditionals [Same principle if discrete] Consider joint distr...
Author: Clemence Lewis
1 downloads 0 Views 155KB Size
Introduction to Bayesian inference March 2, 2005

1

Quick review: Joint, Marginals, Conditionals

[Same principle if discrete]

Consider joint distribution: fXY (x, y) = fX|Y (x|y)fY (y) = fY |X (y|x)fX (x). then fX|Y (x|y) =

fY |X (y|x)fX (x) ∝ fY |X (y|x)fX (x). fY (y)

Nothing “Bayesian” about this yet - it is just mathematically the correct solution. marginals:

fX|Y (x|y)fY (y) fXY (x, y) = fY |X (y|x) fY |X (y|x)   = fY |X (y|x)fX (x)dy = fX|Y (x|y)fY (y)dy.

fX (x) =

Example 1: Consider 

x y



 ∼N

0 0

   1 ρ ; ρ 1

[already considered the generalization of this in KF notes] so that we have

fXY (x, y) = = = =



     1 ρ −1 x 1 √ exp − x y ρ 1 y 2 ( 2π)2 (1 − ρ2 )       1 ρ −1 x 1 1  √ exp − x y ρ 1 y 2 ( 2π)2 (1 − ρ2 )



1 1 2 1 1 2 √ exp − x √  exp − (y − ρx) 2 2 2π 2π (1 − ρ2 ) fX (x)fY |X (y|x) 1 

OR



1 1 1 1 exp − (x − ρy)2 = √ exp − y 2 √  2 2 2π 2π (1 − ρ2 ) = fY (y)fX|Y (x|y). So in this case, marginal same, fY (y) = fX (x) = N (0, 1) fX|Y (x|y) = N (ρy; 1 − ρ2 ), fY |X (y|x) = N (ρx; 1 − ρ2 ). Nothing Bayesian above. 2

0·1

Bayesian statistics:overview

Only “Bayesian” when we consider learning about a parameter θ from all of our data y. Slightly controversial to frequentists (ML practitioners) because we regard θ as a random variable and attach a “prior” to it. So we have to construct (before the data) a prior f (θ). This needn’t be proper (have finite volume) - e.g. Jeffreys prior on variances. MAIN RULE: f (θ|y) =

L(y|θ)f (θ) ∝ L(y|θ)f (θ). f (y)

where, of course, the likelihood L(y|θ) = f (y1 , ..., yn |θ). We record summaries such as E[θ|y], var[θ|y], etc.. We have  f (y) =

L(y|θ)f (θ)dθ Θ

=

f (θ|y) . L(y|θ)f (θ)

is the normalising constant known as the “marginal likelihood” - only really important when comparing between different models e.g. 2 linear models with different sets of covariates. [Known as model selection, model mixing, model averaging - not doing in this course]. Generally a fully parametric classical approach (not dissimilar from ML).

Advantages: Allows asymptotically efficient estimation even when the likelihood may not be written down explicitly e.g. multivariate probit, SV models etc..). Model selection may be included. Entirely consistent methodologically so parameter uncertainty easy to take into account. Disadvantages:

3

0·2

Conjugate models

Conjugate models are when the prior has the same form as the posterior (2 examples below) and evrything may be performed analytically - all integrals tractable. In this case we can work out f (θ|y) exactly (and the moments if they exist). Also in this case we can work out marginal likelihood as f (y) = 0·3

f (θ|y) . L(y|θ)f (θ)

Example 2: Poisson -Gamma

We have a model where transactions (between stock trades) arise as a Poisson process so that the number of trades per day yi ∼ P o(θ), i = 1, ..., n fY (yi ) = ce−θ θ yi . [Note properties of Poisson are E[yi ] = θ, Var[yi ] = θ] Now if we maximize the log likelihood (ML) we would have l(θ) = log L(y|θ) = k − nθ + log θ

n

yi .

i=1

So 

l (θ) = −n + θ = y. 

l (θ) = −

n

n

i=1 yi

θ

i=1 yi θ2

giving

= θ/n so E[−1/l (θ)]

so unbiased as E[y] =E[y] = θ and asymptotically we have: L θ → N (θ; θ/n) L

θ → θ θ → N (θ; θ/n)since

or N (θ; y/n).

= E[−1/l (θ)] = θ/n, so, from ML theory, our unbiassed estimator in this Note that Var(θ) case reaches the Cramer-Rao lower bound (best unbiassed estimator - min var - we can get!). Byaesian approach: Now consider a conjugate prior, which is Gamma, in this case so we have fθ (θ) = ga(α; β) 4

fθ (θ) = c1 θ α−1 e−βθ where c1 = β α /Γ(α). [Note properties of Gamma E[θ] = α/β and Var[θ] = α/β 2 also as α → ∞ we get Normality]. We choose values in our prior ourselves e.g. α = 1, β = 1.2 reflecting our prior beliefs - this is the part which frequentists do not like! Though for large samples it makes no difference as we will see.. We get (applying Bayes formula - think graphically also...) f (θ|y) ∝ L(y|θ)f (θ) so log f (θ|y) = const + l(θ) + log fθ (θ) = const − nθ + ny log θ + (α − 1) log θ − βθ so (since conjugate post of same form as prior by definition..) f (θ|y) = ga(α + ny; β + n) so under of posterior we have that Eθ|y [θ] = V arθ|y [θ] =

α + ny β+n α + ny . (β + n)2

Note that our posterior has the same asymptotic form as our MLE [kind of - we are thinking of θ having the dist here]. Eθ|y [θ] → y V arθ|y [θ] → y/n f (θ|y) → N (.; .). so asymptotically, regardless of our prior specification: θ ∼ N (y; y/n). Note the reversal in that we are considering the distribution of θ. Confidence intervals (called confidence regions) are pretty much the same and have a direct interpretation (“we have 95% 5

confidence that θ ∈ [a, b].. ). Note that this is subtly but importantly different from frequentist interpretation “if the same experiment were performed a large number of times then we would expect that on 95% of occasions θ would lie within the confidence intervals constructed in the same manner from the data”. The Bayesian interpretation is actually simpler and (arguably) more intuitive! The conjugacy argument (prior of same form as posterior) makes everything go through more simply in that we have an explicit form for the posterior.

Likelihood n = 4 Posterior

0.45

Prior Ga(3,1.2)

0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05

0

1

2

3

4

5

6

7

8

Figure 1: Poisson-Gamma example with n=4. true parameter = 5.

6

9

10

Likelihood n = 20 Posterior

0.8

Prior Ga(3,1.2)

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

1

2

3

4

5

6

7

8

9

10

Figure 2: Poisson-Gamma example with n=20. true parameter = 5. 0·4

Example 2: Unknown variance (σ 2 )

Take the univariate Gaussian linear model with (known) parameter β yi ∼ N (xi β; σ 2 ) so that in matrix form the n × 1 vector y ∼ Nn (Xβ; σ 2 In ). We obtain, log L(y|σ 2 ) = const −

n 1 log σ 2 − 2 (y − Xβ) (y − Xβ). 2 2σ

We take Gamma prior for σ −2 ∼ ga(α, β) so that σ 2 ∼ Iga(α, β), the inverse-gamma. Note properties If a rv Z ∼ Iga(a, b) then fZ (z) = cz −(a+1) e−b/z .N ormconstc = ba /Γ(a). E(z) = b/(a − 1), V (z) = b2 /{(a − 1)2 (a − 2)} so we have log-prior: log f (σ 2 ) = const − (α + 1) log σ 2 − β/σ 2 Hence we get the log-posterior: log f (σ 2 |y) = const −

n 1 log σ 2 − 2 (y − Xβ) (y − Xβ) 2 2σ 7

Likelihood n = 200 Posterior

Prior Ga(3,1.2)

2

1

0

1

2

Posterior

3

4

5

6

7

8

9

10

N(.,.) appr

2

1

4.5

4.6

4.7

4.8

4.9

5.0

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

Figure 3: Poisson-Gamma example with n=200. true parameter = 5. −(α + 1) log σ 2 − β/σ 2 n 1 = const − (α + + 1) log σ 2 − {β + (y − Xβ) (y − Xβ)}/σ 2 2 2 so



 1 n  f (σ |y) = Iga α + ; β + (y − Xβ) (y − Xβ) 2 2 2

In particular E[σ 2 |y]

= →

β + 12 (y − Xβ) (y − Xβ) α + n2 − 1 (yi − xi β)2 (y − Xβ) (y − Xβ) = , n n

the sample variance of the error as we would expect!

8

6.0

0·5

Bayesian prediction

Let us suppose we wish to forecast in a Bayesian manner a future observation so we have to forecast yn+1 given y = (y1 , ..., yn ) . We may write down,  f (yn+1 |y) = f (yn+1 |θ)f (θ|y)dθ. In conjugate models we may solve this explicitly. For the Poisson example we looked we get what is called a Poisson-Gamma distribution. For the LM with unknown variance we get a Normal with Inverse-gamma variance (i.e. a t distribution). If n is very large then, under weak regularity assumptions, E[θ|y] → θ (MLE) and Var[θ|y] → 0 so we have

f (yn+1 |y) → fY |θ (yn+1 |θ). In smaller samples it is important to take parameter uncertainty into account. Consider for example a portfolio problem where we have two assets with excess returns per month as      µ1 y1t =N ;Σ y2t µ2 0·6

Other conjugate examples:

Other conjugate forms,Bernardo & Smith (1994) 1

The Normal Linear model (unknown σ 2 , β)

see Press(), Bernardo & Smith (1994) and Greene(latest vol).

9

1·1

Non-conjugacy

Whilst the philosophical reasons for Bayesian inference might be compelling, they didn’t have that much impact on statistics/econometrics until comparatively recently. The reason for the resurgence of Bayesian methods (now about almost 50% of papers in stats and a growing proportion in econometrics) was • Model comparison: Bayes factors allow comparison between many competing models (e.g. subsets of parameters in LM). • Model averaging. • Efficient estimation for latent variable models (Stoch Vol, hierarchical linear models, limited dependent models etc..) Lets explore the last point more since we have a restricted time on Bayesian methods and this leads us into simulation fairly easily. The essential point here is that we might be completely content to use ML provided we can explicitly write down the (log-)likelihood. Whilst we might parametrically specify a DGM we do not always have a situation where the likelihood can be written down. It may be intractable.

10

1·2

Latent variable (limited dependent model)

Consider the multivariate (bivariate in our case) probit model e.g individual employed/unemployed and manual/non-manually trained:  yi =

yi∗ where

 = 

ui1 ui2

yi1 yi2 ∗ yi1 ∗ yi2



⎞ ∗ >0 1 yi1 ∗ ≤0 ⎟ ⎜ ⎟ 0 yi1 =⎜ ∗ >0 ⎠ ⎝ 1 yi2 ∗ ≤0 0 yi2       ui1 xi1 β1 = + , ui2 xi2 β2 



 ∼N

(1·1)

(1·2)

   1 ρ ; ρ 1

0 0

so (2·2) is essentially a LM but we do not observe yi∗ directly, only through yi in (2·1). The log-likelihood is of course θ = (β, ρ), l(θ) = log f (y|β, ρ) =

n

log f (yi |β, ρ)

i=1

where each component,  f (yi |β, ρ) = Pr(Yi yi |β, ρ) =

I(Yi = yi |yi∗ )f (yi∗ |β, ρ)dyi∗ .

Picture when β = 0.. When the dim is 1 (rather than 2) we may write down l(β, ρ) explicitly. When dim is  2 we have an intractable integral (even numerical methods do not work for dimensions of about 3 or 4 or more!!). This is where modern Bayesian methods become useful..

11

1·2.1

Questions

1. Consider that we have a biassed coin (not 50/50 heads vs tails) and we wish to learn about θ, the probability of a head. (a) Write down the log-likelihood for θ.

(b) write down the MLE θ.

(c) What is the variance of θ.

If the answer is same as (c) we (d) Write down the I(θ) = −[l (θ)]−1 and evaluate this at θ. have the Cramer-Rao lowr bound (cannot get a better estimator than this of course).

(e) Deduce the asymptotic distribution of θ. (f) Assume that we have a prior on our θ as Beta(α, β). (g) Deduce the posterior. (h) What is the mean and variance under the posterior? (i) What happens to the mean and variance as n→ ∞. Binomial distribution: X ∼Bin(n, θ)

E(X) = nθ V ar(X) = nθ(1 − θ). n! θ x (1 − θ)n−x , f or x = 0, 1, ..., n fX (x) = (n − x)!x! As n → ∞ X ∼ N (nθ,nθ(1 − θ)). Beta distribution: X ∼Beta(α, β) E(X) = fX (x) =

αβ α V ar(X) = . 2 α+β (α + β) (α + β + 1) Γ(α + β) α−1 x (1 − x)β−1 , f or0 < x < 1 Γ(α)Γ(β)

12

References Bernardo, J. M. & Smith, A. F. M. (1994). Bayesian Theory. John Wiley, Chichester.

13