Bayesian Estimation & Model Evaluation Frank Schorfheide University of Pennsylvania MFM Summer Camp
June 12, 2016
Frank Schorfheide
Bayesian Estimation & Model Evaluation
Why Bayesian Inference?
• Why not?
p(θ|Y ) = R
p(Y |θ)p(θ) p(Y |θ)p(θ)dθ
• Treat uncertainty with respect to shocks, latent states, parameters,
and model specifications uncertainty symmetrically. • Condition inference on what you know (the data Y ) instead of what
you don’t know (the parameter θ). • Make optimal decision conditional on observed data.
Frank Schorfheide
Bayesian Estimation & Model Evaluation
Excuses and Overview
• Too little time to provide a detailed survey of state-of-the-art
Bayesian methods. • Instead: an eclectic collection of ideas and insights related to: 1
Model Development
2
Identification
3
Priors
4
Computations
5
Working with Multiple Models
Frank Schorfheide
Bayesian Estimation & Model Evaluation
1. Model Development • Bayesian estimation can take a lot of time... so don’t waste it on
bad models! • Suppose you have an elaborate macro-finance DSGE model... • Applied theorists get credit for plugging parameter values into the
model and solving/simulating it. • You can easily get extra credit by: • specifying a prior distribution p(θ); • generating draws θ i , i = 1, . . . , N from prior; ˜ i (conditional on θi ), i = 1, . . . , N; • simulating trajectories Y ˜ i ); • computing sample statistics S(Y • comparing the distribution of simulated sample statistics observed sample statistic S(Y ); • calling it a prior predictive check
Frank Schorfheide
Bayesian Estimation & Model Evaluation
1. Predictive Checks – An Example
Reference: Chang, Doh, and Schorfheide (2007, JMCB) Frank Schorfheide
Bayesian Estimation & Model Evaluation
2. Identification • We are trying to learn the parameters θ from the data. • Formal definitions... e.g., model is identified at θ0 if
p(Y |θ) = p(Y |θ0 ) implies that θ = θ0 . • Without identification or with weak identfication: • use more/different data to achieve identification; • use identification-robust inference procedures.
• Lack of identification does not raise conceptual issues for Bayesian
inference (as long as priors are proper), but possibly computational challenges. Reference: Fernandez-Villaverde, Rubio-Ramirez, Schorfheide (2016, HB of Macro Chapter)
Frank Schorfheide
Bayesian Estimation & Model Evaluation
2. (Lack of) Identification – An Analytical Example • Let φ be an identifiable reduced-form parameter. • Let θ be a structural parameter of interest:
φ≤θ
and θ ≤ φ + 1.
• Parameter θ is set-identified. • The interval Θ(φ) = [φ, φ + 1] is called the identified set. • This problem shows up prominently in VARs identified with sign
restrictions.
References: Moon and Schorfheide (2012, Econometrica); Schorfheide (2016, Discussion of World Congress Lectures by M¨ uller and Uhlig)
Frank Schorfheide
Bayesian Estimation & Model Evaluation
2. (Lack of) Identification – An Analytical Example • Joint posterior of θ and φ:
p(θ, φ|Y ) = p(φ|Y )p(θ|φ, Y ) ∝ p(Y |φ)p(θ|φ)p(φ). • Because θ does not enter the likelihood function, we deduce that
p(Y |φ)p(φ) p(Y |φ)p(φ)dφ p(θ|φ, Y ) = p(θ|φ). p(φ|Y )
=
R
No updating of beliefs about θ conditional on φ! • Marginal posterior distribution of θ:
Z
θ
p(θ|Y ) =
p(φ|Y )p(θ|φ)dφ θ−1
Updating of marginal posterior of θ!
Frank Schorfheide
Bayesian Estimation & Model Evaluation
2. An Analytical Example: Posterior p(θ|Y )
Posterior Density π(θ)
Assume φ|Y ∼ N − 0.5, V¯ ; θ|φ ∼ U[φ, φ + 1]. V¯ is equal to 1/4 (solid red), 1/20 (dashed blue), and 1/100 (dotted green).
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2
-1.5
-1
-0.5
0
0.5
1
1.5
Parameter θ
Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Prior Distributions
• Ideally: probabilistic representation of our knowledge/beliefs before
observing sample Y . • More realistically: choice of prior as well as model are influenced by
some observations. Try to keep influence small or adjust measures of uncertainty. • Views about role of priors: 1
keep them “uninformative” (???) so that posterior inherits shape of likelihood function;
2
use them to regularize the likelihood function;
3
incorporate information from sources other than Y ;
Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Role of Priors – Example 1 • “Uninformative” priors? • Consider structural VAR
yt = Φyt−1 + Σtr Ωt ,
ut = Σtr Ωt ,
E[ut ut0 ] = Σ
• Uniform distribution on orthonormal matrix Ω does not induce
uniform prior over identified set for IRF IRF (i, h) = Φh Σtr [Ω].i = Φh Σtr q,
where
kqk = 1
q2 F q (Σtr )
F θ (Σtr )
θ = q1
tr Σtr 21 q1 + Σ22 q2 = 0
Reference: Schorfheide (2016, World Congress Discussion) Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Role of Priors – Example 2a • Consider model
yt = θ1 x1,t + θ1 θ2 x2,t + ut . • No identification of θ2 if θ1 = 0. • Models with multiplicative parameters generate likelihood functions
that look like this... θ2 0.8 0.6 0.4 0.2
θ1
0.0 0.1 0.2 0.3 0.4 0.5
Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Role of Priors – Example 2a • Identification problem also distorts
p(θ1 = 0|Y ) ∝
Z
p(Y |θ1 = 0, θ2 )p(θ2 )p(θ1 = 0)dθ2 .
• Reparameterize: α1 = θ1 , α2 = θ1 θ2 . • Prior p(α1 , α2 ) ∝ c can regularize the problem. • Jacobian:
∂α 1 ∂θ0 = θ2
0 = |θ1 |. θ1
• Prior density p(θ1 , θ2 ) ∝ |θ1 | vanishes as θ1 approaches point of
non-identification. • More generally: try to add information when data are not particularly informative.
References: For cointegration model: Kleibergen and van Dijk (1994), Kleibergen and Paap (2002) Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Role of Priors – Example 2b • For instance, high-dimensional VARs:
Y = X Φ + U,
ut ∼ N(0, Σ)
with low observation-parameter ratio. • Hierarchical (conjugate) MNIW prior p(Φ, Σ|λ) adds information.
Frequentist perspective: add some bias and reduce variance to improve MSE. • How much? Data-driven choice of λ (empirical Bayes):
ˆ = argmaxλ λ
Z
p(Y |Φ, Σ)p(Φ, Σ|λ)d(Φ, Σ)
• Or specify prior p(λ) and integrate out hyperparameters. • Alternative priors: LASSO, spike-and-slab,... References: Giannone, Lenza, and Primiceri (2014, REStat)
Frank Schorfheide
Bayesian Estimation & Model Evaluation
3. Role of Priors – Example 3 • Prior elicitation based on: pre-sample information; information from
excluded data series; or micro (macro) level information when estimating a model on macro (micro) data. • A cute example... • Production function: Yt = (At Ht )
α
Kt1−α
1−ϕ·
Ht −1 Ht−1
2 ! .
• Prior for adjustment costs ϕ? • Firms can either search for workers, incurring adjustment costs • • • •
)2 Y , or pay head hunters for finding workers. ϕ( ∆H H Head hunters service fee is ζW ∆H. Head hunters tend to charge about ζ = 1/3 to 2/3 of quarterly earnings of a worker. Recruiting costs should be approximately the same: ϕ( ∆H )2 Y = ζW ∆H. H With the labor share of 1/3 (= WH ) for a size of one percent increase Y = 1%, we obtain a range of 22 to 44 for ϕ. of employment, ∆H H
Reference: Chang, Doh, and Schorfheide (2007, JMCB) Frank Schorfheide
Bayesian Estimation & Model Evaluation
4. Computations • Practical work utilizes algorithms to generate draws θ i , i = 1, . . . , N
from posterior p(θ|Y ). • Post-process draws by converting them into object of interest
hi = h(θi ) to characterize p(h(θ)|Y ) =⇒ inference and decision making under uncertainty. • Important algorithms: • importance sampling • Markov chain Monte Carlo (MCMC) algorithms, e.g.,
Metropolis-Hastings samplers or Gibbs samplers • More recently: widespread access to parallel computation
environments. • Sequential Monte Carlo (SMC) techniques provide an interesting
alternative. Reference: Herbst and Schorfheide (2015, Princeton University Press) Frank Schorfheide
Bayesian Estimation & Model Evaluation
4. Importance Sampling • Target posterior π(θ) ∝ f (θ). R R • Use identity h(θ)f (θ)dθ = h(θ) gf (θ) (θ) g (θ)dθ. • θ i ’s are draws from g (·). • approximation:
Eπ [h] ≈ 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
1 N
PN 1 N
i i i=1 h(θ )w (θ ) , PN i i=1 w (θ )
w (θ) =
f (θ) . g (θ)
f g1 g2
6
4
2
0
weights
2
4
6
2
4
6
f/g1 f/g2 6
4 Frank Schorfheide 2
0
Bayesian Estimation & Model Evaluation
4. A Challenging Posterior • Consider the state-space model:
yt = [1 1]st ,
st =
θ12 2 (1 − θ1 ) − θ1 θ2
• Shocks: t ∼ iidN(0, 1); uniform prior.
0 (1 − θ12 )
st−1 +
1 0
• Simulate T = 200 observations given θ = [0.45, 0.45]0 , which is
observationally equivalent to θ = [0.89, 0.22]0 . 1.0
θ2
0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4
Frank Schorfheide
0.6
0.8
θ1
1.0
Bayesian Estimation & Model Evaluation
t .
4. From Importance to Sequential Importance Sampling
5 4 3 2 1 0 50
40
30 n 20
10
0.6
0.0
0.2
1.0
0.4 θ 1
[p(Y |θ)]φn p(θ) fn (θ) πn (θ) = R = , φ n Zn [p(Y |θ)] p(θ)dθ Frank Schorfheide
0.8
φn =
n Nφ
λ
Bayesian Estimation & Model Evaluation
4. SMC Algorithm: A Graphical Illustration
10 5 0
−5 −10
C
φ0
S
M
φ1
C
S
M
φ2
C
S
M
φ3
• πn (θ) is represented by a swarm of particles {θni , Wni }N i=1 : N 1 X i a.s. Wn h(θni ) −→ Eπn [h(θn )]. h¯n,N = N i=1
• C is Correction; S is Selection; and M is Mutation. Frank Schorfheide
Bayesian Estimation & Model Evaluation
4. SMC Algorithm 1
Initialization. (φ0 = 0). Draw the initial particles from the prior:
2
θ1i ∼ p(θ) and W1i = 1, i = 1, . . . , N. Recursion. For n = 1, . . . , Nφ ,
iid
1
Correction. Reweight the particles from stage n − 1 by defining the incremental weights i w ˜ni = [p(Y |θn−1 )]φn −φn−1
(1)
and the normalized weights ˜ ni = W
1 N
i w ˜ni Wn−1 , PN i ˜ni Wn−1 i=1 w
i = 1, . . . , N.
(2)
An approximation of Eπn [h(θ)] is given by N 1 X ˜i i h˜n,N = ). Wn h(θn−1 N i=1
2
(3)
Selection. Frank Schorfheide
Bayesian Estimation & Model Evaluation
4. SMC Algorithm 1 2
Initialization. Recursion. For n = 1, . . . , Nφ , 1 2
Correction. ˆ N Selection. (Optional Resampling) Let {θ} i=1 denote N iid draws from a multinomial distribution characterized by support points and i i ˜ ni }N weights {θn−1 ,W i=1 and set Wn = 1. An approximation of Eπn [h(θ)] is given by N 1 X i ˆi Wn h(θn ). hˆn,N = N i=1
3
(4)
Mutation. Propagate the particles {θˆi , Wni } via NMH steps of a MH algorithm with transition density θni ∼ Kn (θn |θˆni ; ζn ) and stationary distribution πn (θ). An approximation of Eπn [h(θ)] is given by N 1 X h¯n,N = h(θni )Wni . N i=1
Frank Schorfheide
(5)
Bayesian Estimation & Model Evaluation
4. Remarks • Correction Step: • reweight particles from iteration n − 1 to create importance sampling approximation of Eπn [h(θ)] • Selection Step: the resampling of the particles • (good) equalizes the particle weights and thereby increases accuracy of subsequent importance sampling approximations; • (not good) adds a bit of noise to the MC approximation. • Mutation Step: • adapts particles to posterior πn (θ); • imagine we don’t do it: then we would be using draws from prior p(θ) to approximate posterior π(θ), which can’t be good! 5 4 3 2 1 0 50
40
30 n 20
10
0.0
0.2
0.6 0.4 θ 1
0.8
1.0
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Working with Multiple Models • Assign prior probabilities γj,0 to models Mj , j = 1, . . . , J. • Posterior model probabilities are given by
γj,0 p(Y |Mj ) , γj,T = PJ j=1 γj,0 p(Y |Mj ) where p(Y |Mj ) =
Z
p(Y |θ(j) , Mj )p(θ(j) |Mj )dθ(j)
• Log marginal data densities are one-step-ahead predictive scores:
ln p(Y |Mj ) =
T X
Z ln
t=1
p(yt |θ(j) , Y1:t−1 , Mj )p(θ(j) |Y1:t−1 , Mj )dθ(j) .
• Bayesian model averaging:
p(h|Y ) =
J X
γj,T p(hj (θ(j) )|Y , Mj ).
j=1
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Working with Multiple Models
• Application: DSGE model with and without financial frictions. • Food for thought: • Bayesian model averaging essentially assumes that the model space
is complete. Is it? • Time-varying model weights can be a stand in for nonlinear
macroeconomic dynamics.
Reference: Del Negro, Hasegawa, and Schorfheide (2016, JoE)
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. A Stylized Framework • Consider principal-agent setting in mind to separate the task of
estimating models from the task of combining them. • Agents Mm = econometric modelers: • provide principal with predictive densities p(yt+1 |Itm , Mm ); • are rewarded based on the realized value of ln p(yt+1 |Itm , Mm ) (induces truth-telling). • Itm is model specific information set. • Principal P = policy maker who aggregates information obtained
from modelers:
p(yt+1 |λ, ItP , P) = λp(yt+1 |It1 , M1 ) + (1 − λ)p(yt+1 |It2 , M2 ) where ItP = {y1:t , {p(yτ |Iτm−1 , Mm )}tτ =1 for m = 1, 2}
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Bayesian Model Averaging (BMA): λ ∈ {1, 0} • At any time T the policy maker can use the predictive densities to
form marginal likelihoods: p(Y1:T |Mi ) =
T Y t=1
p(yt |Y1:t−1 , Mi )
• . . . use them to update model probabilities:
λBMA p(Y1:T |M1 ) 0 λBMA = P[λ = 1|Y ] = 1:T T BMA | {z } λ0 p(Y1:T |M1 ) + (1 − λBMA )p(Y1:T |M2 ) 0 P[M1 is correct] • Predictive density:
pBMA (yt+1 |ItP , P)
=
λBMA p(yt+1 |Y1:t , M1 ) t
+(1 − λBMA )p(yt+1 |Y1:t , M2 ) t
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. BMA and Model Misspecification • BMA is based on the assumption that the model space contains the
‘true’ model (“complete model space”): p(y1:T |λ, P) =
Q p(y1:T |M1 ) = QTt=1 p(yt |Y1:t−1 , M1 ) p(y1:T |M2 ) = Tt=1 p(yt |Y1:t−1 , M2 )
if if
λ=1 λ=0
DGP = p(Y1:T )
KL Discrepancy
p(Y1:T |M1 )
p(Y1:T |M2 )
a.s.
• λBMA −→ 1 or 0 as T −→ ∞ (Dawid 1984, others): Asymptotically, T
no model averaging! All the weight is on model closest in KL discrepancy. Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Optimal (Static) Pools: λ ∈ [0, 1] • A policy maker concerned about misspecification of Mi could create
convex combinations of predictive densities:
DGP = p(Y1:T )
p(Y1:T |M1 ) p(Y1:T |λ, P) =
p(Y1:T |M2 ) T Y t=1
λ p (yt |Y1:t−1 , M1 ) + (1 − λ) p (yt |Y1:t−1 , M2 )
• λSP T = argmaxλ∈[0,1] p(y1:T |λ, P) generally 6→ 1 or 0 (unless one of
the models is correct): Exploits gains from diversification.
References: Hall and Mitchell (2007), Geweke and Amisano (2011) Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Dynamic Pools - Prior for Weights λ1:T • Dynamic pool: replace λ by sequence λt • Likelihood function:
T Y p(y1:T |λ1:T , P) = λt p(yt |y1:t−1 , M1 )+(1−λt )p(yt |y1:t−1 , M2 ) . t=1
• Prior p(λ1:T |ρ) for sequence λ1:T :
p xt = ρxt−1 + 1 − ρ2 εt , λt = Φ(xt ) where Φ(.) is the Gaussian CDF.
εt ∼ iid N(0, 1),
x0 ∼ N(0, 1),
• Unconditionally, λt ∼ U[0, 1] for all t. • Hyperparameter ρ controls the amount of “smoothing.” • As ρ −→ 1: dynamic pool −→ static pool. • Specify a prior distribution for ρ (and other hyperparamters) and
base our results on the (real time) posterior distribution. Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Dynamic Pools - Nonlinear State Space System • Measurement equation:
p(yt |λt , P) = λt p(yt |y1:t−1 , M1 ) + (1 − λt )p(yt |y1:t−1 , M2 )
• Transition equation:
λt
=
Φ(xt ),
xt = ρxt−1 +
p
1 − ρ2 εt ,
εt ∼ iid N(0, 1)
• Use particle filter to construct the sequence p(λt |ρ, ItP , P)..
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Application
• Two models: Smets-Wouters and Smets-Wouters with financial
frictions • Track relative performance over time and construct real-time weights
Frank Schorfheide
Bayesian Estimation & Model Evaluation
5. Log Scores Comparison: SWFF vs SWπ
p(¯ yt+h,h |Itm+ , Mm ) Frank Schorfheide
Bayesian Estimation & Model Evaluation
(h)
5. Dynamic Pools – Posterior pDP (λt |ItP , P) ρ ∼ U[0, 1], µ = 0, σ = 1
Frank Schorfheide
Bayesian Estimation & Model Evaluation
To Recap...
• Too little time to provide a detailed survey of state-of-the-art
Bayesian methods. • Instead: an eclectic collection of ideas and insights related to: 1
Model Development
2
Identification
3
Priors
4
Computations
5
Working with Multiple Models
Frank Schorfheide
Bayesian Estimation & Model Evaluation