Stan: A platform for Bayesian inference Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell Department of Statistics, Columbia University, New York (and other places)
5 Mar 2014
1/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
2/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
3/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
4/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Stan code for linear regression
data { int N; int K; vector[N] y; matrix[N,K] X; } parameters { vector[K] b; real sigma; } model { y ~ normal(X*b, sigma); }
5/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Stan code for an item response model: 1 data { int I; // # questions int J; // # students int K; // # schools int N; // # observations int ii[N]; // question for n int jj[N]; // student for n int kk[N]; // school for n int y[N]; // correctness for n } parameters { vector[J] alpha; // ability for student j vector[I] beta; // difficulty for item i vector[K] gamma; // ability for school k vector[I] delta; // discrimination for item i real sigma_gamma; // school ability scale } 6/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Stan code for an item response model: 2
model { alpha ~ normal(0,1); // priors beta ~ normal(0,10); gamma ~ normal(0,sigma_gamma); delta ~ lognormal(-0.5,1); for (n in 1:N) // likelihood y[n] ~ bernoulli_logit(delta[ii[n]] * ( alpha[jj[n]] + gamma[kk[n]] - beta[ii[n]] ) ); }
7/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Stan overview
I
Fit open-ended Bayesian models
I
Specify log posterior density in C++
I
Code a distribution once, then use it everywhere
I
Hamiltonian No-U-Turn sampler
I
Autodiff
I
Runs from R or Python; postprocessing
8/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
People I
Core: I I
I
I I I I
I I I
Andrew Gelman (stats/poli-sci): R, proto-user . . . Bob Carpenter (comp sci/linguistics): C++, language design. . . Matt Hoffman (comp sci/acoustics): NUTS, optimization, C++, . . . Daniel Lee (comp sci/stats): C++, make, integration, . . . Ben Goodrich (stats/poli-sci): linear algebra, R, C++, . . . Michael Betancourt (physics): geometric samplers, C++, . . . Marcus Brubaker (comp sci): linear algebra, optimization, C++, . . . Jiqiang Guo (stats): R, C++, . . . Peter Li (stats/math): C++, . . . Allen Riddell (comp sci): Python, C++, . . .
I
Research collaborators
I
Users 9/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Funding
I
U.S. National Science Foundation
I
Novartis
I
Columbia University
I
U.S. Department of Energy
10/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Roles of Stan
I
Bayesian inference for unsophisticated users (alternative to BUGS)
I
Bayesian inference for sophisticated users (alternative to programming it yourself)
I
Fast and scalable gradient computation
I
Environment for developing new algorithms
11/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Example from toxicology
12/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Sparse data
13/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Validation using predictive simulations
14/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Public opinion: Health care reform
15/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Public opinion: School vouchers
16/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Public opinion: Death penalty
I
Hierarchical time series model
I
Open-ended 17/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Tree rings
18/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Speed dating
I
Each person meets 10–20 “dates”
I
Rate each date on Attractiveness, Sincerity, Intelligence, Ambition, Fun to be with, Shared interests Outcomes
I
I I
Do you want to see this person again? (Yes/No) How much do you like this person (1–10)
I
How important are each of the 6 attributes?
I
Logistic or linear regression
I
Hierarchical model: coefficients vary by person
19/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Steps of Bayesian data analysis
I
Model building
I
Inference
I
Model checking
I
Model understanding and improvement
20/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Background on Bayesian computation
I
Point estimates and standard errors
I
Hierarchical models
I
Posterior simulation
I
Markov chain Monte Carlo (Gibbs sampler and Metropolis algorithm)
I
Hamiltonian Monte Carlo
21/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Solving problems I
Problem: Gibbs too slow, Metropolis too problem-specific
I
Solution: Hamiltonian Monte Carlo
I
Problem: Interpreters too slow, won’t scale
I
Solution: Compilation
I
Problem: Need gradients of log posterior for HMC
I
Solution: Reverse-mode algorithmic differentation
I
Problem: Existing algo-diff slow, limited, unextensible
I
Solution: Our own algo-diff
I
Problem: Algo-diff requires fully templated functions
I
Solution: Our own density library, Eigen linear algebra 22/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Radford Neal (2011) on Hamiltonian Monte Carlo
“One practical impediment to the use of Hamiltonian Monte Carlo is the need to select suitable values for the leapfrog stepsize, , and the number of leapfrog steps L . . . Tuning HMC will usually require preliminary runs with trial values for and L . . . Unfortunately, preliminary runs can be misleading . . . ” 23/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
The No U-Turn Sampler
I
Created by Matt Hoffman
I
Run the HMC steps until they start to turn around (bend with an angle > 180◦ )
I
Computationally efficient
I
Requires no tuning of #steps
I
Complications to preserve detailed balance
24/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Hoffman and Gelman
NUTS Example Trajectory 0.4
0.3
0.2
0.1
0
−0.1 −0.1
0
0.1
0.2
0.3
0.4
0.5
Blue ellipse is contour of target distribution Initial position at black solid circle Arrows indicate a U-turn in momentum
Figure I 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions θ I traced out by the leapfrog integrator and associated with elements of the set of visited states B, the black solid circle is the starting position, the red solid circles I are positions associated with states that must be excluded from the set C of possible next samples because their joint probability is below the slice variable u, Gelman Carpenter Betancourt .. Stan: A platform and theHoffman positions Lee withGoodrich a red “x” through them .correspond to states that mustfor be Bayesian inference
25/35
No-U-Turn Sampler NUTS vs. Gibbs andThe Metropolis
I
Two dimensions of highly correlated 250-dim distribution
Figure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots
1M samples from Metropolis, from correlated Gibbs (thin to 1K) compare 1,000 independent draws from1M a highly 250-dimensional distribu-
I
tion1K (right) with 1,000,000 samples1K (thinned to 1,000 samples I samples from NUTS, independent draws for display) generated by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here. 26/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
NUTS vs. Basic HMC I I I
250-D normal and logistic regression models Vertical axis shows effective #sims (big is good) (Left) NUTS; (Right) HMC with increasing t = L
27/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
NUTS vs. Basic HMC II I I I
Hierarchical logistic regression and stochastic volatility Simulation time is step size times #steps L NUTS can beat optimally tuned HMC
28/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Solving more problems in Stan I I I I I I I I I I I I
Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior Solution: Calculate limits (e.g. limx→0 x log x) Problem: Restrictive licensing (e.g., closed, GPL, etc.) Solution: Open-source, BSD license 29/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Example: the “Kumaraswamy distribution”
p(θ|a, b) = a b x θ−1 (1 − θa )(b − 1) for a, b > 0 and θ ∈ (0, 1) model { // Put priors on a and b here if you want // Put in the rest of your model // Kumaraswamy log-likelihood increment_log_prob(N*(log(a)+log(b))+(a-1)*sum_log_theta); for (n in 1:N) increment_log_prob((b-1)*log1m(pow(theta[n],a))); }
30/35
Gelman Carpenter Hoffman Lee Goodrich Betancourt . . .
Stan: A platform for Bayesian inference
Check that it worked R code: N