graphical models)

Introduction to Bayesian multilevel models (hierarchical Bayes/graphical models) Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.co...

Author: Mabel Payne

19 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Graphical Models. Lecture 5: Undirected Graphical Models, con7nued. Andrew McCallum

Probabilistic Graphical Models

1 Elements of Graphical Models

An introduction to graphical models

An Introduction to Graphical Models

Using Probabilistic Graphical Models to Solve NPcomplete

Learning Graphical Models for Stationary Time Series

Unsupervised Learning with Truncated Gaussian Graphical Models

Mixed Graphical Models via Exponential Families

Hidden Markov Model and Graphical Models

Probabilistic graphical models: Introduction and general information

Learning graphical models of preferences. Theoretical results

Probabilistic Graphical Models for Brain Computer Interfaces

Contextual Symmetries in Probabilistic Graphical Models

GGMselect: R package for estimating Gaussian graphical models

Alternative Graphical Causal Models and the Identification of Direct Effects

Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Three New Graphical Models for Statistical Language Modelling

Causal inference with graphical models in small and big data

Goal-Based Imitation as Probabilistic Inference over Graphical Models

Walk-Sums and Belief Propagation in Gaussian Graphical Models

Course Organisation. Bayesian and Decision Models in AI. Course Aims. Literature. Probabilistic Graphical Models in AI

Chapter 7 Linear Programming Models: Graphical and Computer Models - Dr. Samir Safi

Introduction to Bayesian multilevel models (hierarchical Bayes/graphical models) Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/

IAC Winter School, 3–4 Nov 2014

1 / 44

1970 baseball averages Efron & Morris looked at batting averages of baseball players who had N = 45 at-bats in May 1970 — ‘large’ N & includes Roberto Clemente (outlier!) Red = n/N maximum likelihood estimates of true averages Blue = Remainder of season, Nrmdr ≈ 9N 'True'

Early season

RMSE = 0.277

Shrinkage

RMSE = 0.148 0.2

0.3

0.4

0.265

Cyan = James-Stein estimator: nonlinear, correlated, biased But better! 2 / 44

0

Full-Season Rank

5

First 45 at-bats Full season Shrinkage estimator

10

Lines show closer estimate Shrinkage closer 15/18 15

20

0

0.1

0.2

0.3

0.4

0.5

Batting Average

3 / 44

0

Full-Season Rank

5

First 45 at-bats Full season Shrinkage estimator

10

Lines show closer estimate Shrinkage closer 15/18 15

20

0

0.1

0.2

0.3

0.4

0.5

Batting Average

Theorem (independent Gaussian setting): In dimension > ∼3, shrinkage estimators always beat independent MLEs in terms of expected RMS error “The single most striking result of post-World War II statistical theory” — Brad Efron 3 / 44

Accounting For Measurement Error Introduce latent/hidden/incidental parameters Suppose f (x|θ) is a distribution for an observable, x.

From N precisely measured samples, {xi }, we can infer θ from Y L(θ) ≡ p({xi }|θ) = f (xi |θ) i

p(θ|{xi }) ∝ p(θ)L(θ) = p(θ, {xi }) (A binomial point process) 4 / 44

Graphical representation

• • •

Nodes/vertices = uncertain quantities (gray → known) Edges specify conditional dependence Absence of an edge denotes conditional independence

θ

x1

x2

xN

Graph specifies the form of the joint distribution: Y p(θ, {xi }) = p(θ) p({xi }|θ) = p(θ) f (xi |θ) i

Posterior from BT: p(θ|{xi }) = p(θ, {xi })/p({xi }) 5 / 44

But what if the x data are noisy, Di = {xi + ǫi }?

{xi } are now uncertain (latent) parameters We should somehow incorporate ℓi (xi ) = p(Di |xi ): p(θ, {xi }, {Di }) = p(θ) p({xi }|θ) p({Di }|{xi }) Y = p(θ) f (xi |θ) ℓi (xi ) i

Marginalize over {xi } to summarize inferences for θ. Marginalize over θ to summarize inferences for {xi }. Key point: Maximizing over xi and integrating over xi can give very different results! 6 / 44

To estimate x1 : p(x1 |{x2 , . . .}) =

Z

dθ p(θ) f (x1 |θ) ℓ1 (x1 ) ×

= ℓ1 (x1 )

N Z Y

dxi f (xi |θ) ℓi (xi )

i =2

Z

dθ p(θ) f (x1 |θ)Lm,˘1 (θ)

ˆ ≈ ℓ1 (x1 )f (x1 |θ) with θˆ determined by the remaining data (EB) ˆ behaves like a prior that shifts the x1 estimate away from f (x1 |θ) the peak of ℓ1 (xi ) This generalizes the corrections derived by Eddington, Malmquist and Lutz-Kelker (sans selection effects) (Landy & Szalay (1992) proposed adaptive Malmquist corrections that can be viewed as an approximation to this.) 7 / 44

Graphical representation θ

x1

x2

xN

D1

D2

DN

p(θ, {xi }, {Di }) = =

p(θ) p({xi }|θ) p({Di }|{xi }) Y Y p(θ) f (xi |θ) p(Di |xi ) = p(θ) f (xi |θ) ℓi (xi ) i

i

(sometimes called a “two-level MLM” or “two-level hierarchical model”) 8 / 44

Multilevel models

1

Conditional and marginal dependence/independence

2

Populations and multilevel modeling

3

MLMs for cosmic populations

9 / 44

Multilevel models

1

Conditional and marginal dependence/independence

2

Populations and multilevel modeling

3

MLMs for cosmic populations

10 / 44

Binomial counts

...

n1 heads in N flips

...

n2 heads in N flips

Suppose we know n1 and want to predict n2 11 / 44

Predicting binomial counts — known α Success probability α → p(n|α) =

N! n n!(N−n)! α (1

− α)N−n

|| N

Consider two successive runs of N = 20 trials, known α = 0.5 p(n2 |n1 , α) = p(n2 |α)

|| N

n1 and n2 are conditionally independent 20

n2

15

10

5

0 0

5

10

n1

15

20

12 / 44

Model structure as a graph • • •

Circular nodes/vertices = a priori uncertain quantities (gray = becomes known as data) Edges specify conditional dependence Absence of an edge indicates conditional independence α

α

⇐⇒ n1

n2

ni

nN

nN

N −1

p({ni }|α) =

Y

p(ni |α)

i

Knowing α lets you predict each ni , independently 13 / 44

Predicting binomial counts — uncertain α Consider the same setting, but with α uncertain Outcomes are physically independent, but n1 tells us about α → outcomes are marginally dependent: Z Z p(n2 |n1 , N) = dα p(α, n2 |n1 , N) = dα p(α|n1 , N) p(n2 |α, N) Prior: α = 0.5 ± 0.1 20

15

15

10

10

n2

n2

Flat prior on α 20

5

5

0

0 0

5

10

n1

15

20

0

5

10

n1

15

20

14 / 44

Graphical model — “Probability for everything” α

Flow of Information

n1

p(α, n1 , n2 ) = π(α)

Y

n2

p(ni |α) ≡ π(α)

i

Y

ℓi (α)

member likelihood

i

From joint to conditionals: Q π(α) i ℓi (α) p(α, n1 , n2 ) Q = R p(n1 , n2 ) dα π(α) i ℓi (α) R dα p(α, n1 , n2 ) p(n2 |n1 ) = p(n1 )

p(α|n1 , n2 ) =

Observing n1 lets you learn about α Knowledge of α affects predictions for n2 → dependence on n1

15 / 44

Multilevel models

1

Conditional and marginal dependence/independence

2

Populations and multilevel modeling

3

MLMs for cosmic populations

16 / 44

A population of coins/flippers

Each ﬂipper+coin ﬂips different number of times

17 / 44

Population parameters

θ

α1

α2

αN

Success probabilities

n1

n2

nN

Data

p(θ, {αi }, {ni }) = π(θ)

!

p(αi |θ) p(ni |αi )

i

= π(θ)

!

p(αi |θ) ℓi (αi )

i

Terminology: θ are hyperparameters, π(θ) is the hyperprior

18 / 44

A simple multilevel model: beta-binomial Goal: Learn a population-level “prior” by pooling data Qualitative

Quantitative θ = (a, b) or (µ, σ) Population parameters

θ

α1

α2

αN

Success probabilities

n1

n2

nN

Data

p(θ, {αi }, {ni }) = π(θ)

!

= π(θ)

!

π(θ) = Flat(µ, σ) p(αi |θ) = Beta(αi |θ)

p(ni |αi ) =

N i ni αi (1 − αi )Ni −ni ni

p(αi |θ) p(ni |αi )

i

p(αi |θ) ℓi (αi )

i

19 / 44

Generating the population & data Beta distribution (mean, conc'n)

Binomial distributions

20 / 44

Likelihood function for one member’s α

21 / 44

Learning the population distribution

22 / 44

Lower level estimates

Two approaches • Hierarchical Bayes (HB): Calculate marginals Z Y p(αj |{ni }) ∝ dθ π(θ) p(αi |θ) p(ni |αi ) i 6=j

• Empirical Bayes (EB): Plug in an optimum θˆ and estimate {αi } View as approximation to HB, or a frequentist procedure 23 / 44

Lower level estimates

Bayesian outlook

•

Marginal posteriors are narrower than likelihoods

•

Point estimates tend to be closer to true values than MLEs (averaged across the population)

•

Joint distribution for {αi } is dependent 24 / 44

Frequentist outlook

•

Point estimates are biased

•

Reduced variance → estimates are closer to truth on average (lower MSE in repeated sampling)

•

Bias for one member estimate depends on data for all other members

Lingo

•

Estimates shrink toward prior/population mean

•

Estimates “muster and borrow strength” across population (Tukey’s phrase); increases accuracy and precision of estimates

25 / 44

Population and member estimates

26 / 44

Competing data analysis goals “Shrunken” member estimates provide improved & reliable estimate for population member properties But they are under-dispersed in comparison to the true values → not optimal for estimating population properties∗ No point estimates of member properties are good for all tasks! We should view data catalogs as providing descriptions of member likelihood functions, not “estimates with errors”

∗

Louis (1984); Eddington noted this in 1940!

27 / 44

Multilevel models

1

Conditional and marginal dependence/independence

2

Populations and multilevel modeling

3

MLMs for cosmic populations

28 / 44

Observing and modeling cosmic populations Indicator scatter & Transformation bias

Population

Measurement Error

Truncation & Censoring

Observables

Measurements

F

L

Catalog

F

Mapping

F

Observation

Selection

r Space: !"χ

#

z = precise

#

z

#

z

= uncertain

Science goals

•

Density estimation: Infer the distribution of source characteristics, p(χ)

•

Regression/Cond’l density estimation: Infer relationships between different characteristics

•

Map regression: Infer parameters defining the mapping from characteristics to observables

Notably, seeking improved point estimates of source characteristics is seldom a goal in astronomy. 29 / 44

Number counts, luminosity functions GRB peak fluxes

Loredo & Wasserman 1993, 1995, 1998: GRB luminosity/spatial dist’n via hierarchical Bayes

TNO magnitudes

Gladman+ 1998, 2001, 2008: TNO size distribution via hierarchical Bayes 30 / 44

CB244 molecular cloud

+

Herschel data from Stutz 2010

SED properties vs. position

Kelly+ 2012: Dust parameter correlations via hierarchical Bayes β = power law modification index Expect β → 0 for large grains

31 / 44

Measurement error models for cosmic populations

Indicator scatter & Transformation bias

Population

Measurement Error

Truncation & Censoring

Observables

Measurements

F

L

Catalog

F

Mapping

F

Observation

Selection

r Space: !"χ

#

z = precise

#

z

#

z

= uncertain

32 / 44

Schematic graphical model Population parameters

θ

A directed acyclic graph (DAG)

χ1

χ2

χN

Source characteristics

= "Random variable" node (has pdf) Becomes known (data)

O1

O2

ON

Source observables

D1

D2

DN

Data

= Conditional dependence

Graph specifies the form of the joint distribution: Y p(θ, {χi }, {Oi }, {Di }) = p(θ) p(χi |θ) p(Oi |χi ) p(Di |Oi ) i

Posterior from Bayes’s theorem: p(θ, {χi }, {Oi }|{Di }) = p(θ, {χi }, {Oi }, {Di }) / p({Di }) 33 / 44

Plates Population parameters

θ

θ

Plate χ1

χ2

χN

Source characteristics

χi

O1

O2

ON

Source observables

Oi

D1

D2

DN

Data

Di N

34 / 44

“Two-level” effective models Number counts O = flux

Dust SEDs χ = spectrum params

θ

θ

Oi

χi

Di

Di N

Calculate flux dist’n using “fundamental eqn” of stat astro

N

Observables = fluxes in bandpasses Fluxes are deterministic in χi

(Analytically/numerically marginalize over χ = (L, r )) 35 / 44

From flips to fluxes Simplified number counts model

•

αi → source flux, Fi

• •

Upper level π(α) → log N–log S dist’n ni → counts in CCD pixels

⇒ “Eddington bias” in disguise, with both member and population inference and uncertainty quantification

36 / 44

Another conjugate MLM: Gamma-Poisson Goal: Learn a flux dist’n from photon counts Qualitative

Quantitative θ = (α, s) or (µ, σ) Population parameters

θ

π(θ) = Flat(µ, σ)

F1

F2

FN

Source properties

p(Fi |θ) = Gamma(Fi |θ)

n1

n2

nN

Observed data

p(ni |Fi ) = Pois(ni |ǫi Fi )

37 / 44

Gamma-Poisson population and member estimates True ML Shrunken pts MLM

p(F)

0.018 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000

KLDML = 0.060 b KLDShr = 0.179 b KLDMLM = 0.031 b

0

50

True

100

F

150

200

250

ML

RMSE = 4.30

EB

RMSE = 3.72

Simulations: N = 60 sources from gamma with hF i = 100 and σF = 30; exposures spanning dynamic range of ×16 38 / 44

Benefits and requirements of cosmic MLMs Benefits

•

Selection effects quantified by non-detection data • vs. V /Vmax and “debiasing” approaches

•

Source uncertainties propagated via marginalization • Adaptive generalization of Eddington/Malmquist “corrections” • No single adjustment addresses source & pop’n estimation

Requirements

•

Data summaries for non-detection intervals (exposure, efficiency)

•

Likelihood functions (not posterior dist’ns) for detected source characteristics (Perhaps a role for interim priors) 39 / 44

Some Bayesian MLMs in astronomy Surveys (number counts/“log N–log S”/Malmquist): • GRB peak flux dist’n (Loredo & Wasserman 1998+ )

•

TNO/KBO magnitude distribution (Gladman+ 1998; Petit+ 2008)

•

Malmquist-type biases in cosmology; MLM tutorial (Loredo & Hendry 2009 in BMIC book)

•

“Extreme deconvolution” for proper motion surveys (Bovy, Hogg, & Roweis 2011)

•

Exoplanet populations (2014 Kepler workshop)

Directional & spatio-temporal coincidences: • GRB repetition (Luo+ 1996; Graziani+ 1996)

• •

GRB host ID (Band 1998; Graziani+ 1999) VO cross-matching (Budav´ ari & Szalay 2008) 40 / 44

Linear regression with measurement error: • QSO hardness vs. luminosity (Kelly 2007) Time series: • SN 1987A neutrinos, uncertain energy vs. time (Loredo & Lamb 2002)

•

Multivariate “Bayesian Blocks” (Dobigeon, Tourneret & Scargle 2007)

•

SN Ia multicolor light curve modeling (Mandel+ 2009+ )

41 / 44

How far we’ve come SN 1987A neutrinos, 1990 Marked Poisson point process Background, thinning/truncation, measurement error

SN Ia light curves Mandel 2009, 2011 Nonlinear regression, Gaussian process regression, measurement error

θ

t, ǫ

t, ǫ

D

D N

Model checking via examining conditional predictive dist’ns

N

Model checking via cross validation

42 / 44

SN IIP light curves (Sanders+ 2014)

43 / 44

Recap of Key Ideas

• • •

Conditional & marginal dependence/independence

• •

Beta-binomial & gamma-Poisson conjugate MLMs

Latent parameters for measurement error Graphical models, multilevel models, hyperparameters

Shrinkage estimators (member point estimates) • Empirical Bayes • Hierarchical Bayes

•

Member vs. population inference—competing goals

44 / 44