Model Inference and Averaging

Model Inference and Averaging Ting-Kam Leonard Wong Department of Mathematics, University of Washington, Seattle UW Math Data Science Seminar May 28,...

Author: Thomasine Craig

0 downloads 3 Views 248KB Size

Report

Download PDF

Recommend Documents

Measuring Underlying Inflation Using Dynamic Model Averaging

Fiscal Decentralisation and Economic Growth: A Bayesian Model Averaging Approach

Further Inference in the Multiple Regression Model

Selection of Prior Weights for Weighted Model Averaging

Lecture 5. Statistical Inference and the Classical Linear Regression Model

Averaging via Cartan scalars

Inference and Regression

Summary and Inference

Limit Setting: Averaging Times and Statistical Analysis

STATISTICAL INFERENCE. Statistical Inference. Statistical Inference. Sampling Sampling distributions

Using precipitation data ensemble and Bayesian Model Averaging for uncertainty analysis in SWAT streamflow simulation

A Parallel model for Noise reduction of images using smoothing filters and Image averaging

Statistical Inference in the Classical Linear Regression Model

Car-Following Model Based on Fuzzy Inference System

Data-driven forward model inference for EEG brain imaging

Optimal control as a graphical model inference problem

SMOOTHTRAC Sonic Averaging System (SAS)

Politician Effort and Voter Inference

Computational Physiology and Clinical Inference

Bayesian networks: Inference and learning

Inference Rules and Decision Rules

OIL ontology inference and interchange

Inference for Correlation and Regression

Variational Inference

Model Inference and Averaging Ting-Kam Leonard Wong Department of Mathematics, University of Washington, Seattle

UW Math Data Science Seminar May 28, 2015

1 / 25

Topics

Selected topics from

The Elements of Statistical Learning by Hastie, Tibshirani and Friedman Chapter 7: Model Assessment and Selection Chapter 8: Model Inference and Averaging

2 / 25

Estimating prediction error

3 / 25

General setting I

T = {(xi , yi ) : 1 ≤ i ≤ N}: training data (samples)

I

I

(X , Y ): underlying population (with a certain distribution) ˆf (X ): prediction model estimated from T L(Y , ˆf (X )): loss function, e.g. (X − ˆf (X ))2

I

Generalization error

I

ErrT = E[L(Y , ˆf (X ))|T ] I

Expected prediction error Err = E[L(Y , ˆf (X ))] = E[ErrT ]

I

Training error N 1X err = L(yi , ˆf (xi )) N i=1

4 / 25

Main ideas

I I I I

Training error err may not be a good estimate of Err or ErrT err is computed over T to which ˆf is fitted Bias of ˆf (·) (model does not capture true f completely) Variability of ˆf (·) (which depends on T )

5 / 25

Bias-variance decomposition Assume I I

Y = f (X ) + ε, E[ε] = 0, Var(ε) = σε2 L(Y , ˆf (X )) = (Y − ˆf (X ))2

Fix x0 . Bias-variance decomposition (with ˆf fixed): Err(x0 ) = E[(Y − ˆf (x0 ))2 |X = x0 ] = σ 2 + [ˆf (x0 ) − f (x0 )]2 + E[(ˆf (x0 ) − E ˆf (x0 ))2 ] ε

= Irreducible error + Bias2 + Variance I

Bias-variance tradeoff; overfitting

I

Estimation of prediction error

6 / 25

Example: k -nearest neighbor regression

I

Average the y -values of the k nearest neighbors {x(`) }: k

X ˆf (x0 ) = 1 yx(`) k `=1

I

Bias-variance decomposition (conditioned on training data): "

k

1X Err(x0 ) = σε2 + f (x0 ) − f (x(`) ) k `=1

I

#2 +

σε2 k

Optimal k depends on the shape of f .

7 / 25

Example: k -nearest neighbor regression

i=1

`=1

0.40

0.45

0.50

Estimate of Err

0.35

I I

X = (X1 , X2 ) ∈ [0, 1]2 uniformly distributed f (X ) = 3X12 + X1 X2 (unknown), σε2 = 0.25 Sampled N = 200 values for x and estimate #2 " N k σ2 1X 1X 2 c f (x(i,`) ) + ε Errk = σε + f (xi ) − N k k

0.30

I

0

10

20

30

40

k

8 / 25

Example: k -nearest neighbor regression If we naively compute N 2 1 X yi − ˆf (xi ) N i=1

we get horrible result (if we know the truth)

0.00

0.10

0.20

0.30

Training error (square loss)

0

10

20

30

40

k

9 / 25

Cross-Validation Setting: I I

ˆ Fix a learning method {(xi , yi )}N i=1 7→ f (·) Estimate Err = E[L(Y , ˆf (X ))]

Procedure: I I

I

Divide data randomly into K parts, typically 5 or 10 For each 1 ≤ k ≤ K , train ˆf −k without the k th part. Each ˆf −k is trained with about (K − 1)/K of the entire data set For each data point (xi , yi ) let κ(i) ∈ {1, . . . , K } be its group, and compute N 1X ˆ L(yi , ˆf −κ(i) (xi )) CV(f ) = N i=1

I

ˆf = ˆfα , find α ˆ which minimizes CV(ˆfα ). Use this α ˆ to fit to all the data

10 / 25

Cross-Validation: remarks

(p.245) Consider a classification problem with a large number of predictors and consider the following modeling strategy: 1. Screen the predictors: find a subset of “good predictors” that show fairly strong correlation with the class labels 2. Using just this subset of predictors, build a multivariate classifier 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model

11 / 25

Cross-Validation: remarks Authors warned (p. 246): While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals. Correct procedure: Divide data into K groups at random. For each k : 1. Find a subset of “good” predictors that show fairly strong correlation with the class labels, using all of the samples except those in group k 2. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k .3 Use the classifier to predict the class labels for the samples in fold k

12 / 25

Bootstrap methods

Z = (z1 , . . . , zN ), zi = (xi , yi ): data set I

A bootstrap sample is Z∗b with N data points sampled with replacement from Z

I

Apply computation to Z∗b , repeat for b = 1, . . . , B

I

Take average over the B bootstrap samples

13 / 25

Bootstrap: simple example (taken from Internet)

14 / 25

Estimating prediction error with bootstrap I

Simple approach B N 1X1X c L(yi , ˆf ∗b (xi )) Errboot = B N b=1

I

i=1

Leave-one-out bootstrap N X X 1 c (1) = 1 Err L(yi , ˆf ∗b (xi )) boot N |C −i | −i i=1

b∈C

where C −i represents those bootstrap samples which does not contain observation i I

More sophisticated variants...

I

Requires lots of computation 15 / 25

Model Selection and Averaging

16 / 25

Bayesian approach and BIC

Fitting is carried out by maximizing a log-likelihood I

Bayesian information criterion (BIC) for a model with d parameters BIC = −2 · loglik + (log N) · d

I

Consider candidate models Mm , M = 1, . . . , M

I

Choose m to minimize BIC

I

Example: Choose p and q for ARMA(p, q), a class of linear time series models

17 / 25

Why is BIC “Bayesian”? Idea: use data to update your information about which model is the best I P(Mm ): prior distribution I Posterior P(Mm |Z) ∝ P(Mm ) · P(Z|Mm ) I I

Bayesian decision rule: choose m which maximizes P(Mm |Z) Laplace approximation: dm log P(Z|Mm ) ≈ log P(Z|θbm , Mm ) − · log N 2

I

where θbm is the maximum likelihood estimate So min BIC is approximately equivalent to choosing the model with largest posterior probability (with uniform prior). In fact 1

e− 2 BICm

P(Mm |Z) ≈ P M

`=1 e

− 21 BIC` 18 / 25

Example: time series analysis I

ARMA(p, q) model for {Xt }: Xt =

p X

φi Xt−i + εt +

i=1

In practice, usually p, q ≤ 2 or 3. Assume true model is ARMA(1, 2): Xt = 0.5Xt−1 + εt + 0.5εt−1 − 0.3εt−2 Autocorrelation function

ACF

0.2

−1

0.4

0

0.6

1

0.8

2

1.0

Time series of X

0.0

−2

I

θj εt−j

j=1

−3

I

q X

0

50

100

150 Time

200

250

300

0

5

10

15 Lag

20

19 / 25

Example: time series analysis

I

We simulated N = 300 data points from the true model

I

Fit an ARMA(p, q) model for p, q = 0, 1, . . . , 5 and compute BIC: pq 0 1 2 3 4 5

I

0 753.11 630.03 616.04 594.15 573.31 577.29

1 567.09 566.42 571.61 571.18 575.26 578.68

2 566.84 570.36 576.04 576.39 579.13 584.22

3 572.46 576.02 578.11 577.40 584.42 589.81

4 570.28 576.36 580.41 585.96 590.11 592.97

5 575.44 580.81 585.60 587.43 590.70 596.40

(0, 1), (0, 2), (1, 1) and (1, 2) give the best fit

20 / 25

Model averaging I

{Mm }M m=1 : candidate models for data Z

I

ξ: some quantity of interest, e.g. f (x0 ) (random in the Bayesian framework)

I

Posterior distribution P(ξ|Z) =

M X

P(ξ|Mm , Z)P(Mm |Z)

m=1

Recall P(Mm |Z) can be approximated by BIC I

In general, one may consider model averaging of the form ˆf (x) =

M X

wmˆfm (x)

m=1

21 / 25

Stacking

First an example: I

One might hope to do something which approximates  !2  M X ˆ wmˆfm (x)  w(x) = arg min E  Y − w

where wm ≥ 0,

P

m

m=1

wm = 1

I

Consider fm = prediction from the best subset of inputs of size m ˆM = 1 among M total inputs. Then the optimal weight is w

I

Problem: This procedure does not take care about complexity

22 / 25

Stacking I

Let ˆfm−i (x) be the prediction at x, using model m, applied to the dataset with (xi , yi ) removed

I

Compute the stacking weight ˆ st = arg min w w

where wi ≥ 0 and I

P

i

N X

  yi −

M X

!2  wmˆfm−i (xi )



m=1

i=1

wi = 1

Final prediction is ˆf (x) =

M X

stˆ ˆm w fm (x)

m=1 I

ˆf minimizes the leave-one-out cross-validation error among the convex combinations 23 / 25

Important topics not covered

I

Vapanik-Chervonenkis (VC) dimension

I

EM algorithm

I

Markov chain Monte Carlo

I

Bagging and bumping

I

...

24 / 25

Thank you!

25 / 25