Model Inference and Averaging

Model Inference and Averaging Ting-Kam Leonard Wong Department of Mathematics, University of Washington, Seattle UW Math Data Science Seminar May 28,...
Author: Thomasine Craig
0 downloads 3 Views 248KB Size
Model Inference and Averaging Ting-Kam Leonard Wong Department of Mathematics, University of Washington, Seattle

UW Math Data Science Seminar May 28, 2015

1 / 25

Topics

Selected topics from

The Elements of Statistical Learning by Hastie, Tibshirani and Friedman Chapter 7: Model Assessment and Selection Chapter 8: Model Inference and Averaging

2 / 25

Estimating prediction error

3 / 25

General setting I

T = {(xi , yi ) : 1 ≤ i ≤ N}: training data (samples)

I

I

(X , Y ): underlying population (with a certain distribution) ˆf (X ): prediction model estimated from T L(Y , ˆf (X )): loss function, e.g. (X − ˆf (X ))2

I

Generalization error

I

ErrT = E[L(Y , ˆf (X ))|T ] I

Expected prediction error Err = E[L(Y , ˆf (X ))] = E[ErrT ]

I

Training error N 1X err = L(yi , ˆf (xi )) N i=1

4 / 25

Main ideas

I I I I

Training error err may not be a good estimate of Err or ErrT err is computed over T to which ˆf is fitted Bias of ˆf (·) (model does not capture true f completely) Variability of ˆf (·) (which depends on T )

5 / 25

Bias-variance decomposition Assume I I

Y = f (X ) + ε, E[ε] = 0, Var(ε) = σε2 L(Y , ˆf (X )) = (Y − ˆf (X ))2

Fix x0 . Bias-variance decomposition (with ˆf fixed): Err(x0 ) = E[(Y − ˆf (x0 ))2 |X = x0 ] = σ 2 + [ˆf (x0 ) − f (x0 )]2 + E[(ˆf (x0 ) − E ˆf (x0 ))2 ] ε

= Irreducible error + Bias2 + Variance I

Bias-variance tradeoff; overfitting

I

Estimation of prediction error

6 / 25

Example: k -nearest neighbor regression

I

Average the y -values of the k nearest neighbors {x(`) }: k

X ˆf (x0 ) = 1 yx(`) k `=1

I

Bias-variance decomposition (conditioned on training data): "

k

1X Err(x0 ) = σε2 + f (x0 ) − f (x(`) ) k `=1

I

#2 +

σε2 k

Optimal k depends on the shape of f .

7 / 25

Example: k -nearest neighbor regression

i=1

`=1

0.40

0.45

0.50

Estimate of Err

0.35

I I

X = (X1 , X2 ) ∈ [0, 1]2 uniformly distributed f (X ) = 3X12 + X1 X2 (unknown), σε2 = 0.25 Sampled N = 200 values for x and estimate #2 " N k σ2 1X 1X 2 c f (x(i,`) ) + ε Errk = σε + f (xi ) − N k k

0.30

I

0

10

20

30

40

k

8 / 25

Example: k -nearest neighbor regression If we naively compute N 2 1 X yi − ˆf (xi ) N i=1

we get horrible result (if we know the truth)

0.00

0.10

0.20

0.30

Training error (square loss)

0

10

20

30

40

k

9 / 25

Cross-Validation Setting: I I

ˆ Fix a learning method {(xi , yi )}N i=1 7→ f (·) Estimate Err = E[L(Y , ˆf (X ))]

Procedure: I I

I

Divide data randomly into K parts, typically 5 or 10 For each 1 ≤ k ≤ K , train ˆf −k without the k th part. Each ˆf −k is trained with about (K − 1)/K of the entire data set For each data point (xi , yi ) let κ(i) ∈ {1, . . . , K } be its group, and compute N 1X ˆ L(yi , ˆf −κ(i) (xi )) CV(f ) = N i=1

I

ˆf = ˆfα , find α ˆ which minimizes CV(ˆfα ). Use this α ˆ to fit to all the data

10 / 25

Cross-Validation: remarks

(p.245) Consider a classification problem with a large number of predictors and consider the following modeling strategy: 1. Screen the predictors: find a subset of “good predictors” that show fairly strong correlation with the class labels 2. Using just this subset of predictors, build a multivariate classifier 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model

11 / 25

Cross-Validation: remarks Authors warned (p. 246): While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals. Correct procedure: Divide data into K groups at random. For each k : 1. Find a subset of “good” predictors that show fairly strong correlation with the class labels, using all of the samples except those in group k 2. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k .3 Use the classifier to predict the class labels for the samples in fold k

12 / 25

Bootstrap methods

Z = (z1 , . . . , zN ), zi = (xi , yi ): data set I

A bootstrap sample is Z∗b with N data points sampled with replacement from Z

I

Apply computation to Z∗b , repeat for b = 1, . . . , B

I

Take average over the B bootstrap samples

13 / 25

Bootstrap: simple example (taken from Internet)

14 / 25

Estimating prediction error with bootstrap I

Simple approach B N 1X1X c L(yi , ˆf ∗b (xi )) Errboot = B N b=1

I

i=1

Leave-one-out bootstrap N X X 1 c (1) = 1 Err L(yi , ˆf ∗b (xi )) boot N |C −i | −i i=1

b∈C

where C −i represents those bootstrap samples which does not contain observation i I

More sophisticated variants...

I

Requires lots of computation 15 / 25

Model Selection and Averaging

16 / 25

Bayesian approach and BIC

Fitting is carried out by maximizing a log-likelihood I

Bayesian information criterion (BIC) for a model with d parameters BIC = −2 · loglik + (log N) · d

I

Consider candidate models Mm , M = 1, . . . , M

I

Choose m to minimize BIC

I

Example: Choose p and q for ARMA(p, q), a class of linear time series models

17 / 25

Why is BIC “Bayesian”? Idea: use data to update your information about which model is the best I P(Mm ): prior distribution I Posterior P(Mm |Z) ∝ P(Mm ) · P(Z|Mm ) I I

Bayesian decision rule: choose m which maximizes P(Mm |Z) Laplace approximation: dm log P(Z|Mm ) ≈ log P(Z|θbm , Mm ) − · log N 2

I

where θbm is the maximum likelihood estimate So min BIC is approximately equivalent to choosing the model with largest posterior probability (with uniform prior). In fact 1

e− 2 BICm

P(Mm |Z) ≈ P M

`=1 e

− 21 BIC` 18 / 25

Example: time series analysis I

ARMA(p, q) model for {Xt }: Xt =

p X

φi Xt−i + εt +

i=1

In practice, usually p, q ≤ 2 or 3. Assume true model is ARMA(1, 2): Xt = 0.5Xt−1 + εt + 0.5εt−1 − 0.3εt−2 Autocorrelation function

ACF

0.2

−1

0.4

0

0.6

1

0.8

2

1.0

Time series of X

0.0

−2

I

θj εt−j

j=1

−3

I

q X

0

50

100

150 Time

200

250

300

0

5

10

15 Lag

20

19 / 25

Example: time series analysis

I

We simulated N = 300 data points from the true model

I

Fit an ARMA(p, q) model for p, q = 0, 1, . . . , 5 and compute BIC: pq 0 1 2 3 4 5

I

0 753.11 630.03 616.04 594.15 573.31 577.29

1 567.09 566.42 571.61 571.18 575.26 578.68

2 566.84 570.36 576.04 576.39 579.13 584.22

3 572.46 576.02 578.11 577.40 584.42 589.81

4 570.28 576.36 580.41 585.96 590.11 592.97

5 575.44 580.81 585.60 587.43 590.70 596.40

(0, 1), (0, 2), (1, 1) and (1, 2) give the best fit

20 / 25

Model averaging I

{Mm }M m=1 : candidate models for data Z

I

ξ: some quantity of interest, e.g. f (x0 ) (random in the Bayesian framework)

I

Posterior distribution P(ξ|Z) =

M X

P(ξ|Mm , Z)P(Mm |Z)

m=1

Recall P(Mm |Z) can be approximated by BIC I

In general, one may consider model averaging of the form ˆf (x) =

M X

wmˆfm (x)

m=1

21 / 25

Stacking

First an example: I

One might hope to do something which approximates  !2  M X ˆ wmˆfm (x)  w(x) = arg min E  Y − w

where wm ≥ 0,

P

m

m=1

wm = 1

I

Consider fm = prediction from the best subset of inputs of size m ˆM = 1 among M total inputs. Then the optimal weight is w

I

Problem: This procedure does not take care about complexity

22 / 25

Stacking I

Let ˆfm−i (x) be the prediction at x, using model m, applied to the dataset with (xi , yi ) removed

I

Compute the stacking weight ˆ st = arg min w w

where wi ≥ 0 and I

P

i

N X

  yi −

M X

!2  wmˆfm−i (xi )



m=1

i=1

wi = 1

Final prediction is ˆf (x) =

M X

stˆ ˆm w fm (x)

m=1 I

ˆf minimizes the leave-one-out cross-validation error among the convex combinations 23 / 25

Important topics not covered

I

Vapanik-Chervonenkis (VC) dimension

I

EM algorithm

I

Markov chain Monte Carlo

I

Bagging and bumping

I

...

24 / 25

Thank you!

25 / 25

Suggest Documents