Model Inference and Averaging Ting-Kam Leonard Wong Department of Mathematics, University of Washington, Seattle
UW Math Data Science Seminar May 28, 2015
1 / 25
Topics
Selected topics from
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman Chapter 7: Model Assessment and Selection Chapter 8: Model Inference and Averaging
2 / 25
Estimating prediction error
3 / 25
General setting I
T = {(xi , yi ) : 1 ≤ i ≤ N}: training data (samples)
I
I
(X , Y ): underlying population (with a certain distribution) ˆf (X ): prediction model estimated from T L(Y , ˆf (X )): loss function, e.g. (X − ˆf (X ))2
I
Generalization error
I
ErrT = E[L(Y , ˆf (X ))|T ] I
Expected prediction error Err = E[L(Y , ˆf (X ))] = E[ErrT ]
I
Training error N 1X err = L(yi , ˆf (xi )) N i=1
4 / 25
Main ideas
I I I I
Training error err may not be a good estimate of Err or ErrT err is computed over T to which ˆf is fitted Bias of ˆf (·) (model does not capture true f completely) Variability of ˆf (·) (which depends on T )
5 / 25
Bias-variance decomposition Assume I I
Y = f (X ) + ε, E[ε] = 0, Var(ε) = σε2 L(Y , ˆf (X )) = (Y − ˆf (X ))2
Fix x0 . Bias-variance decomposition (with ˆf fixed): Err(x0 ) = E[(Y − ˆf (x0 ))2 |X = x0 ] = σ 2 + [ˆf (x0 ) − f (x0 )]2 + E[(ˆf (x0 ) − E ˆf (x0 ))2 ] ε
= Irreducible error + Bias2 + Variance I
Bias-variance tradeoff; overfitting
I
Estimation of prediction error
6 / 25
Example: k -nearest neighbor regression
I
Average the y -values of the k nearest neighbors {x(`) }: k
X ˆf (x0 ) = 1 yx(`) k `=1
I
Bias-variance decomposition (conditioned on training data): "
k
1X Err(x0 ) = σε2 + f (x0 ) − f (x(`) ) k `=1
I
#2 +
σε2 k
Optimal k depends on the shape of f .
7 / 25
Example: k -nearest neighbor regression
i=1
`=1
0.40
0.45
0.50
Estimate of Err
0.35
I I
X = (X1 , X2 ) ∈ [0, 1]2 uniformly distributed f (X ) = 3X12 + X1 X2 (unknown), σε2 = 0.25 Sampled N = 200 values for x and estimate #2 " N k σ2 1X 1X 2 c f (x(i,`) ) + ε Errk = σε + f (xi ) − N k k
0.30
I
0
10
20
30
40
k
8 / 25
Example: k -nearest neighbor regression If we naively compute N 2 1 X yi − ˆf (xi ) N i=1
we get horrible result (if we know the truth)
0.00
0.10
0.20
0.30
Training error (square loss)
0
10
20
30
40
k
9 / 25
Cross-Validation Setting: I I
ˆ Fix a learning method {(xi , yi )}N i=1 7→ f (·) Estimate Err = E[L(Y , ˆf (X ))]
Procedure: I I
I
Divide data randomly into K parts, typically 5 or 10 For each 1 ≤ k ≤ K , train ˆf −k without the k th part. Each ˆf −k is trained with about (K − 1)/K of the entire data set For each data point (xi , yi ) let κ(i) ∈ {1, . . . , K } be its group, and compute N 1X ˆ L(yi , ˆf −κ(i) (xi )) CV(f ) = N i=1
I
ˆf = ˆfα , find α ˆ which minimizes CV(ˆfα ). Use this α ˆ to fit to all the data
10 / 25
Cross-Validation: remarks
(p.245) Consider a classification problem with a large number of predictors and consider the following modeling strategy: 1. Screen the predictors: find a subset of “good predictors” that show fairly strong correlation with the class labels 2. Using just this subset of predictors, build a multivariate classifier 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model
11 / 25
Cross-Validation: remarks Authors warned (p. 246): While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals. Correct procedure: Divide data into K groups at random. For each k : 1. Find a subset of “good” predictors that show fairly strong correlation with the class labels, using all of the samples except those in group k 2. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k .3 Use the classifier to predict the class labels for the samples in fold k
12 / 25
Bootstrap methods
Z = (z1 , . . . , zN ), zi = (xi , yi ): data set I
A bootstrap sample is Z∗b with N data points sampled with replacement from Z
I
Apply computation to Z∗b , repeat for b = 1, . . . , B
I
Take average over the B bootstrap samples
13 / 25
Bootstrap: simple example (taken from Internet)
14 / 25
Estimating prediction error with bootstrap I
Simple approach B N 1X1X c L(yi , ˆf ∗b (xi )) Errboot = B N b=1
I
i=1
Leave-one-out bootstrap N X X 1 c (1) = 1 Err L(yi , ˆf ∗b (xi )) boot N |C −i | −i i=1
b∈C
where C −i represents those bootstrap samples which does not contain observation i I
More sophisticated variants...
I
Requires lots of computation 15 / 25
Model Selection and Averaging
16 / 25
Bayesian approach and BIC
Fitting is carried out by maximizing a log-likelihood I
Bayesian information criterion (BIC) for a model with d parameters BIC = −2 · loglik + (log N) · d
I
Consider candidate models Mm , M = 1, . . . , M
I
Choose m to minimize BIC
I
Example: Choose p and q for ARMA(p, q), a class of linear time series models
17 / 25
Why is BIC “Bayesian”? Idea: use data to update your information about which model is the best I P(Mm ): prior distribution I Posterior P(Mm |Z) ∝ P(Mm ) · P(Z|Mm ) I I
Bayesian decision rule: choose m which maximizes P(Mm |Z) Laplace approximation: dm log P(Z|Mm ) ≈ log P(Z|θbm , Mm ) − · log N 2
I
where θbm is the maximum likelihood estimate So min BIC is approximately equivalent to choosing the model with largest posterior probability (with uniform prior). In fact 1
e− 2 BICm
P(Mm |Z) ≈ P M
`=1 e
− 21 BIC` 18 / 25
Example: time series analysis I
ARMA(p, q) model for {Xt }: Xt =
p X
φi Xt−i + εt +
i=1
In practice, usually p, q ≤ 2 or 3. Assume true model is ARMA(1, 2): Xt = 0.5Xt−1 + εt + 0.5εt−1 − 0.3εt−2 Autocorrelation function
ACF
0.2
−1
0.4
0
0.6
1
0.8
2
1.0
Time series of X
0.0
−2
I
θj εt−j
j=1
−3
I
q X
0
50
100
150 Time
200
250
300
0
5
10
15 Lag
20
19 / 25
Example: time series analysis
I
We simulated N = 300 data points from the true model
I
Fit an ARMA(p, q) model for p, q = 0, 1, . . . , 5 and compute BIC: pq 0 1 2 3 4 5
I
0 753.11 630.03 616.04 594.15 573.31 577.29
1 567.09 566.42 571.61 571.18 575.26 578.68
2 566.84 570.36 576.04 576.39 579.13 584.22
3 572.46 576.02 578.11 577.40 584.42 589.81
4 570.28 576.36 580.41 585.96 590.11 592.97
5 575.44 580.81 585.60 587.43 590.70 596.40
(0, 1), (0, 2), (1, 1) and (1, 2) give the best fit
20 / 25
Model averaging I
{Mm }M m=1 : candidate models for data Z
I
ξ: some quantity of interest, e.g. f (x0 ) (random in the Bayesian framework)
I
Posterior distribution P(ξ|Z) =
M X
P(ξ|Mm , Z)P(Mm |Z)
m=1
Recall P(Mm |Z) can be approximated by BIC I
In general, one may consider model averaging of the form ˆf (x) =
M X
wmˆfm (x)
m=1
21 / 25
Stacking
First an example: I
One might hope to do something which approximates !2 M X ˆ wmˆfm (x) w(x) = arg min E Y − w
where wm ≥ 0,
P
m
m=1
wm = 1
I
Consider fm = prediction from the best subset of inputs of size m ˆM = 1 among M total inputs. Then the optimal weight is w
I
Problem: This procedure does not take care about complexity
22 / 25
Stacking I
Let ˆfm−i (x) be the prediction at x, using model m, applied to the dataset with (xi , yi ) removed
I
Compute the stacking weight ˆ st = arg min w w
where wi ≥ 0 and I
P
i
N X
yi −
M X
!2 wmˆfm−i (xi )
m=1
i=1
wi = 1
Final prediction is ˆf (x) =
M X
stˆ ˆm w fm (x)
m=1 I
ˆf minimizes the leave-one-out cross-validation error among the convex combinations 23 / 25
Important topics not covered
I
Vapanik-Chervonenkis (VC) dimension
I
EM algorithm
I
Markov chain Monte Carlo
I
Bagging and bumping
I
...
24 / 25
Thank you!
25 / 25