Correlated variable selection and online learning

Correlated variable selection and online learning Alan Qi Purdue CS & Statistics Joint work with F. Yan, S. Zhe, T.P. Minka, and R. Xiang Outline p...
Author: Dwight Eaton
39 downloads 0 Views 4MB Size
Correlated variable selection and online learning Alan Qi Purdue CS & Statistics

Joint work with F. Yan, S. Zhe, T.P. Minka, and R. Xiang

Outline p (variables or features)

n ‣ EigenNet and NaNOS: Selecting correlated variables (p>>n)

n

p

‣ Virtual Vector Machine: Bayesian online learning (n>>p)

Outline p (variables or features)

n ‣ EigenNet and NaNOS: Selecting correlated variables (p>>n)

n

p

‣ Virtual Vector Machine: Bayesian online learning (n>>p)

Correlated variable selection

n (samples) p (variables or features)

Many variables (p>>n)

Uncorrelated variables x2 2.5

Lasso

2 1.5 1 0.5 0

class 1 1

class 2 2

3

x1

Lasso (Tibshirani 1994) works well when variables are uncorrelated

Correlated variables x2 2.5

Lasso

2 1.5 1 0.5 0

1

2

3

x1

Problem: Lasso selects only one out of two strongly correlated variables.

Sparsity regularizers highest data likelihood



Lasso: variable selection by l1 regularization solution



Elastic-net (Zou & Hastie 2005): selection of groups of variables by composite l1/2 regularization

feasible set

Sparsity regularizers •

Lasso: variable selection by l1 regularization



Elastic-net (Zou & Hastie 2005): selection of groups of variables by composite l1/2 regularization



And many more: SCAD, etc...

Both Lasso and Elastic net ignore valuable correlation information embedded in data.

EigenNet • Guide variable selection by data eigen-structure • Select eigen-subpsace based on labeled information

λ1

w yi

xi

p(w) ∝ exp(−λ1

� d

|wd |)

1 p(yi ) = 1 + exp(−yi wT xi )

i = 1, . . . , n

Sparse conditional model

V = v1 , . . . , vm

vj : eigenvector of XT X

˜ w

λ2

vj

sj j = 1, . . . , m

Generative model

λ1

yi

λ3

w

˜ w

λ2

xi

vj

sj

i = 1, . . . , n

j = 1, . . . , m

Integration of models

When variables are uncorrelated

data likelihood

v feasible set

When variables are correlated

v

Eigenvector attracts classifier.

Adaptive composite regularization

v

When variables are uncorrelated

v

When variables are correlated

EigenNet vs Lasso x2

x1 Both select the relevant uncorrelated variable.

EigenNet vs Lasso x2

x1 EigenNet selects both correlated variables.

Prediction with uncorrelated variables 0.6 0.5 test error rate

• n: 10-80 • p: 40 • 8 variables are relevant to the labels

Lasso Elastic net Bayesian lasso EigenNet

0.4 0.3 0.2 0.1 0

20 40 60 # of training examples

80

Results averaged over 10 runs

Prediction with correlated variables 0.45 0.4 test error rate

• n: 10-80 • p: 40 • 8 variables are relevant • Two groups of 4 correlated variables

Lasso Elastic net Bayesian lasso EigenNet

0.35 0.3 0.25 0.2 0

20 40 60 # of training examples

80

Results averaged over 10 runs

True and estimated classifiers Weights

Weights

Lasso 0

10

20

Elastic net 30

40

50

0

10

Weights

20

30

40

50

Group 1 Group 2 EigenNet 0

10

20

30

40

SNP-based AD classification Classification Error Rate

374 subjects 2000 SNPs

0.4

0.35

0.3

Lasso

Elastic net

ARD

EigenNet

Predicting ADAS-Cog score Root−Mean−Square Error 726 subjects 14 imaging features

8.5 8 7.5 7 6.5

Lasso

Elastic net

EigenNet

Joint graph and node selection Input variables

labels

Constraints: e.g., biological pathways, structural symmetry

New sparse Bayesian model NaNOS (Zhe et al. 2013):

-

Encode constraints (e.g., pathways) via graph Laplacians in a generalized spike and slab prior

-

Automatically determine which constraints and features are useful

Results on synthetic data

egression: F1

20

F1

PMSE

0.75 26

0.5

24

22

0.25

(a) Regression: PMSE

Higher 10 selection accuracy

(c) Classification: Error rate

Fig. 4: Prediction errors and F1 scores for gene selection in 1

1 0.95

transcription factors regulatory network TFs, the regulated g in Experiment 1, ex

0.8 F1

1 15 0.98

0.6 F1

10

0.94

0.9 0.4

ρ=

0.85

5

(c) 0.9 Classification: Error rate EXP1−D1−D2 EXP2−D1−D2

15

5

(b) Regression: F1

20

F1Error rate (%)

NaNOS: Lower prediction error

1

Error rate (%)

29

EXP3

(d) Classification: F1 EXP1−D1−D2 EXP2−D1−D2

diction errors and F scores for gene selection in Experiment 3.

EXP3

Results on real data 20

PMSE

2.65

2.55

2.45

13.5

Error rate (%)

Error rate (%)

2.75

16

9

12

(a) Diffuse large B cell lymphoma

11

(b) Colorectal cancer

(c) Pancreatic ductal adenocarcinoma

Fig. 6: Predictive performance on three gene expression studies of cancer. (c)

Cell membrane MHCI

Cell Membrane

MHCII Smad4

NOG

Endoplasmic Reticulum

BMPR1 BMP

Smad1/5/8

BMPR2

Nucleus

CIITA

RFX

MHCI

CREB

MHCII

TNF ɲ IFNG

Smad6/7

Transcription factor

Predicting conversion from MCI to AD ARD

Error rate 50 % 38 25

NaNOS

13 0

ARD

NaNOS

Using symmetry constraint

Ongoing research: sparse multiview learning Related work:

-

Sparse latent factor models (West 2003, Carvalho et al. 2008, Bhattacharya and Dunson 2011)

-

Canonical correlation analysis

Outline p (variables or features)

n ‣ EigenNet: Selecting correlated variables (p>>n)

n

p

‣ Virtual Vector Machine: Bayesian online learning (n>>p)

Ubiquitous data streams

How to handle massive data or data stream? Online learning

Bayesian treatment



Linear dynamic systems: Kalman



Nonlinear dynamic systems:

filtering

Extended Kalman filtering, unscented Kalman filtering, particle filtering, assumed density filtering, online variational inference, etc.

Previous online learning work • • •



Stochastic gradient: Perceptron (Rosenblatt 1962) Natural gradient (Le Roux et al., 2007) Online Gaussian process classification (Csató & Opper, 2002) Passive-Aggressive algorithm (Crammer et al. 2006)

What has been ignored?

Previous data points

Idea

Summarize all previous samples by representative virtual points & dynamically update them

Intuition • Many data points are not important to classification - Prune them • Many data points are similar - Group them

Virtual vector machine

irtualApply data likelihoods: this intuition in a Bayesian framework � q(w) ∝ r(w) f (bi ; w) (7)

th in va Ga tw

here bi is the ith virtual data point and f (bi ; w) has q(w) : Approximate posterior of The classifier he form of the original likelihood factors. Gaus-w an r(w) b isi called the “residual”; represents : A virtual point in aitsmall buffer information from real dataResidue that is not included in the r(w) the : Gaussian irtual points. Because the likelihood terms f are step unctions, the resulting distribution on w is a Gausan modulated by a piecewise constant function. If

Th pr div

i

r(w) ∼ N (mr , Vr )

(8)

Dynamic update Two operations to maintain a fixed buffer size: • Eviction • Merging

Eviction Remove point with smallest impact on classifier posterior distribution

Eviction

Point with biggest “Bayesian margin” �

mT w x/

Remove point with smallest impact on classifier posterior distribution

x T Vw x

Version space (posterior space)

The subset of all hypotheses that are consistent with the observed training examples Four data points: hyperplanes Version space: brown area

Deleting unimportant point

Version space: brown area EP approximation: red ellipse Four data points: hyperplanes

Version space with three points after deleting one point (with the largest distance to version space)

Merging Merge similar points leading to smallest impact on classifier posterior distribution

Merging Merge similar points leading to smallest impact on classifier posterior distribution

Merging similar points

Version space: brown area EP approximation: red ellipse Four data points: hyperplanes

Version space with three points after merging two similar points

New algorithm •

Assumed density filtering (ADF): - Project a posterior distribution to the exponential family given a data point



Inverse ADF: - Find a virtual point b� that will lead to a desired projection from two real points

Estimation accuracy

• •

VVM achieves a smooth trade-off between computation cost and estimation accuracy. ADF uses a buffer of size one and EP uses the whole sequence of 300 data points.

Thyroid classification

Results averaged over 10 random permutations of data

Email spam classification

Results averaged over 10 random permutations. Buffer size: VVM 30 SOGP 143 PA 80

Conclusions

Summary ‣ Correlated variable selection (p>>n):

-

Capture correlation between variables by eigenvectors or graph constraints Select both eigensubspace/graphs and features Handle both regression and classification

‣ Bayesian online learning (n>>p): -

Combine parametric approximation with online data summarization Lead to efficient computation and accurate estimation for nonlinear dynamic systems

Acknowledgment •

My group: Z. Xu, S. Zhe, Y. Guo, S. A. Naqvi, D. Runyan, X. Tan, H. Peng, Y.Yang, Y. Han, F. Yan



Collaborators: T. P. Minka (Microsoft), S. Pyne (Harvard Medical school), G. Ceder (MIT), L. Shen (IU Medical school), A. Saykin (IU Medical school), J. P. Robinson (Purdue), K. Lee (U. of Toronto), F. Garip (Harvard), J. Wang (Eli Lily), P.Yu (Eli Lilly), H. Neven (Google)

• -

Funding agencies: NSF: CAREER, Robust Intelligence, CDI, STC Microsoft, IBM, Eli Lilly Showalter foundation Indiana Clinical & Translational Science Institute Purdue