principal component regression without principal component analysis

principal component regression without principal component analysis Roy Frostig 1 , Cameron Musco 2 , Christopher Musco 2 , Aaron Sidford 3 1 Stanford...

Author: Allison Hawkins

0 downloads 1 Views 2MB Size

Report

Download PDF

Recommend Documents

Sparse Principal Component Analysis

Online Principal Component Analysis

Weighted Principal Component Analysis

Cone-constrained Principal Component Analysis

A PRINCIPAL COMPONENT ANALYSIS FOR TREES

Principal Component Analysis and Near Infrared Spectroscopy

A Tutorial on Principal Component Analysis

IN recent years, robust principal component analysis

Candid Covariance-Free Incremental Principal Component Analysis

Principal component analysis with linear algebra

Principal Component Analysis and Quasar Identification Techniques

Regularized Principal Component Analysis for Spatial Data

Clustering and Principal Component Methods

Regression and Principal Component Analyses: a Comparison Using Few Regressors

Application of principal component analysis enables to effectively find important

Functional Principal Component Analysis of Financial Time Series

FUZZY BASED NONLINEAR PRINCIPAL COMPONENT ANALYSIS FOR PROCESS MONITORING

Principal Volatility Component and Their Applications

Transfer Functions for Imaging Spectroscopy Data using Principal Component Analysis

SUPPLIERS SELECTION MODEL USING FUZZY PRINCIPAL COMPONENT ANALYSIS

Longitudinal functional principal component modeling via Stochastic Approximation Monte Carlo

Longitudinal functional principal component modelling via Stochastic Approximation Monte Carlo

Robust Principal Component Analysis for Background Subtraction: Systematic Evaluation and Comparative Analysis

Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR)

principal component regression without principal component analysis Roy Frostig 1 , Cameron Musco 2 , Christopher Musco 2 , Aaron Sidford 3 1 Stanford, 2 MIT, 3 Microsoft

Reseach

0

resources

Paper, slides, and template code available at chrismusco.com

1

our work

Simple, robust algorithms for principal component regression.

2

principal component regression

Principal Component Regression (PCR) = Principal Component Analysis (unsupervised)

+

Linear Regression (supervised)

3

our approach: skip the dimensionality reduction

PCA 15

Regression 15

15

10

10

5

5

10

5

0 0

0

−5

−5

−5

−10

−15 15 10

15 5

10 0

−10

−10

5 0

−5 −5

−10

−10 −15

−15

−15 −15

−10

−5

0

5

10

15

−15 −15

−10

−5

0

5

10

15

4

our approach: skip the dimensionality reduction

PCA 15

Regression 15

15

10

10

5

5

10

5

0 0

0

−5

−5

−5

−10

−15 15 10

15 5

10 0

−10

−10

5 0

−5 −5

−10

−10 −15

−15

−15 −15

−10

−5

0

5

10

15

−15 −15

−10

−5

0

5

10

15

Regression is cheap (fast iterative or stochastic methods).

PCA is a major computational bottleneck.

4

our approach: skip the dimensionality reduction

Single-shot iterative algorithm 15

15

10

10

5

5

0

0

−5

−5

−10

−10

−15 15

−15 15 10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

5

our approach: skip the dimensionality reduction

Single-shot iterative algorithm 15

15

10

10

5

5

0

0

−5

−5

−10

−10

−15 15

−15 15 10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

Final algorithm just uses a few applications of any fast, black-box regression routine.

5

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

data points

features

A

left singular vectors singular values σ1 σ2

=

U

Σ

σd-1 σd

right singular vectors

VT

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

data points

features

Aλ

left singular vectors singular values σ1 σ2

= Uλ

Σkλ

right singular vectors

VλT

σd-1 σd

6

formal setup

Singular values of A 20

σi2

15

10

5

0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Singular values of A 20

“Signal” σi2

15

10

“Noise”

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Singular values of Aλ 20

“Signal” σi2

15

10

“Noise”

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Principal Component Regression (PCR): Goal: x∗ = arg minx ∥Aλ x − b∥2 Solution: x = (ATλ Aλ )−1 ATλ b What’s the computational cost?

8

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ).

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ.

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ. Cost of evaluating x = (ATλ Aλ )−1 ATλ b (regression): ≈ O(nnz(A) ·

√ κ).

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ. Cost of evaluating x = (ATλ Aλ )−1 ATλ b (regression): ≈ O(nnz(A) ·

√ κ).

For PCR, k is large, κ is small (Aλ is well conditioned). 9

our goal

Goal: Remove bottleneck dependence on k

10

reason for hope?

Optimistic observation: PCA computes too much information.

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b =

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b = (AT A)−1 ATλ b

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b = (AT A)−1 ATλ b

Don’t need to compute Aλ (which incurs a k dependence) as long as we can apply it to a single vector efﬁciently.

11

reason for hope?

It’s very often more efﬁcient to apply a matrix function once than compute the matrix function explicitly.

12

reason for hope?

It’s very often more efﬁcient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x

12

reason for hope?

It’s very often more efﬁcient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x

12

reason for hope?

It’s very often more efﬁcient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x ∙ exp(A) . . . many more

12

reason for hope?

It’s very often more efﬁcient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x ∙ exp(A) . . . many more

Why not Aλ ?

12

main result

Theorem (Main Result) There’s an algorithm that approximately applies ATλ to any vector b using ≈ log(1/ϵ) well conditioned linear system solutions.

13

main result

Theorem (Main Result) There’s an algorithm that approximately applies ATλ to any vector b using ≈ log(1/ϵ) well conditioned linear system solutions.

PCR in ≈ O(nnz(A) ·

√

κ) time.

13

decomposing matrix

Goal: Apply ATλ quickly to any vector.

14

decomposing matrix

Goal: Apply ATλ quickly to any vector. 20

σi2

15

Spectrum of Aλ

10

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 14

decomposing matrix

50

AλT = SAT

45 40 35

Spectrum of A

σi2

30 25 20 15 10 5 0

0

100

200

300

400

500

600

700

800

900

1000

i

1.5

1

Spectrum of S

0.5

0

−0.5

0

100

200

(large σi)

300

400

500

i

600

700

800

900

1000

(small σi)

14

decomposing matrix

ATλ b = SAT b 1.5

1

Spectrum of S

0.5

0

−0.5

0

100

200

(large σi)

300

400

500

i

600

700

800

900

1000

(small σi)

How do we apply S to AT b? 15

decomposing matrix

How do we apply S to AT b? This is actually a common task.

16

decomposing matrix

How do we apply S to AT b? This is actually a common task. ∙ Saad, Bekas, Kokiopoulou, Erhel, Guyomarc’h, Napoli, Polizzi, others: Applications in eigenvalue counting, computational materials science, learning problems like eigenfaces and LSI, etc.

16

decomposing matrix

How do we apply S to AT b? This is actually a common task. ∙ Saad, Bekas, Kokiopoulou, Erhel, Guyomarc’h, Napoli, Polizzi, others: Applications in eigenvalue counting, computational materials science, learning problems like eigenfaces and LSI, etc. ∙ Tremblay, Puy, Gribonval, Vandergheynst: “Compressive Spectral Clustering” ICML 2016. 16

a first approximation

We turn to Ridge Regression, a popular alternative to PCR:

17

a first approximation

We turn to Ridge Regression, a popular alternative to PCR: Goal: x∗ = arg minx ∥Ax − b∥2 + λ∥x∥2 . Solution: x∗ = (AT A + λI)−1 AT b.

17

a first approximation

We turn to Ridge Regression, a popular alternative to PCR: Goal: x∗ = arg minx ∥Ax − b∥2 + λ∥x∥2 . Solution: x∗ = (AT A + λI)−1 AT b.

Claim:

R = (AT A + λI)−1 AT A coarsely approximates S.

17

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

18

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

Singular values of R = (AT A + λI)−1 AT A: σi (R) =

σi2 (A) σi2 (A) + λ

18

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

Singular values of R = (AT A + λI)−1 AT A:  1 if σ 2 (A) >> λ, 2 σ (A) i ≈ σi (R) = 2 i σi (A) + λ 0 if σ 2 (A)