principal component regression without principal component analysis

principal component regression without principal component analysis Roy Frostig 1 , Cameron Musco 2 , Christopher Musco 2 , Aaron Sidford 3 1 Stanford...
Author: Allison Hawkins
0 downloads 1 Views 2MB Size
principal component regression without principal component analysis Roy Frostig 1 , Cameron Musco 2 , Christopher Musco 2 , Aaron Sidford 3 1 Stanford, 2 MIT, 3 Microsoft

Reseach

0

resources

Paper, slides, and template code available at chrismusco.com

1

our work

Simple, robust algorithms for principal component regression.

2

principal component regression

Principal Component Regression (PCR) = Principal Component Analysis (unsupervised)

+

Linear Regression (supervised)

3

our approach: skip the dimensionality reduction

PCA 15

Regression 15

15

10

10

5

5

10

5

0 0

0

−5

−5

−5

−10

−15 15 10

15 5

10 0

−10

−10

5 0

−5 −5

−10

−10 −15

−15

−15 −15

−10

−5

0

5

10

15

−15 −15

−10

−5

0

5

10

15

4

our approach: skip the dimensionality reduction

PCA 15

Regression 15

15

10

10

5

5

10

5

0 0

0

−5

−5

−5

−10

−15 15 10

15 5

10 0

−10

−10

5 0

−5 −5

−10

−10 −15

−15

−15 −15

−10

−5

0

5

10

15

−15 −15

−10

−5

0

5

10

15

Regression is cheap (fast iterative or stochastic methods).

PCA is a major computational bottleneck.

4

our approach: skip the dimensionality reduction

Single-shot iterative algorithm 15

15

10

10

5

5

0

0

−5

−5

−10

−10

−15 15

−15 15 10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

5

our approach: skip the dimensionality reduction

Single-shot iterative algorithm 15

15

10

10

5

5

0

0

−5

−5

−10

−10

−15 15

−15 15 10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

10

15 5

10 0

5 0

−5 −5

−10

−10 −15

−15

Final algorithm just uses a few applications of any fast, black-box regression routine.

5

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

data points

features

A

left singular vectors singular values σ1 σ2

=

U

Σ

σd-1 σd

right singular vectors

VT

6

formal setup

Standard Regression: Given: A, b Solve: x∗ = arg minx ∥Ax − b∥2 Principal Component Regression: Given: A, b, λ Solve: x∗ = arg minx ∥Aλ x − b∥2

data points

features



left singular vectors singular values σ1 σ2

= Uλ

Σkλ

right singular vectors

VλT

σd-1 σd

6

formal setup

Singular values of A 20

σi2

15

10

5

0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Singular values of A 20

“Signal” σi2

15

10

“Noise”

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Singular values of Aλ 20

“Signal” σi2

15

10

“Noise”

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 7

formal setup

Principal Component Regression (PCR): Goal: x∗ = arg minx ∥Aλ x − b∥2 Solution: x = (ATλ Aλ )−1 ATλ b What’s the computational cost?

8

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ).

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ.

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ. Cost of evaluating x = (ATλ Aλ )−1 ATλ b (regression): ≈ O(nnz(A) ·

√ κ).

9

naive pcr runtime

Cost of computing Aλ (PCA): ≈ O(nnz(A)k + dk2 ). k is the number of principal components with value > λ. Cost of evaluating x = (ATλ Aλ )−1 ATλ b (regression): ≈ O(nnz(A) ·

√ κ).

For PCR, k is large, κ is small (Aλ is well conditioned). 9

our goal

Goal: Remove bottleneck dependence on k

10

reason for hope?

Optimistic observation: PCA computes too much information.

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b =

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b = (AT A)−1 ATλ b

11

reason for hope?

Optimistic observation: PCA computes too much information.

x∗ = (ATλ Aλ )−1 ATλ b = (AT A)−1 ATλ b

Don’t need to compute Aλ (which incurs a k dependence) as long as we can apply it to a single vector efficiently.

11

reason for hope?

It’s very often more efficient to apply a matrix function once than compute the matrix function explicitly.

12

reason for hope?

It’s very often more efficient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x

12

reason for hope?

It’s very often more efficient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x

12

reason for hope?

It’s very often more efficient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x ∙ exp(A) . . . many more

12

reason for hope?

It’s very often more efficient to apply a matrix function once than compute the matrix function explicitly. ∙ (AT A)x, (AT A)2 x, or (AT A)3 x ∙ A−1 x ∙ exp(A) . . . many more

Why not Aλ ?

12

main result

Theorem (Main Result) There’s an algorithm that approximately applies ATλ to any vector b using ≈ log(1/ϵ) well conditioned linear system solutions.

13

main result

Theorem (Main Result) There’s an algorithm that approximately applies ATλ to any vector b using ≈ log(1/ϵ) well conditioned linear system solutions.

PCR in ≈ O(nnz(A) ·



κ) time.

13

decomposing matrix

Goal: Apply ATλ quickly to any vector.

14

decomposing matrix

Goal: Apply ATλ quickly to any vector. 20

σi2

15

Spectrum of Aλ

10

5

λ 0

0

100

200

300

400

500

600

700

800

900

1000

i 14

decomposing matrix

50

AλT = SAT

45 40 35

Spectrum of A

σi2

30 25 20 15 10 5 0

0

100

200

300

400

500

600

700

800

900

1000

i

1.5

1

Spectrum of S

0.5

0

−0.5

0

100

200

(large σi)

300

400

500

i

600

700

800

900

1000

(small σi)

14

decomposing matrix

ATλ b = SAT b 1.5

1

Spectrum of S

0.5

0

−0.5

0

100

200

(large σi)

300

400

500

i

600

700

800

900

1000

(small σi)

How do we apply S to AT b? 15

decomposing matrix

How do we apply S to AT b? This is actually a common task.

16

decomposing matrix

How do we apply S to AT b? This is actually a common task. ∙ Saad, Bekas, Kokiopoulou, Erhel, Guyomarc’h, Napoli, Polizzi, others: Applications in eigenvalue counting, computational materials science, learning problems like eigenfaces and LSI, etc.

16

decomposing matrix

How do we apply S to AT b? This is actually a common task. ∙ Saad, Bekas, Kokiopoulou, Erhel, Guyomarc’h, Napoli, Polizzi, others: Applications in eigenvalue counting, computational materials science, learning problems like eigenfaces and LSI, etc. ∙ Tremblay, Puy, Gribonval, Vandergheynst: “Compressive Spectral Clustering” ICML 2016. 16

a first approximation

We turn to Ridge Regression, a popular alternative to PCR:

17

a first approximation

We turn to Ridge Regression, a popular alternative to PCR: Goal: x∗ = arg minx ∥Ax − b∥2 + λ∥x∥2 . Solution: x∗ = (AT A + λI)−1 AT b.

17

a first approximation

We turn to Ridge Regression, a popular alternative to PCR: Goal: x∗ = arg minx ∥Ax − b∥2 + λ∥x∥2 . Solution: x∗ = (AT A + λI)−1 AT b.

Claim:

R = (AT A + λI)−1 AT A coarsely approximates S.

17

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

18

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

Singular values of R = (AT A + λI)−1 AT A: σi (R) =

σi2 (A) σi2 (A) + λ

18

a first approximation

Singular values of S: σi (S) =

 1 0

if σi2 (A) ≥ λ, if σi2 (A) < λ.

Singular values of R = (AT A + λI)−1 AT A:  1 if σ 2 (A) >> λ, 2 σ (A) i ≈ σi (R) = 2 i σi (A) + λ 0 if σ 2 (A)

Suggest Documents