Shrinkage Structure of Partial Least Squares

Shrinkage Structure of Partial Least Squares Ole C. Lingjaerde, Nils Christophersen University of Oslo Scandinavia Journal of Statistics 2000 Lingja...
Author: Naomi Stevens
3 downloads 1 Views 583KB Size
Shrinkage Structure of Partial Least Squares Ole C. Lingjaerde, Nils Christophersen University of Oslo

Scandinavia Journal of Statistics 2000

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

1 / 20

Outline

1

Introduction Shrinkage structure Partial least square Ritz values Filter factor for PLS

2

Shrinkage structure of PLS

3

Discussion

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

2 / 20

Starting from OLS

Consider linear model y = X β +  where b is a p-vector of unknown parameters and  is an n-dimensional noise vector.

Definition (Singular value decomposition) Consider n × p dimensional matrix X , define the singular value decomposition (SVD) as: X = UDV T

(1)

where U T U = V T V = VV T = Ip and D is a diagonal matrix with singular values σ1 ≥ . . . ≥ σp on the diagonal. So we have X −1 = VD −1 U T and X T X = UD 2 V T .

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

3 / 20

More from OLS

Following Equation 1 in the previous slide, we have βˆOLS = VD −1 U T y =

p X uT y i

i=1

σi

vi

(2)

where σi is the ith eigenvalue of X T X . Note that in cases of high collinearity, we have very small σi certain i. In this case OLS estimation suffers huge variability. fi = uiT y is called Fourier coefficient, which is of interest in latter discussions.

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

4 / 20

The salvation of OLS: shrinkage estimator To solve the collinearity problem suffered by OLS, several shrinkage estimation methodologies has been developed. For example, Principal Component Regression (PCR): βˆPCR =

m X uT y i

i=1

σi

vi

(3)

where m < p. And the Ridge Regression (RR): βˆRR =

p X

σi2 uiT y vi + k σi

(4)

σ2 i=1 i

We may expect PCR and RR to yield ”shrinked” results from βˆOLS .

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

5 / 20

Filter factors Many well-known shrinkage estimator may take the form β=

p X i=1

wi

uiT y vi σi

(5)

where wi is called filter factors. Filter factor may be of interest in discussing the shrinkage of PLS. For example: For OLS, wi = 1 For PCR, wi = 1 for i ≤ m and wi = 0 for i > m For RR, wi =

σi2 2 σi +k σi2 2 σi +ki

For GRR, wi =

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

6 / 20

Partial least square Definition (Partial Least Square (PLS)) For m ≥ 1 we define Krylov space as Km = span{X T y , (X T X )X T y , . . . , (X T X )m−1 X T y } Then the PLS estimate is defined as βˆPLS

= argmin|y − X β|2 subject to β ∈ Km

(6)

Note that the solution of PLS is a function on m. Let Rm be a p × m matrix whose columns span the space of Km , then the explicit solution is  −1 T T T T βˆPLS = Rm Rm X XRm Rm X y

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

(7)

7 / 20

Partial least square (continued)

1

Definition (Partial Least Square (another version)) The general underlying model of multivariate PLS is: X

= TP T + E

Y

= TQ T + F

(8)

where X is an n × p matrix of predictors, Y is an n × 1 matrix of responses, T is an n × m score matrix, and P and Q are p × m and 1 × m loading matrices, respectively. Note that when m = rank(X ), the result is identical to that of OLS. However, the shrinkage property and wi of PLS is not so apparent as PCR and RR. 1

http://en.wikipedia.org/wiki/Partial_least_squares_regression

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

8 / 20

Ritz values

Definition (Ritz values) Let (v , λ) be the eigenvector-eigenvalue pair for X T X . When we want to find a good approximation in Km for (v , λ), an orthogonal projection method we get (u, θ) with X T Xu − θu ⊥ Km

(9)

The θ here is called Ritz value. Intuitively, Ritz value should be approaching the eigen value of X T X as m increases. T X T XR . Ritz value is also the eigenvalues for Rm m

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

9 / 20

Ritz values for PLS Theorem (Property of Ritz values) (m)

The Ritz values θ1 1

λ1 >

(m) θ1

(m)

≥ . . . ≥ θm

≥ ... ≥ (m) θi

(m) θm

satisfy the properties:

> λp

2

λp−m+i ≤

3

Each of the open intervals (θi+1 , θi eigenvalues λj .

4

The Ritz values {θi

< λi , i = 1, . . . , m (m)

(m)

(m)

θ1

(m)

(m+1)

} and {θi (m)

> θ1

) contains one or more

} separate each other. (m)

(m+1)

> . . . > θm > θm+1

Thus for fixed k, the kth largest Rits value increases with m and the kth smallest Ritz value decreases with m Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

10 / 20

A picture explains everything, about Ritz values

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

11 / 20

Filter factor of PLS

Theorem (Filter factor values for PLS) Assume that dim(Km ) = m. The filter factor for PLS with m factors are given by   m Y (m) 1 − λi  wi = 1 − (10) (m) θ j=1 j for i = 1, . . . , p. Here λ1 ≥ . . . ≥ λp are eigen values of X T X . θ1 ≥ . . . ≥ θm are eigen values of VmT X T XVm . Vm is any p × m matrix that form an orthogonal basis for Km . Here we can see that the filter factor for PLS is completely determined by the eigenvalues of matrix X T X and VmT X T XVm .

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

12 / 20

Outline

1

Introduction Shrinkage structure Partial least square Ritz values Filter factor for PLS

2

Shrinkage structure of PLS

3

Discussion

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

13 / 20

Properties of filter factors

Theorem (Largest and smallest filter factors) For all m, we have wpm ≤ 1, and w1

(m)

≥ 1 for m = 1, 3, 5, . . .

(m) w1

≤ 1 for m = 2, 4, 6, . . .

This theorem provides a striking difference between the filter factors of PLS and other shrinkage methods (e.g. PCR, RR). While in other methods the filter factor are generally no larger than 1, the PLS filter factor oscillate at 1.

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

14 / 20

A picture explains everything, for filter factors

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

15 / 20

Property of filter factors, continued Theorem (Filter factors in the middle) For m < M there is a partitioning of the set of integers 1, . . . , p in m + 1 non-empty disjoint sets I1 , . . . , Im+1 , where each element in Ij is smaller than each element in Ik when j < k. When m is even, we have (m)

wi

≤ 1 for i ∈ I1 ∪ I3 ∪ . . . ∪ Im+1

(m) wi

≥ 1 for i ∈ I2 ∪ I4 ∪ . . . ∪ Im

When m is odd, we have (m)

wi

≥ 1 for i ∈ I1 ∪ I3 ∪ . . . ∪ Im+1

(m) wi

≤ 1 for i ∈ I2 ∪ I4 ∪ . . . ∪ Im (m)

We can see that wi also oscillate around 1 with m. But as m increases, (m) wi approaches to 1. Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

16 / 20

Close to unity condition

Theorem (Close to unity condition) For 1 ≤ i ≤ m ≤ p − 2, we have (m)

|wi

(m) (λ1

− 1| ≤ ρi

− λp )m tan2 φ(X T y , vi ) λm p

(m)

Here ρi is defined as a very complicated function on Ritz values and eigenvalues of X T X . φ(X T y , vi ) is the angle between X T y and vi . vi is the ith eigenvector of X T X . (m) This theorem shows that for a small angle between X T y and vi , wi is close to unity.

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

17 / 20

Large deviation condition

Theorem (Large deviation condition) Let N be an integer with m ≤ N ≤ p. δ∗ δ ∗ = min{|λi − λk |, i 6= k, i, k ≤ N}. Given a positive δ ≤ 2λ , suppose 1 there is a set of m distinct indices J ⊆ {1, . . . , N} such that (m) |wi − 1| < δ m for i ∈ J. Then for any index l ∈ {1, . . . , p} \ J, we have (m)

|wl

− 1| >

Y i∈J

|1 −

X λl λl |−δ λi λi i∈J

Y k∈J\{i}

|1 −

λl | + O(δ 2 ) λi

This shows that for l such that λl  λi for all i ∈ J (e.g. high collinearity (m) cases), the correspondence filter factor wl must deviate significantly from 1.

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

18 / 20

Discussion

The filter factor for PLS oscillate around 1 First PLS filter factor quickly converge to 1. But the last Ritz value has poor approximation to corresponding eigenvalue Ritz values approximate their corresponding eigen values in natural order Intermdiate filter factors follow a less consistent pattern Filter factors may be negative! ...

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

19 / 20

Thank you!

Lingjaerde, Christophersen (U Oslo)

Shrinkage Structure of Partial Least Squares

Scandinavia J Stat 2000

20 / 20