12/9/2013
Partial least Squares • Multivariate regression • Multiple Linear Regression (MLR) • Principal Component Regression (PCR) • Partial Least Squares (PLS)
Partial Least Squares
• Validation
A tutorial • Preprocessing Lutgarde Buydens
Multivariate Regression
Multivariate Regression k
Raw data
p
Raw data
2
k
p
2
1.5
1.5
1
Y
X
1
0.5
Y
X 0.5
0
n
0
n
-0.5 2000
4000
6000
8000
10000
12000
14000
-1
W avenumber (cm ) 4000
6000
8000
10000
12000
14000
Rows: Cases, observations …
W avenumber (cm-1 )
Rows: Cases, observations, …
Collums: Variables, Classes, tags
Analytical observations of different samples Experimental runs Persons …. X: Independent variabels (will be always available) Y: Dependent variables ( to be predicted later from X)
P: Spectral variables Analytical measurements
Y = f(X) : Predict Y from X MLR: Multiple Linear Regression PCR: Principal Component Regression PLS: Partial Least Sqaures
K: Class information Concentration,..
MLR: Multiple Linear Regression
From univariate to Multiple Linear Regression (MLR) y
y
y= b0 +b1 x1 + ε
b0 : intercept b1 : slope
ε
Least squares regression
y= b0 +b1 x1 + ε
b0 : intercept b1 : slope
ε
Least squares regression
Collums: Variables, Classes, tags
X: Independent variabels (will be always available) Y: Dependent variables ( to be predicted later from X)
x Multiple Linear Regression
y
y= b0 +b1 x1 + b2x2 + … bpxp + ε
-0.5 2000
^ Y Y E
maximizes
x
ε
x1
r ( y, y ) x2
1
12/9/2013
MLR: Multiple Linear Regression y= b0 +b1 x1 + b2x2 + … bpxp + ε
y
•Disadavantages: (XTX)-1
ε
^ Y Y E
MLR: Multiple Linear Regression
x
• Uncorrelated X-variables required
• n p +1
r(x1,x2) 1
y x1
x2
p+1
Ynk = XnpBpk + Enk
y
b = (XTX)-1XTy
: : :
=
n
n
b0 b1
X
1 1
x1
e + x2
bp
1 1
n
MLR: Multiple Linear Regression
MLR: Multiple Linear Regression Disadavantages: (XTX)-1
Disadavantages: (XTX)-1
y
• Uncorrelated X-variables required
• Uncorrelated X-variables required Set A
y
r(x1,x2) 1
Fits a plane through a line !!
x1
Set B
r(x1,x2) 1
x1
x2
x1
x2
x2
y
-1.01
-0.99
-1.01
-0.99
-1.89
3.23
3.25
3.23
3.25
10.33
5.49
5.55
5.49
5.55
19.09
0.23
0.21
0.23
0.23
2.19
-2.87
-2.91
-2.87
-2.91
-8.09
3.67
3.76
3.67
3.76
11.29
y= b1 x1 + b2x2 + ε
x2 MLR
b1
b2
b1
10.3
-6.92
2.96
R2
b2
R2
=0.98
MLR: Multiple Linear Regression Disadavantages:
x1
yn1 = Xnpbp1 + en1
0.28
=0.98
PCR: Principal Component Regression
(XTX)-1
• Uncorrelated X-variables required
Step 1: Perform PCA on the original X
• n p +1
Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model p cols
a cols PCA T
X Step 1 X
a1 a2
MLR
aa
y Step2
p n-rows
n-rows
n Dimension reduction
Variable Selection Latent variables (PCR, PLS)
Step 3: Calculate b-coefficients from the a-coefficients
b0
n-rows a1 a2 aa
b1 Step 3 bp
2
12/9/2013
PCR: Principal Component Regression
PCR: Principal Component Regression
xp Step 0 : Meancenter
X
Step 1: Perform PCA:
X = TPT X* = (TPT)*
Step 2: Perform MLR
Y=TA
PC1
A = (TTT)-1TTY
x1
Step 3 : Calculate B
Y = X* B
Y = (T PT) B
MLR on reconstructed X*= (TPT)*
A = PT B B = (PPT)-1PA
x2 Dimension reduction:
B = PA
b 0 y yˆ
Calculate b0’s
Use scores (projections) on latent variables that explain maximal variance in X
PCR: Principal Component Regression
PLS: Partial Least Squares Regression Phase 1 p cols
Optimal number of PC’s
Phase 2
a col
a2
PLS Calculate Crossvalidation RMSE for different # PC’s
RMSECV
( y y ) i
MLR
T
X
k cols
a1
aa
y
2
i
n-rows
n
n-rows
n-rows
a1 k cols
Phase 3
Y
b0 b1
a1 a2 aa
n-rows
PLS: Partial Least Squares Regression
PLS: Partial Least Squares Regression Phase 1 : Calculate new independent variables (T)
Projection to Latent Structure PCR
xp
PLS
xp
Sequential Algorithm: Latent variables and their scores are calculated sequentially • Step 0: Mean center X
PC1
x1
LV1 (w)
• Step 1: Calculate w Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY (XTY)pk = WpaDaa ZTak
w1 = 1st col. of W
x1 xp
x2
Use PC: Maximizes variance in X
bp
w1
x2
Use LV: Maximizes covariance (X,y) = VarX*vary*cor(X,y)
x1
x2
3
12/9/2013
PLS: Partial Least Squares Regression
PLS: Partial Least Squares Regression Phase 1 p cols
Phase 1 : Calculate new independent variables (T) Sequential Algorithm: Latent variables and their scores are calculated sequentially
k cols
a1 a2
PLS
MLR
T
X
• Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY (XTY)pk = WpaDaa ZTak
Phase 2
a col
aa
y
w1 = 1st col. of W n-rows
xp
•Step 2:
a1 w
Calculate t1, scores (projections) of X on w1
tn1 = Xnpwp1
n-rows
n-rows
k cols
Phase 3
Y
x1
b0 b1
a1 a2
aa
n-rows
bp
x2
PLS: Partial Least Squares Regression
MLR, PCR, PLS:
Optimal number of LV’s
Set A
Calculate Crossvalidation RMSE for different # LV’s
RMSECV
(y i y i )2
n
Set B
x1
x2
x1
x2
y
-1.01
-0.99
-1.01
-0.99
-1.89
3.23
3.25
3.23
3.25
10.33
5.49
5.55
5.49
5.55
19.09
0.23
0.21
0.23
0.23
2.19
-2.87
-2.91
-2.87
-2.91
-8.09
3.67
3.76
3.67
3.76
11.29
y= b1 x1 + b2x2 + ε
VALIDATION
b1
b2
b1
b2
MLR
10.3
-6.92
2.96
0.28
PCR
1.60
1.62
1.60
1.62
PLS
1.60
1.62
1.60
1.62
Common measure for prediction error
Estimating prediction error. Basic Principle: test how well your model works with new data, it has not seen yet!
4
12/9/2013
A Biased Approach
Validation: Basic Principle Basic Principle:
Prediction error of the samples the model was built on test how well your model works with new data, it has not seen yet!
Error is biased! Samples also used to build the model
Split data in training and test set.
model is biased towards accurate prediction of these specific samples
Several ways: One large test set Leave one out and repeat: LOO Leave n objects out and repeat: LNO ... Apply entire model procedure on the test set
Validation
Training and test sets Split in training and test set. • Test set should be representative of training set • Random choice is often the best • Check for extremely unlucky divisions • Apply whole procedure on the test and validation sets
b0 Training set
Build model : bp
Full data set
Test set
yˆ
RMSEP
Remark: for final model use whole data set.
Cross-validation
Cross-validation: an example • The data
• Most simple case: Leave-One-Out (=LOO, segment=1 sample). Normally 10-20% out (=LnO). • Remark: for final model use whole data set.
5
12/9/2013
Cross-validation: an example • Split data into training set and validation set
Cross-validation: an example
Cross-validation: an example • Split data into training set and test set
Cross-validation: an example
• Build a model on the training set
Cross-validation: an example • Split data again into training set and valid. set – Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
Cross-validation: an example • Split data again into training set and valid. set – Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
6
12/9/2013
Cross-validation: an example
Cross-validation: an example
• Split data again into training set and valid. set
• Split data again into training set and valid. set
– Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
– Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
Cross-validation: an example
Cross-validation: an example
• Split data again into training set and valid. set
• Split data again into training set and valid. set
– Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
– Until all samples have been in the validation set once – Common: Leave-One-Out (LOO)
Cross-validation: a warning
Cross-validation: a warning
• Data: 13 x 5 = 65 NIR spectra (1102 wavelengths) – 13 samples: different composition of NaOH, NaOCl and Na2CO3 – 5 temperatures: each sample measured at 5 temperatures
• The data 1102
3
1 Composit ion
NaOH (wt%)
NaOCl (wt%)
Na2CO3 (wt%)
1
18.99
0
0
15
21
27
34
40
2
9.15
9.99
0.15
15
21
27
34
40
3
15.01
0
4.01
15
21
27
34
40
4
9.34
5.96
3.97
15
21
27
34
40
13
…
…
…
…
…
13
16.02
2.01
1.00
15
21 27
34
40
Temperature (°C)
2 y
65
65
Leave SAMPLE out
7
12/9/2013
Validation
Selection of number of LV’s Training Set
Trough Validation:
2) Build model : b 0
Choose number of LV’s that results in model with lowest prediction errror Testset to assess final model cannot be used !
1) determine #LV’s : wit test’ set
Full data set
Test’ set
Divide trainingset Crossvalidation Test set
bp
yˆ
RMSEP
Remark: for final model use whole data set.
Double Cross Validation
CV2
Double cross-validation
1) determine #LV’s : CV Innerloop
• The data
2) Build model : CV Outer loop b0 Full data set
Training setC CV 1
bp
yˆ
RMSEP
Remark: for final model use whole data set Skip.
Double cross-validation
Double cross-validation
• Split data into training set and validation set
• Split data into training set and validation set
Used later to assess model performance!
8
12/9/2013
Double cross-validation
1LV
2LV
3LV
• Apply crossvalidation on the rest: Split training set into (new) training set and test set
1LV
2LV
3LV
1LV
2LV
3LV
Lowest RMSECV
Double cross-validation
9
12/9/2013
Cross-validation: an example
Cross-validation: an example
• Repeat procedure
• Repeat procedure
– Until all samples have been in the validation set once
– Until all samples have been in the validation set once
Double cross-validation
PLS: an example
• In this way:
Raw + meancentered data
– The number of LVs is determined by using samples not used to build the model with
Raw data
Meancentered data
2
0.3 0.25
1.5
0.2
Absorbance (a.u.)
0.15 Absorbance (a.u.)
– The prediction error is also determined using samples the model has not seen before
1
0.5
0.1 0.05 0 -0.05
0
-0.1 -0.15
Remark: for final model use whole data set. -0.5 2000
4000
6000
8000
10000
12000
-0.2 2000
14000
Wavenumber (cm-1)
RMSECV vs. No of LVs
4000
6000
8000
10000
12000
14000
Wavenumber (cm-1)
Regression coeffficients Raw data
Absorbance (a.u.)
2
RMSECV values for prediction of NaOH 0.7
0.6
-0.5 3000
0.4
4000
5000
6000
7000
8000
9000 10000 11000 12000 13000
Wavenumber (cm-1)
0.3
10
0.2
0.1
1
2
3
4
5 6 7 Number of LVs
8
9
10
Regression coefficient
RMSECV
1 0.5 0
0.5
0
1.5
8 6 4 2 0 -2 3000
4000
5000
6000
7000
8000
9000 10000 11000 12000 13000
Wavenumber (cm-1)
10
12/9/2013
Why Pre-Processing ?
True vs. predicted
Data Artefacts
3 Original spectrum
True values vs. predictions
18
NaOH, predicted
16
14
12
Baseline correction Alignment Scatter correction Noise removal Scaling, Normalisation Transformation …..
2.5
8
10
12
14 NaOH, true
16
original
0.8
0.6
0.7
0.5
0.6
0.4 0.3 0.2 0.1
18
Other
20
1400
1600
0.4
0.8
0.3
0.6 200
400
600 800 1000 1200 Wavelength (a.u.)
1400
1600
0.6
offset+slope
0.6
0.5
0.4
0.4
0.3
200
400
0 600 800 1000 1200 1400 16000 Wavelength (a.u.)
0
0
Pre-Processing Methods STEP 2: (10x) SCATTER
STEP 3: (10x) NOISE
STEP 4: (7x) SCALING & TRANSFORMATION S Meancentering
No baseline correction
No scatter correction
No noise removal
(3x) Detrending polynomial order (2-3-4)
(4x) scaling: Mean Median Max L2 norm
(9x) S-G smoothing (window: 5-9-11 pt) (order: 2-3-4)
(2x) Derivatisation (1st – 2nd )
SNV
Pareto scaling
(3x) RNV (15, 25, 35)%
Poisson scaling
AsLS
MSC
Level scaling
400
600
800 1000 Wavelength (a.u.)
1200
1400
1600
Pre-Processing Results • Complexity of the model : no of LV • Classification Accuracy
Raw Data
Autoscaling Range scaling
Log transformation
Supervised pre-processing methods No noise removal
200
200 400 600 800 1000 1200 1400 1600 Wavelength (a.u.)
4914 combinations: all reasonable
OSC DOSC
0.2
0.1
200 400 600 800 1000 1200 1400 160000 Wavelength (a.u.)
STEP 1: (7x) BASELINE
0.3
0.1
0.2
0.2 0.1
0.1
0.4
0.3
0.3
0.2
multiplicative
0.6
0.5
0.5
0.5
0.4
Intensity (a.u.)
Intensity (a.u.)
Intensity (a.u.)
offset
0.7
0.7
0.8 0.7
original offset offset+slope multiplicative offset + slope + multiplicative
0.7
0.1
0.8 0.8
2500
original offset offset+slope multiplicative offset + slope + multiplicative
Intensity (a.u.)
600 800 1000 1200 Wavelength (a.u.)
2000
Meancentering Autoscaling Range scaling Pareto scaling Poisson scaling Level scaling Log scaling
Complexity of the model (no of LV)
400
1500 Wavelength (cm-1)
0.2 200
1000
0.5
0 0
00
500
• Missing values • Outliers
0.7
0 0
1
0
Intensity (a.u.)
Intensity (a.u.)
0.8
1.5
0.5
10
8
Offset Slope Scatter
2 Intensity (a.u)
• • • • • • •
20
J. Engel et al. TrAC 2013
Classification accuracy %
11
12/9/2013
SOFTWARE • PLS Toolbox (Eigenvector Inc.) – www.eigenvector.com – For use in MATLAB (or standalone!)
• XLSTAT-PLS (XLSTAT) – www.xlstat.com – For use in Microsoft Excel
• Package pls for R – Free software – http://cran.r-project.org
12