HOW TO PERFORM MULTIVARIATE ANALYZES? W4M

HOW TO PERFORM MULTIVARIATE ANALYZES? W4M Core Team http://workflow4metabolomics.org 1 The "Multivariate" module The "Multivariate" module on W4...
Author: Erika Fowler
26 downloads 1 Views 4MB Size
HOW TO PERFORM

MULTIVARIATE ANALYZES?

W4M Core Team

http://workflow4metabolomics.org

1

The "Multivariate" module The "Multivariate" module on W4M allows you to perform: • Principal Component Analysis (PCA) • Partial Least-Squares regression (PLS) and discriminant analysis (PLS-DA) • Orthogonal Partial Least-Squares regression (OPLS) and discriminant analysis (OPLS-DA)

The original algorithms for PCA, PLS and OPLS with the NIPALS algorithm have been implemented by using the R environment

http://workflow4metabolomics.org

2

Chaining the statistical modules The Multivariate module can be chained with the Univariate module, and also the Filters module (either to filter out pool or blank samples before the statistics, or filter out the variables according to a statistical threshold after the analysis)

http://workflow4metabolomics.org

3

Preparing your files (1/9) Your data must be split into 3 files: • dataMatrix.tsv • sampleMetadata.tsv • variableMetadata.tsv

http://workflow4metabolomics.org

4

Preparing your files (2/9) Each file can be prepared by using Excel and saved using the tabulated type format:

http://workflow4metabolomics.org

5

Preparing your files (3/9) You can then rename your file with the .tsv extension (instead of .txt) by right-clicking on the file:

.tsv files (i.e. tabular separated) can be handled correctly both by Excel and Galaxy. http://workflow4metabolomics.org

6

Preparing your files (4/9) Decimal separator must be "." Missing values must be indicated as "NA"

http://workflow4metabolomics.org

7

Preparing your files (5/9) Note: you can switch your default language in Excel to English in order to have your decimal separator automatically set to "." 1

3 4

2

http://workflow4metabolomics.org

8

Preparing your files (6/9) The dataMatrix.tsv file must contain: • the names of your samples in the first row • the names of your variables in the first column • numbers (or NA) in all the other cells

Note: the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel compatibility

http://workflow4metabolomics.org

9

Preparing your files (7/9) The sampleMetadata.tsv file must contain: • the names of the factors to be used in statistical analyzes in the first row • the columns must be either characters (resp. numbers) for qualitative (resp. quantitative) factors

• the names of your samples in the first column which must exactly match those of the dataMatrix.tsv file

Note: • 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel compatibility • 2) you can add columns for storing metadata about your samples even though it is not used in your Galaxy analysis • 3) results from statistical analyzes (e.g. scores) will be added as supplementary columns in this file http://workflow4metabolomics.org

10

Preparing your files (8/9) The variableMetadata.tsv file must contain: • the names of the metadata (e.g. mzmed, rtmed) in the first row (there must be at least one column in addition to the variable names) • the names of your variables in the first column which must exactly match those of the dataMatrix.tsv file

Note: • 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel compatibility • 2) you can add columns for storing metadata about your variables even though it is not used in your Galaxy analysis • 3) results from the statistical analyzes (e.g. loadings, VIPs) will be added as new columns in this file http://workflow4metabolomics.org

11

Preparing your files (9/9) Sample and variable names: • should not start with a digit • should contain only • • • • • • •

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789 , [comma] - [dash] _ [underscore] [blank]

• other punctuations and accents should not be used • your sample and variable names should not contain any duplicate

http://workflow4metabolomics.org

12

Loading your files into Galaxy (1/2) Upload your three files (dataMatrix.tsv, sampleMetadata.tsv and variableMetadata.tsv) 1

• either by using the icon and drag & dropping the file:

3

2

4 http://workflow4metabolomics.org

13

Loading your files into Galaxy (2/2) • or with the Get Data / Upload File

2 1 3 4

5

http://workflow4metabolomics.org

14

Check that your data have been uploaded correctly

http://workflow4metabolomics.org

15

Rename your history (optional)

http://workflow4metabolomics.org

16

Open the "Multivariate" module and select your 3 files of interest:

3 4

5 1 2

you are now ready to start your multivariate analyzes! http://workflow4metabolomics.org

17

Principal component analysis (PCA)

http://workflow4metabolomics.org

18

Select • the total number of components

• the scaling

• the logarithm (log10) transformation of the values (optional)

• the components for display

• and launch the computation http://workflow4metabolomics.org

19

Graphical results Look at the "figure.pdf" file to see the scree plot, extreme observations, and the loading and score plots

http://workflow4metabolomics.org

20

Observation diagnostics The "observation diagnostics" plot highlights observations with large distance from the center in the score plane (score distance) or large distance to their projection in the score plane (orthogonal distance)

score distance orthogonal distance

http://workflow4metabolomics.org

21

Graphical results The figure can be downloaded as a .pdf file

1

2

http://workflow4metabolomics.org

22

Numerical results Numerical results (including the percentage of explained inertia) can be viewed in the "information.txt" file

1

http://workflow4metabolomics.org

23

Score and loading values The score (resp. loading) values of the selected components have been added as columns in the sampleMetadata.tsv (resp. variableMetadata.tsv) files

2 1

http://workflow4metabolomics.org

24

Tuning the parameters You can recall the page with your parameters, modify them, and restart the analysis

1

3

2

http://workflow4metabolomics.org

25

Advanced parameters (1/2) • Default algorithm is svd (faster) except if there are missing values (nipals will be used instead)

• The number of extreme values on the loading plots (coloured in red) can be modified

• The type of graphic can be modified

http://workflow4metabolomics.org

26

Advanced parameters (2/2) • A factor (column of the sampleMetadata.tsv) can be indicated to color the samples

• In case of a qualitative factor, it can be used to draw the Mahalanobis ellipses of each class

http://workflow4metabolomics.org

27

References • Husson F., Le S. and Pages J. (2011). Exploratory multivariate analysis by example using R. Chapman & Hall/CRC • Ringner M. (2008). What is principal component analysis? Nature Biotechnology, 26:303-304. http://dx.doi.org/10.1038/nbt0308-303 • Baccini A. (2010). Statistique descriptive multidimensionnelle (pour les nuls). www.math.univ-toulouse.fr/~baccini/zpedago/asdm.pdf

http://workflow4metabolomics.org

28

Partial Least Squares (PLS) and Partial Least Squares Discriminant Analysis (PLS-DA)

http://workflow4metabolomics.org

29

Select (1/2) • the Y response to be modelled (column of the sampleMetadata.tsv file)

• Note: in the case of a qualitative response, Mahalanobis ellipses can be drawn for each class by indicating the same factor as Y

• the number of random permutations of the labels to estimate the significance of the model

http://workflow4metabolomics.org

30

Select (2/2) • the total number of components

• the scaling

• the logarithm (log10) transformation of the values (optional)

• the components for display

• and launch the computation http://workflow4metabolomics.org

31

Graphical results Look at the "figure.pdf" file to see the results of the permutation testing, extreme observations, and the loading and score plots

http://workflow4metabolomics.org

32

Diagnostic metrics 0 ≤ R2X ≤ 1: percentage of X inertia explained by the model 0 ≤ R2Y ≤ 1: percentage of Y inertia explained by the model

0 ≤ Q2Y ≤ 1: estimation of the predictive performance of the model by cross-validation R2X and R2Y increase with the number of components while Q2Y reaches a maximum (due to overfitting limitation), as can be visualized with the "overview" graphic:

http://workflow4metabolomics.org

33

Permutation testing The algorithm randomly permutates the Y labels, builds the models and computes the R2X, R2Y, Q2Y

Counting the number of R2Y (and Q2Y) metrics from random models which are superior to the values of the true model gives an indication of the significance of the PLS modelling

http://workflow4metabolomics.org

34

Numerical results The details of the R2X, R2Y, and Q2Y values are stored in the "information.txt" file

1

http://workflow4metabolomics.org

35

Scores, loadings and VIPs The score (resp. loading and VIPs) of the selected components have been added as columns in the sampleMetadata.tsv (resp. variableMetadata.tsv) files

2

1

http://workflow4metabolomics.org

36

Advanced parameters (1/2) 1 • Use the icon to view your last parameters, modify them and start a new computation

2

• The optimal number of components can be estimated

• The dataset can be split into a reference and test subsets (the latter comprising samples with odd indices)

in this case, an estimation of the error on the test subset (RMSEP) is computed in addition to the estimation of the error on the reference test (RMSEE)

http://workflow4metabolomics.org

37

Advanced parameters (2/2) • Several other types of graphics are available:

• XY-scores

• predict-train and predict-test (the latter being available only if the test set of odd-indices has been defined)

http://workflow4metabolomics.org

38

Orthogonal Partial Least Squares (OPLS) and Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)

http://workflow4metabolomics.org

39

Select (1/2) • the Y response to be modelled (as in PLS)

• set the number of predictive components to 1

• the number of orthogonal components

• the number of random permutations of the labels to estimate the significance of the model (as in PLS)

http://workflow4metabolomics.org

40

Select (2/2) • the scaling

• the logarithm (log10) transformation of the values (optional)

• the components for display

• and launch the computation

http://workflow4metabolomics.org

41

Graphical results Look at the "figure.pdf" file to see the results of the permutation testing, extreme observations, and the loading and score plots

http://workflow4metabolomics.org

42

Diagnostic metrics and permutation testing Diagnostics are similar to PLS (see above)

Note: OPLS improves the interpretation of the components but not the overall predictive performance of the model

http://workflow4metabolomics.org

43

Numerical results The details of the R2X, R2Y, and Q2Y values are stored in the "information.txt" file

1

http://workflow4metabolomics.org

44

Advanced parameters 1 • Use the icon to view your last parameters, modify them and start a new computation

2

• The optimal number of orthogonal components can be estimated

• Note: Care should be taken to avoid too many orthogonal components (which would result in overfitting) by thoroughly examining the R2Y and Q2Y values in the "overview" graphic

http://workflow4metabolomics.org

45

References • Trygg J., Holmes E. and Lundstedt T. (2007). Chemometrics in Metabonomics. Journal of Proteome Research, 6:469-479. http://dx.doi.org/10.1021/pr060594q • Wheelock A. and Wheelock C.E. (2013). Trials and tribulations of omics data analysis: Assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine. Molecular BioSystems, 9:2589-2596. http://dx.doi.org/10.1039/C3MB70194H

http://workflow4metabolomics.org

46