Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Statistical Genomics and Bioinformatics Workshop 8/16/2013 Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies ...
2 downloads 1 Views 1MB Size
Statistical Genomics and Bioinformatics Workshop 8/16/2013

Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies Genomic Clustering and Signature  Development Brooke L. Fridley, PhD University of Kansas Medical Center  1

Genomic Clustering

2

1

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Clustering Basics • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. – It is also called unsupervised learning. • Exploratory tool: use these methods for visualization, hypothesis generation, selection of genes for further consideration – We should not use these methods inferentially.

• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions. 3

Why cluster genes? • Identify groups of possibly co-regulated genes • Identify typical temporal or spatial gene expression patterns • Arrange a set of genes in a linear order that is at least not totally meaningless (we hope). • Aids in the interpretation 4

2

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Why cluster samples? • Quality control: Detect experimental artifacts/bad hybridizations • Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests, classification)

• Identify new “classes” of biological samples – tumor subtypes – Disease heterogeneity

5

Human breast tumors cluster into 6 distinct molecular subtypes of breast cancer with differences in patient survival.

Hallett, et al. (2012) A gene signature for predicting outcome in patients with basal-like breast cancer. Scientific Reports 6

3

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Clustering vs Classification • Clustering is ‘unsupervised’: – We don’t use any information about what class the samples belong to (e.g. disease status, cancer type) to determine cluster structure – Hierarchical, K-Means, PCA, SOM, model-based clustering – Clustering finds groups in the data • Classification methods are ‘supervised’: – identifying to which of a set of groups/categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. – Discriminant analysis, PAM (Shrunken centroids), Random Forests/CART, K-Nearest Neighbor, – Classification methods finds ‘classifiers’ 7

Cluster analysis Generally, cluster analysis is based on two components: 1. Distance measure: Quantification of dissimilarity / similarity of objects. 2. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large betweencluster distances.

8

4

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Distance and Similarity • Every clustering method is based solely on the measure of distance or similarity. – The clustering is only as good as the distance matrix • Generally, not enough thought and time is spent on choosing and estimating the distance/similarity matrix. – Applying correlation to highly skewed data will provide misleading results. – Applying Euclidean distance to data measured on categorical scale will be invalid. 9

Hierarchical Clustering • The most over used statistical method in gene expression analysis • Tends to be pretty unstable. – Many different ways to perform hierarchical clustering – Sensitive to small changes in the data • Provided with clusters of every size – where to “cut” the dendrogram is user-determined Huse, Kwon, et al (2010) Parallel Evolution in Pseudomonas aeruginosa over 39,000 generations in vivo. 10 mBio. 1(4).

5

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Distances between clusters used for hierarchical clustering • Distance between two clusters is based on the pairwise distances between members of the clusters. • Complete linkage: largest distance • Average linkage: average distance • Single linkage:

smallest distance

• Complete linkage gives preference to compact / spherical clusters. • Single linkage can produce long stretched clusters. 11

Hierarchical Clustering Example: Visualizing DNA Methylation Data CpG loci

Tissue samples 

Unmethylated

Methylated

Unsupervised hierarchical clustering based on Manhattan distance and average linkage.

12 Christensen et al. PLoS Genetics, 2009

6

Statistical Genomics and Bioinformatics Workshop 8/16/2013

K-Means • Intuitive and very easy to implement • Pre-specification of the number of clusters K - K is typically unknown for most practical purposes - Misspecification may lead to poor results • Choice of a distance measure may be difficult to justify • Clusters are expected to be of similar size • May not work well for irregular clusters - Clusters based on the first-moment (i.e., mean) only

13

Gene A

K-means clustering illustration

Gene B

14

7

Statistical Genomics and Bioinformatics Workshop 8/16/2013

• Need to first pre-specify the number of clusters • For simplicity, assume k = 2 • Step 1: Initialize the means for the two clusters

(1)

Gene A

m1

(1)

m2

Gene B

15

• Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean

(1)

Gene A

m1

(1)

m2

Gene B

16

8

Statistical Genomics and Bioinformatics Workshop 8/16/2013

• Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean

(1)

Gene A

m1

(1)

m2

Gene B 17

Gene A

• Re-compute the means based on the observations within that cluster (i.e., m(2)1 and m(2)2) • Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean

(2)

m1 (2)

m2

18

Gene B

9

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Gene A

• Re-compute the means based on the observations within that cluster (i.e., m(2)1 and m(2)2) • Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean

(2)

m1 (2)

m2

19

Gene B

• After 5 iterations, we converge on our final solution! • Consists of class labels for each of the n observations

(5)

Gene A

m2

(5)

m1

Gene B 20

10

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Cautions about Clustering • • • •

Clustering can be a useful exploratory tool Cluster results are very sensitive to noise in the data Need to assess cluster structure and stability of results Different clustering approaches can give quite different results – Methods – Number Clusters – Distance measures • For hierarchical clustering, interpretation is almost always subjective • Doesn’t tell us anything about what features should be used for clustering 21

Genomic Classification

22

11

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Learning set Bad prognosis recurrence  5yrs

Good Prognosis ? Matesis > 5

Objects Array Feature vectors Gene expression new array

Reference L van’t Veer et al (2002) Gene expression  profiling predicts clinical outcome of breast  cancer. Nature, Jan.

Classification rule 23

Decision tree classifiers (CART) Gene 1 Mi1  0.18

0 yes

0

2

0.18

no

1

Transparent rules and easy to interpret and implement 24

12

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Ensemble classifiers Resample 1

Classifier 1

Resample 2

Classifier 2

Training Set X1, X2, … X100

Aggregate classifier

Resample 499

Resample 500

Classifier 499

Classifier 500

Examples: Bagging Boosting Random Forest 25

Feature Selection • Lead to better classification performance by removing variables that are noise with respect to the outcome • May provide useful biological insights • Can eventually lead to the diagnostic tests 26

13

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Classifier Performance Assessment • Any classification rule needs to be evaluated for its performance on the future samples. • One needs to estimate future performance based on what is available • Assessing performance of the classifier based on – Cross-validation • V-fold CV • Leave-one-out cross validation (LOOCV)

– Training vs Testing set – Testing on independent dataset 27

Diagram of performance assessment Classifier

Training Set

Resubstitution  estimation

(CV) Learning set Training set

Classifier

Cross Validation

Performance assessment

(CV) Test  set

Classifier

Independent test set

Test set estimation 28

14

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Pharmacogenomic (PGx) Classifiers Benefits: • Enables patients to be treated with drugs that actually work for them • Avoids false negative trials for heterogeneous populations • Avoids erroneous generalizations of conclusions from positive trials

Develop a PGx classifier to  determine patients likely to benefit  from a new TRT

Establish reproducibility of the PGx  classifier

Use the PGx classifier to design and  analyze a new clinical trial to  evaluate effectiveness of TRT in the  overall population or pre‐defined  subsets determined by the classifier. 29

Coming Full Circle: Integrating Many Methods & Data Types

30

15

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Molecular Phenotype Based GWAS

Molecular Case  Subtype A 

GWAS

Molecular Case  Subtype B 

GWAS

Novel Subtype  Specific Loci

All Cases

• Molecular Subtype GWAS: – For risk with existing controls – For clinical outcome – For quantitative trait

Based on  Clustering

31

Example: Integrative Analysis Multiple Methods and Multiple Data Types Association analysis to determine epigenetic features associated with clinical outcome (TTR)

Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes

Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between methylation subgroups

Pathway Analysis of resulting differential expressed genes to determine enriched pathways. 32

16

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Application to Ovarian Cancer 337 HGS  Ovarian  Cancer Tumors

Training set N=168 Recurrences = 110

Testing set N=169 Recurrences = 116

• Restricted to High Grade Serous (HGS) histology • Pre-chemo tumor sample • 450K Illumina Methylation Array • Similar stage and recurrence status between testing and training data sets

Wang, Cicek, … ,Fridley, Goode (2013). Tumor Hypo-Methylation at 6p21.3 Associates with Longer Time to Recurrence of High-Grade Serous Epithelial Ovarian Cancer. Under review.

33

Step 1

Epigenetics Step 2

Step 3

Step 4 34

17

Statistical Genomics and Bioinformatics Workshop 8/16/2013

337 high‐grade  serous tumors

Random split

Training set (168) Train

Testing set (169) optimal number of  CpG loci = 60

Test

SS‐RPMM 1. 2. 3.

Loci ranking Cross‐validation Clustering and  signature generation

rL:  worse outcome rR:  better outcome

Analysis workflow of semi‐supervised clustering used in this study.  35

All the HGS samples in testing set 169 samples / 104 events L: worse outcome R: better outcome

p-value=5.3E-4, HR=0.5

Kaplan Meier plot of association between groups (R or L)  and recurrence time, for all the  samples in testing set (n=169). 36

18

Statistical Genomics and Bioinformatics Workshop 8/16/2013

HGS testing samples with platinum and taxane treatment 130 samples/ 90 events L: worse outcome R: better outcome

p-value=1.20E-5, HR=0.39

Kaplan Meier plot of association between groups (R or L)  and recurrence time, for the  samples in testing set and received platinum and taxane treatment (n=130). 37

Nearest Shrunken Centroid Method (PAM Analysis) • Goal: In a sample with K different classes and p variables, what variables contribute to the separation of these classes? • PAM: Shrinks each class centroid towards the overall centroid. The shrinkage factor is determined by CV. • The shrinkage de-noises large effects while setting small ones to zero (i.e., selection of key genes) 38

19

Statistical Genomics and Bioinformatics Workshop 8/16/2013

39

Ovarian Cancer Study • 104 of the HGS cases have Agilent gene expression data rL (poor outcome): 48 rR (better outcome): 56

40

20

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Gene expression heatmap of PAM selected genes

L: worse outcome R: better outcome

921 probes with higher  expression in L group

1413 probes with higher  expression in R group

Expression heatmaps of signature genes selected by PAM analysis using shrinkage factor  1.5, which was selected based on minimum cross‐validation error. 41

What are the genes that distinguish between the molecular subtypes? • 712 genes are over expressed in patients with poor outcome – moderately enriched in signaling pathways, such as Wnt/beta-catenin Signaling (p-value=8.71E-5). • 958 genes are over expressed in patients with better outcome – extremely enriched in immune related pathways, such as Antigen Presentation Pathway (p-value=1.6E-32), Crosstalk between Dendritic Cells and Natural Killer Cells (p-value=2E-24), and Communication between Innate and Adaptive Immune Cells (p-value=5E-24). – Might explain why this group is associated with better outcome (blessed by protection of boosted immune mechanism) 42

21

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Questions? Thank You for Attending this  Workshop.

43

22

Suggest Documents