Statistical Genomics and Bioinformatics Workshop 8/16/2013
Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies Genomic Clustering and Signature Development Brooke L. Fridley, PhD University of Kansas Medical Center 1
Genomic Clustering
2
1
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Clustering Basics • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. – It is also called unsupervised learning. • Exploratory tool: use these methods for visualization, hypothesis generation, selection of genes for further consideration – We should not use these methods inferentially.
• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions. 3
Why cluster genes? • Identify groups of possibly co-regulated genes • Identify typical temporal or spatial gene expression patterns • Arrange a set of genes in a linear order that is at least not totally meaningless (we hope). • Aids in the interpretation 4
2
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Why cluster samples? • Quality control: Detect experimental artifacts/bad hybridizations • Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests, classification)
• Identify new “classes” of biological samples – tumor subtypes – Disease heterogeneity
5
Human breast tumors cluster into 6 distinct molecular subtypes of breast cancer with differences in patient survival.
Hallett, et al. (2012) A gene signature for predicting outcome in patients with basal-like breast cancer. Scientific Reports 6
3
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Clustering vs Classification • Clustering is ‘unsupervised’: – We don’t use any information about what class the samples belong to (e.g. disease status, cancer type) to determine cluster structure – Hierarchical, K-Means, PCA, SOM, model-based clustering – Clustering finds groups in the data • Classification methods are ‘supervised’: – identifying to which of a set of groups/categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. – Discriminant analysis, PAM (Shrunken centroids), Random Forests/CART, K-Nearest Neighbor, – Classification methods finds ‘classifiers’ 7
Cluster analysis Generally, cluster analysis is based on two components: 1. Distance measure: Quantification of dissimilarity / similarity of objects. 2. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large betweencluster distances.
8
4
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Distance and Similarity • Every clustering method is based solely on the measure of distance or similarity. – The clustering is only as good as the distance matrix • Generally, not enough thought and time is spent on choosing and estimating the distance/similarity matrix. – Applying correlation to highly skewed data will provide misleading results. – Applying Euclidean distance to data measured on categorical scale will be invalid. 9
Hierarchical Clustering • The most over used statistical method in gene expression analysis • Tends to be pretty unstable. – Many different ways to perform hierarchical clustering – Sensitive to small changes in the data • Provided with clusters of every size – where to “cut” the dendrogram is user-determined Huse, Kwon, et al (2010) Parallel Evolution in Pseudomonas aeruginosa over 39,000 generations in vivo. 10 mBio. 1(4).
5
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Distances between clusters used for hierarchical clustering • Distance between two clusters is based on the pairwise distances between members of the clusters. • Complete linkage: largest distance • Average linkage: average distance • Single linkage:
smallest distance
• Complete linkage gives preference to compact / spherical clusters. • Single linkage can produce long stretched clusters. 11
Hierarchical Clustering Example: Visualizing DNA Methylation Data CpG loci
Tissue samples
Unmethylated
Methylated
Unsupervised hierarchical clustering based on Manhattan distance and average linkage.
12 Christensen et al. PLoS Genetics, 2009
6
Statistical Genomics and Bioinformatics Workshop 8/16/2013
K-Means • Intuitive and very easy to implement • Pre-specification of the number of clusters K - K is typically unknown for most practical purposes - Misspecification may lead to poor results • Choice of a distance measure may be difficult to justify • Clusters are expected to be of similar size • May not work well for irregular clusters - Clusters based on the first-moment (i.e., mean) only
13
Gene A
K-means clustering illustration
Gene B
14
7
Statistical Genomics and Bioinformatics Workshop 8/16/2013
• Need to first pre-specify the number of clusters • For simplicity, assume k = 2 • Step 1: Initialize the means for the two clusters
(1)
Gene A
m1
(1)
m2
Gene B
15
• Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean
(1)
Gene A
m1
(1)
m2
Gene B
16
8
Statistical Genomics and Bioinformatics Workshop 8/16/2013
• Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean
(1)
Gene A
m1
(1)
m2
Gene B 17
Gene A
• Re-compute the means based on the observations within that cluster (i.e., m(2)1 and m(2)2) • Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean
(2)
m1 (2)
m2
18
Gene B
9
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Gene A
• Re-compute the means based on the observations within that cluster (i.e., m(2)1 and m(2)2) • Compute the distance of each of the points to each of the two means • Assign points to the cluster with the closest mean
(2)
m1 (2)
m2
19
Gene B
• After 5 iterations, we converge on our final solution! • Consists of class labels for each of the n observations
(5)
Gene A
m2
(5)
m1
Gene B 20
10
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Cautions about Clustering • • • •
Clustering can be a useful exploratory tool Cluster results are very sensitive to noise in the data Need to assess cluster structure and stability of results Different clustering approaches can give quite different results – Methods – Number Clusters – Distance measures • For hierarchical clustering, interpretation is almost always subjective • Doesn’t tell us anything about what features should be used for clustering 21
Genomic Classification
22
11
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Learning set Bad prognosis recurrence 5yrs
Good Prognosis ? Matesis > 5
Objects Array Feature vectors Gene expression new array
Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.
Classification rule 23
Decision tree classifiers (CART) Gene 1 Mi1 0.18
0 yes
0
2
0.18
no
1
Transparent rules and easy to interpret and implement 24
12
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Ensemble classifiers Resample 1
Classifier 1
Resample 2
Classifier 2
Training Set X1, X2, … X100
Aggregate classifier
Resample 499
Resample 500
Classifier 499
Classifier 500
Examples: Bagging Boosting Random Forest 25
Feature Selection • Lead to better classification performance by removing variables that are noise with respect to the outcome • May provide useful biological insights • Can eventually lead to the diagnostic tests 26
13
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Classifier Performance Assessment • Any classification rule needs to be evaluated for its performance on the future samples. • One needs to estimate future performance based on what is available • Assessing performance of the classifier based on – Cross-validation • V-fold CV • Leave-one-out cross validation (LOOCV)
– Training vs Testing set – Testing on independent dataset 27
Diagram of performance assessment Classifier
Training Set
Resubstitution estimation
(CV) Learning set Training set
Classifier
Cross Validation
Performance assessment
(CV) Test set
Classifier
Independent test set
Test set estimation 28
14
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Pharmacogenomic (PGx) Classifiers Benefits: • Enables patients to be treated with drugs that actually work for them • Avoids false negative trials for heterogeneous populations • Avoids erroneous generalizations of conclusions from positive trials
Develop a PGx classifier to determine patients likely to benefit from a new TRT
Establish reproducibility of the PGx classifier
Use the PGx classifier to design and analyze a new clinical trial to evaluate effectiveness of TRT in the overall population or pre‐defined subsets determined by the classifier. 29
Coming Full Circle: Integrating Many Methods & Data Types
30
15
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Molecular Phenotype Based GWAS
Molecular Case Subtype A
GWAS
Molecular Case Subtype B
GWAS
Novel Subtype Specific Loci
All Cases
• Molecular Subtype GWAS: – For risk with existing controls – For clinical outcome – For quantitative trait
Based on Clustering
31
Example: Integrative Analysis Multiple Methods and Multiple Data Types Association analysis to determine epigenetic features associated with clinical outcome (TTR)
Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes
Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between methylation subgroups
Pathway Analysis of resulting differential expressed genes to determine enriched pathways. 32
16
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Application to Ovarian Cancer 337 HGS Ovarian Cancer Tumors
Training set N=168 Recurrences = 110
Testing set N=169 Recurrences = 116
• Restricted to High Grade Serous (HGS) histology • Pre-chemo tumor sample • 450K Illumina Methylation Array • Similar stage and recurrence status between testing and training data sets
Wang, Cicek, … ,Fridley, Goode (2013). Tumor Hypo-Methylation at 6p21.3 Associates with Longer Time to Recurrence of High-Grade Serous Epithelial Ovarian Cancer. Under review.
33
Step 1
Epigenetics Step 2
Step 3
Step 4 34
17
Statistical Genomics and Bioinformatics Workshop 8/16/2013
337 high‐grade serous tumors
Random split
Training set (168) Train
Testing set (169) optimal number of CpG loci = 60
Test
SS‐RPMM 1. 2. 3.
Loci ranking Cross‐validation Clustering and signature generation
rL: worse outcome rR: better outcome
Analysis workflow of semi‐supervised clustering used in this study. 35
All the HGS samples in testing set 169 samples / 104 events L: worse outcome R: better outcome
p-value=5.3E-4, HR=0.5
Kaplan Meier plot of association between groups (R or L) and recurrence time, for all the samples in testing set (n=169). 36
18
Statistical Genomics and Bioinformatics Workshop 8/16/2013
HGS testing samples with platinum and taxane treatment 130 samples/ 90 events L: worse outcome R: better outcome
p-value=1.20E-5, HR=0.39
Kaplan Meier plot of association between groups (R or L) and recurrence time, for the samples in testing set and received platinum and taxane treatment (n=130). 37
Nearest Shrunken Centroid Method (PAM Analysis) • Goal: In a sample with K different classes and p variables, what variables contribute to the separation of these classes? • PAM: Shrinks each class centroid towards the overall centroid. The shrinkage factor is determined by CV. • The shrinkage de-noises large effects while setting small ones to zero (i.e., selection of key genes) 38
19
Statistical Genomics and Bioinformatics Workshop 8/16/2013
39
Ovarian Cancer Study • 104 of the HGS cases have Agilent gene expression data rL (poor outcome): 48 rR (better outcome): 56
40
20
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Gene expression heatmap of PAM selected genes
L: worse outcome R: better outcome
921 probes with higher expression in L group
1413 probes with higher expression in R group
Expression heatmaps of signature genes selected by PAM analysis using shrinkage factor 1.5, which was selected based on minimum cross‐validation error. 41
What are the genes that distinguish between the molecular subtypes? • 712 genes are over expressed in patients with poor outcome – moderately enriched in signaling pathways, such as Wnt/beta-catenin Signaling (p-value=8.71E-5). • 958 genes are over expressed in patients with better outcome – extremely enriched in immune related pathways, such as Antigen Presentation Pathway (p-value=1.6E-32), Crosstalk between Dendritic Cells and Natural Killer Cells (p-value=2E-24), and Communication between Innate and Adaptive Immune Cells (p-value=5E-24). – Might explain why this group is associated with better outcome (blessed by protection of boosted immune mechanism) 42
21
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Questions? Thank You for Attending this Workshop.
43
22