Define the common question The methods Supervised analysis of two groups – How – Multiple comparisons correction – Primary vs. Metastasis tumor cells (An example) – Caveats Unsupervised analysis – What is clustering – How we clustering – Sample clustering (an example) – Gene Clustering (an example) – Coupling the two clustering
Samples
Typical Experiments One-color experiment
Two-color experiment ion t i d con
trol n o c
ion t i d con
RNA extraction
RNA labeling
trol n o c
DNA Array Technologies (A) Affymetrix
(B) “Spotting”
Expression Data Matrix (a result of quantification) Samples
Genes
Experiments set
Gene annotations
Sample annotations
Gene expression matrix c
Gene expression levels
Analysis flow Linearization Normalization Filtering Data analysis
Applications of microarrays Evolution – Most of gene expression differences between chimpanzees human have been detected in their brain. Development – Associating gene expression with metamorphosis stages in Drosophila Regulation – Finding novel regulatory motifs by coupling motif search with coexpression Behavior – Can predict the behavior of honey-bees workers by their brain gene expression. Functional annotation – Annotating unknown genes based on co expression (guilty by association) Tissue – Molecular signature specific to subtype of cancer tissues
Biological questions Sample classification – What are the set of genes that differentiate between two or more groups of Treatments (Supervised methods) – What is the set of samples that have the same expression profile in the detected cell(s). (Unsupervised methods)
Gene classification – What is the set of genes that have the same expression profile along a set of treatments. (Unsupervised methods)
Data analysis methods Supervised methods – Analysis of variance (ANOVA/T-test) – Discriminate analysis – K-nearest neighbors
Unsupervised methods – Partition methods K-means SOM (Self-Organization Maps)
– Hierarchical methods
Supervised classification
Classifying normal and cancer group of patients 110
Expression profile of a gene
90 80 70 60
Patients
ALL10
ALL9
ALL8
ALL7
ALL6
ALL5
ALL4
ALL3
ALL2
ALL1
N10
N9
N8
N7
N6
N5
N4
N3
N2
N1
50
Expression Intensity
100
N (Normal) ALL (Cancer)
Genes
Samples
Multiple comparison correction rejecting the null hypothesis for 10,000 tests with a p-value of 0.05 – 500 test are expected to be falsely significant results.
Random matrix Group A
10,000 rows
Group B
Multiple comparison correction methods Family-wise error rate (FWER) –Adjust the type I error (p-value) in a way it ensures no more than one false positive False Discovery Rate (FDR) - Adjust the type I error (p-value) in a way it ensures expected proportion of false positive
Classification of primary and metastatic tumors by t-test
Color code
Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.
low
High
Patients
outlier ALL10
ALL9
ALL8
ALL7
ALL6
ALL5
ALL4
ALL3
ALL2
ALL1
N10
N9
N8
N7
N6
N5
N4
N3
N2
N1
50
60
70
80
90
Expression Intensity
100
110
Note About Supervision Expression profile of a gene
N (Normal) ALL (Cancer)
Note About Supervision
Color code
Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.
Constructing clustering (an example) Method: Agglomerative clustering Steps: • Comparing all pairwise distances • Define the relationship among samples
Data: Six RNA samples from three Tissues with duplicated
Distance matrix Single linkage = minimum distance Average linkage = average distance Complete linkage = maximum distance
Create a Tissues Dendrogram
Sample classification Unsupervised classified the primary tumor samples into two groups Horizontal color bar shows recurrent (red) vs. non-recurrent (black) patient Vertical color bar shows Metastatic overexpressed (red) vs. primary over-expressed genes (black) genes Color code
low
High
Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.
Genes related to the same function are clustered together Human Fibroblast Growth
0h
1:1
Down regulated
Up regulated
24h
A - Cholesterol biosynthesis B – Cell cycle C – Immediate and early response D – Signaling and Angiogenesis E – Wound healing
Eisen MB, Proc. Natl. Acad. Sci. USA Vol. 95, pp. 14863–14868,
Co-expression may imply for co-regulation Genes that were clustered together found to be have the URS1 motif 5’-DSGGCGGCND-3’ in their upstream region. For metabolic genes it was found in 15 out of 52. Genes that were clustered together found to be have the MSE motif 5’- DNCRCAAAW-3’ in their upstream region, which is suggested to be recognized by Ndt80 transcription factor.
S. Chu, J. et al., The Transcriptional Program of Sporulation in Budding Yeast Science 282:699-705
Notes About Hierarchical Clustering Giving a dendrogram (or a tree) we may browse the level of coherent expression considering functionality or regulation. Flip nodes
Validation of clustering Random1 – shuffling within rows Random2 – shuffling within columns Random3 – shuffling both rows and columns What are the number of valid cluster? How many genes should be cluster together by chance?
Eisen MB, Proc. Natl. Acad. Sci. USA Vol. 95, pp. 14863–14868
Coupling classification
Gene that classifies the sample into two expected groups 1:1
Down regulated
Up regulated
Normal
Cancer
Normal
Cancer
Gene that classifies the sample into two unexpected groups
Normal
Cancer
Classifiers comparison t-test
Gene relationships A-priori knowledge
Each gene separately
yes
Discriminant Hierarchical clustering analysis/Knearest neighbors Combination Combination of genes of genes