Typical Experiments. One-color experiment. Two-color experiment. RNA extraction. RNA labeling

Microarray analysis Outline Genes Define the common question The methods Supervised analysis of two groups – How – Multiple comparisons correction ...
Author: William May
0 downloads 0 Views 4MB Size
Microarray analysis

Outline Genes

Define the common question The methods Supervised analysis of two groups – How – Multiple comparisons correction – Primary vs. Metastasis tumor cells (An example) – Caveats Unsupervised analysis – What is clustering – How we clustering – Sample clustering (an example) – Gene Clustering (an example) – Coupling the two clustering

Samples

Typical Experiments One-color experiment

Two-color experiment ion t i d con

trol n o c

ion t i d con

RNA extraction

RNA labeling

trol n o c

DNA Array Technologies (A) Affymetrix

(B) “Spotting”

Expression Data Matrix (a result of quantification) Samples

Genes

Experiments set

Gene annotations

Sample annotations

Gene expression matrix c

Gene expression levels

Analysis flow Linearization Normalization Filtering Data analysis

Two-color experiment Pre-processing One-color experiment

Normalization

Applications of microarrays Evolution – Most of gene expression differences between chimpanzees human have been detected in their brain. Development – Associating gene expression with metamorphosis stages in Drosophila Regulation – Finding novel regulatory motifs by coupling motif search with coexpression Behavior – Can predict the behavior of honey-bees workers by their brain gene expression. Functional annotation – Annotating unknown genes based on co expression (guilty by association) Tissue – Molecular signature specific to subtype of cancer tissues

Biological questions Sample classification – What are the set of genes that differentiate between two or more groups of Treatments (Supervised methods) – What is the set of samples that have the same expression profile in the detected cell(s). (Unsupervised methods)

Gene classification – What is the set of genes that have the same expression profile along a set of treatments. (Unsupervised methods)

Data analysis methods Supervised methods – Analysis of variance (ANOVA/T-test) – Discriminate analysis – K-nearest neighbors

Unsupervised methods – Partition methods K-means SOM (Self-Organization Maps)

– Hierarchical methods

Supervised classification

Classifying normal and cancer group of patients 110

Expression profile of a gene

90 80 70 60

Patients

ALL10

ALL9

ALL8

ALL7

ALL6

ALL5

ALL4

ALL3

ALL2

ALL1

N10

N9

N8

N7

N6

N5

N4

N3

N2

N1

50

Expression Intensity

100

N (Normal) ALL (Cancer)

Genes

Samples

Multiple comparison correction rejecting the null hypothesis for 10,000 tests with a p-value of 0.05 – 500 test are expected to be falsely significant results.

Random matrix Group A

10,000 rows

Group B

Multiple comparison correction methods Family-wise error rate (FWER) –Adjust the type I error (p-value) in a way it ensures no more than one false positive False Discovery Rate (FDR) - Adjust the type I error (p-value) in a way it ensures expected proportion of false positive

Classification of primary and metastatic tumors by t-test

Color code

Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.

low

High

Patients

outlier ALL10

ALL9

ALL8

ALL7

ALL6

ALL5

ALL4

ALL3

ALL2

ALL1

N10

N9

N8

N7

N6

N5

N4

N3

N2

N1

50

60

70

80

90

Expression Intensity

100

110

Note About Supervision Expression profile of a gene

N (Normal) ALL (Cancer)

Note About Supervision

Color code

Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.

low

High

t-test vs. Discriminant analysis

Gene 1

T-test

Gene 2

Gene 2

Discriminant analysis

Group A Group B

Gene 1

Unsupervised classification

Hair length

What is clustering?

Chin length

People in n-dimentional characteristics space

People

Characters Chin Person1 Person2 Person3 Person4 Person5 Person6 Person7 Person8 Person9

Hair

Hat

Nose

Glasses

Neck

Genes

Genes in n-dimentional experimental conditions space RNA samples Heart Uterus

Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9

Liver

Kidney Pancreas Muscle

Expression cluster (2D)

Expression profile (n-D)

Finding similar patterns in expression matrix

Reordered Gene Matrix

Distances Manhattan (blocks) Euclidian Pearson correlation

Manhattan distance

Control Treatment

Distance

Expression Intensity

∑| x − y |

Treatments

Euclidian distance

∑ (x − y )

2

i

Control Treatment

Distance

Expression Intensity

i

Treatments

Pearson correlation

Intensity

r=

Treatments

∑ (x − x )(y − y ) ∑ (x − x ) ∑ (y − y ) 2

2

Constructing clustering (an example) Method: Agglomerative clustering Steps: • Comparing all pairwise distances • Define the relationship among samples

Data: Six RNA samples from three Tissues with duplicated

Distance matrix Single linkage = minimum distance Average linkage = average distance Complete linkage = maximum distance

Create a Tissues Dendrogram

Sample classification Unsupervised classified the primary tumor samples into two groups Horizontal color bar shows recurrent (red) vs. non-recurrent (black) patient Vertical color bar shows Metastatic overexpressed (red) vs. primary over-expressed genes (black) genes Color code

low

High

Ramaswamy S, Ross KN, Lander ES & Golub TD Nat Genet. 2003 Jan;33(1):49-54.

Genes related to the same function are clustered together Human Fibroblast Growth

0h

1:1

Down regulated

Up regulated

24h

A - Cholesterol biosynthesis B – Cell cycle C – Immediate and early response D – Signaling and Angiogenesis E – Wound healing

Eisen MB, Proc. Natl. Acad. Sci. USA Vol. 95, pp. 14863–14868,

Co-expression may imply for co-regulation Genes that were clustered together found to be have the URS1 motif 5’-DSGGCGGCND-3’ in their upstream region. For metabolic genes it was found in 15 out of 52. Genes that were clustered together found to be have the MSE motif 5’- DNCRCAAAW-3’ in their upstream region, which is suggested to be recognized by Ndt80 transcription factor.

S. Chu, J. et al., The Transcriptional Program of Sporulation in Budding Yeast Science 282:699-705

Notes About Hierarchical Clustering Giving a dendrogram (or a tree) we may browse the level of coherent expression considering functionality or regulation. Flip nodes

Validation of clustering Random1 – shuffling within rows Random2 – shuffling within columns Random3 – shuffling both rows and columns ‰ What are the number of valid cluster? ‰ How many genes should be cluster together by chance?

Eisen MB, Proc. Natl. Acad. Sci. USA Vol. 95, pp. 14863–14868

Coupling classification

Gene that classifies the sample into two expected groups 1:1

Down regulated

Up regulated

Normal

Cancer

Normal

Cancer

Gene that classifies the sample into two unexpected groups

Normal

Cancer

Classifiers comparison t-test

Gene relationships A-priori knowledge

Each gene separately

yes

Discriminant Hierarchical clustering analysis/Knearest neighbors Combination Combination of genes of genes

yes

no

Thank you