Unsupervised analysis of gene expression data

Unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected] Overal...
Author: Fay Parsons
7 downloads 0 Views 497KB Size
Unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Overall workflow of a microarray study Biological question Experiment design Microarray experiment Image analysis Pre-processing Data Analysis

Experimental verification 2

Hypothesis

Applied Bioinformatics, Spring 2011

Three major goals of gene expression studies  

 

 

3

Class comparison (supervised analysis)  

e.g. disease biomarker discovery

 

Differential expression analysis

 

Input: gene expression data, class label of the samples

 

Output: differentially expressed genes

Class detection (unsupervised analysis)  

e.g. patient subgroup detection

 

Clustering analysis

 

Input: gene expression data

 

Output: groups of similar samples or genes

Class prediction (supervised learning)

!"#$%&'%(&)* /..3&'&4( /.51&4( //3&4( /0/&4( /055&6&4( /078&4( /1/2&4( /10.&4( /8.5&)&4( /81/&4( /819&4( /893&4( /878&:&4( /550052&4&4( /550053&4&4(

+,-.&/ !"#!!! +")$$! ("%(%% +"()(' '"&!%) #"*$$# #"$($+ #"$'+( '"*&#% $"&)+) ("%)$$ !"#*#) ("*&+# )%"#&'$ )%"*&&'

+,-.&0 !"$%&$ +")!*$ ("%%*' +"(''% '"'##+ #"&*!) #"$**% #"$*!! '"'#'% $"&%(% #"+*$+ !"'!(+ ("*+%) )%"$&*$ )%")('+

 

e.g. disease diagnosis and prognosis

 

Machine learning techniques

 

Input: gene expression data, class label of the samples (training data)

 

Output: prediction model Applied Bioinformatics, Spring 2011

+,-.&1 !"$'() +"'&+' #"+%'( +"#)&% '"&*#% #"&%$* #"'(%+ #"$')% '")'*! $"&#$( #"+&') !"''+! ("%!!# )%"#$&& )%")++&

+,-2.&/ !"$')& +"&))) +"%')' +"($!) '"*(%% #"'&+% #"##*# #"##%$ '"*'#& $"&!&* ("%&'! !"''(% ("&#'! )%"'&%$ )%"&'#'

+,-2.&0 !"$#&' +")&%' !"#*!& +"('&& '"'$(* #"$%(' #"#'*! #"$+!( '"*!(# $"&$&& ("%)'& !"$*)) ("#%$! )%"&*'' )%"&)+)

+,-2.&1 !"*%(* +"&'+' +"&##* +"(*'$ '"&+(+ #"&(() #"'#!! #"(&*# '"#!'+ $")!%! ("%+() !"'&&$ ("&+'+ )%"*)'' )%"&'%$

What is clustering

 

Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities

 

Unsupervised techniques that do not require sample annotation in the process Samples

Genes

Sample_1 Sample_2 Sample_3 Sample_4 Sample_5

4

TNNC1 DKK4 ZNF185 CHST3 FABP3 MGST1 DEFA5 VIL1 AKAP12 HS3ST1 ……

14.82 10.71 15.20 13.40 15.87 12.76 10.63 11.47 18.26 10.61 ……

14.46 10.37 14.96 13.18 15.80 12.80 10.47 11.69 18.10 10.67 ……

14.76 11.23 15.07 13.15 15.85 12.67 10.54 11.87 18.50 10.50 ……

11.22 19.74 12.57 11.18 13.16 14.92 15.52 13.94 15.60 12.44 ……

Applied Bioinformatics, Spring 2011

11.55 19.73 12.37 10.99 12.99 15.02 15.52 14.01 15.69 12.23 ……

…… …… …… …… …… …… …… …… …… …… …… ……

Why clustering?

5

 

Exploratory data analysis, providing rough maps and suggesting directions for further study

 

Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram

 

Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes

 

Functional annotation based on guilt by association

Applied Bioinformatics, Spring 2011

Clustering methods

6

 

Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters

 

Partitioning: divide the data into g groups using some reallocation algorithms, e.g. K-means

Applied Bioinformatics, Spring 2011

Hierarchical clustering  

Agglomerative clustering (bottom-up)    

 

 

At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster. The algorithm stops when all sample units are combined into a single cluster of size n.

Divisive clustering (top-down)    

 

7

Start out with all sample units in n clusters of size 1.

Start out with all sample units in a single cluster of size n. At each step of the algorithm, clusters are partitioned into a pair of daughter clusters, selected to maximize the distance between each daughter. The algorithm stops when sample units are partitioned into n clusters of size 1.

Applied Bioinformatics, Spring 2011

Agglomerative clustering

 

8

Require distance measurement  

Between two objects

 

Between clusters

Applied Bioinformatics, Spring 2011

Between objects distance measurement  

Euclidean distance  

 

 

#( x

i " yi )

Parametric, normally distributed and follow the linear regression model !

 

Focus on the expression profile shape

 

Non-parametric, no assumption

!

Less sensitive but more robust than Pearson

Applied Bioinformatics, Spring 2011

2

i=1 n

Focus on the expression profile shape !

Spearman correlation coefficient

 

9

Focus on the absolute expression value

d=

Pearson correlation coefficient  

 

n

r=

# (x i=1

#

n i=1

d =1" r

i

" x )(y i " y )

(x i " x ) 2

#

n i=1

(y i " y ) 2

Different measurement, different distance

Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink) Pearson: GeneC (green) Spearman: GeneD (red)

10

Gene expression level (log2)

6 5 4

GeneA

3

GeneB GeneC

2

GeneD

1 0 1

2

3

4

5

Time (hr)

Applied Bioinformatics, Spring 2011

6

7

Between cluster distance measurement

11

 

Single linkage: the smallest distance of all pairwise distances

 

Complete linkage: the maximum distance of all pairwise distances

 

Average linkage: the average distance of all pairwise distances

Applied Bioinformatics, Spring 2011

Visualization and interpretation of hierarchical clustering results  

Dendrogram  

 

 

 

Tree structure with the genes or samples as the leaves The height of the join indicates the distance between the left branch and the right branch

Heat map  

12

Output of a hierarchical clustering

Graphical representation of data where the values are represented as colors.

Applied Bioinformatics, Spring 2011

Partitioning

 

 

13

General idea  

Select the number of groups, g

 

Randomly divide the objects into g Group

 

Iteratively rearrange the objects until a stop condition

Representative methods  

K-means

 

Self Organizing Map (SOM)

Applied Bioinformatics, Spring 2011

K-means

14

 

Define k = number of clusters

 

Randomly initialize a seed vector for each cluster

 

Go through all objects, and assign each object to the cluster witch it is most similar to

 

Recalculate all seed vectors as means of patterns of each cluster

 

Repeat 3 & 4 until a stop condition (e.g. Until all objects get assigned to the same partition twice in a row)

Applied Bioinformatics, Spring 2011

K-means seed vector 1 Randomly initialize seeds Objects join with closest seed seed vector 2

Recaculate seeds Reassign objects Recaculate seeds Reassign objects

Seeds become stable: final clusters 15

Applied Bioinformatics, Spring 2011

Cool animations  

Hierarchical clustering  

 

K-means  

16

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

http://animation.yihui.name/mvstat:k-means_cluster_algorithm

Applied Bioinformatics, Spring 2011

Resources  

 

17

Data source  

Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/

 

ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

Microarray data analysis tools  

Bioconductor: http://www.bioconductor.org/

 

Expression profiler: http://www.ebi.ac.uk/expressionprofiler/

Applied Bioinformatics, Spring 2011

Summary  

Agglomerative clustering  

Bottom-up

 

Between objects distance measurement  

Euclidean distance

 

Pearson’s correlation coefficient Spearman’s correlation coefficient

 

 

 

 

 

Single linkage

 

Complete linkage

 

Average linkage

Visualization  

Dendrogram

 

Heat map

k-means clustering  

18

Between cluster distance measurement

Partitioning Applied Bioinformatics, Spring 2011

Exercise  

Data set: evan_deneris_2010_5ht_top500diff.txt

 

500 selected probe sets

 

Four groups (Rostral_5ht, Rostral_non5ht, Caudal_5ht, Caudal_non5ht)

 

No missing value; Already normalized; Already log transformed

 

Use hierarchical clustering in Expression profiler (http://www.ebi.ac.uk/expressionprofiler) to generate a heat map

19

Applied Bioinformatics, Spring 2011

Suggest Documents