Clustering: hierarchical and k-means. Clustering analysis

Clustering: hierarchical and k-means Clustering analysis Need to define; • measure of similarity • algorithm for using the measure of similarity to d...
Author: Tamsin Simon
2 downloads 2 Views 3MB Size
Clustering: hierarchical and k-means

Clustering analysis Need to define; • measure of similarity • algorithm for using the measure of similarity to discover natural groups in the data The number of ways to divide n items into k clusters: kn/k! Example: 10500/10! = 2.756 × 10493 T.R.Hvidsten and J. Komorowski

1

Measure of similarity What is similar?

Euclidean distance

E2

d

E1 T.R.Hvidsten and J. Komorowski

Hierarchical clustering INPUT: n genes/experiments • Consider each gene/experiment as an individual cluster and initiate an n × n distance matrix d • Repeat – identify the two most similar clusters in d (i.e. smallest number in d) – merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)

OUTPUT: A tree of merged genes/experiments (called a dendrogram) T.R.Hvidsten and J. Komorowski

2

Hierarchical clustering Intercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage

T.R.Hvidsten and J. Komorowski

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g. dEN = 2 dEDu = 7 dSpI = 1 Intercluster strategy: SINGLE LINKAGE T.R.Hvidsten and J. Komorowski

3

Iteration 1 E N Da Du G Fr Sp I P H Fi

E 0 2 2 7 6 6 6 6 7 9 9

N Da Du G Fr Sp I 0 1 5 4 6 6 6 7 8 9

P H Fi

8 7

0 6 0 5 5 0 6 9 7 0 5 9 7 2 5 9 7 1 6 10 8 5 8 8 9 10 9 9 9 9

6 5 4 0 1 3 10 9

3 0 2 4 0 1 10 10 0 9 9 8 0

I

Fr

T.R.Hvidsten and J. Komorowski

Iteration 2 I Fr E N Da Du G Sp P H Fi

I Fr 0 6 6 5 9 7 1 4 10 9

E N Da Du G Sp P H Fi 0 2 2 7 6 6 7 9 9

8 7

0 1 5 4 6 7 8 9

6 0 6 0 5 5 0 5 9 7 0 6 10 8 3 0 8 8 9 10 10 0 9 9 9 9 9 8

5 4 3 2 0

1 I

Fr

Da N

T.R.Hvidsten and J. Komorowski

4

Iteration 3 Da N I Fr 0 Da N 5 0 I Fr 2 6 E 5 9 Du 4 7 G 5 1 Sp 6 4 P 8 10 H 9 9 Fi

E Du G Sp P

H Fi

8 7 6

0 7 6 6 7 9 9

5 0 5 9 10 8 9

0 7 0 8 3 0 9 10 10 0 9 9 9 8

4 3 2 0

1 I

Fr Sp

Da N

I

Fr Sp

Da N E

T.R.Hvidsten and J. Komorowski

Iteration 4 Sp I Fr Da N E Du G P H Fi

Sp I 0 5 6 9 7 3 10 9

Da 0 2 5 4 6 8 9

E Du G

P

H Fi

0 7 0 6 5 0 7 10 8 0 9 8 9 10 0 9 9 9 9 8

8 7 6 5 4 3

0

2 1

T.R.Hvidsten and J. Komorowski

5

Iteration 5 8

E Da Sp I N Fr Du G E Da N Sp I Fr Du G P H Fi

P

H Fi

7 6

0

5

5 5 4 6 8 9

0 9 7 3 10 9

4 0 5 10 8 9

3 0 8 0 9 10 0 9 9 8

2 1 0

I

Fr Sp P Da N E

T.R.Hvidsten and J. Komorowski

Iteration 6 P Sp E Da I Fr N Du G P Sp 0 I Fr E Da 5 N 9 Du 7 G 10 H 9 Fi

8 H Fi

7 6 5

0 5 4 8 9

4 0 5 8 9

3 0 9 9

2 0 8

1 0

I

Fr Sp P Da N E

G

T.R.Hvidsten and J. Komorowski

6

Iteration 7 8

GE Da P Sp N I Fr Du H GE Da N P Sp I Fr Du H Fi

7 Fi

6 5 4

0

3 5 5 8 9

0 9 10 9

2

0 8 9

1

0 8

0

I

Fr Sp P Da N E

G Du

T.R.Hvidsten and J. Komorowski

Iteration 8 Du GE Da P Sp N I Fr H Du GE Da N

0

P Sp I Fr H Fi

5 8 9

8 7 Fi

6 5 4 3 2

0 10 9

0 8

1 0

I

Fr Sp P Da N E

G Du

T.R.Hvidsten and J. Komorowski

7

Iteration 9 8 P Sp I Fr Du G E Da N H P Sp I Fr Du G E Da N H Fi

7 6 Fi

5 4

0 8 9

3 2

0 8

0

1 I

Fr Sp P Da N E

G Du

H

T.R.Hvidsten and J. Komorowski

Iteration 10 P Sp I Fr Fi Du G E Da N H Fi H P Sp I Fr Du G E Da N

8 7 6 5 4

0

3 2 1

8

0

I

Fr Sp P Da N E

G Du

H Fi

T.R.Hvidsten and J. Komorowski

8

Any data mining result needs to be consistent BOTH with the data and current knowledge!

T.R.Hvidsten and J. Komorowski

Evaluation of clusters Clusters may be evaluated according to how well they describe current knowledge

8 7 6 5 4 3 2 1

Roman Slavic Germanic

I

Fr Sp P Da N E

G Du

H Fi

Ugro-Finnish T.R.Hvidsten and J. Komorowski

9

Hierarchical clustering: properties • Huge memory requirements: stores the n × n matrix • Running time: O(n3) • Deterministic: produces the same clustering each time • Nice visualization: dendrogram • Number of clusters can be selected using the dendrogram T.R.Hvidsten and J. Komorowski

K-means clustering • Split the data into k random clusters • Repeat – calculate the centroid of each cluster – (re-)assign each gene/experiment to the closest centroid – stop if no new assignments are made

T.R.Hvidsten and J. Komorowski

10

Example of K-means: two dimensions Initial clusters K=2

T.R.Hvidsten and J. Komorowski

Iteration 1 Calculate centroids

x

x

T.R.Hvidsten and J. Komorowski

11

Iteration 1 (Re-)assign

x

x

T.R.Hvidsten and J. Komorowski

Iteration 2 Calculate centroids x

x

T.R.Hvidsten and J. Komorowski

12

Iteration 2 (Re-)assign

x

x

T.R.Hvidsten and J. Komorowski

Iteration 3 Calculate centroid x

x

T.R.Hvidsten and J. Komorowski

13

Iteration 3 (Re-)assign

x

No new assignments! STOP

x

T.R.Hvidsten and J. Komorowski

K-means: properties • Low memory usage • Running time: O(n) • Improves iteratively: not trapped in previous mistakes • Non-deterministic: will in general produce different clusters with different initializations • Number of clusters must be decided in advance T.R.Hvidsten and J. Komorowski

14

Hierarchical vs. k-means • Hierarchical clustering: – – – –

computationally expensive -> relatively small data sets nice visualization, no. of clusters can be selected deterministic cannot correct early ”mistakes”

• K-means: – – – –

computationally efficient -> large data sets predefined no. of clusters non-deterministic -> should be run several times iterative improvement

• Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

T.R.Hvidsten and J. Komorowski

Example 1 96 normal and malignant lymphocyte samples Almost 20 000 cDNA clones Two sub-clusters of DLBCL were shown to include patients with significantly different expected survival time!

Alizadeh et al., Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling, Nature, 403:503-511, T.R.Hvidsten and J. Komorowski 2000.

15

Example 2

Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999.

The mRNA level of 8613 human genes were measured in fibroblasts at 12 time points from 0 minutes to 24 hours. 517 genes whose expression changed substantially in response to serum was selected.

Expression clusters

Functional clusters

T.R.Hvidsten and J. Komorowski

Example 3 Transcriptional profiling of the cell cycle in human fibroblasts using 6,800 genes every other hour from 0 to 24 hours. Biological process with a significantly higher representation in certain clusters than what would be expected by chance Cho et al., Transcriptional regulation and function during the human cell cycle, Nature Genetics, 27: 48-54, 2001.

T.R.Hvidsten and J. Komorowski

16

Suggest Documents