Reconstructing Regulatory Networks from Microarrays

Reconstructing Regulatory Networks from Microarrays Lecture 1 – Apr 26, 2011 GENOME 541, Introduction to Computational Molecular Biology, Spring 2011 ...
0 downloads 0 Views 4MB Size
Reconstructing Regulatory Networks from Microarrays Lecture 1 – Apr 26, 2011 GENOME 541, Introduction to Computational Molecular Biology, Spring 2011 Instructor: Su-In Lee University of Washington, Seattle

Outline „ „ „ „ „

Microarray data analysis Clustering approaches Beyond clustering Algorithms for learning regulatory networks Learning regulatory networks from genetically diverse set of individuals

2

1

Microarray gene expression data Downregulated

Experiments (samples) i Upregulated

Genes

Induced

j

Repressed

Eij - RNA level of gene j in experiment i

3

Analyzing micrarray data Un-supervised learning

Supervised learning problems

genes

Learn the mapping!

Learn the underlying model!

clinical trait samples „ „

(cancer/normal)

Gene signatures can provide valuable diagnostic tool for clinical purposes Can also help characterize basic cellular processes, such as metastasis, aging

„

„

Gene clustering can reveal cellular processes and their response to different conditions Sample clustering can reveal phenotypically distinct populations, with clinical implications

* Bhattacharjee et al. Classification of human lung carcinomas by mRNA * van't Veer et al. Gene expression profiling predicts expression profiling reveals distinct adenocarcinoma subclasses. PNAS (2001). clinical outcome of breast cancer. Nature (2002).

4

2

Hierarchical clustering „ „

Compute all pairwise distances Merge closest pair E1 E2

E3

… „

„

… „

1. Euclidean distance 2. (Pearson’s) correlation coefficient 3. Etc etc…

Easy Depends on where to start the grouping Trouble to interpret the “tree” structure

6

K-means clustering E1 E2 Gene 1 Gene 2

E2

„

Overall optimization

„

How many (k)

„

How to initiate

„

Local minima

E1

Gene N

„

Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose the “best” algorithm. 7

3

Clustering expression profiles Limitations: No explanation on what caused expression of each gene „ (No regulatory mechanism) „

Co-regulated genes cluster together

Infer gene function 8

Beyond Clustering „ „

„

Cluster: set of genes with similar expression profiles Regulatory module: set of genes with shared regulatory mechanism Goal: „

Automatic method for identifying candidate modules and their regulatory mechanism

9

4

Inferring regulatory networks “Expression data” – measurement of mRNA levels of all genes

Experimental conditions

Q≈2x104 (for human)

„

Infer the regulatory network that controls gene expression „

Causal relationships among e1-Q

A and B regulate the expression of C



eQ

B

A C

(A and B are regulators of C) „

e6

e1

Bayesian networks 10

Regulatory network „

Bayesian network representation „ „

„

Xi: expression level of gene i Val(Xi): continuous X1

Interpretation „ „

Conditional independence Causal relationships

X2 X4

X3

Conditional probability distribution (CPD)?

X5

X6 11

5

CPD for discrete expression level „

After discretizing the expression levels to “high” and “low”… „

Parameters – probability values in every entry X1

Table CPD X5=high

X5=low

X3=high, X4=high

0.3

0.7

X3=high, X4=low

0.95

0.05

X3=low, X4=high

0.1

0.9

X3=low, X4=low

0.2

0.8

X2 X4

X3

parameters

X5

X6 12

Tree CPD

Context specificity of gene expression Context A Basal expression level

X2 Upstream region of target gene (X5)

Context B Activator induces expression

X1

RNA level

X4

X3

? X5

Activator (X3)

X6

activator binding site

Context C Activator + repressor decrease expression

Repressor (X4)

repressor binding site

Activator (X3)

activator binding site

13

6

Context specificity of gene expression X1

RNA level

Context A Basal expression level

X2

Context B Activator induces expression

X4

X3

Upstream region of target gene (X5)

? X5

Activator (X3)

X6

X3 Ï

activator binding site

false

true

X4 Ï

Repressor (X4)

Activator (X3)

false

P(Level)

true P(Level)

P(Level)

...

Context C Activator + repressor decrease expression

Level 0

Context A

Level

Level -3

3

activator binding site

repressor binding site

14 C Context B Context

14

Continuous-valued expression I „

Tree conditional probability distributions (CPD) „

„

Parameters – mean (µ) & variance (σ2) of the normal distribution in each context Represents combinatorial and context-specific regulation

Tree CPD X3Ï false

true

(μA,σA) (μB,σB) P(Level)

parameters

X4 Ï

false

X1

(μC,σC)

X2 X4

X3

true P(Level)

...

P(Level)

Level 0

Context A

Level

Level 3

Context B

-3

Context C

X5

X6

15 15

7

Continuous-valued expression II „

Linear Gaussian CPD „

Parameters – weights w1,…,wN associated with the parents (regulators) X1

Linear Gaussian CPD

XN

parameters X3

X4

w11 w



w22 w

X2

XN

wNN w

X4

X3 P(Level)

X5

Level 0

P(X5|Par5:w) = N(Σwixi , ε2)

X5

X6 16

Learning „

Structure learning „ „ „

[Koller & Friedman]

Constraint based approaches Score based approaches Bayesian model averaging Given a set of all possible network structures and the scoring function that measures how well the model fits the observed data, we try to select the highest scoring network structure.

„

Scoring function „ „ „

Likelihood score Bayesian score Regularization

X1 X2 X4

X3

X5

X6 17 17

8

Scoring functions „

Let S: structure, ΘS: parameters for S, D: data Likelihood score

„

How to overcome overfitting?

„

„ „

Reduce the complexity of the model Bayesian score; Regularization

Structure S X1 X2 X4

X3

X5

X6 18 18

Modularity of regulatory networks „

„

„

Genes tend to be co-regulated with others by the same factors. Biologically more relevant More compact representation „ „

„

X1

Smaller number of parameters Reduced search space for structure learning

X2 X3

Candidate regulators „

A fixed set of genes that can be parents of other modules.

Module 1

Module 2

X4 Module 3

X5

X6

Same module ⇒ Share the CPD 19

9

The module networks concept

Tree CPDs

Module 1

Module 2

Module 3

X5

X6

-3 x

Repressor X4

0.5 x

Activator X3

true Repressor X4K

repressor expression

X4

X3

false

false

target gene expression

true

Module 3 genes

X2

Activator X3K

activator expression

regulation program

X1

+

=

induced

Module3

Linear CPDs

false

Goal false

20

repressed

Heat Shock? CMK1

Regulatory program

true

HAP4 Ï true

experiments

Identify modules (member genes) Discover module regulation program

„ „

genes (X’s)

Candidate regulators

X11 X3Ï

(µA,σA)

false

true

Module 3

...

P(Level)

P(Level) Level 0

Context A

Module 2

X4

X3 X3

X4 Ï fals e

P(Level)

X2

true

Module 1

Level

Level 3

Context B

-3

Context C

X5

X6 21 21

10

Learning „

Structure learning „

„

Find the structure that maximizes the structural score (Bayesian score or via regularization)

Expectation Maximization (EM) algorithm „

„

M-step: Given a partition of the genes into modules, learn the best regulation program (linear or tree CPD) for each module. E-step: Given the inferred regulatory programs, we reassign genes into modules such that the associated regulation program best predicts each gene’s behavior. 22

Learning regulatory network „

Iterative procedure „ „

Cluster genes into modules (E-step) Learn a regulatory program for each module (tree model) (M-step)

Candidate regulators

Maximum increase in the structural score (Bayesian) KEM1 MKT1

MSK1

M22 M1

ECM18 UTH1

DHH1

ASG7

MEC3

GPA1 HAP4 MFA1

TEC1

HAP1 PHO3 SGS1 SEC59

PHO5 RIM15

PHO84 PHM6

PHO2

PHO4

SAS5

SPL2 GIT1

VTC3 23

11

From expression to regulation (M-step) „

Combinatorial search over the space of trees

Arrays sorted in original order HAP4

PHO3 PHO5

PHO84 PHM6

PHO2

PHO4 GIT1

HAP4 Ï

VTC3

Arrays sorted according to expression of HAP4

24

Segal et al. Nat Genet 2003

24

From expression to regulation (M-step) SIP4 HAP4

PHO3 PHO5

PHO84 PHM6

PHO2

PHO4 GIT1

VTC3

SIP4 Ï

Segal et al. Nat Genet 2003

HAP4 Ï

25

12

Learning control programs Score: log P(M | D)

‹

log ∫P(D | M,µ,σ )P(µ,σ) dµdσ 0

HAP4 Ï

Score of HAP4 split: log P(M | D)

‹

log ∫P(DHAP4Ï| M,µ,σ )P(µ,σ) dµdσ + log ∫P(DHAP4Ð| M,µ,σ )P(µ,σ) dµdσ 26

Segal et al. Nat Genet 2003

0

0

26

Review – Learning regulatory network „

Iterative procedure Cluster genes into modules (E-step) Learn a regulatory program for each module (M-step)

„ „

Maximum increase in Bayesian score KEM1

MSK1

HAP4K

DHH1

MKT1 false

M22 M1

ECM18 UTH1

ASG7

MEC3

GPA1 HAP4 MFA1

TEC1

true

Candidate regulators

HAP1 MSK1K PHO3 false

true

SGS1

SEC59

PHO5 RIM15

PHO84 PHM6

PHO2

PHO4

SAS5

SPL2 GIT1

VTC3 28

13

Feature selection via regularization „

Assume linear Gaussian CPD

„

MLE: solve maximizew - (Σwixi - ETargets)2

x1 w1

parameters

x2



w2

Regulatory network

xN

Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes

wN

ETargets

P(ETargets|x:w) = N(Σwixi , ε2)

Problem: This objective learns too many regulators 29

L1 regularization „

“Select” a subset of regulators „ „

„

Combinatorial search? Effective feature selection algorithm: L1 regularization (LASSO) [Tibshirani, J. Royal. Statist. Soc B. 1996]

minimizew (Σwixi - ETargets)2+ Σ C |wi|: convex optimization! ⇒ Induces sparsity in the solution w (Many wi‘s set to zero) x1 w1

x2 w2 ETargets



xN wN

Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes

P(ETargets|x:w) = N(Σwixi , ε2) 30

14

Learning linear regulatory network -3 x

„

GPA1

Iterative procedure 0.5 x +

„

=

„

MFA1

+

Learn a regulatory -1.2 program x M120 for each module Cluster genes into modules Modul e

M1011 M321

M120

M22 M1 ECM18

UTH1

M321

TEC1

ASG7

MEC3

GPA1

Is this predicted MFA1 relationship “real”?

HAP1 PHO3 SGS1 SEC59

PHO5 RIM15 SAS5

PHO84

PHM6 L1 regularized optimization PHO2 minimizew (Σwixi - ETargetsPHO4 )2+ Σ C |wi| SPL2 GIT1

VTC3 31

Lee et al., PLoS Genet 2009

Statistical evaluation „

Cross-validation test „ Divide the data (experiments) into training and test data „ Compute the likelihood function for the Test data “Test data”

experiments

“Test likelihood” How well it fits to the test data?

? genes

Regulatory network 32

15

Module Evaluation Criteria „ „

„

„

Are the module genes functionally coherent? Do the regulators have regulatory roles in the predicted conditions? Are the genes in the module known targets of the predicted regulators? Are the regulators consistent with the cisregulatory motifs found in promoters of the module genes?

33

Functional coherence

Modules

m Module 1 „

n k

N

Known functional categories

Gene ontology Predicted targets of TFs Sharing TF binding sites : Cholesterol synthesis

How significant is the overlap? „

Calculate P(# overlap ≥ k | K, n, N, two groups are independent) based on the hypergeometric distribution 34

16

Module Functional Coherence 100

Coherence (%)

90

26 Modules >60% Coherent

80

41 Modules >40% Coherent

70 60 50 40 30 20 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

0

‹

Metabolic: AA, respiration, glycolysis, galactose

‹

Stress: Oxidative stress, osmotic stress

‹

Cellular localization: Nucleas, ER

‹

Cellular processes: Cell cycle, sporulation, mating

‹

35 Molecular functions: Protein folding, RNA & DNA processing, trafficking

Example: Respiration Module

‹

HAP4 known to up regulate Oxid. Phos.

36

17

Example: Respiration Module

‹

HAP4 known to up regulate Oxid. Phos.

HAP4, MSN4, XBP1 known to be regulators under predicted conditions

‹

37

Example: Respiration Module

‹

HAP4 known to up regulate Oxid. Phos.

HAP4, MSN4, XBP1 known to be regulators under predicted conditions

‹

HAP4 Binding site found in 39/55 genes

‹

38

18

Example: Respiration Module

‹

HAP4 known to up regulate Oxid. Phos.

HAP4, MSN4, XBP1 known to be regulators under predicted conditions

‹

HAP4 Binding site found in 39/55 genes

‹

MSN4 Binding site found in 28/55 genes

‹

39

Any drawback? „

Many models can get similar scores. Which one would you choose?

„

A gene can be involved in multiple modules.

„

Possible to incorporate prior knowledge?

40

19

Structural learning via boostrapping „

Many networks that achieve similar scores … 33.15

„

33.10

32.99

Which one is the most robust? „ „

Estimate the robustness of each network or each edge. How?? Learn the networks from multiple datasets.

Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001

Bootstrapping „

Sampling with replacement

„

41

Estimated confidence of each edge i

=

# networks that contain the edge total # networks (N)

genes

… experiments

Original data

Bootstrap data 1

data 2



data N

… Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001

42

20

Bootstrapping „

„

Sampling with replacement

Estimated confidence of each edge i

=

# networks that contain the edge total # networks (N)

0.8 0.31



0.75

0.26

0.56

0.43 0.73

Bootstrap data 1

data 2



data N

… Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001

43

Overlapping processes „

The living cell is a complex system „ „

„

Example, the cell cycle Genes functionally relevant to cell cycle regulation in the specific cell cycle phase

Genes have a  regulatory  mechanism

Genes participate in  multiple processes

Partial figure from  Maaser and Borlak  Oct. 2008

Mutually exclusive clustering as a common approach to analyzing gene expression „ „ „

(+) genes likely to share a common function (-) group genes into mutually exclusive clusters (-) no info about genes relation to one another 44

21

Decomposition of processes… „

„

Model an expression level of a gene as a mixture of regulatory modules. HAP4K

Hard EM vs soft EM

false

true

MSK1K false

true

DHH1K

w1

X

w2 : wN

false

true

PUF3K false

true

HAP1K false

true

HAP5K false

Probabilistic discovery of overlapping cellular processes and their regulation Battle et al. Journal of Computational Biology 2005

true

45

Outline „ „ „ „ „

Microarray data analysis Clustering approaches Beyond clustering Algorithms for learning regulatory networks Learning regulatory networks from genetically diverse set of individuals

46

22