Reconstructing Regulatory Networks from Microarrays Lecture 1 – Apr 26, 2011 GENOME 541, Introduction to Computational Molecular Biology, Spring 2011 Instructor: Su-In Lee University of Washington, Seattle
Outline
Microarray data analysis Clustering approaches Beyond clustering Algorithms for learning regulatory networks Learning regulatory networks from genetically diverse set of individuals
2
1
Microarray gene expression data Downregulated
Experiments (samples) i Upregulated
Genes
Induced
j
Repressed
Eij - RNA level of gene j in experiment i
3
Analyzing micrarray data Un-supervised learning
Supervised learning problems
genes
Learn the mapping!
Learn the underlying model!
clinical trait samples
(cancer/normal)
Gene signatures can provide valuable diagnostic tool for clinical purposes Can also help characterize basic cellular processes, such as metastasis, aging
Gene clustering can reveal cellular processes and their response to different conditions Sample clustering can reveal phenotypically distinct populations, with clinical implications
* Bhattacharjee et al. Classification of human lung carcinomas by mRNA * van't Veer et al. Gene expression profiling predicts expression profiling reveals distinct adenocarcinoma subclasses. PNAS (2001). clinical outcome of breast cancer. Nature (2002).
4
2
Hierarchical clustering
Compute all pairwise distances Merge closest pair E1 E2
E3
…
…
1. Euclidean distance 2. (Pearson’s) correlation coefficient 3. Etc etc…
Easy Depends on where to start the grouping Trouble to interpret the “tree” structure
6
K-means clustering E1 E2 Gene 1 Gene 2
E2
Overall optimization
How many (k)
How to initiate
Local minima
E1
Gene N
Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose the “best” algorithm. 7
3
Clustering expression profiles Limitations: No explanation on what caused expression of each gene (No regulatory mechanism)
Co-regulated genes cluster together
Infer gene function 8
Beyond Clustering
Cluster: set of genes with similar expression profiles Regulatory module: set of genes with shared regulatory mechanism Goal:
Automatic method for identifying candidate modules and their regulatory mechanism
9
4
Inferring regulatory networks “Expression data” – measurement of mRNA levels of all genes
Experimental conditions
Q≈2x104 (for human)
Infer the regulatory network that controls gene expression
Causal relationships among e1-Q
A and B regulate the expression of C
…
eQ
B
A C
(A and B are regulators of C)
e6
e1
Bayesian networks 10
Regulatory network
Bayesian network representation
Xi: expression level of gene i Val(Xi): continuous X1
Interpretation
Conditional independence Causal relationships
X2 X4
X3
Conditional probability distribution (CPD)?
X5
X6 11
5
CPD for discrete expression level
After discretizing the expression levels to “high” and “low”…
Parameters – probability values in every entry X1
Table CPD X5=high
X5=low
X3=high, X4=high
0.3
0.7
X3=high, X4=low
0.95
0.05
X3=low, X4=high
0.1
0.9
X3=low, X4=low
0.2
0.8
X2 X4
X3
parameters
X5
X6 12
Tree CPD
Context specificity of gene expression Context A Basal expression level
X2 Upstream region of target gene (X5)
Context B Activator induces expression
X1
RNA level
X4
X3
? X5
Activator (X3)
X6
activator binding site
Context C Activator + repressor decrease expression
Repressor (X4)
repressor binding site
Activator (X3)
activator binding site
13
6
Context specificity of gene expression X1
RNA level
Context A Basal expression level
X2
Context B Activator induces expression
X4
X3
Upstream region of target gene (X5)
? X5
Activator (X3)
X6
X3 Ï
activator binding site
false
true
X4 Ï
Repressor (X4)
Activator (X3)
false
P(Level)
true P(Level)
P(Level)
...
Context C Activator + repressor decrease expression
Level 0
Context A
Level
Level -3
3
activator binding site
repressor binding site
14 C Context B Context
14
Continuous-valued expression I
Tree conditional probability distributions (CPD)
Parameters – mean (µ) & variance (σ2) of the normal distribution in each context Represents combinatorial and context-specific regulation
Tree CPD X3Ï false
true
(μA,σA) (μB,σB) P(Level)
parameters
X4 Ï
false
X1
(μC,σC)
X2 X4
X3
true P(Level)
...
P(Level)
Level 0
Context A
Level
Level 3
Context B
-3
Context C
X5
X6
15 15
7
Continuous-valued expression II
Linear Gaussian CPD
Parameters – weights w1,…,wN associated with the parents (regulators) X1
Linear Gaussian CPD
XN
parameters X3
X4
w11 w
…
w22 w
X2
XN
wNN w
X4
X3 P(Level)
X5
Level 0
P(X5|Par5:w) = N(Σwixi , ε2)
X5
X6 16
Learning
Structure learning
[Koller & Friedman]
Constraint based approaches Score based approaches Bayesian model averaging Given a set of all possible network structures and the scoring function that measures how well the model fits the observed data, we try to select the highest scoring network structure.
Scoring function
Likelihood score Bayesian score Regularization
X1 X2 X4
X3
X5
X6 17 17
8
Scoring functions
Let S: structure, ΘS: parameters for S, D: data Likelihood score
How to overcome overfitting?
Reduce the complexity of the model Bayesian score; Regularization
Structure S X1 X2 X4
X3
X5
X6 18 18
Modularity of regulatory networks
Genes tend to be co-regulated with others by the same factors. Biologically more relevant More compact representation
X1
Smaller number of parameters Reduced search space for structure learning
X2 X3
Candidate regulators
A fixed set of genes that can be parents of other modules.
Module 1
Module 2
X4 Module 3
X5
X6
Same module ⇒ Share the CPD 19
9
The module networks concept
Tree CPDs
Module 1
Module 2
Module 3
X5
X6
-3 x
Repressor X4
0.5 x
Activator X3
true Repressor X4K
repressor expression
X4
X3
false
false
target gene expression
true
Module 3 genes
X2
Activator X3K
activator expression
regulation program
X1
+
=
induced
Module3
Linear CPDs
false
Goal false
20
repressed
Heat Shock? CMK1
Regulatory program
true
HAP4 Ï true
experiments
Identify modules (member genes) Discover module regulation program
genes (X’s)
Candidate regulators
X11 X3Ï
(µA,σA)
false
true
Module 3
...
P(Level)
P(Level) Level 0
Context A
Module 2
X4
X3 X3
X4 Ï fals e
P(Level)
X2
true
Module 1
Level
Level 3
Context B
-3
Context C
X5
X6 21 21
10
Learning
Structure learning
Find the structure that maximizes the structural score (Bayesian score or via regularization)
Expectation Maximization (EM) algorithm
M-step: Given a partition of the genes into modules, learn the best regulation program (linear or tree CPD) for each module. E-step: Given the inferred regulatory programs, we reassign genes into modules such that the associated regulation program best predicts each gene’s behavior. 22
Learning regulatory network
Iterative procedure
Cluster genes into modules (E-step) Learn a regulatory program for each module (tree model) (M-step)
Candidate regulators
Maximum increase in the structural score (Bayesian) KEM1 MKT1
MSK1
M22 M1
ECM18 UTH1
DHH1
ASG7
MEC3
GPA1 HAP4 MFA1
TEC1
HAP1 PHO3 SGS1 SEC59
PHO5 RIM15
PHO84 PHM6
PHO2
PHO4
SAS5
SPL2 GIT1
VTC3 23
11
From expression to regulation (M-step)
Combinatorial search over the space of trees
Arrays sorted in original order HAP4
PHO3 PHO5
PHO84 PHM6
PHO2
PHO4 GIT1
HAP4 Ï
VTC3
Arrays sorted according to expression of HAP4
24
Segal et al. Nat Genet 2003
24
From expression to regulation (M-step) SIP4 HAP4
PHO3 PHO5
PHO84 PHM6
PHO2
PHO4 GIT1
VTC3
SIP4 Ï
Segal et al. Nat Genet 2003
HAP4 Ï
25
12
Learning control programs Score: log P(M | D)
log ∫P(D | M,µ,σ )P(µ,σ) dµdσ 0
HAP4 Ï
Score of HAP4 split: log P(M | D)
log ∫P(DHAP4Ï| M,µ,σ )P(µ,σ) dµdσ + log ∫P(DHAP4Ð| M,µ,σ )P(µ,σ) dµdσ 26
Segal et al. Nat Genet 2003
0
0
26
Review – Learning regulatory network
Iterative procedure Cluster genes into modules (E-step) Learn a regulatory program for each module (M-step)
Maximum increase in Bayesian score KEM1
MSK1
HAP4K
DHH1
MKT1 false
M22 M1
ECM18 UTH1
ASG7
MEC3
GPA1 HAP4 MFA1
TEC1
true
Candidate regulators
HAP1 MSK1K PHO3 false
true
SGS1
SEC59
PHO5 RIM15
PHO84 PHM6
PHO2
PHO4
SAS5
SPL2 GIT1
VTC3 28
13
Feature selection via regularization
Assume linear Gaussian CPD
MLE: solve maximizew - (Σwixi - ETargets)2
x1 w1
parameters
x2
…
w2
Regulatory network
xN
Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes
wN
ETargets
P(ETargets|x:w) = N(Σwixi , ε2)
Problem: This objective learns too many regulators 29
L1 regularization
“Select” a subset of regulators
Combinatorial search? Effective feature selection algorithm: L1 regularization (LASSO) [Tibshirani, J. Royal. Statist. Soc B. 1996]
minimizew (Σwixi - ETargets)2+ Σ C |wi|: convex optimization! ⇒ Induces sparsity in the solution w (Many wi‘s set to zero) x1 w1
x2 w2 ETargets
…
xN wN
Candidate regulators (features) Yeast: 350 genes Mouse: 700 genes
P(ETargets|x:w) = N(Σwixi , ε2) 30
14
Learning linear regulatory network -3 x
GPA1
Iterative procedure 0.5 x +
=
MFA1
+
Learn a regulatory -1.2 program x M120 for each module Cluster genes into modules Modul e
M1011 M321
M120
M22 M1 ECM18
UTH1
M321
TEC1
ASG7
MEC3
GPA1
Is this predicted MFA1 relationship “real”?
HAP1 PHO3 SGS1 SEC59
PHO5 RIM15 SAS5
PHO84
PHM6 L1 regularized optimization PHO2 minimizew (Σwixi - ETargetsPHO4 )2+ Σ C |wi| SPL2 GIT1
VTC3 31
Lee et al., PLoS Genet 2009
Statistical evaluation
Cross-validation test Divide the data (experiments) into training and test data Compute the likelihood function for the Test data “Test data”
experiments
“Test likelihood” How well it fits to the test data?
? genes
Regulatory network 32
15
Module Evaluation Criteria
Are the module genes functionally coherent? Do the regulators have regulatory roles in the predicted conditions? Are the genes in the module known targets of the predicted regulators? Are the regulators consistent with the cisregulatory motifs found in promoters of the module genes?
33
Functional coherence
Modules
m Module 1
n k
N
Known functional categories
Gene ontology Predicted targets of TFs Sharing TF binding sites : Cholesterol synthesis
How significant is the overlap?
Calculate P(# overlap ≥ k | K, n, N, two groups are independent) based on the hypergeometric distribution 34
16
Module Functional Coherence 100
Coherence (%)
90
26 Modules >60% Coherent
80
41 Modules >40% Coherent
70 60 50 40 30 20 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0
Metabolic: AA, respiration, glycolysis, galactose
Stress: Oxidative stress, osmotic stress
Cellular localization: Nucleas, ER
Cellular processes: Cell cycle, sporulation, mating
35 Molecular functions: Protein folding, RNA & DNA processing, trafficking
Example: Respiration Module
HAP4 known to up regulate Oxid. Phos.
36
17
Example: Respiration Module
HAP4 known to up regulate Oxid. Phos.
HAP4, MSN4, XBP1 known to be regulators under predicted conditions
37
Example: Respiration Module
HAP4 known to up regulate Oxid. Phos.
HAP4, MSN4, XBP1 known to be regulators under predicted conditions
HAP4 Binding site found in 39/55 genes
38
18
Example: Respiration Module
HAP4 known to up regulate Oxid. Phos.
HAP4, MSN4, XBP1 known to be regulators under predicted conditions
HAP4 Binding site found in 39/55 genes
MSN4 Binding site found in 28/55 genes
39
Any drawback?
Many models can get similar scores. Which one would you choose?
A gene can be involved in multiple modules.
Possible to incorporate prior knowledge?
40
19
Structural learning via boostrapping
Many networks that achieve similar scores … 33.15
33.10
32.99
Which one is the most robust?
Estimate the robustness of each network or each edge. How?? Learn the networks from multiple datasets.
Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001
Bootstrapping
Sampling with replacement
41
Estimated confidence of each edge i
=
# networks that contain the edge total # networks (N)
genes
… experiments
Original data
Bootstrap data 1
data 2
…
data N
… Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001
42
20
Bootstrapping
Sampling with replacement
Estimated confidence of each edge i
=
# networks that contain the edge total # networks (N)
0.8 0.31
…
0.75
0.26
0.56
0.43 0.73
Bootstrap data 1
data 2
…
data N
… Inferring sub-networks from perturbed expression profiles, Pe’er et al. Bioinformatics 2001
43
Overlapping processes
The living cell is a complex system
Example, the cell cycle Genes functionally relevant to cell cycle regulation in the specific cell cycle phase
Genes have a regulatory mechanism
Genes participate in multiple processes
Partial figure from Maaser and Borlak Oct. 2008
Mutually exclusive clustering as a common approach to analyzing gene expression
(+) genes likely to share a common function (-) group genes into mutually exclusive clusters (-) no info about genes relation to one another 44
21
Decomposition of processes…
Model an expression level of a gene as a mixture of regulatory modules. HAP4K
Hard EM vs soft EM
false
true
MSK1K false
true
DHH1K
w1
X
w2 : wN
false
true
PUF3K false
true
HAP1K false
true
HAP5K false
Probabilistic discovery of overlapping cellular processes and their regulation Battle et al. Journal of Computational Biology 2005
true
45
Outline
Microarray data analysis Clustering approaches Beyond clustering Algorithms for learning regulatory networks Learning regulatory networks from genetically diverse set of individuals
46
22