Short introduction to gene finding
DNA: transcription pre-mRNA: splicing mRNA: translation protein:
1
Example: Donor site signals ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg tcgatgaatttgggatcatccggtgagagctcttcctctctcctgg agatgacgtccgtgatgagaaggtagggggtgcaccccagtcccca gtggagaatgagaggtgggatggtaggtgatgccttcgaggcccag tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttgaagaaactccacgaagaggttgatggcagtgactttcggaaa agtggatgcccttaaaggaaccgtggagtaccaacccccctgcagt cgacccgtgaccctcgtgagaggtacgaagccccagcccggggctc aaatgcagtggaagagggactagtacgtgagccatgctgggagtgt catggcgggtgtgctgaagaaggtgagacgaatggaggtcactgtt tgtgtgcaatactcctcacgaggtatgtacctttgttctttcttcg gaagctggctatggttaaagcggtaagtagctaagtcagttttgtt attagaagaggtgattcttcaggtaaagaaaaagttgactatttag 2
Signals in gene finding • Conserved sequences of fixed length that appear at boundaries of exons and other important places. • Our interest: create a realistic generative probabilistic models of these signals that can be used inside probabilistic models of genes (such as hidden Markov models) – High probability to actual signals – Low probability to decoys (sites which are not signals) • Different from classification: – Clear separation of signals and decoys not possible – We care about the values of probabilities – Many more decoys than positives: for 10KB of sequence ≈ 1 signal and ≈ 600 decoys 3
Example: Position Weight Matrix (PWM) 0
1
2
3
4
5
6
7
8
A
.38
.62
.12
0
0
.71
.73
.11
.21
C
.31
.10
.04
0
0
.02
.06
.06
.10
G
.18
.12
.77
1
0
.24
.08
.75
.14
T
.13
.16
.07
0
1
.03
.13
.08
.55
ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg 4
Main challenge: dependencies within signal How much more information, if we consider pairs instead of individual positions? +6 +5 +4 +3
0.058
+2
0.047
+1
0.035
-1
0.023
-2
0.012
-3
0 -3
-2
-1
+1
+2
+3
+4
+5
+6
(darker is better) 5
Signals as Bayesian Networks • random variables / vertices = signal positions • edges = “dependencies” between positions To generate signal by model M • Generate characters at signal positions in topological order • Model specifies for each position i: Pr[Si = xi | Sj1 = xj1 , . . . , Sjk = xjk ], where j1 , . . . , jk are predecessors of i in the DAG
n
A
G
G
T
6
n
A
G
T
Examples of generative models for donor signal PWM (order 0): n
A
G
G
T
n
A
G
T
G
G
T
n
A
G
T
G
G
T
n
A
G
T
G
G
T
n
A
G
T
G
G
T
n
A
G
T
PWM (order 1): n
A
Tree (HOT, order 1): n
A
PWM (order 2): n
A
HOT (order 2): n
A
7
Training HOT models Task: Given a training set S1 , . . . , S` , find model topology with maximum in-degree k that maximizes likelihood of S1 , . . . , S` . Amount of training data limits the maximum in-degree (otherwise risk of overfitting) Once topology is known, this is a problem with fully observed data, and training is trivial (frequency counting) Optimization problem Choose a DAG of maximum in-degree k minimizing cost Cost of each vertex h, depends on a set of its parents T : H(T, h) − H(T ), P where H(X) = − x Pr(X = x) log Pr(X = x)
8
Solving the optimization problem • For k ≥ 2: formulated as minimum directed spanning hypertree – NP-hard – use integer programming to solve • For k = 1: equivalent to minimum spanning tree [Chow, Liu 1968]
9
Finding optimal topology with integer program Variables: • aT,h = 1, iff T is the set of parents of vertex h • bi,j = 1, iff vertex i is before vertex j in topological ordering minimize
P
T,h:|T |≤k
aT,h (H(T, h) − H(T ))
subject to: • bi,j + bj,i = 1 • bi,j + bj,k + bk,i ≤ 2 • aT,h ≤ bx,h for all x ∈ T P • T aT,h = 1
10
Performance in gene finding Donor
PWM
HOT-3
Acceptor
PWM
PWM-3
Exon Sensitivity
60%
63% +3
Exon Specificity
62%
63% +1
Donor Sensitivity
70%
72% +2
Donor Specificity
73%
72% −1
Start Sensitivity
48%
68% +20
Start Specificity
38%
40% +2
(Set of 41 human genes with 533 exons)
11
Using signal models for discrimination • Model defines probability Pr(s|signal) • For discrimination, we want Pr(signal|s): Pr(signal) Pr(s|signal) Pr(signal|s) = Pr(signal) Pr(s|signal) + Pr(¬signal) Pr(s|¬signal) • Choose a threshold score for “true” predicted donor site • Changing the threshold balances sensitivity vs. specificity
12
True positives
Sensitivity vs. specificity
0.9 PWM-0 PWM-2 HOT-2
0.8 0.7 0.6 0.0
0.1 0.2 0.3 False positives
0.4
(Set of 41 human genes with 533 exons)
13
How much data do you need for training?
Specificity (%)
1.6 PWM-0 PWM-1 PWM-2 PWM-3 TREE HOT-2 HOT-3
1.4
1.2
5000
10000
Amount of training data
Specificity of donor sites at 90% sensitivity (Set of 41 human genes with 533 exons) 14
Signals in gene finding – summary • Main problem: how to capture dependencies between non-adjacent signal positions • Traditional tradeoff: how many dependencies we can capture without overfitting (limited in-degree of vertex in the model) • Training HOT models is hard in general; however integer programming does reasonable job • Structured models (HOTs) improve performance over unstructured models (PWMs) • Improving performance of a signal may significantly impact performance in other parts of the model
15