Short introduction to gene finding DNA: transcription. pre-mrna: splicing. mrna: translation. protein:

Short introduction to gene finding DNA: transcription pre-mRNA: splicing mRNA: translation protein: 1 Example: Donor site signals ccatcccctatattta...
Author: Marianna Lloyd
0 downloads 0 Views 75KB Size
Short introduction to gene finding

DNA: transcription pre-mRNA: splicing mRNA: translation protein:

1

Example: Donor site signals ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg tcgatgaatttgggatcatccggtgagagctcttcctctctcctgg agatgacgtccgtgatgagaaggtagggggtgcaccccagtcccca gtggagaatgagaggtgggatggtaggtgatgccttcgaggcccag tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttgaagaaactccacgaagaggttgatggcagtgactttcggaaa agtggatgcccttaaaggaaccgtggagtaccaacccccctgcagt cgacccgtgaccctcgtgagaggtacgaagccccagcccggggctc aaatgcagtggaagagggactagtacgtgagccatgctgggagtgt catggcgggtgtgctgaagaaggtgagacgaatggaggtcactgtt tgtgtgcaatactcctcacgaggtatgtacctttgttctttcttcg gaagctggctatggttaaagcggtaagtagctaagtcagttttgtt attagaagaggtgattcttcaggtaaagaaaaagttgactatttag 2

Signals in gene finding • Conserved sequences of fixed length that appear at boundaries of exons and other important places. • Our interest: create a realistic generative probabilistic models of these signals that can be used inside probabilistic models of genes (such as hidden Markov models) – High probability to actual signals – Low probability to decoys (sites which are not signals) • Different from classification: – Clear separation of signals and decoys not possible – We care about the values of probabilities – Many more decoys than positives: for 10KB of sequence ≈ 1 signal and ≈ 600 decoys 3

Example: Position Weight Matrix (PWM) 0

1

2

3

4

5

6

7

8

A

.38

.62

.12

0

0

.71

.73

.11

.21

C

.31

.10

.04

0

0

.02

.06

.06

.10

G

.18

.12

.77

1

0

.24

.08

.75

.14

T

.13

.16

.07

0

1

.03

.13

.08

.55

ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg 4

Main challenge: dependencies within signal How much more information, if we consider pairs instead of individual positions? +6 +5 +4 +3

0.058

+2

0.047

+1

0.035

-1

0.023

-2

0.012

-3

0 -3

-2

-1

+1

+2

+3

+4

+5

+6

(darker is better) 5

Signals as Bayesian Networks • random variables / vertices = signal positions • edges = “dependencies” between positions To generate signal by model M • Generate characters at signal positions in topological order • Model specifies for each position i: Pr[Si = xi | Sj1 = xj1 , . . . , Sjk = xjk ], where j1 , . . . , jk are predecessors of i in the DAG

n

A

G

G

T

6

n

A

G

T

Examples of generative models for donor signal PWM (order 0): n

A

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

PWM (order 1): n

A

Tree (HOT, order 1): n

A

PWM (order 2): n

A

HOT (order 2): n

A

7

Training HOT models Task: Given a training set S1 , . . . , S` , find model topology with maximum in-degree k that maximizes likelihood of S1 , . . . , S` . Amount of training data limits the maximum in-degree (otherwise risk of overfitting) Once topology is known, this is a problem with fully observed data, and training is trivial (frequency counting) Optimization problem Choose a DAG of maximum in-degree k minimizing cost Cost of each vertex h, depends on a set of its parents T : H(T, h) − H(T ), P where H(X) = − x Pr(X = x) log Pr(X = x)

8

Solving the optimization problem • For k ≥ 2: formulated as minimum directed spanning hypertree – NP-hard – use integer programming to solve • For k = 1: equivalent to minimum spanning tree [Chow, Liu 1968]

9

Finding optimal topology with integer program Variables: • aT,h = 1, iff T is the set of parents of vertex h • bi,j = 1, iff vertex i is before vertex j in topological ordering minimize

P

T,h:|T |≤k

aT,h (H(T, h) − H(T ))

subject to: • bi,j + bj,i = 1 • bi,j + bj,k + bk,i ≤ 2 • aT,h ≤ bx,h for all x ∈ T P • T aT,h = 1

10

Performance in gene finding Donor

PWM

HOT-3

Acceptor

PWM

PWM-3

Exon Sensitivity

60%

63% +3

Exon Specificity

62%

63% +1

Donor Sensitivity

70%

72% +2

Donor Specificity

73%

72% −1

Start Sensitivity

48%

68% +20

Start Specificity

38%

40% +2

(Set of 41 human genes with 533 exons)

11

Using signal models for discrimination • Model defines probability Pr(s|signal) • For discrimination, we want Pr(signal|s): Pr(signal) Pr(s|signal) Pr(signal|s) = Pr(signal) Pr(s|signal) + Pr(¬signal) Pr(s|¬signal) • Choose a threshold score for “true” predicted donor site • Changing the threshold balances sensitivity vs. specificity

12

True positives

Sensitivity vs. specificity

0.9 PWM-0 PWM-2 HOT-2

0.8 0.7 0.6 0.0

0.1 0.2 0.3 False positives

0.4

(Set of 41 human genes with 533 exons)

13

How much data do you need for training?

Specificity (%)

1.6 PWM-0 PWM-1 PWM-2 PWM-3 TREE HOT-2 HOT-3

1.4

1.2

5000

10000

Amount of training data

Specificity of donor sites at 90% sensitivity (Set of 41 human genes with 533 exons) 14

Signals in gene finding – summary • Main problem: how to capture dependencies between non-adjacent signal positions • Traditional tradeoff: how many dependencies we can capture without overfitting (limited in-degree of vertex in the model) • Training HOT models is hard in general; however integer programming does reasonable job • Structured models (HOTs) improve performance over unstructured models (PWMs) • Improving performance of a signal may significantly impact performance in other parts of the model

15

Suggest Documents