Short introduction to gene finding DNA: transcription. pre-mrna: splicing. mrna: translation. protein:

Short introduction to gene finding DNA: transcription pre-mRNA: splicing mRNA: translation protein: 1 Example: Donor site signals ccatcccctatattta...

Author: Marianna Lloyd

0 downloads 0 Views 75KB Size

Report

Download PDF

Recommend Documents

DNA Duplexes Affects mrna Transcription

Nonrandom Gene Organization: Structural Arrangements of Specific Pre-mRNA Transcription and Splicing with SC-35 Domains

A Conserved mrna Export Machinery Coupled to pre-mrna Splicing

Chapter 2 DNA, RNA, Transcription and Translation

LAB : PROTEIN SYNTHESIS TRANSCRIPTION AND TRANSLATION

Insulin mrna to Protein Kit

DNA RNA Protein. Introduction to Molecular Genetics I. INTRODUCTION

Gene Expression Assays for Quantitative Measurement of Mitochondrial DNA Transcription

Central Dogma DNA RNA. Proteins. Replication. Transcription. AIDS virus. Translation

RNA. Transcription and Translation

Reverse Transcription of Thyroglobulin 33-S mrna

Introduction to Protein DNA Interactions. Structure, Thermodynamics, and Bioinformatics

Gene Expression DNA RNA. Protein. Metabolites, stress, environment

Mechanistic studies of the mrna transcription cycle

Gene transcription. Program. Tutorial II Gene expression: Molecular mechanisms of gene transcription

Introduction to Legal Translation

Gene finding strategies

Promoter occlusion prevents transcription of adenovirus poly.peptide IX mrna until after DNA replication

Nucleus. Chromosomes. Gene DNA

Short introduction to PSTricks

CHAPTER 7 From DNA to Protein

Dictionary-driven prokaryotic gene finding

Protein, DNA & RNA Structure

Short introduction to gene finding

DNA: transcription pre-mRNA: splicing mRNA: translation protein:

1

Example: Donor site signals ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg tcgatgaatttgggatcatccggtgagagctcttcctctctcctgg agatgacgtccgtgatgagaaggtagggggtgcaccccagtcccca gtggagaatgagaggtgggatggtaggtgatgccttcgaggcccag tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttcttgtggctattttaaaaggtaattcatggagaaatagaaaaa tttgaagaaactccacgaagaggttgatggcagtgactttcggaaa agtggatgcccttaaaggaaccgtggagtaccaacccccctgcagt cgacccgtgaccctcgtgagaggtacgaagccccagcccggggctc aaatgcagtggaagagggactagtacgtgagccatgctgggagtgt catggcgggtgtgctgaagaaggtgagacgaatggaggtcactgtt tgtgtgcaatactcctcacgaggtatgtacctttgttctttcttcg gaagctggctatggttaaagcggtaagtagctaagtcagttttgtt attagaagaggtgattcttcaggtaaagaaaaagttgactatttag 2

Signals in gene finding • Conserved sequences of fixed length that appear at boundaries of exons and other important places. • Our interest: create a realistic generative probabilistic models of these signals that can be used inside probabilistic models of genes (such as hidden Markov models) – High probability to actual signals – Low probability to decoys (sites which are not signals) • Different from classification: – Clear separation of signals and decoys not possible – We care about the values of probabilities – Many more decoys than positives: for 10KB of sequence ≈ 1 signal and ≈ 600 decoys 3

Example: Position Weight Matrix (PWM) 0

1

2

3

4

5

6

7

8

A

.38

.62

.12

0

0

.71

.73

.11

.21

C

.31

.10

.04

0

0

.02

.06

.06

.10

G

.18

.12

.77

1

0

.24

.08

.75

.14

T

.13

.16

.07

0

1

.03

.13

.08

.55

ccatcccctatatttatggcaggtgaggaaagggtgggggctgggg attcatcatcatgggtgcatcggtgagtatctcccaggccccaatc agaagatctaccccaccatctggtaagtgtgtcccaccactgcccc acagagtgagcccttcttcaaggtgggtggtgtcagggcctccccc acgagtcctgcatgagccagatgtaaggcttgccgttgccctccct tgcagaacctcatggtgctgaggtggggccaagcctgggccggggg 4

Main challenge: dependencies within signal How much more information, if we consider pairs instead of individual positions? +6 +5 +4 +3

0.058

+2

0.047

+1

0.035

-1

0.023

-2

0.012

-3

0 -3

-2

-1

+1

+2

+3

+4

+5

+6

(darker is better) 5

Signals as Bayesian Networks • random variables / vertices = signal positions • edges = “dependencies” between positions To generate signal by model M • Generate characters at signal positions in topological order • Model specifies for each position i: Pr[Si = xi | Sj1 = xj1 , . . . , Sjk = xjk ], where j1 , . . . , jk are predecessors of i in the DAG

n

A

G

G

T

6

n

A

G

T

Examples of generative models for donor signal PWM (order 0): n

A

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

G

G

T

n

A

G

T

PWM (order 1): n

A

Tree (HOT, order 1): n

A

PWM (order 2): n

A

HOT (order 2): n

A

7

Training HOT models Task: Given a training set S1 , . . . , S` , find model topology with maximum in-degree k that maximizes likelihood of S1 , . . . , S` . Amount of training data limits the maximum in-degree (otherwise risk of overfitting) Once topology is known, this is a problem with fully observed data, and training is trivial (frequency counting) Optimization problem Choose a DAG of maximum in-degree k minimizing cost Cost of each vertex h, depends on a set of its parents T : H(T, h) − H(T ), P where H(X) = − x Pr(X = x) log Pr(X = x)

8

Solving the optimization problem • For k ≥ 2: formulated as minimum directed spanning hypertree – NP-hard – use integer programming to solve • For k = 1: equivalent to minimum spanning tree [Chow, Liu 1968]

9

Finding optimal topology with integer program Variables: • aT,h = 1, iff T is the set of parents of vertex h • bi,j = 1, iff vertex i is before vertex j in topological ordering minimize

P

T,h:|T |≤k

aT,h (H(T, h) − H(T ))

subject to: • bi,j + bj,i = 1 • bi,j + bj,k + bk,i ≤ 2 • aT,h ≤ bx,h for all x ∈ T P • T aT,h = 1

10

Performance in gene finding Donor

PWM

HOT-3

Acceptor

PWM

PWM-3

Exon Sensitivity

60%

63% +3

Exon Specificity

62%

63% +1

Donor Sensitivity

70%

72% +2

Donor Specificity

73%

72% −1

Start Sensitivity

48%

68% +20

Start Specificity

38%

40% +2

(Set of 41 human genes with 533 exons)

11

Using signal models for discrimination • Model defines probability Pr(s|signal) • For discrimination, we want Pr(signal|s): Pr(signal) Pr(s|signal) Pr(signal|s) = Pr(signal) Pr(s|signal) + Pr(¬signal) Pr(s|¬signal) • Choose a threshold score for “true” predicted donor site • Changing the threshold balances sensitivity vs. specificity

12

True positives

Sensitivity vs. specificity

0.9 PWM-0 PWM-2 HOT-2

0.8 0.7 0.6 0.0

0.1 0.2 0.3 False positives

0.4

(Set of 41 human genes with 533 exons)

13

How much data do you need for training?

Specificity (%)

1.6 PWM-0 PWM-1 PWM-2 PWM-3 TREE HOT-2 HOT-3

1.4

1.2

5000

10000

Amount of training data

Specificity of donor sites at 90% sensitivity (Set of 41 human genes with 533 exons) 14

Signals in gene finding – summary • Main problem: how to capture dependencies between non-adjacent signal positions • Traditional tradeoff: how many dependencies we can capture without overfitting (limited in-degree of vertex in the model) • Training HOT models is hard in general; however integer programming does reasonable job • Structured models (HOTs) improve performance over unstructured models (PWMs) • Improving performance of a signal may significantly impact performance in other parts of the model

15