Combinatorics of RNA Secondary Structures

Combinatorics of RNA Secondary Structures Christine E. Heitsch School of Mathematics Georgia Institute of Technology The 23rd Clemson Mini-Conferenc...
Author: Hugo Haynes
9 downloads 0 Views 1001KB Size
Combinatorics of RNA Secondary Structures

Christine E. Heitsch School of Mathematics Georgia Institute of Technology

The 23rd Clemson Mini-Conference on Discrete Mathematics and Algorithms – October 3, 2008

Discrete Mathematical Biology

Combinatorics. . . “as motivated by”

and

“with applications to” . . . molecular biology.

C. E. Heitsch, GaTech

1

Combinatorics on (Biological) Words

DNA −→

←− RNA

DNA and RNA are (oriented, biochemical) sequences over (nucleotide) alphabets with the complementary Watson-Crick base pairing. C. E. Heitsch, GaTech

2

RNA: More Than Just the Messenger The structure of Pariacoto virus reveals a dodecahedral cage of duplex RNA, by Tang et. al. in Nat Struct Biol.

Breakthrough of the year: Small RNAs Make Big Splash Published in Science Issue of 20 Dec 2002.

C. E. Heitsch, GaTech

3

RNA Structural Hierarchy

Three Dimensional RNA Molecular Structure Tertiary: all other intra−molecular interactions

tRNA

Secondary: set of base pairs induced by self−bonding Primary: linear sequence of nucleotide bases

C. E. Heitsch, GaTech

4

Folded RNA Molecules Form Structures Selective base pair hybridization ⇐⇒ structure and function GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA

Primary sequence −→ secondary structure −→ 3D molecule C. E. Heitsch, GaTech

5

Important Biomathematical Questions Prediction? GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA

Analysis?

Design? How do RNA sequences encode secondary structures?

C. E. Heitsch, GaTech

6

Representing RNA Configurations by Trees Abstract folded sequence to its graphical “skeleton”: 1−loop

4−loop

leaf vertex

vertex of degree 3 5 5 4

6 stacked base pairs external loop

edge of weight 6

6

root vertex

stacked base pairs −→ edges, single-stranded regions −→ vertices. C. E. Heitsch, GaTech

7

Designing RNA Secondary Structures

Given a plane tree T with edge weights W , produce (by a deterministic algorithm) an RNA sequence R such that S(R) is represented by T . R=

T 5 5 4 6

C. E. Heitsch, GaTech

S(R)

gcggau uua

Design

gcuc aguuggga gagc g

Predict

=⇒

ccaga cugaaga ucugg

=⇒

agguc cugug uucgauc cacag a auucgc acca

8

Sequence to Structure: A One-to-Many Mapping Output of sir_graph by D. Stewart and M. Zuker

Output of sir_graph by D. Stewart and M. Zuker

A

A 40

A A A G G A C G 30 C G C A A A A G 20 C G C C

Output of sir_graph by D. Stewart and M. Zuker

C C C C

A

G

G C

A

A A

C G

3’

5’

G A A A C C 60

G G G

C

G

C C

dG 30= -39.02 [initially -43.4]

05Dec16-11-06-00

A A A G 50

C G C G A

A A A

A

A

G C

C

G C

C C

G

A G C

A

G C G C

A A

G C

05Dec16-11-25-58

G C A A

G C

A

40

A

A

A

A

5’

A

G C

C

G

A

A 50

G C

40

C C

G G

A

A A

A

A A

A

A

A A 5’

05Dec16-11-06-00

dG = -42.3 [initially -42.3]

A

60

A

A 3’

05Dec16-11-06-00

30

A

A 10

A

dG = -41.7 [initially -41.7]

C G

A

G C

G

dG = -42.4 [initially -42.4]

C G C G

G C

A

A

50

A

A A A A 10

G A A

C

20

40

A

C G C G

C G

A A A C G

A

C G C G

C G

A

A A

A

A A C G 60 C G

C G

A

A

A A

C G

A

A

3’

G G G G G C C G C C C C A

A

C G

A

A

A

A

3’

A

20

A A A

20

60

A

5’

40

A

A A A A

G C

A

A

G C

A A

Output of sir_graph by D. Stewart and M. Zuker

A

A

G C

A

30

20

C A A

A

G C

G C

G C

G

G C

G C

C

G

G C

G C

C

G

G C

30

C

G A

A

G G

C

G

A A

G

G C

G C

5’

G

A

G C

10

G C

A

10

A

A

A

G

C

G

A A

A 50 A

A A A A 10 G C

G C

C

C

A A

A

A A

Output of sir_graph by D. Stewart and M. Zuker

A

A

3’

G G A A A C G C 60 G C 50 G C G C A C A A A

dG = -42.4 [initially -42.4]

05Dec16-11-06-00

Secondary structures for the combinatorial RNA sequence R = aaaa gggggg aaaa cccccc aaaa gggggg aaaa cccccc aaaa gggggg aaaa cccccc aaaa. C. E. Heitsch, GaTech

9

Plane / Ordered / Linear Trees From Stanley’s Enumerative Combinatorics, Vol. 2 : Definition. A plane tree T is a rooted tree whose subtrees at any vertex are linearly ordered. A vertex with k children has degree k.



,

,

,

,



T3 =

Tn = {T | n edges (and n + 1 vertices)} and |Tn| = C. E. Heitsch, GaTech

1 n+1

·

2n n



= Cn 10

Local Moves (“as motivated by”) Intuitively, switch the “pairing” between the “half-edges” of any two adjacent edges.

C. E. Heitsch, GaTech

11

Local Moves Give Global Structure

G3:

                                                                                                                                                       

                                                                                                                                                                                    













  

  





   













 



 









     













    





   













 



 





















     



                                                                                                                

Theorem (H). The graph Gn is connected with diameter n − 1  n and n-partite with disjoint sets Tn,k where |Tn,k | = n1 nk k−1 . C. E. Heitsch, GaTech

12

Catalan Numbers and Noncrossing Partitions

1

1 2

4

2

4

3

3

P = {1}, {2, 3}, {4}

P 0 = {1, 3, 4}, {2}

Theorem (Fomin, H). The lattice (Tn, ) is isomorphic to NC(n) — the lattice of noncrossing partitions of [n] ordered by refinement.

C. E. Heitsch, GaTech

13

Open Enumeration Problem Solved Complementation operation: 4’

1

n

1’

4

2 3’

3

2’

[Kreweras, 1972]

C. E. Heitsch, GaTech

Counting orbits in the Catalan lattice

2 3 4 5 6 7 8 9

Orbits 2 4 (1) 0 1 0 1 (1) 1 0 1 1 1 0 1 1 1 0

of Length 6 n 2n 0 1 0 0 1 0 0 1 1 0 2 3 0 3 9 0 5 28 0 8 85 3 14 262

Total

Cn

1 2 3 6 14 34 95 280

2 5 14 42 132 429 1430 4862

14

Discrete Mathematical Biology

Combinatorics. . . “as motivated by”

and

“with applications to” . . . molecular biology.

C. E. Heitsch, GaTech

15

Important Biomathematical Questions Prediction? GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA

Analysis?

Design? How do RNA sequences encode secondary structures?

C. E. Heitsch, GaTech

16

Predicting Nested RNA Base Pairings Let R = b1b2 . . . bn ∈ {a, u, c, g}+ be a 50 to 30 RNA sequence. Definition. Let S(R) be a set of base pairs S(R) = {bi − bj | 1 ≤ i < j ≤ n} where all bi − bj and bi0 − bj 0 are distinct, i = i0 ⇐⇒ j = j 0, and either i < i0 < j 0 < j or i < j < i0 < j 0. Then S(R) is a nested secondary structure of R. C. E. Heitsch, GaTech

17

Thermodynamics of RNA Folding RNA secondary structures are balanced between energetically favorable helices (stacked base pairs) and destabilizing loops (single-stranded regions). ‘‘Helices’’

Loops Helices

‘‘Loops’’

C. E. Heitsch, GaTech

18

Nearest Neighbor Energy Model

S. cerevisiae Phe-tRNA at 37◦

∆G = -22.7 kcal/mole

C. E. Heitsch, GaTech

19

RNA Viral Genomes (“with applications to”) Selective base pair hybridization ⇐⇒ structure and function The armyworm Pariacoto virus (or PaV) consists of 2 ssRNA sequences: RNA1 with 3,011 nucleotides and RNA2 which has 1,311 nucleotides.

? 100’s of different 2nd−ary structures with free energy near minimal dG = −1669.4 kcal/mole.

Primary sequence −→ secondary structure ?!? −→ 3D molecule C. E. Heitsch, GaTech

20

Identifying Significant Base Pairings Two mfold predicted secondary structures: a PaV RNA2 configuration (right) and the base pairing of a “scrambled” RNA2 sequence (below) which is much more extended and less extensively branched.

C. E. Heitsch, GaTech

21

Analyzing Vertex Energies For a tree T , let EL(T ) be the sum of the vertex energies. Vertex degree Related loop Minimal energy

0 1-loop 4.10

1 2-loop 2.30

k − 1, k ≥ 3 k-loop, k ≥ 3 3.40 - 1.50 k

root, degree j ≥ 1 external -1.90 j

Minimal Energy

EL(T) = 6.6

EL(T) = 6.7

EL(T) = 6.7

EL(T) = 5.2

EL(T) = 6.8

Plane trees T with 3 edges and their total loop energies EL(T ). C. E. Heitsch, GaTech

22

Minimal Loop Energy Configurations

Minimal Energy

EL(T) = 6.6

EL(T) = 6.7

EL(T) = 6.7

EL(T) = 5.2

EL(T) = 6.8

Plane trees T with 3 edges and their total loop energies EL(T ). Theorem (H). For plane trees T with n edges, the total loop energy EL(T ) is minimal when T has the maximal number of vertices with degree 2. (When n is odd, the root has degree 1.) C. E. Heitsch, GaTech

23

Designing, Analyzing, and Predicting. . . Mathematical Result: Associated loop energies are minimized by maximizing vertices with three edges.

Biological Hypothesis: Branching degree in viral RNA loops correlates with functional significance. C. E. Heitsch, GaTech

24

In Conclusion

There is much to explore at the interface of discrete mathematics and molecular biology.

C. E. Heitsch, GaTech

25

Acknowledgments

• Our Clemson hosts and conference organizers! • NIH R01, Joint DMS/NIGMS Initiative to Support Research in the Area of Mathematical Biology. • Burroughs Wellcome Fund Career Award at the Scientific Interface. • Computation and Informatics in Biology and Medicine Training Program. • Anne Condon (CS) and Holger Hoos (CS), University of British Columbia. • Sergey Fomin (Math), University of Michiagn, Ann Arbor. • Steve Harvey (Bio) and David Bader (CS), Georgia Institute of Technology. • Michael Zuker’s mfold algorithm.

C. E. Heitsch, GaTech

26

Suggest Documents