Combinatorics of RNA Secondary Structures
Christine E. Heitsch School of Mathematics Georgia Institute of Technology
The 23rd Clemson Mini-Conference on Discrete Mathematics and Algorithms – October 3, 2008
Discrete Mathematical Biology
Combinatorics. . . “as motivated by”
and
“with applications to” . . . molecular biology.
C. E. Heitsch, GaTech
1
Combinatorics on (Biological) Words
DNA −→
←− RNA
DNA and RNA are (oriented, biochemical) sequences over (nucleotide) alphabets with the complementary Watson-Crick base pairing. C. E. Heitsch, GaTech
2
RNA: More Than Just the Messenger The structure of Pariacoto virus reveals a dodecahedral cage of duplex RNA, by Tang et. al. in Nat Struct Biol.
Breakthrough of the year: Small RNAs Make Big Splash Published in Science Issue of 20 Dec 2002.
C. E. Heitsch, GaTech
3
RNA Structural Hierarchy
Three Dimensional RNA Molecular Structure Tertiary: all other intra−molecular interactions
tRNA
Secondary: set of base pairs induced by self−bonding Primary: linear sequence of nucleotide bases
C. E. Heitsch, GaTech
4
Folded RNA Molecules Form Structures Selective base pair hybridization ⇐⇒ structure and function GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA
Primary sequence −→ secondary structure −→ 3D molecule C. E. Heitsch, GaTech
5
Important Biomathematical Questions Prediction? GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA
Analysis?
Design? How do RNA sequences encode secondary structures?
C. E. Heitsch, GaTech
6
Representing RNA Configurations by Trees Abstract folded sequence to its graphical “skeleton”: 1−loop
4−loop
leaf vertex
vertex of degree 3 5 5 4
6 stacked base pairs external loop
edge of weight 6
6
root vertex
stacked base pairs −→ edges, single-stranded regions −→ vertices. C. E. Heitsch, GaTech
7
Designing RNA Secondary Structures
Given a plane tree T with edge weights W , produce (by a deterministic algorithm) an RNA sequence R such that S(R) is represented by T . R=
T 5 5 4 6
C. E. Heitsch, GaTech
S(R)
gcggau uua
Design
gcuc aguuggga gagc g
Predict
=⇒
ccaga cugaaga ucugg
=⇒
agguc cugug uucgauc cacag a auucgc acca
8
Sequence to Structure: A One-to-Many Mapping Output of sir_graph by D. Stewart and M. Zuker
Output of sir_graph by D. Stewart and M. Zuker
A
A 40
A A A G G A C G 30 C G C A A A A G 20 C G C C
Output of sir_graph by D. Stewart and M. Zuker
C C C C
A
G
G C
A
A A
C G
3’
5’
G A A A C C 60
G G G
C
G
C C
dG 30= -39.02 [initially -43.4]
05Dec16-11-06-00
A A A G 50
C G C G A
A A A
A
A
G C
C
G C
C C
G
A G C
A
G C G C
A A
G C
05Dec16-11-25-58
G C A A
G C
A
40
A
A
A
A
5’
A
G C
C
G
A
A 50
G C
40
C C
G G
A
A A
A
A A
A
A
A A 5’
05Dec16-11-06-00
dG = -42.3 [initially -42.3]
A
60
A
A 3’
05Dec16-11-06-00
30
A
A 10
A
dG = -41.7 [initially -41.7]
C G
A
G C
G
dG = -42.4 [initially -42.4]
C G C G
G C
A
A
50
A
A A A A 10
G A A
C
20
40
A
C G C G
C G
A A A C G
A
C G C G
C G
A
A A
A
A A C G 60 C G
C G
A
A
A A
C G
A
A
3’
G G G G G C C G C C C C A
A
C G
A
A
A
A
3’
A
20
A A A
20
60
A
5’
40
A
A A A A
G C
A
A
G C
A A
Output of sir_graph by D. Stewart and M. Zuker
A
A
G C
A
30
20
C A A
A
G C
G C
G C
G
G C
G C
C
G
G C
G C
C
G
G C
30
C
G A
A
G G
C
G
A A
G
G C
G C
5’
G
A
G C
10
G C
A
10
A
A
A
G
C
G
A A
A 50 A
A A A A 10 G C
G C
C
C
A A
A
A A
Output of sir_graph by D. Stewart and M. Zuker
A
A
3’
G G A A A C G C 60 G C 50 G C G C A C A A A
dG = -42.4 [initially -42.4]
05Dec16-11-06-00
Secondary structures for the combinatorial RNA sequence R = aaaa gggggg aaaa cccccc aaaa gggggg aaaa cccccc aaaa gggggg aaaa cccccc aaaa. C. E. Heitsch, GaTech
9
Plane / Ordered / Linear Trees From Stanley’s Enumerative Combinatorics, Vol. 2 : Definition. A plane tree T is a rooted tree whose subtrees at any vertex are linearly ordered. A vertex with k children has degree k.
,
,
,
,
T3 =
Tn = {T | n edges (and n + 1 vertices)} and |Tn| = C. E. Heitsch, GaTech
1 n+1
·
2n n
= Cn 10
Local Moves (“as motivated by”) Intuitively, switch the “pairing” between the “half-edges” of any two adjacent edges.
C. E. Heitsch, GaTech
11
Local Moves Give Global Structure
G3:
Theorem (H). The graph Gn is connected with diameter n − 1 n and n-partite with disjoint sets Tn,k where |Tn,k | = n1 nk k−1 . C. E. Heitsch, GaTech
12
Catalan Numbers and Noncrossing Partitions
1
1 2
4
2
4
3
3
P = {1}, {2, 3}, {4}
P 0 = {1, 3, 4}, {2}
Theorem (Fomin, H). The lattice (Tn, ) is isomorphic to NC(n) — the lattice of noncrossing partitions of [n] ordered by refinement.
C. E. Heitsch, GaTech
13
Open Enumeration Problem Solved Complementation operation: 4’
1
n
1’
4
2 3’
3
2’
[Kreweras, 1972]
C. E. Heitsch, GaTech
Counting orbits in the Catalan lattice
2 3 4 5 6 7 8 9
Orbits 2 4 (1) 0 1 0 1 (1) 1 0 1 1 1 0 1 1 1 0
of Length 6 n 2n 0 1 0 0 1 0 0 1 1 0 2 3 0 3 9 0 5 28 0 8 85 3 14 262
Total
Cn
1 2 3 6 14 34 95 280
2 5 14 42 132 429 1430 4862
14
Discrete Mathematical Biology
Combinatorics. . . “as motivated by”
and
“with applications to” . . . molecular biology.
C. E. Heitsch, GaTech
15
Important Biomathematical Questions Prediction? GCGGAUUUAG CUCAGUUGG GAGAGCGCCA GCCUGAAGA UCUGGAGGUC CUGGUUCGA UCCACAGAAU UCGCACCA
Analysis?
Design? How do RNA sequences encode secondary structures?
C. E. Heitsch, GaTech
16
Predicting Nested RNA Base Pairings Let R = b1b2 . . . bn ∈ {a, u, c, g}+ be a 50 to 30 RNA sequence. Definition. Let S(R) be a set of base pairs S(R) = {bi − bj | 1 ≤ i < j ≤ n} where all bi − bj and bi0 − bj 0 are distinct, i = i0 ⇐⇒ j = j 0, and either i < i0 < j 0 < j or i < j < i0 < j 0. Then S(R) is a nested secondary structure of R. C. E. Heitsch, GaTech
17
Thermodynamics of RNA Folding RNA secondary structures are balanced between energetically favorable helices (stacked base pairs) and destabilizing loops (single-stranded regions). ‘‘Helices’’
Loops Helices
‘‘Loops’’
C. E. Heitsch, GaTech
18
Nearest Neighbor Energy Model
S. cerevisiae Phe-tRNA at 37◦
∆G = -22.7 kcal/mole
C. E. Heitsch, GaTech
19
RNA Viral Genomes (“with applications to”) Selective base pair hybridization ⇐⇒ structure and function The armyworm Pariacoto virus (or PaV) consists of 2 ssRNA sequences: RNA1 with 3,011 nucleotides and RNA2 which has 1,311 nucleotides.
? 100’s of different 2nd−ary structures with free energy near minimal dG = −1669.4 kcal/mole.
Primary sequence −→ secondary structure ?!? −→ 3D molecule C. E. Heitsch, GaTech
20
Identifying Significant Base Pairings Two mfold predicted secondary structures: a PaV RNA2 configuration (right) and the base pairing of a “scrambled” RNA2 sequence (below) which is much more extended and less extensively branched.
C. E. Heitsch, GaTech
21
Analyzing Vertex Energies For a tree T , let EL(T ) be the sum of the vertex energies. Vertex degree Related loop Minimal energy
0 1-loop 4.10
1 2-loop 2.30
k − 1, k ≥ 3 k-loop, k ≥ 3 3.40 - 1.50 k
root, degree j ≥ 1 external -1.90 j
Minimal Energy
EL(T) = 6.6
EL(T) = 6.7
EL(T) = 6.7
EL(T) = 5.2
EL(T) = 6.8
Plane trees T with 3 edges and their total loop energies EL(T ). C. E. Heitsch, GaTech
22
Minimal Loop Energy Configurations
Minimal Energy
EL(T) = 6.6
EL(T) = 6.7
EL(T) = 6.7
EL(T) = 5.2
EL(T) = 6.8
Plane trees T with 3 edges and their total loop energies EL(T ). Theorem (H). For plane trees T with n edges, the total loop energy EL(T ) is minimal when T has the maximal number of vertices with degree 2. (When n is odd, the root has degree 1.) C. E. Heitsch, GaTech
23
Designing, Analyzing, and Predicting. . . Mathematical Result: Associated loop energies are minimized by maximizing vertices with three edges.
Biological Hypothesis: Branching degree in viral RNA loops correlates with functional significance. C. E. Heitsch, GaTech
24
In Conclusion
There is much to explore at the interface of discrete mathematics and molecular biology.
C. E. Heitsch, GaTech
25
Acknowledgments
• Our Clemson hosts and conference organizers! • NIH R01, Joint DMS/NIGMS Initiative to Support Research in the Area of Mathematical Biology. • Burroughs Wellcome Fund Career Award at the Scientific Interface. • Computation and Informatics in Biology and Medicine Training Program. • Anne Condon (CS) and Holger Hoos (CS), University of British Columbia. • Sergey Fomin (Math), University of Michiagn, Ann Arbor. • Steve Harvey (Bio) and David Bader (CS), Georgia Institute of Technology. • Michael Zuker’s mfold algorithm.
C. E. Heitsch, GaTech
26