Towards a DNA Sequencing Theory
(Learning a String) (Preliminary Version)
Ming Li University of Waterloo
DNA: Death’s Natural Altenrative - Globe EI Mail, April 14, 1990
Key words: Machine Learning, Shortest Common SUperstring, Approximation Algorithm, DNA sequencing.
Abstract In laboratories, DNA sequencing is (roughly) done by sequencing large amount Of short fragments and then finding a shortest
1
Introduction
The fundamental quest to Life is to understand how it functions. Since the discovery of deoxyribonucleic
common superstring of the fragments.
~ new field of molecular acid (DNA) in the 1 9 5 0 ’ ~the we study frameworks, under daU-genetics has expanded a t a rate that can be matched sible assumptions, suitable for massive automated DNA by no other fields except computer science. sequencing and for analyzing DNA sequencing ‘goA DNA molecule, which holds the secrets of Life, conIithms’ we the DNA sequencing problem as sists of a sugar-phosphate backbone and, attached to learning a superstring from its randomly drawn sub- it, a longsequence of four kinds of nucleotide bases. strings. Under certain restrictions, this may be viewed At an abstract level, a single-stranded DNA as learning a superstring in Valiant learning model molecule can be represented as a character string, and in this we give an efficient for Over the set of nucleotides {A, C , G,T}.Such a charlearning a and a quantitative bound On acter string ranges from a few thousand symbols long how many samples suffice. for a simple virus to 2 x 10’ symbols for a fly and One major Obstacle to Our approach turns Out to to 3 109 symbols for a human being. Deterdnbe a quite w e l l - ~ ~ o wopen n question on how to aP- ing this for different molecules, or seqvenc;ng proximate the shortest common superstring of a set of the molecules, is a crucial step towards understandstrings, raised by a number of authors in the last ten ing the biological functions of the Huge
years [ 6 9 291 30i. we give the first provably good al- (in terms of money and man years) national projects gorithm which approximates the shortest superstring of sequencing human genome are proposed Or underOf length by a superstring Of length o(nlogn). way although, unfortunately, major technical prob‘The author is supported by NSERC operating grants lems rem&. OGP-0036747and OGP-0046506. Address: Computer SciWith current laboratory methods, such as M a a m ence Department, University of Waterloo, Ontario N2L 3G1, Gilbert or Sanger procedure, it is by far impossible Canada. Email: mliOwatmath.wsterloo.edu
CH2925-6/90/0000/0125$01.OO (6.1990 IEEE
125
to sequence a long molecule (of 3 x IO9 base pairs
experiments. We will try t o formulate the problem
for human) as a whole. Each time, a randomly cho-
in different ways and provide solutions in certain sit-
sen fragment of only u p to 500 base pairs can be
uations. We face at least two major problems:
sequenced.
(a) Given a set of strings, how do we efficiently a p
Various methods to sequence the long
molecule are used. In general, biochemists ”cut” mil-
proximate the shortest common superstring (finding
lions of such (identical) molecules into pieces each
it is NP-complete)? This has been an open problem
typically containing about 200-500 nucleotides (char-
raised by Gallant, Maier and Storer [6], Turner [30],
acters) [4, 161. For example, a biochemist can decide
and Tarhio and Ukkonen [29] in the past ten years.
to cut the molecules whenever the substring GGACTT The latter two papers contain two algorithms, based (via restriction enzymes) appears. She or he may a p
on the maximum overlapping method also used by
ply many different cuttings using different substrings.
biochemists, and a conjecture that these algorithms
After cutting, these fragments can be roughly “sorted” guarantee good performance with respect to optimal length.
according to their “weight” (“length”) in a not very precise fashion.
A biochemist “samples” the frag-
(b) Even given good approximation algorithms for
ments (of each length plus minus 50 base pairs) ran-
the superstrings, does that guarantee we can infer the
domly. Then, say, Sanger procedure is applied to
DNA sequence correctly?
sequence the sampled fragment. A good technician
We will provide an answer for (a) by giving a prov-
can process, say, 7 to 14 such fragments a day manu-
ably good approximation algorithm for the shortest
ally. Sampling is expensive and slow in the sense that
common superstring problem.
it is manual or, at the best, mechanical. The sam-
puts a superstring of length at most n l o g n where n
Our algorithm out-
pling is random in the sense that we have absolutely is the optimal length. We will also partially answer no idea where the fragment of 500 characters might
(b). Under certain restrictions, we study how well the
come from the 3 x lo9 characters long sequence. It
shortest superstring algorithm does and we provide a
is also well-known that long repetitions appear i n a
new “sequencing” algorithm under the Valiant learn-
human genome.
ing model. We will also give quantitative bounds on
From hundreds, sometimes millions, of these ran-
the number of random fragments needed to be se-
dom fragments, a biochemist has to assemble the su-
quenced.
perstring representing the whole molecule. Programs,
This paper studies Valiant learning in a domain
that try t o approximate a shortest common super-
where the sample space is of polynomial size. Since
string (whole molecule) of a given set of strings (frag-
these concepts are trivially polynomially learnable,
ments) are routinely used [23, 161. However, it is not
not much attention has been paid to them. In the
known whether this always works or why it works.
past, researchers are concentrated on the “learnabil-
It is reported in [23, 27, 161 that programs based on
ity” of concept classes whose sample spaces are, of
shortest common superstring algorithms1 work very
course (otherwise the problem would be trivial), super-
satisfactorily for real biological data. In [29], it is
polynomial [31, 2, 12, 8, 26, 22, 113, although efficient
mentioned that “although there is no a priori reason
sampling was studied for example in [9]. On the other
to aim at a shortest superstring, this seems to be the
hand, our problem has trivial algorithms that need
most natural restriction that makes the program non-
high polynomial number of samples, but it also has
trivial.” The goal of this paper is to initiate a study of
non-trivial algorithms requiring low polynomial num-
mathematical foundations for above discussions and
ber of samples. It worths noting that artificial intelligence meth-
‘It was only conjecturedthat these algorithms approximate shortest superstrings.
ods [21] and string matching algorithms have been
126
extensively applied to DNA sequence analysis [5]. The paper is organized as follows: In the next section, we provide a solution to question (a). Our new approximation algorithm is also crucial in the follow-
sidered similar approximation algorithms. Both [29, 301 conjectured that their algorithms would return a superstring within the length of 2 times the opti-
mal. But this Tarhio-Turner-Ukkonen conjecture and
the question of finding an algorithm with non-trivial modeling our DNA sequencing problem. We will as- performance guarantee with respect to length remain sociate this problem with the Valiant learning model. open today. Papers [29, 301 did establish some perWe also state a stronger version of the Occam Razor formance guarantees with respect to so-called "overtheorem proved in [3], which is needed in Section 4. lapping" measure. Such measure basically measures Then in Section 4 we give provably efficient learning the number of bits saved by the algorithm compared to plainly concatenating all the strings. It was shown algorithms for the DNA sequencing problem. that if the optimal solution saves m bits, then the ap-
ing sections. In Section 3, we study possible ways of
2
Approximating the Shortest
proximate solutions in [29, 301 save at least 7n/2 bits.
Common Superstring
spect to the optimal length since this in the best case only says that the approximation procedure produces a superstring which is of half the total length of all strings. The basic algorithm of [29, 30, 281 is quite simple and elegant. It is as follows: Given a set of strings S. Choose a pair of strings from S with largest overlap, merge these two, and put the result back in S. Repeat this process until only one string, the su-
As we have discussed above, the shortest common superstring problem plays an important role in the current DNA sequencing practice. This section will present the first provably good algorithm for such problem.
2.1
However, in general this has no implication with re-
The Shortest Common Superstring perstring, is left in S. This algorithm has always proProblem duced a superstring of length within 2 times the o p
The shortest common superstring problem is: Given a finite set S of strings over some finite or infinite alphabet Cl find a shortest string w such that each string x in S is a substring of w , i.e., w can be written as uxv for some U and v . The decision version of this problem is known t o be NP-complete [6, 71. Due to its wide range of applications in many areas such as data compression and molecular biology [29, 30, 61, finding good approximation algorithms for the shortest common superstring problem has become an important subject of research. It is an open question to find a provably good approximation algorithm [6,29,30] for this problem. We provide a solution. The superstring constructed by our algorithm is a t most nlogn long where n is the optimal length.
timal length in all, random or artificial, experiments so far. We will give a different algorithm. Our algorithm guarantees that the solution is within a logn multiplicative factor of the optimal length n, although we believe that it achieves 2n. Notation: We usually use small letters for strings and capital letters for sets. If s is a string, then denotes its length, that is, number of characters in s. If S is a set, then 1.51 denotes its cardinality, that is, number of elements in set S.
[SI
2.2
The Approximation Algorithm
Assume that we are given a set S of m strings over some finite or infinite alphabet E. Let t be an optimal There have been twoindependent results by Tarhio superstring with It1 = so each string in s is a and Ukkonen [291 and "urner L3O1 both Of which con- substring of t . Following [29, 301, we assume that
no string in S is a substring of another since these
I n the above algorithm we could put the result m(s, s’)
back into S, and hence get the following algorithm. Group-Merge 1: Input: S = ( 3 1 , ..., sm}. (0) Let weight(si) = IsiI for each string si E S. appearances in t , reading from left t o right. We list
strings can be easily deleted from S. Hence we can order the strings in S by the left ends of their first them according to above ordering:
31,
...,sm.
(1) Find s,s’ E S such that
In the
following we identify si with its first appearance in t .
cost (9,
The idea behind our algorithm is t o merge large groups. Each time, we try to determine two strings
3’)
= min m
Imb1s’)I weight(m(s,s t ) )
is minimized where,
such that merging them properly would make many
weight(m(s,3’)) =
others become substrings of the newly merged string.
weight(a) aEA
For two strings s and s’, we use the word merge
where A is the set of strings in S that are substrings
t o mean that we put s and s’ (and nothing else) to-
gether, possibly utilizing some overlaps they have in of m(3,s’). (2) Merge s, s’ to m ( s , s‘) as defined in Step (1). common, to construct a superstring of the two. In general there may be more than one way to merge s Delete set A from S. Add m ( s ,s’) to S. (3) If IS1 > 1 then go t o (l),else the only string
and d. There may be two optimal ways, and many
other non-optimal ways. For example, if s = 010 left in S is the superstring. and s’ = 00200, we can merge them with s in front In order t o keep the analysis simple, we only analyze steps (0) - (3) of algorithm Group-Merge.
optimally as m l ( s , s t ) = 0100200 or with s‘ in front optimally as mZ(s,s’) = 0020010; We can also merge
Theorem 1 Given a set of strings S, if the length of optimal superstring is n, then algorithm Groupmq(s,s’) = 00200010. These are all possible ways of Merge pToduces a superstring of length O(n log n). merging s,s’. For each merge function m, we write m(s,3’) to denote the resulting superstring. There As discussed above, we can assume Proof: are at most 2min{ls), Is’I) ways of merging s and 3’. that in S no string is a substring of another and all
them non-optimally as m ~ ( ss ,t ) = 01000200 or
We now present our algorithm.
strings are ordered by their first appearance in the
Group-Merge: Input: S = { S I ,
..., sm).
shortest superstring. Let this order be
(0) Let T := 0.
S I , 32,
...,sm.
We separate S into groups: The first group G1 con-
(1) Find s,s’ E S such that
cost(s, s t ) = min m
, i, i tains s ~ , . . . , swhere
< m, is the largest index
such that (the first appearances of) s1 and
Im(3, 3’)l
lap in t ; The second group contains
zueight(m(s,3’))
j, i
is minimized where
+ 15 j
5 m, is the
Si+lr
Si
over-
...,sj where
largest index such that sj
overlaps with s,+l in t ; And so on. In general if sk is
weight(m(s,s’)) =
~ sk+l, ...,sp the last element in G I ,then G I + contains
lull aEA
where p , k
where A is the set of strings in S that are substrings
sk+l overlaps with sp in t . See Figure 1. Assume that there are g groups: G I ,..., G,. For
of m(s,3’). (2) Merge s, s’ t o m ( s ,s t ) as defined in Step (1).
T := T u { m ( s ,3’)). Delete set A from S. (3) If IS1 > 0 then go to (1). (4) If IT(> 1 then S := T and go to return the only string in T as superstring.
+ 1 5 p 5 m,is the largest index such that
Gi let b, and ti be the first (bottom) and last (top) string in G,,according to our ordering, respectively. That is, b; and ti sandwich the rest of strings in G;
(l),else
and for some optimal mi, every string in Gi is a substring of m ( b i , t ; ) .
128
first and last strings in S, , according to our ordering,
f
respectively. And let m, be the merge function used in the r t h iteration to combine br and t,. Let there be a total of B iterations executed by Group-Merge.
G3
Now the length Lj we used to merge strings in Gj can be measured as follows:
Figure 1: Grouping of Substrings (where IC' Lemma 2
5 IC indicates the first step Gjk"' becomes
E:==,Imi(bi, t,)l 5 2n, where n is the length empty)
of shortest superstring t for S.
Proof:
This can easily be seen geometrically:
5 Imj ( b j
put all the groups G1,...,Ggback to their original positions in the sptimal arrangement (which gives the
(where H(m)=
i
t j ) IH(IIGj
II)
= O(1ogm))
shortest superstring t ) . Then strings in G , overlap = lmj ( b j v ti 1I@(logIIGj I 1)with nothing except strings in Gi-1 and Gi+l (and of Hence the total length we use to merge all GI, ...,G , course in G; itself), for i = 2, ...,g - 1. Thus counting is, by Lemma 2, the optimal solution for each G, separately at most 9 9 doubles the length of the optimal solution. U 5 O ( l o g 1 ) Imi(bi,ti)l ~
CL' i=l
Lemma 3 FOTeach G,, after deleting an arbitrary
i=l
< @(log1)2n = @(log1)n,
number of strings from Gi, among the strings left, we can still merge a pair of strings in G , so that all other where 1 = m q [[Gill. But O(1ogn) = O(log1) since n is (polynomially) larger than the number of strings strings are substrings of this merge. Furthermore the in any G, and (polynomially) larger than the length resulting merge is a substring of q ( b ; , t i ) . of longest string in any G,. Therefore the algorithm Proof: Consider the original optimal arrangement will output a superstring of length at most O(n1ogn). and ordering of strings 91,.-.,,s in t . All strings in U G, overlap at a single position. So after deleting some Remark. We conjecture that our algorithm alstrings, we can still properly merge the first (bottom) and last (top) strings among the leftover strings in Gi to obtain a superstring such that all the leftover strings in G, are substrings of this superstring. Obvi-
ways outputs a superstring of length no more than 2n (and possibly even less) where n is the optimal length. Vijay Vazirani constructed an example for which our algorithm outputs a superstring of length
ously this superstring is also a substring of m i ( b ; , t , ) . 1.3n. Samir Khuller suggested that we can replace
U
the weight function by just counting the number of strings "sandwiched" by the two merging strings. Albe the set of strings remaining in G, before the though this algorithm still guarantees the n log n per7th iteration. Let Sr be the set of strings cancelled formance, but a counter example shows that the outat r t h iteration of Group-Merge. Let b r , t , be the put sometimes exceeds 2n in length. For a set of strings A, let IlAll =
CoEA Jul. Let
129
Modeling DNA Sequencing Problem
3
sitions of s are covered with probability 1 - n-O(l). By the same calculation, the probability of existing some substring which does not overlap for more than
112 characters with some other substring in the sam-
The task of this section is to properly put our DNA
sequencing problem into a suitable mathematical frame-.ple is also less than n-'(l). Since s is random, there work. In the following, we safely assume that the is no repetitions in s of length greater than 112 if
> 4logn
expected sampling length I is much larger than log n
1
where n is the length of the DNA sequence2.
probability 1 - n-O('), after O(*")
[18], we thus can precisely recover s with
samples.
I t may also be reasonable to assume that in order
Of course, assuming s being random is danger-
t o sequence a human genome of 3 x l o 9 long, at about
ous since it is known that human DNA molecules are
500 characters per fragment, sequencing process must
not very random and actually contain many (pretty
be fully automated without human intervention. We long) repetitions (and redundancies). Above calculation may be meaningful only for lower lifes when the
are interested in fast and simple sequencing methods
that can be easily automated. Another safe assump- repetitions mostly are less than 112. tion we will make is that sampling will be uniformly
3.2
random. This is more or less the reality or achievable. We consider two models.
Learning It Approximately
An alternative approach, which we will take in the rest of this paper, is to make no assumptions about
3.1
Recovering It Precisely
the sequence s, although we will lose some power of
If we can assume that a DNA sequence s is random
our theory.
and substrings from SJ are uniformly sampled, where
Our learned DNA sequence will only be good with
We appeal to Valiant learning model.
Sl contains s's substrings of length 1 and substrings high probability for short queries about several hunof length less than I that are prefixes or suffixes of dred characters long. Although this only partially s, then s can be identified with no error with high captures reality, we are able to fully characterize the world we have captured. This is still meaningful since
probability.
certain biological functions are encoded by just a few
Just t o simplify calculation, we consider a fixed I, rather than for all 1 (less than 500). This should not
hundred base pairs. One purpose of DNA sequencing
change the result. We provide a rough calculation.
is to find the functionality of the genes, there is really
Let K ( s ) 2
)SI = n,
no need to insist on recovering the original sequence
where K ( s ) is the Kolmogorov
complexity of s '. Divide s into
112
=
precisely, especially when this is impossible.
consecutive
blocks. Let E, denote the event where the ith block
We first describe the distribution independent
is not fully covered by any sample. Let E = U&.
model of learning introduced by Valiant [31]. More
At each random sampling, Pr(Ei) 5
Then
discussion and justification of this model can be found
samples, P r ( E ) 5 n-O('). SO all po-
in [2, 12, 14, 32, 331. [15] has been a useful source for
after
o(+)
*.
me. We assume that the learning algorithm of A has
2 B y current technology l is about 500, whereas n 5 3 x 10'
available a black box called EXAMPLES(A), with
to be interesting. 3We refer the reader to [18]for an introduction of Kol-
two buttons labeled POS and NEG. If POS (NEG) is
mogorov complexity. Here we give an intuitive definition: Kolmogorov complexity of a binary string z conditionalto g, writ-
pushed, a positive (negative) example is generated ac-
ten as K(zly), is the lengthbf the shortest Pascal program,
cording to some h e d but unknown probability distri-
encoded in binary in a standard way, that starts with input y,
bution D+ ( D - ) . We assume nothing about the dis-
prints z and halts. K ( z ) = K ( z ( e ) .
tributions D+ and D-,except that
130
CsEPOS D + ( s )=
xsENEG zaENEG
1 and D - ( s ) = 1 (i.e., CaEPOS D - ( s ) = 0 Theorem 4 Let C and C' be concept classes. Let and D + ( s ) = 0). For discrete domains, the c E C with size(c) = n. FOT a! 2 1 and 0 5 p < 1, and let A be an algorithm that on input of m/2 posiValiant learnability can be defined as follows. tive examples of c drawn from D+ and m/2 negative Definition 1 Let C and C' be concept classes. C is examples of c drawn front D-, outputs a hypothepolynomially learnable from examples (by e') if there sis c' E c' that is consistent with the examples and is an (randomized) algorithm A with access t o POS satisfies K(c'1c) 5 n a d , where K(c'1c) is the Koland N E G which, taking inputs 0 < c , 6 < 1, for any mogorov complexity of c' conditional t o c. Then A as c E C and D+, D-, halts in polynomial(size(c), f ) a learning algorithm for c by c' for time and outputs a concept c' E C' that with probabil1 l n Q i i t y greater than 1 - 6 satisfies m = O(nax(-log-,(-)i=q). € b e
i,
D + ( s )