Towards a DNA Sequencing Theory (Learning a String)

Towards a DNA Sequencing Theory (Learning a String) (Preliminary Version) Ming Li University of Waterloo DNA: Death’s Natural Altenrative - Globe E...
Author: Brianna Henry
0 downloads 1 Views 753KB Size
Towards a DNA Sequencing Theory

(Learning a String) (Preliminary Version)

Ming Li University of Waterloo

DNA: Death’s Natural Altenrative - Globe EI Mail, April 14, 1990

Key words: Machine Learning, Shortest Common SUperstring, Approximation Algorithm, DNA sequencing.

Abstract In laboratories, DNA sequencing is (roughly) done by sequencing large amount Of short fragments and then finding a shortest

1

Introduction

The fundamental quest to Life is to understand how it functions. Since the discovery of deoxyribonucleic

common superstring of the fragments.

~ new field of molecular acid (DNA) in the 1 9 5 0 ’ ~the we study frameworks, under daU-genetics has expanded a t a rate that can be matched sible assumptions, suitable for massive automated DNA by no other fields except computer science. sequencing and for analyzing DNA sequencing ‘goA DNA molecule, which holds the secrets of Life, conIithms’ we the DNA sequencing problem as sists of a sugar-phosphate backbone and, attached to learning a superstring from its randomly drawn sub- it, a longsequence of four kinds of nucleotide bases. strings. Under certain restrictions, this may be viewed At an abstract level, a single-stranded DNA as learning a superstring in Valiant learning model molecule can be represented as a character string, and in this we give an efficient for Over the set of nucleotides {A, C , G,T}.Such a charlearning a and a quantitative bound On acter string ranges from a few thousand symbols long how many samples suffice. for a simple virus to 2 x 10’ symbols for a fly and One major Obstacle to Our approach turns Out to to 3 109 symbols for a human being. Deterdnbe a quite w e l l - ~ ~ o wopen n question on how to aP- ing this for different molecules, or seqvenc;ng proximate the shortest common superstring of a set of the molecules, is a crucial step towards understandstrings, raised by a number of authors in the last ten ing the biological functions of the Huge

years [ 6 9 291 30i. we give the first provably good al- (in terms of money and man years) national projects gorithm which approximates the shortest superstring of sequencing human genome are proposed Or underOf length by a superstring Of length o(nlogn). way although, unfortunately, major technical prob‘The author is supported by NSERC operating grants lems rem&. OGP-0036747and OGP-0046506. Address: Computer SciWith current laboratory methods, such as M a a m ence Department, University of Waterloo, Ontario N2L 3G1, Gilbert or Sanger procedure, it is by far impossible Canada. Email: mliOwatmath.wsterloo.edu

CH2925-6/90/0000/0125$01.OO (6.1990 IEEE

125

to sequence a long molecule (of 3 x IO9 base pairs

experiments. We will try t o formulate the problem

for human) as a whole. Each time, a randomly cho-

in different ways and provide solutions in certain sit-

sen fragment of only u p to 500 base pairs can be

uations. We face at least two major problems:

sequenced.

(a) Given a set of strings, how do we efficiently a p

Various methods to sequence the long

molecule are used. In general, biochemists ”cut” mil-

proximate the shortest common superstring (finding

lions of such (identical) molecules into pieces each

it is NP-complete)? This has been an open problem

typically containing about 200-500 nucleotides (char-

raised by Gallant, Maier and Storer [6], Turner [30],

acters) [4, 161. For example, a biochemist can decide

and Tarhio and Ukkonen [29] in the past ten years.

to cut the molecules whenever the substring GGACTT The latter two papers contain two algorithms, based (via restriction enzymes) appears. She or he may a p

on the maximum overlapping method also used by

ply many different cuttings using different substrings.

biochemists, and a conjecture that these algorithms

After cutting, these fragments can be roughly “sorted” guarantee good performance with respect to optimal length.

according to their “weight” (“length”) in a not very precise fashion.

A biochemist “samples” the frag-

(b) Even given good approximation algorithms for

ments (of each length plus minus 50 base pairs) ran-

the superstrings, does that guarantee we can infer the

domly. Then, say, Sanger procedure is applied to

DNA sequence correctly?

sequence the sampled fragment. A good technician

We will provide an answer for (a) by giving a prov-

can process, say, 7 to 14 such fragments a day manu-

ably good approximation algorithm for the shortest

ally. Sampling is expensive and slow in the sense that

common superstring problem.

it is manual or, at the best, mechanical. The sam-

puts a superstring of length at most n l o g n where n

Our algorithm out-

pling is random in the sense that we have absolutely is the optimal length. We will also partially answer no idea where the fragment of 500 characters might

(b). Under certain restrictions, we study how well the

come from the 3 x lo9 characters long sequence. It

shortest superstring algorithm does and we provide a

is also well-known that long repetitions appear i n a

new “sequencing” algorithm under the Valiant learn-

human genome.

ing model. We will also give quantitative bounds on

From hundreds, sometimes millions, of these ran-

the number of random fragments needed to be se-

dom fragments, a biochemist has to assemble the su-

quenced.

perstring representing the whole molecule. Programs,

This paper studies Valiant learning in a domain

that try t o approximate a shortest common super-

where the sample space is of polynomial size. Since

string (whole molecule) of a given set of strings (frag-

these concepts are trivially polynomially learnable,

ments) are routinely used [23, 161. However, it is not

not much attention has been paid to them. In the

known whether this always works or why it works.

past, researchers are concentrated on the “learnabil-

It is reported in [23, 27, 161 that programs based on

ity” of concept classes whose sample spaces are, of

shortest common superstring algorithms1 work very

course (otherwise the problem would be trivial), super-

satisfactorily for real biological data. In [29], it is

polynomial [31, 2, 12, 8, 26, 22, 113, although efficient

mentioned that “although there is no a priori reason

sampling was studied for example in [9]. On the other

to aim at a shortest superstring, this seems to be the

hand, our problem has trivial algorithms that need

most natural restriction that makes the program non-

high polynomial number of samples, but it also has

trivial.” The goal of this paper is to initiate a study of

non-trivial algorithms requiring low polynomial num-

mathematical foundations for above discussions and

ber of samples. It worths noting that artificial intelligence meth-

‘It was only conjecturedthat these algorithms approximate shortest superstrings.

ods [21] and string matching algorithms have been

126

extensively applied to DNA sequence analysis [5]. The paper is organized as follows: In the next section, we provide a solution to question (a). Our new approximation algorithm is also crucial in the follow-

sidered similar approximation algorithms. Both [29, 301 conjectured that their algorithms would return a superstring within the length of 2 times the opti-

mal. But this Tarhio-Turner-Ukkonen conjecture and

the question of finding an algorithm with non-trivial modeling our DNA sequencing problem. We will as- performance guarantee with respect to length remain sociate this problem with the Valiant learning model. open today. Papers [29, 301 did establish some perWe also state a stronger version of the Occam Razor formance guarantees with respect to so-called "overtheorem proved in [3], which is needed in Section 4. lapping" measure. Such measure basically measures Then in Section 4 we give provably efficient learning the number of bits saved by the algorithm compared to plainly concatenating all the strings. It was shown algorithms for the DNA sequencing problem. that if the optimal solution saves m bits, then the ap-

ing sections. In Section 3, we study possible ways of

2

Approximating the Shortest

proximate solutions in [29, 301 save at least 7n/2 bits.

Common Superstring

spect to the optimal length since this in the best case only says that the approximation procedure produces a superstring which is of half the total length of all strings. The basic algorithm of [29, 30, 281 is quite simple and elegant. It is as follows: Given a set of strings S. Choose a pair of strings from S with largest overlap, merge these two, and put the result back in S. Repeat this process until only one string, the su-

As we have discussed above, the shortest common superstring problem plays an important role in the current DNA sequencing practice. This section will present the first provably good algorithm for such problem.

2.1

However, in general this has no implication with re-

The Shortest Common Superstring perstring, is left in S. This algorithm has always proProblem duced a superstring of length within 2 times the o p

The shortest common superstring problem is: Given a finite set S of strings over some finite or infinite alphabet Cl find a shortest string w such that each string x in S is a substring of w , i.e., w can be written as uxv for some U and v . The decision version of this problem is known t o be NP-complete [6, 71. Due to its wide range of applications in many areas such as data compression and molecular biology [29, 30, 61, finding good approximation algorithms for the shortest common superstring problem has become an important subject of research. It is an open question to find a provably good approximation algorithm [6,29,30] for this problem. We provide a solution. The superstring constructed by our algorithm is a t most nlogn long where n is the optimal length.

timal length in all, random or artificial, experiments so far. We will give a different algorithm. Our algorithm guarantees that the solution is within a logn multiplicative factor of the optimal length n, although we believe that it achieves 2n. Notation: We usually use small letters for strings and capital letters for sets. If s is a string, then denotes its length, that is, number of characters in s. If S is a set, then 1.51 denotes its cardinality, that is, number of elements in set S.

[SI

2.2

The Approximation Algorithm

Assume that we are given a set S of m strings over some finite or infinite alphabet E. Let t be an optimal There have been twoindependent results by Tarhio superstring with It1 = so each string in s is a and Ukkonen [291 and "urner L3O1 both Of which con- substring of t . Following [29, 301, we assume that

no string in S is a substring of another since these

I n the above algorithm we could put the result m(s, s’)

back into S, and hence get the following algorithm. Group-Merge 1: Input: S = ( 3 1 , ..., sm}. (0) Let weight(si) = IsiI for each string si E S. appearances in t , reading from left t o right. We list

strings can be easily deleted from S. Hence we can order the strings in S by the left ends of their first them according to above ordering:

31,

...,sm.

(1) Find s,s’ E S such that

In the

following we identify si with its first appearance in t .

cost (9,

The idea behind our algorithm is t o merge large groups. Each time, we try to determine two strings

3’)

= min m

Imb1s’)I weight(m(s,s t ) )

is minimized where,

such that merging them properly would make many

weight(m(s,3’)) =

others become substrings of the newly merged string.

weight(a) aEA

For two strings s and s’, we use the word merge

where A is the set of strings in S that are substrings

t o mean that we put s and s’ (and nothing else) to-

gether, possibly utilizing some overlaps they have in of m(3,s’). (2) Merge s, s’ to m ( s , s‘) as defined in Step (1). common, to construct a superstring of the two. In general there may be more than one way to merge s Delete set A from S. Add m ( s ,s’) to S. (3) If IS1 > 1 then go t o (l),else the only string

and d. There may be two optimal ways, and many

other non-optimal ways. For example, if s = 010 left in S is the superstring. and s’ = 00200, we can merge them with s in front In order t o keep the analysis simple, we only analyze steps (0) - (3) of algorithm Group-Merge.

optimally as m l ( s , s t ) = 0100200 or with s‘ in front optimally as mZ(s,s’) = 0020010; We can also merge

Theorem 1 Given a set of strings S, if the length of optimal superstring is n, then algorithm Groupmq(s,s’) = 00200010. These are all possible ways of Merge pToduces a superstring of length O(n log n). merging s,s’. For each merge function m, we write m(s,3’) to denote the resulting superstring. There As discussed above, we can assume Proof: are at most 2min{ls), Is’I) ways of merging s and 3’. that in S no string is a substring of another and all

them non-optimally as m ~ ( ss ,t ) = 01000200 or

We now present our algorithm.

strings are ordered by their first appearance in the

Group-Merge: Input: S = { S I ,

..., sm).

shortest superstring. Let this order be

(0) Let T := 0.

S I , 32,

...,sm.

We separate S into groups: The first group G1 con-

(1) Find s,s’ E S such that

cost(s, s t ) = min m

, i, i tains s ~ , . . . , swhere

< m, is the largest index

such that (the first appearances of) s1 and

Im(3, 3’)l

lap in t ; The second group contains

zueight(m(s,3’))

j, i

is minimized where

+ 15 j

5 m, is the

Si+lr

Si

over-

...,sj where

largest index such that sj

overlaps with s,+l in t ; And so on. In general if sk is

weight(m(s,s’)) =

~ sk+l, ...,sp the last element in G I ,then G I + contains

lull aEA

where p , k

where A is the set of strings in S that are substrings

sk+l overlaps with sp in t . See Figure 1. Assume that there are g groups: G I ,..., G,. For

of m(s,3’). (2) Merge s, s’ t o m ( s ,s t ) as defined in Step (1).

T := T u { m ( s ,3’)). Delete set A from S. (3) If IS1 > 0 then go to (1). (4) If IT(> 1 then S := T and go to return the only string in T as superstring.

+ 1 5 p 5 m,is the largest index such that

Gi let b, and ti be the first (bottom) and last (top) string in G,,according to our ordering, respectively. That is, b; and ti sandwich the rest of strings in G;

(l),else

and for some optimal mi, every string in Gi is a substring of m ( b i , t ; ) .

128

first and last strings in S, , according to our ordering,

f

respectively. And let m, be the merge function used in the r t h iteration to combine br and t,. Let there be a total of B iterations executed by Group-Merge.

G3

Now the length Lj we used to merge strings in Gj can be measured as follows:

Figure 1: Grouping of Substrings (where IC' Lemma 2

5 IC indicates the first step Gjk"' becomes

E:==,Imi(bi, t,)l 5 2n, where n is the length empty)

of shortest superstring t for S.

Proof:

This can easily be seen geometrically:

5 Imj ( b j

put all the groups G1,...,Ggback to their original positions in the sptimal arrangement (which gives the

(where H(m)=

i

t j ) IH(IIGj

II)

= O(1ogm))

shortest superstring t ) . Then strings in G , overlap = lmj ( b j v ti 1I@(logIIGj I 1)with nothing except strings in Gi-1 and Gi+l (and of Hence the total length we use to merge all GI, ...,G , course in G; itself), for i = 2, ...,g - 1. Thus counting is, by Lemma 2, the optimal solution for each G, separately at most 9 9 doubles the length of the optimal solution. U 5 O ( l o g 1 ) Imi(bi,ti)l ~

CL' i=l

Lemma 3 FOTeach G,, after deleting an arbitrary

i=l

< @(log1)2n = @(log1)n,

number of strings from Gi, among the strings left, we can still merge a pair of strings in G , so that all other where 1 = m q [[Gill. But O(1ogn) = O(log1) since n is (polynomially) larger than the number of strings strings are substrings of this merge. Furthermore the in any G, and (polynomially) larger than the length resulting merge is a substring of q ( b ; , t i ) . of longest string in any G,. Therefore the algorithm Proof: Consider the original optimal arrangement will output a superstring of length at most O(n1ogn). and ordering of strings 91,.-.,,s in t . All strings in U G, overlap at a single position. So after deleting some Remark. We conjecture that our algorithm alstrings, we can still properly merge the first (bottom) and last (top) strings among the leftover strings in Gi to obtain a superstring such that all the leftover strings in G, are substrings of this superstring. Obvi-

ways outputs a superstring of length no more than 2n (and possibly even less) where n is the optimal length. Vijay Vazirani constructed an example for which our algorithm outputs a superstring of length

ously this superstring is also a substring of m i ( b ; , t , ) . 1.3n. Samir Khuller suggested that we can replace

U

the weight function by just counting the number of strings "sandwiched" by the two merging strings. Albe the set of strings remaining in G, before the though this algorithm still guarantees the n log n per7th iteration. Let Sr be the set of strings cancelled formance, but a counter example shows that the outat r t h iteration of Group-Merge. Let b r , t , be the put sometimes exceeds 2n in length. For a set of strings A, let IlAll =

CoEA Jul. Let

129

Modeling DNA Sequencing Problem

3

sitions of s are covered with probability 1 - n-O(l). By the same calculation, the probability of existing some substring which does not overlap for more than

112 characters with some other substring in the sam-

The task of this section is to properly put our DNA

sequencing problem into a suitable mathematical frame-.ple is also less than n-'(l). Since s is random, there work. In the following, we safely assume that the is no repetitions in s of length greater than 112 if

> 4logn

expected sampling length I is much larger than log n

1

where n is the length of the DNA sequence2.

probability 1 - n-O('), after O(*")

[18], we thus can precisely recover s with

samples.

I t may also be reasonable to assume that in order

Of course, assuming s being random is danger-

t o sequence a human genome of 3 x l o 9 long, at about

ous since it is known that human DNA molecules are

500 characters per fragment, sequencing process must

not very random and actually contain many (pretty

be fully automated without human intervention. We long) repetitions (and redundancies). Above calculation may be meaningful only for lower lifes when the

are interested in fast and simple sequencing methods

that can be easily automated. Another safe assump- repetitions mostly are less than 112. tion we will make is that sampling will be uniformly

3.2

random. This is more or less the reality or achievable. We consider two models.

Learning It Approximately

An alternative approach, which we will take in the rest of this paper, is to make no assumptions about

3.1

Recovering It Precisely

the sequence s, although we will lose some power of

If we can assume that a DNA sequence s is random

our theory.

and substrings from SJ are uniformly sampled, where

Our learned DNA sequence will only be good with

We appeal to Valiant learning model.

Sl contains s's substrings of length 1 and substrings high probability for short queries about several hunof length less than I that are prefixes or suffixes of dred characters long. Although this only partially s, then s can be identified with no error with high captures reality, we are able to fully characterize the world we have captured. This is still meaningful since

probability.

certain biological functions are encoded by just a few

Just t o simplify calculation, we consider a fixed I, rather than for all 1 (less than 500). This should not

hundred base pairs. One purpose of DNA sequencing

change the result. We provide a rough calculation.

is to find the functionality of the genes, there is really

Let K ( s ) 2

)SI = n,

no need to insist on recovering the original sequence

where K ( s ) is the Kolmogorov

complexity of s '. Divide s into

112

=

precisely, especially when this is impossible.

consecutive

blocks. Let E, denote the event where the ith block

We first describe the distribution independent

is not fully covered by any sample. Let E = U&.

model of learning introduced by Valiant [31]. More

At each random sampling, Pr(Ei) 5

Then

discussion and justification of this model can be found

samples, P r ( E ) 5 n-O('). SO all po-

in [2, 12, 14, 32, 331. [15] has been a useful source for

after

o(+)

*.

me. We assume that the learning algorithm of A has

2 B y current technology l is about 500, whereas n 5 3 x 10'

available a black box called EXAMPLES(A), with

to be interesting. 3We refer the reader to [18]for an introduction of Kol-

two buttons labeled POS and NEG. If POS (NEG) is

mogorov complexity. Here we give an intuitive definition: Kolmogorov complexity of a binary string z conditionalto g, writ-

pushed, a positive (negative) example is generated ac-

ten as K(zly), is the lengthbf the shortest Pascal program,

cording to some h e d but unknown probability distri-

encoded in binary in a standard way, that starts with input y,

bution D+ ( D - ) . We assume nothing about the dis-

prints z and halts. K ( z ) = K ( z ( e ) .

tributions D+ and D-,except that

130

CsEPOS D + ( s )=

xsENEG zaENEG

1 and D - ( s ) = 1 (i.e., CaEPOS D - ( s ) = 0 Theorem 4 Let C and C' be concept classes. Let and D + ( s ) = 0). For discrete domains, the c E C with size(c) = n. FOT a! 2 1 and 0 5 p < 1, and let A be an algorithm that on input of m/2 posiValiant learnability can be defined as follows. tive examples of c drawn from D+ and m/2 negative Definition 1 Let C and C' be concept classes. C is examples of c drawn front D-, outputs a hypothepolynomially learnable from examples (by e') if there sis c' E c' that is consistent with the examples and is an (randomized) algorithm A with access t o POS satisfies K(c'1c) 5 n a d , where K(c'1c) is the Koland N E G which, taking inputs 0 < c , 6 < 1, for any mogorov complexity of c' conditional t o c. Then A as c E C and D+, D-, halts in polynomial(size(c), f ) a learning algorithm for c by c' for time and outputs a concept c' E C' that with probabil1 l n Q i i t y greater than 1 - 6 satisfies m = O(nax(-log-,(-)i=q). € b e

i,

D + ( s )