Algorithms for Motif Search

1 Algorithms for Motif Search 1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Introdu...

Author: Blanche Stone

10 downloads 0 Views 219KB Size

Report

Download PDF

Recommend Documents

THE SEARCH FOR efficient algorithms for reconstructing

Monte-Carlo Search Algorithms

Quantum search algorithms

Random Search Algorithms

PATTERN SEARCH ALGORITHMS FOR BOUND CONSTRAINED MINIMIZATION

Parallel Algorithms for Depth-First Search

Search Algorithms for Discrete Optimization Problems

Index-based algorithms for motif search and their integration in a system for differential genome analysis Dissertation

Efficient, High-Performance Algorithms for N-Best Search

Comparing Heuristic Search Methods and Genetic Algorithms for Warehouse Scheduling

Automated Mechanism Design: A New Application Area for Search Algorithms

Line Search Algorithms with Guaranteed Sufficient Decrease

Chapter 2: Greedy Algorithms and Local Search

Visualizing Spatial Proximity of Search Algorithms

A Benchmark Study of Optimization Search Algorithms

1 Graph Representations and Graph Search Algorithms

Genetic Algorithms As Function Optimizers. Genetic Algorithms. Genetic Algorithms: Machine Learning or Search? GA Applications

Algorithms. Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection

Algorithms. Algorithms 3.3 BALANCED SEARCH TREES. 2-3 search trees red-black BSTs B-trees ROBERT SEDGEWICK KEVIN WAYNE

Protein Motif Recognition I

Citation for published version (APA): Amolochitis, E. (2014). Algorithms for Academic Search and Recommendation Systems

Algorithms. Algorithms 3.2 BINARY SEARCH TREES. BSTs ordered operations deletion ROBERT SEDGEWICK KEVIN WAYNE

Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search

Birthday paradox, coupon collectors, caching algorithms and self-organizing search

1 Algorithms for Motif Search 1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Algorithms for Planted Motif Search . . . . . . . . . . . . . . .

1-1 1-1 1-2

A Probabilistic Analysis • The WINNOWER and SP-STAR Algorithms • Pattern Branching • Random Projection Algorithm • Algorithm PMS1 • Algorithm PMS2 • Algorithm PMS3 • Saving Memory • Extensions

1.4 Techniques for Edited Motif Search . . . . . . . . . . . . . . . . 1-10 An Algorithm Similar to PMS1 Algorithm

•

A Randomized

1.5 Algorithms for Simple Motifs Search . . . . . . . . . . . . . . . 1-14 A Simple Sorting Based Algorithm • Simple Motif Search (SMS) • Parallelism • TEIRESIAS Algorithm

Sanguthevar Rajasekaran University of Connecticut

1.1

1.6 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17 1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18

Abstract

The problem of identifying meaningful patterns (i.e., motifs) from biological data has been studied extensively due to its paramount importance. Three versions of this problem have been identified in the literature. Numerous algorithms have been proposed in the literature for motif search. In this chapter we survey some of the algorithms that have been proposed for these problems.

1.2

Introduction

Motif search is an important problem in biology. This problem in general requires finding short patterns of interest from voluminous data. Several variants of this motif search problem have been identified in the literature. In this chapter we are interested in the following three versions: Problem 1. Input are n sequences of length m each. Input also are two integers l and d. The problem is to find a motif (i.e., a sequence) M of length l. It is given that each input sequence contains a variant of M . The variants of interest are sequences that are at a hamming distance of d from M . (This problem is also known as the planted (l, d)-motif search problem.) 0-8493-8597-0/01/$0.00+$1.50 c 2001 by CRC Press, LLC °

1-1

1-2 Problem 2. The input is a database DB of sequences S1 , S2 , . . . , Sn . Input also are integers l, d, and q. Output should be all the patterns in DB such that each pattern is of length l and it occurs in at least q of the n sequences. A pattern U is considered an occurrence of another pattern V as long as the edit distance between U and V is at most d. We refer to this problem as the edited motif search problem. Problem 3. A pattern is a string of symbols (also called residues) and ?’s. A “?” refers to a wild card character. A pattern cannot begin or end with ?. AB?D, EB??DS?R, etc. are examples of patterns. The length of a pattern is the number of characters in it (including the wildcard characters). This problem takes as input a database DB of sequences. The goal is to identify all the patterns of length at most l (with anywhere from 0 to bl/2c wild card characters). In particular, the output should be all the patterns together with a count of how many times each pattern occurs. Optionally a threshold value for the number of occurrences could be supplied. We refer to this problem as the simple motifs search problem.

1.3

Algorithms for Planted Motif Search

Though the planted motif search problem (Problem 1) is defined for arbitrary sequences, in the literature it is usually assumed that the input consists of DNA sequences (and hence the alphabet size is 4). We also make this assumption. Numerous papers have been written in the past on the topic of Problem 1. Examples include Bailey and Elkan [4], Lawrence et al. [19], Rocke and Tompa [28]. These algorithms employ local search techniques such as Gibbs sampling, expectation optimization, etc. These algorithms may not output the planted motif always. We refer to such algorithms as approximation algorithms . Algorithms that always output the correct answer are referred to as exact algorithms . More algorithms have been proposed for Problem 1 by the following authors: Pevzner and Sze [24], Buhler and Tompa [8]. The algorithm in [24] is based on finding cliques in a graph and the algorithm of [8] employs random projections. These algorithms have been experimentally demonstrated to perform well. These are approximation algorithms as well. Details of these algorithms are given in this chapter. Algorithms for Problem 1 can be categorized into two depending on the basic approach employed, namely, profile-based algorithms and pattern-based algorithms (see e.g., [25]). Profilebased algorithms predict the starting positions of the occurrences of the motif in each sequence. On the other hand, pattern-based algorithms predict the motif (as a sequence of residues) itself. Several pattern based algorithms are known. Examples include PROJECTION [8], MULTIPROFILER [17], MITRA [11], and PatternBranching [25]. PatternBranching (due to Price, Ramabhadran and Pevzner [25]) starts with random seed strings and performs local searches starting from these seeds. More details on this algorithm are provided later. Examples of profile-based algorithms include CONSENSUS [15], GibbsDNA [19], MEME [3], and ProfileBranching [25]. The performance of profile-based algorithms are specified with a measure called “performance coefficient”. The performance coefficient gives an indication of how many positions (for the motif occurrences) have been predicted correctly. For the (15, 4) challenge problem, these algorithms have the following performance coefficients (respectively): 0.2, 0.32, 0.14, and 0.57. The run times of these algorithms for this instance are (respectively, in seconds): 40, 40, 5, and 80. A profile-based algorithm could either be approximate or exact. Likewise a pattern-based algorithm may either be exact or approximate. Algorithms that are exact are also known as exhaustive enumeration algorithms in the literature.

Algorithms for Motif Search

1-3

Many exact algorithms are known. (See e.g., [21, 7, 13, 30, 31, 32, 34, 26].) However, as pointed out in [8], these algorithms “become impractical for the sizes involved in the challenge problem”. Exceptions are the MITRA algorithm [11] and the PMS algorithms of Rajasekaran, Balla and Huang [26]. These algorithms are pattern-based and are exact. MITRA solves for example the (15, 4) instance in 5 minutes using 100 MB of memory [11]. This algorithm is based on the WINNOWER algorithm [24] and uses pairwise similarity information. A new pruning technique enables MITRA to be more efficient than WINNOWER. MITRA uses a mismatch tree data structure and splits the space of all possible patterns into disjoint subspaces that start with a given prefix. The same (15, 4) instance is solved in 3.5 minutes by PMS [26]. It is noteworthy here that the profile-based algorithms such as CONSENSUS, GibbsDNA, MEME, and ProfileBranching take much less time for the (15, 4) instance [25]. However these algorithms fall under the approximate category and may not always output the correct answer. Some of the pattern-based algorithms (such as PROJECTION, MULTIPROFILER, and PatternBranching) also take much less time [25]. However these are approximate as well (though the success rates are close to 100%).

1.3.1

A Probabilistic Analysis

The problem of planted motif search is complicated by the fact that, for a given value of d, if the value of l is small enough, then the expected number of motifs that occur by random chance could be enormous. For instance, when n = 20, m = 600, l = 9, d = 2, the expected number of spurious motifs (that occur in each input sequence at a hamming distance of d) is 1.6. On the other hand for n = 20, m = 600, l = 10, d = 2, the expected number of spurious motifs is only 6.1 × 10−8 . A probabilistic analysis to this effect can be conducted as follows. Let Sk be any input sequence 1 ≤ k ≤ n and let u be any l-mer. Call the positions § ¨ 1, l + 1, 2l + 1, . . . , m−l+1 l + 1 special positions. Probability that u occurs in Sk at a l ¡ ¢ ¡ ¢d ¡ 1 ¢l−d hamming distance of d starting from a specific special position is p = dl 43 . 4 Thus, probability that u occurs in¨Sk starting from at least one of the special positions is § 0 1 − (1 − p)m where m0 = m−l+1 + 1. As a result, probability that u occurs somewhere l m0 in Sk is at least 1 − (1 − p) . This means that the expected number h of l-mers that i occur in each of the input sequences (at a hamming distance of d) is ≥ 4l 1 − (1 − p)m

1.3.2

0

n

.

The WINNOWER and SP-STAR Algorithms

The algorithm of Pevzner and Sze [24] (called WINNOWER) works as follows. If A and B are two instances (i.e., occurences) of the motif then the hamming distance between A and 2 B is at most 2d. In fact the expected hamming distance between A and B is 2d − 4d 3l . The algorithm constructs a collection C of all possible l-mers in the input. A Graph G(V, E) is then constructed. Each l-mer in C will correspond to a node in G. Two nodes u and v in G are connected by an edge if and only if the hamming distance between the two l-mers is at most 2d and these l-mers come from two different sequences. Clearly, the n instances of the motif M form a clique of size n in G. Thus the problem of finding M reduces to that of finding large cliques in G. Unfortunately, there will be numerous ’spurious’ edges (i.e., edges that do not connect instances of M ) in G and also finding cliques is N P-hard. Pevzner and Sze [24] employ a clever technique to prune spurious edges.

1-4 Pevzner and Sze [24] observe that the graph G constructed above is ’almost random’ and is multipartite. They use the notion of an extendable clique. If Q = {v1 , v2 , . . . , vk } is any clique, node u is called a neighbor of Q if {v1 , v2 , . . . , vk , u} is also a clique. In other words, Q can be extended to a larger clique with the inclusion of u. A clique is called extendable if it has at least one neighbor in every part of the multipartite graph G. The algorithm WINNOWER ¡ ¢ is based on the observation that every edge in a maximal n-clique belongs to at least n−2 k−2 extendable cliques of size k. This observation is used to eliminate edges. WINNOWER is an iterative algorithm where cliques of larger and larger sizes are constructed. The run time of the algorithm is O(N 2d+1 ) where N = nm. But in practice the algorithm runs much faster. In [24] another algorithm called SP-STAR has also been given. This algorithm is faster than WINNOWER and uses less memory. WINNOWER algorithm treats all the edges of G equally without distinguishing between edges based on similarities. SP-STAR scores the l-mers of C as well as the edges of G appropriately and hence eliminates more edges than WINNOWER in any iteration.

1.3.3

Pattern Branching

A local searching algorithm ¡ ¢ called PatternBranching has been proposed in [25]. If u is any l-mer, then there are dl 3d l-mers that are at a hamming distance of d from u. Call each such l-mer a neighbor of u. One way of solving the planted motif search problem is to start from each l-mer in the input, search the neighbors of this l-mer, score them appropriately and output the best scoring ¡ ¢ neighbor. There are a total of n(m − l + 1) l-mers in the input. Each of these l-mers has dl 3d neighbors. For each such neighbor, a score can be computed. ¡ ¢ Having computed the scores of all of these n(m − l + 1) dl 3d neighbors, the best scoring neighbor is output. A similar approach has been employed by [35, 13]. Let S = S1 , S2 , . . . , Sn be the collection of n given input sequences. The algorithm of [25] only examines a selected subset of neighbors of any l-mer u of the input and hence is more efficient. For any l-mer u, let Di (u) stand for the set of neighbors of u that are at a hamming distance of i (for 1 ≤ i ≤ d). For any input sequence Sj let d(u, Sj ) denote the minimum Pn hamming distance between u and any l-mer of Sj (for 1 ≤ j ≤ n). Let d(u, S) = j=1 d(u, Sj ). For any l-mer u in the input let BestNeighbor(u) stand for the neighbor v in D1 (u) whose distance d(v, S) is minimum from among all the elements of D1 (u). The PatternBranching algorithm starts from a u, identifies u1 =BestNeighbor(u); Then it identifies u2 =BestNeighbor(u1 ); and so on. It finally outputs ud . The best ud from among all possible u’s is output. A pseudocode for this algorithm is given next. In this pseudocode u and u0 are the same. Algorithm PatternBranching(S, l, d) Let M be an arbitrary l-mer; for each l-mer u in S do for j := 0 to d do if d(uj , S) < d(M, S) then M := uj ; uj+1 :=BestNeighbor(uj ); output M ; The above algorithm has been shown to perform well in practice [25]. Also, the above algorithm keeps only one best neighbor for each l-mer. Instead, it is possible to keep more

Algorithms for Motif Search

1-5

than one best neighbors. In other words, BestNeighbor(uj ) could be a set of l-mers instead of a single l-mer. In [25] a profile-based version of PatternBranching has been given as well. This version is called ProfileBranching.

1.3.4

Random Projection Algorithm

The algorithm of Buhler and Tompa [8] is based on random projections. Let the motif M of interest be an l-mer. Let C be the collection of all the l-mers from all the n input sequences. Project these l-mers along k randomly chosen positions (for some appropriate value of k). In other words, for every l-mer u ∈ C, generate a k-mer u0 which is a subsequence of u corresponding to the k random positions chosen. (The random positions are the same for all the l-mers). We can think of each k-mer thus generated as an integer. We group the k-mers according to their integer values. (I.e., we hash all the l-mers using the k-mer of any l-mer as its hash value). If a hashed group has at least a threshold number s of l-mers in it, then there is a good chance that M will have its k-mer equal to the k-mer of this group. There are n(m − l + 1) l-mers in the input and there are 4k possible k-mers. Thus the expected number of l-mers that hash into the same bucket is n(m−l+1) . The threshold value s is chosen to be twice 4k this expected value. The value of k is chosen such that n(m − l + 1) < 4k . This ensures that the expected number of random l-mers that hash into the same bucket is less than one. It should also be the case that k < l − d. Typical values used for k and s are 7 and 3, respectively. The process of random hashing is repeated r times (for some appropriate value of r) so as to be sure that a bucket of size ≥ s is observed at least once. The value of r can be calculated as follows. Probability p that a given planted motif instance hashes to the planted bucket (l−d) is given by kl . There are n instances of the planted motif and hence the probability that (k) Ps−1 ¡ ¢ fewer than s of them hash into the planted bucket is given by p0 = i=1 ni pi (1 − p)n−i . As a result, probability that fewer than s instances hash l m into the planted bucket in each of log P the r trials is P = (p0 )r . The value of r is thus log p0 . In [8], a value of 0.05 is used for P . We collect all the k-mers (and the corresponding l-mers) that pass the threshold and these are processed further to arrive at the final answer M . Processing is done using the expectation maximization (EM) technique of Lawrence and Reilly [20]. The EM formulation employs the following model. Each input sequence has an instance of a length l motif such that these instances are characterized by a 4 × l weight matrix W . In particular, W [i, j] is the probability that base i occurs in position j (1 ≤ i ≤ 4 and 1 ≤ j ≤ l). Occurrences of bases in the different positions are independent. Bases for the remaining m − l positions in each sequence are governed by a background distribution B. If S is the set of input sequences then the EM-based technique of [20] determines a weight matrix model W ∗ that ∗ ,B) maximizes the likelihood ratio Pr(S|W Pr(S|B) .

1.3.5

Algorithm PMS1

In this section we present details on the algorithm PMS1 of [26]. Consider the following simple algorithm for the planted motif problem: 1) Let the input sequences be S1 , S2 , . . . , Sn . The length of each sequence is m. Form all possible l-mers from out of these sequences. The total number of l-mers is ≤ nm. Call this collection of l-mers C. Let the collection of l-mers in S1 be C 0 ; 2) Let u be an l-mer in C 0 . For all u ∈ C 0 generate all the patterns

1-6 v such that u and v are at a hamming distance of d. The number of³ such patterns for a ¡ ¢ ´ ¡ ¢ given u is dl 3d . Thus the total number of patterns generated is O m dl 3d . Call this collection of l-mers C 00 . Note that C 00 contains M , the desired output pattern (assuming that M does not occur in any of the input sequences); 3) For every pair of l-mers (u, v) with u ∈ C and v ∈ C 00 compute the hamming distance between u and v. Output that l-mer of C 00 that has a neighbor (i.e., an l-mer at a hamming³ distance of´d) in each one of ¡ ¢ the n input sequences. The run time of this algorithm is O nm2 l dl 3d . If M occurs in one of the input sequences, then this algorithm will run in time O(n2 m2 l). Another equally simple algorithm considers every possible l-mer one at a time and checks if this l-mer is the correct motif M . There are 4l possible l-mers. Let M 0 be one such l-mer. We can check if M 0 = M as follows. Compute the hamming distance between u and M 0 for every u ∈ C. (Note that C is the collection of all possible l-mers in the input sequences.) As a result we can check if M 0 occurs in each input sequence (at distance of ¡ a hamming ¢ d). Thus we can identify all the motifs of interest in a total of O nml4l time. We get the following Lemma. LEMMA 1.1

We can solve the planted (l, d)-motif problem in O(nml4l ) time.

Algorithm PMS1 is based on sorting and takes the following form. Algorithm PMS1 1. Generate all possible l-mers from out of each of the n input sequences. Let Ci be the collection of l-mers from out of Si for 1 ≤ i ≤ n. 2. For all 1 ≤ i ≤ n and for all u ∈ Ci generate all l-mers v such that u and v are at a hamming distance of d. Let the collection of l-mers corresponding ³ ¡ ¢ to ´ Ci be l d 0 0 Ci , for 1 ≤ i ≤ n. The total number of patterns in any Ci is O m d 3 . 3. Sort all the l-mers in every Ci0 , 1 ≤ i ≤ n and eliminate duplicates in every Ci0 . Let Li be the resultant sorted list corresponding to Ci0 . 4. Merge all the Li s (1 ≤ i ≤ n) and output the generated (in step 2) l-mer that occurs in all the Li s. The following theorem results. ³ ¡ ¢ ´ Problem 1 can be solved in time O nm dl 3d wl where w is the word ³h ¡ ¢2 i ´ length of the computer. A run time of O nm + m dl 32d wl is also achievable. THEOREM 1.1

1.3.6

Algorithm PMS2

An improved algorithm called PMS2 has also been given in [26]. Let M be the planted motif. Note that if M occurs in every input sequence, then every substring of M also occurs in every input sequence. In particular, there are at least l − k + 1 k-mers (for d ≤ k ≤ l) such that each of these occurs in every input sequence at a hamming distance of at most d. Let Q be the collection of k-mers that can be formed out of M . There are l − k + 1 k-mers in Q. Each one of these k-mers will be present in each input sequence at a hamming distance of at most d.

Algorithms for Motif Search

1-7

In addition, in every input sequence Si , there will be at least one position ij such that a k-mer of Q occurs starting from ij ; another k-mer of Q occurs starting from ij + 1;. . .; yet another k-mer occurs starting from ij + l − k. We can get an l-mer putting together these k-mers that occur starting from each such ij . Possibly, there could be many motifs of length k that are in the positions starting from each of ij , ij + 1, . . . , ij + l − k such that all of these motifs are present in all of the input sequences (with a hamming distance of at most d). Assume that Mij +r is one motif of length k that starts from position ij + r of Si that is also present in every input sequence (for 0 ≤ r ≤ l − k). If the last k − 1 residues of Mij +r are the same as the first k − 1 residues of Mij +r+1 (for 0 ≤ r ≤ l − k − 1), then we can obtain an l-mer from these motifs in the obvious way. This l-mer is potentially a correct motif. Also, note that to obtain potential motifs (of length l), it suffices to process one of the input sequences (in a manner described above). Now we are ready to describe the improved algorithm. There are two phases in the algorithm. In the first phase we identify all (d + c)-mers Md+c (for some appropriate value c) that occur in each of the input sequences at a hamming distance of at most d. We also collect potential l-mers (as described above) in this phase. In the second phase we check, for each l-mer M 0 collected in the first phase, if M 0 is a correct answer or not. Finally we output all the correct answers. First we observe that the algorithm PMS1 can also be used for the case when we look for a motif M that occurs in each input sequence at a hamming distance of at most d. The second observation is that if c is large enough then there will not be many spurious hits. A suggested value for c is the largest integer for which PMS1 could be run (without exceeding the core memory of the computer and within a reasonable amount of time). We present more details on the two phases. Algorithm PMS2 Phase I Solve the planted (d + c, d)-motif problem on the input sequences (with a hamming distance of ≤ d, using e.g., a modified PMS1). Let R be the set of all motifs found. Let Sk be one of the input sequences. (Sk could be an arbitrary input sequence; it could be chosen randomly as well.) Find all the occurrences of all the motifs of R in Sk (with a hamming distance of up to d). This can be done, e.g., as follows: form all the (d + c)-mers of Sk (keeping track of the starting position of each in Sk ); For each (d + c)-mer u ∈ Sk , find all the (d + c)mers v such that u and v are at a hamming distance of at most d. If R0 is the collection of these (d + c)-mers, sort R and R0 and merge them; and figure out all the occurrences of interest. Let Sk be of length m. For every position i in Sk , let Li be the list of all motifs of R that are in Sk (with a hamming distance of ≤ d) starting from position i. Let A be the l-mer of Sk that occurs starting from position i. Let M1 be a member of Li . If M2 is a member of Li+l−(d+c) such that the last 2(d + c) − l characters of M1 are the same as the first 2(d + c) − l characters of M2 , then we could get an l-mer B by appending the last l − (d + c) residues of M2 to M1 (at the end). If the hamming distance between A and B is d, then B is retained as a candidate for the correct motif. We gather all such candidates and check if any of these candidates are correct motifs. Details are given below. for i := 1 to m − l + 1 do for every u ∈ Li do

1-8 for every v ∈ Li+l−(d+c) do Let the l-mer of Sk starting from position i be A. If the last 2(d + c) − l residues of u are the same as the first 2(d + c) − l residues of v, then form an l-mer B by appending the last l − (d + c) residues of v to u. If the hamming distance between A and B is d, then add B to the list C of candidates. Phase II for every v ∈ C do Check if v is a correct motif in O(nml) time. l−(d+c) For any node u of Li there can be at most candidate´ motifs. Thus the time ³P 4 m−l+1 l−(d+c) needed to get all the candidate motifs is O |L |4 l . i i=1

We arrive at the following Theorem. Problem P 1 can be solved in time ´ ¢ i 1.2 d m−l+1 d+c l−(d+c) O nm i=0 d+c 3 + znml + |L |4 l where z is the number of poi i=1 i w tential l-mers collected in the first phase and w is the word length of the computer. If ´ ³ ¡ ¢ Pm−l+1 d+c d d+c l−(d+c) |Li |4 l . d ≤ bl/2c, then the run time is O mn d 3 w + znml + i=1 THEOREM ³ ¡ P

An Alternative Algorithm. We can modify the above algorithm as follows. We first find the collection R of all the (d+c)-mers that are present in every input sequence at a hamming distance of at most d as before. In the above version, we pick only one sequence Sk and find all the candidate motifs arising out of Sk . An alternative is to find the candidate motifs from each sequence and get the Tnintersection of these sets. Let Ai be the set of candidates from Si (1 ≤ i ≤ n). Let A = i=1 Ai . We output A.

1.3.7

Algorithm PMS3

A third algorithm called PMS3 has been given in [26]. This algorithm enables one to handle large values of d. Let d0 = bd/2c. Let M be the motif of interest with |M | = l = 2l0 for some integer l0 . Let M1 refer to the first half of M and M2 to the second half. We know that M occurs in every input sequence. Let S = s1 , s2 , . . . , sm be an arbitrary input sequence. Let the occurrence of M (with a hamming distance of d) in S start at position i. Let S 0 = si , si+1 , . . . , si+l0 −1 and S 00 = si+l0 , . . . , si+l−1 . Then, clearly, either 1) the hamming distance between M1 and S 0 is at most d0 or 2) the hamming distance between M2 and S 00 is at most d0 . Also, note that in every input sequence either M1 occurs with a hamming distance of at most d0 or M2 occurs with a hamming distance of at most d0 . As a result, in at least t0 sequences (where t0 = dt/2e) either M1 occurs with a hamming distance of at most d0 or M2 occurs with a hamming distance of at most d0 . PMS3 exploits these observations. More details can be found in [26].

1.3.8

Saving Memory

The way PMS1 is described, we first form all possible l-mers from out of all the input sequences, generate all relevant neighbors of these l-mers, sort and merge all of them to

Algorithms for Motif Search

1-9

identify the generated l-mer(s) found in all the sequences. We can modify the algorithm as follows so as to reduce the memory used. Algorithm PMS1A Generate all possible l-mers from out of the first input sequence S1 . Let C1 be the collection of these l-mers. For all u ∈ C1 generate all l-mers v such that u and v are at a hamming distance of d. Sort the collection of these l-mers and eliminate duplicates. Let L be the resultant sorted collection. for i := 2 to n do 1. Generate all possible l-mers from out of the input sequence Si . Let Ci be the collection of these l-mers. 2. For all u ∈ Ci generate all l-mers v such that u and v are at a hamming distance of d. Let the collection of these l-mers be Ci0 . 3. Sort all the l-mers in Ci0 and eliminate duplicates. Let Li be the resultant sorted list. 4. Merge Li and L and keep the intersection in L. I.e., set L := L ∩ Li . L now has the motif(s) of interest.

1.3.9

Extensions

The planted (l, d)-motif problem as has been defined (in [24] for example) requires discovering a motif M that occurs in every input sequence at a hamming distance of exactly d. Varitations of this problem can be conceived of. We cosider two variants in this section. Problem 1(a). Input are n sequences each of length m. The problem is to identify a motif M of length l. It is given that each input sequence has a substring of length l such that the hamming distance between this substring and M is at most d. ³ Pd ¡ ¢ Problem 1(a) can be solved in time O nm i=0 il 3i ´ ³ ¡ ¢ then this run time is O nm dl 3d wl . THEOREM 1.3

l w

´ . If d ≤ bl/2c,

Proof. An algorithm similar to PMS1 can be devised. 1. Generate all possible l-mers from out of each of the n input sequences. Let Ci be the collection of l-mers from out of Si for 1 ≤ i ≤ n. 2. For all 1 ≤ i ≤ n and for all u ∈ Ci generate all l-mers v such that u and v are at a hamming distance of at most d. Let the collection of l-mers corresponding to C ³P ¡l¢ i ´i d 0 0 be Ci , for 1 ≤ i ≤ n. The total number of patterns in any Ci is O i=0 m i 3 . 3. Sort all the l-mers in every Ci0 and eliminate duplicates, 1 ≤ i ≤ n. Let Li be the resultant sorted list corresponding to Ci0 . 4. Merge all the Li s (1 ≤ i ≤ n) and output the generated (in step 2) l-mer that occurs in all the Li s. Problem 1(b). Input are n sequences each of length m. The problem is to find all motifs M of length l. A motif M should be output if it occurs in at least ²n of the input sequences

1-10 at a hamming distance of d. Here ² is a fraction specified as a part of the input. (This variant has been considered in [8]. They use a value of 1/2 for ²). THEOREM 1.4

³ ¡ ¢ Problem 1(b) can be solved in time O nm dl 3d

l w

´ .

Proof. The algorithm to be used is the same as PMS1. The only difference is that step 4 now becomes: “Merge all the Li s (1 ≤ i ≤ n) and output the generated (in step 2) l-mers that occur in at least ²n of the Li s”. The run time remains the same asymptotically. One could also refine Problem 1(b) to look for motifs that occur at a hamming distance of at most d.

1.4

Techniques for Edited Motif Search

In this section we consider edited motif search (Problem 2). Here the input is a database DB of sequences S1 , S2 , . . . , Sn . Input also are integers l, d, and q. The output should be all the patterns in the DB such that each pattern is of length l and it occurs in at least q of the n sequences. A pattern U is considered an occurrence of another pattern V as long as the edit distance between U and V is at most d. An algorithm for the above problem has been given by Sagot [29] that has a run time of O(n2 mld |Σ|d ) where m is the average length of the sequences in DB and Σ is the alphabet from which the input sequences are generated. It uses O(n2 m/w) space where w is the word length of the computer. This algorithm can be used to solve Problem 1 as well as Problem 2. Consider the case of Problem 1. This algorithm builds a suffix tree on the given sequences in O(nm) time using O(nm) space. Some preprocessing is done on the suffix tree 2 2 that takes O(n ¡ d m/w) time ¢ and O(n m/w) space. If u is any l-mer present in the input, d there are O l (|Σ| − 1) possible neighbors for u. (A neighbor of u is any word v such that the hamming distance between u and v is d). Any of these neighbors could potentially be a motif there are O(nm) l-mers in the input, the number of such neighbors is ¡ of interest. Since ¢ O nmld (|Σ| − 1)d . The algorithm of [29], for each such neighbor v, walks through the tree to check if v is a possible answer. This walking step is refered to as ’spelling’. The spelling operation takes a total of O(n2 mld (|Σ| − 1)d ) time using an additional O(nm) space. When employed for solving Problem 2, the same algorithm takes O(n2 mld |Σ|d ) time. An algorithm with an expected run time of O(nm + d(nm)1+pow(²) log nm) where ² = d/l and pow(²) is an increasing concave function has been given in [2]. The value of pow(²) is roughly 0.9 for protein and DNA sequences. This algorithm is also suffix-tree based.

1.4.1

An Algorithm Similar to PMS1

In this section we describe a sorting based algorithm that has the same run time as that of [29]. From hereon the word occurrence is used to denote occurrence within an edit distance of d, and the word presence is used to denote exact occurrence (i.e., occurrence within an edit distance of zero). The basic idea behind the algorithm is: We generate all possible l-mers in the database. There are at most mn such l-mers and these are the patterns of interest. For each such l-mer we want to determine if it occurs in at least q of the input sequences. Let u be one of the above l-mers. If v is a string such that the edit distance between u and v is at most d, then we say v is a neighbor of u. We generate all the neighbors of u. For each neighbor v of u we determine a list of input sequences in which v is present. These lists (over all possible

Algorithms for Motif Search

1-11

neighbors of u) are then merged to obtain a list of input sequences in which u occurs (within an edit distance of d). Note that if u is an l-mer, then its neighbors will have a length in the interval [l − d, l + d]. In other words, there are (2d + 1) possible values for the lengths of the neighbors of u. Also note that more than one neighbor of u could have the same length. Corresponding to each r-mer x (where r is an integer in the interval [l − d, l + d]) present in the input, we keep a 4-tuple: (x, SeqN um, P os, 0). Here SeqN um is an index of the input sequence I that x belongs to. There are O(nmd) such 4-tuples. The indexing of the input sequences can be done arbitrarily (e.g., in the order in which they appear in the input). P os is the starting position of x in I. The fourth entry (0) indicates that x is an r-mer present in one of the input sequences. For every neighbor v of u (u being an l-mer present in the input), we keep a 4-tuple as well: (v, SeqN um, P os, 1). Here SeqN um is the index of the sequence I that u belongs to and P os is the starting position of u in I. The fourth entry (1) indicates that this 4-tuple corresponds to a neighbor. We provide details of the algorithm next. Note that each l-mer present in the input could be represented as a pair (SeqN um, P os) where SeqN um is the index of the sequence in which the l-mer is present and P os is the starting position of the l-mer in this sequence. We make use of an array A[1 : n, 1 : m, 1 : n]. At the end of the algorithm this array will have the following property: A[SeqN um, P os, j] = 1 if and only if the l-mer (SeqN um, P os) occurs in the input sequence with index j (1 ≤ j ≤ n). 1. . Generate 4-tuples for all r-mers present in DB where r ∈ [l − d, l + d]. Each of these 4-tuples has a 0 as its fourth entry. Call this collection C. Sort C in lexicographic order and eliminate duplicates among these 4-tuples for which the first two entries are the same to get L1 . Note that the first entries of the 4-tuples in C could be of different lengths. A simple way of sorting these 4-tuples is to group them into (2d + 1) groups one corresponding to each possible length of the first entry and handle the groups separately. L1 has O(nmd) entries. Now sort the 4-tuples of L1 with respect to their first and fourth entries (in lexicographic order) to get L2 . 2. For every distinct l-mer u present in DB generate all the patterns v such that u and v are at an edit distance of at most d. The number of such neighbors for a given u is O(ld |Σ|d ) (a proof follows). This generation is done using the algorithm of Myers [23]. Form the 4-tuples corresponding to all possible neighbors of all the distinct l-mers in DB. Each of these 4-tuples has 1 as its fourth entry. 3. Sort all the 4-tuples generated in step 2 with respect to their first and fourth entries. The total number of 4-tuples is O(nmld |Σ|d ). Let L3 be the sorted sequence. 4. Merge L3 with L2 to get L4 . We can think of L4 as consisting of groups where a group has 4-tuples with the same first entry. A group itself can be thought of as consisting of two subgroups. The 4-tuples of the first subgroup have 0 as their fourth entry. The 4-tuples of the second subgroup have 1 as their fourth entry. for each group G of L4 do Let G1 and G2 be the two supgroups of G. Identify the distinct sequence indices (i.e., second entries) in the 4-tuples of G1 . Let these indices be i1 , i2 , . . . , ik . Note that k ≤ n. for each 4-tuple (v, SeqN um, P os, 1) in G2 do A[SeqN um, P os, aj ] := 1, for 1 ≤ j ≤ k. (I.e., the l-mer (SeqN um, P os) occurs in the input sequences Saj , for 1 ≤ j ≤ k.

1-12 Scan through the array A to output the right l-mers. In particular, output (SeqN um, P os) if A[SeqN um, P os, j] is 1 for 1 ≤ j ≤ n. ¡ ¢ The above algorithm runs in time O n2 mld |Σ|d . The space used is O(nmld |Σ|d ). The space used can be reduced to O(nmd + ld |Σ|d ). THEOREM 1.5

¡ ¢ Proof. The run time of step 1 is O nmd wl using radix sort algorithm. Let u be any l-mer. Then the number of patterns v such that the edit distance between u and v is at most d is O(ld |Σ|d ) as argued next. The same fact has been proven by Crochemore and Sagot as well [10]. Let N (t) be the number of patterns obtainable from u by performing t operations (inserts, deletes, and substitutions) on u. The number of Pd patterns of interest is then t=0 N (t). Of the t operations let the number of inserts, deletes, and substitutions be i, del, s, respectively with i + del + s =¡ t. ¢¡ For a¢¡given choice of ¢ s+i l l i, del, s, it is easy to see that the number of patterns obtainable is l+i |Σ| . As a i del s ³ ´t ¡a¢ ¡ ae ¢b (t+1)(t+2) (l+t)e result, N (t) ≤ using the fact that b ≤ b . Finally, summing N (t) 2 t over all ts, we see the result (as long as d ≥ 6). Thus the generation of all the patterns in step 2 takes O(nmld |Σ|d ) time (using the algorithm in [23]). ¢ ¡ In step 3 sorting takes time O nmld |Σ|d wl . ¡ ¢ Merging in step 4 also takes time O nmld |Σ|d wl . For each 4-tuple of a given G2 the time spent is O(k) where k is the number of distinct sequence indices in the corresponding G1 . Since k ≤ n, the total time in processing all the G2 ’s is O(n2 mld |Σ|d ). These observations prove the theorem (assuming that wl = O(n)). The above algorithm is simpler than the ones in [29, 2]. The algorithms in [29, 2] employ suffix trees. In comparison the above algorithm uses only arrays. The above algorithm can be expected to perform better than that of [29, 2] in practice. In practice the above algorithm is expected to run much faster. It is easy to see that the run time of the above algorithm is O(nmld |Σ|d z) where z is the maximum number of distinct sequence indices in the 4-tuples of any G1 . The expected value of z can be calculated as follows. Let x be any r-mer present in DB. The expected number of sequences that x occurs in is the same as the expected value of z. If I is any input sequence and j is a fixed position ¡ ¢r in I, probability that x is present in I starting from j is 41 . Thus, probability that x ¡ ¢r is present somewhere in I is ≤ m 14 . As a result, the expected number of sequences in ¡ ¢r ¡ ¢l−d which x is present is ≤ nm 14 . Thus the expected value of z is ≤ nm 14 . As an example, if n = 20, m = 600, l − d = 10, the expected value of z is less than 1.

1.4.2

A Randomized Algorithm

In this section we describe a simple randomized algorithm (due to [27]) that has the potential of performing better than the algorithms of [29, 2]. The algorithms in [29, 2] employ suffix trees and the algorithm to be discussed uses arrays. Before presenting the randomized algorithm we present a very simple algorithm. The randomized algorithm is based on this simple algorithm. This algorithm works as follows. 1. Generate all possible l-mers in DB. Let the collection of these l-mers be C. There are at most mn elements in C. Duplicates in C could be eliminated by a simple radix sort.

Algorithms for Motif Search

1-13

2. For every l-mer u in C, compute the number of occurrences of u in DB. This can be done in time O(nmd) using the algorithm of Galil and Park [14]. (See also [1, 6, 18, 22, 23, 33]). Thus we get the following Theorem. THEOREM 1.6

Problem 2 can be solved in time O(n2 m2 d).

A Randomized Algorithm. A randomized algorithm can be developed based on the above algorithm. 1. Generate all possible l-mers in DB. Let C the collection of these l-mers. C has at most nm elements. 2. For each element u in C, pick a random sample Su from DB of 16αnq ln n sequences where α is the probability parameter (assumed to be a constant). Count the number of occurrences Nu of u in the sample. This will take time |Su |md (using the algorithm of Galil and Park [14]) for a single u. 3. For each u in C such that Nu > 10.34α ln n, compute the occurrences of u in the entire input DB. If the number of occurrences of u in DB is q or more, then output u. ³ 2 2 ´ The above algorithm runs in time O n m q log n d + gmnd where g is the number of l-mers that pass the test in step 3. Also, the probability of an incorrect answer is no more than n−α nm. The space used is linear in the input size. THEOREM 1.7

Proof. The run time is easy to see. Note that if an l-mer occurs in less than q input sequences, it will never be output. If an l-mer u occurs in at least q sequences of DB, then the number of occurrences of u in Su (i.e., the value of Nu ) is lower bounded by a binomial random variable with √ mean 16α ln n. An application of the Chernoff bounds (second equation) with ² = 1/(2 n) shows that the probability that Nu is less than 10.34α ln n is no more than n−α . On the same token, let u0 be an l-mer that occurs in at most (3/8)q of the input sequences. The number of occurrences Nu0 of u0 in the √ sample is a binomial with mean 6α ln n. Using Chernoff bounds equation 3 with ² = 1/ 2, probability that Nu0 exceeds 10.25α ln n is at most n−α . In summary, if a pattern occurs in q or more input sequences, it will pass the test of step 3 with high probability. Moreover, not many spurious patterns will pass the test of step 3. If a pattern has to pass the test of step 3, then it has to occur in at least (3/8)q of the input sequences (in a high probabilistic sense). Therefore a high probability upper bound on g is the number of patterns that occur in (3/8)q or more of the input sequences. Also note that there at most nm patterns of interest. Note that this algorithm has the potential of performing better than those of [29, 2], especially for large values of q. When q is large (²n for some constant fraction ², for instance), g can be expected to be small and hence the entire run time could be o(d(nm)1+pow(²) log nm). Next we show that the expected value of g is very small. Assume that every residue in each input sequence is picked randomly (from an alphabet of size 4). Let the input consist of n sequences of length m each. Let v be an l-mer and Sk be any input sequence. Let i be any position in Sk . Probability that v is present in

1-14 Sk starting from position i is (1/4)l . Thus, the probability that v is present in Sk starting from some position is ≤ (m − l + 1)(1/4)l . For every l-mer u there are ≤ cld |Σ|d (for some constant c) neighbors (as has been shown above). The length of any sych neighbor x is in the interval [l − d, l + d]. In Problem 2, we are supposed to do the following: For every l-mer u in the input, check if it occurs in at least q of the input sequences. Therefore, probability that either u or any of its neighbors is present in Sk is at most p1 = (m − l + d + 1)c|Σ|d ld (1/4)l−d . As P a result, probability that ¡ ¢ in q or more of the input sequences is at most Pnu occurs n ¡ ¢ p2 = i=q ni pi1 (1−p1 )n−i ≤ i=q ni pi1 . Using Stirling’s approximation, we see that p2 ≤ √ P n n 2 2√ i By simple arithmetic we see that p2 ≤ 1/(mn) when i=q p1 . πn (n+1)−log(mn)−(1/2) log(πn) q ≥ 2(l−d)−log(m−l+d+1)−log c−d log |Σ|−d log l . If p2 ≤ 1/(mn), then the expected number of patterns that occur in q or more of the input sequences is ≤ 1. Thus indeed the value of g will be small even if q is not this high! As a numerical example, consider the case: l = 20, d = 2, m = 256, n = 500. In this case, the condition on q becomes: q ≥ 36. Also, if α is 4, the probability of an incorrect answer is 2.048 × 10−6 . This is also an upper bound on the expected number of patterns that will be missed by the algorithm! (A pattern is missed by the algorithm if it occurs in at least q of the input sequences but the algorithm fails to detect this). The above analysis could be tightened further.

1.5

Algorithms for Simple Motifs Search

Simple motifs search (Problem 3) takes as input a database DB of n sequences. The goal is to identify all the patterns of length at most l (with anywhere from 0 to bl/2c wild card characters). The output should be all the patterns together with a count of how many times each pattern occurs. The motif model for Problem 3 has been derived as follows [27]. Rajasekaran, et. al. [27] have generated a list of 312 minimotifs (i.e., motifs of short length) that have defined biological functions. They have used this list to select parameters for a de novo analysis of novel minimotifs in the human proteome. They have chosen to analyze novel motifs with a length (l) of 10 amino acids because 92% of the previously characterized minimotifs in their list are less than 10 amino acids in length. Another reason for choosing a length of 10 amino acids is based on the function of minimotifs. Most minimotifs are in binding domains or substrates of enzymes. The peptide ligand binding surfaces on proteins in the Protein Data Bank is usually no longer than 35 angstroms. A 10 amino acid peptide would achieve a maximum of 35 angstroms in length if it were in a random coil or a beta-sheet structure. Thus, the selection of a length of 10 amino acids is consistent with the length of peptides that interact with binding surfaces on protein domains. The average minimotif in their list has 2.1 wildcard positions for any amino acid. Wildcards signify any of the 20 amino acids. Since only 13% of minimotifs in their list have more than 50% wild card positions, they chose l/2 or 5 wild cards as the maximal number in the algorithm. In Problem 3 an optional threshold value for the number of occurrences could be supplied. Determining this threshold is a challenging task. One way of determining this threshold is to rank the motifs in the order of the number of their occurrences and choosing certain number of them (either because they are over-represented or because they are under-represented). Another way of determining the threshold is by analyzing the table of occurrences of all the

Algorithms for Motif Search

1-15

patterns in the database together with a model for the biological sequences under concern. This differs from the first way in that here the number of occurrences of any motif will be weighted as dictated by the model. A typical value for l is 10. A simple sorting based algorithm for Problem 3 (called Simple Motif Search or SMS) is given in [27]. The run time of this algorithm is O(ll/2 N ) for a given pattern length l, the number of wild cards being at most bl/2c. The number of residues in the database is N .

1.5.1

A Simple Sorting Based Algorithm

The algorithm of Martinez [21] addresses a variant of Problem 3. In particular, the input is just one sequence. The output consists of all repeated patterns. The matches of interest are exact. Even if the input has many sequences, they can be concatenated to get a single sequence. The algorithm of [21] works as follows. Let S = x1 , x2 , . . . , xn be the input sequence. This sequence can be thought of as n sequences where each sequence corresponds to a ‘suffix’ of S. I.e., S1 is the same as S; S2 = x2 x3 · · · xn ; and so on. These n sequences are then sorted one residue at a time. At any level of the sorting we have groups of sequences. In particular, after k levels of sorting, two sequences are in the same group, if the first k residues are the same in these two sequences. Sorting at any level is local to groups. A group will not be sorted further if it has only one element. The expected run time of the above algorithm is O(n log n) whereas its worst case run time is Ω(n2 ). The above algorithm can be modified to have an expected run time of O(n) by performing radix sorting with respect to the first Ω(log n) residues of the n sequences (see e.g., [16]). As another variant consider a problem where the input are a sequence S and an integer k. The goal is to report all repeats of length k. This variant can be solved in the worst case in time O(nk/w), w being the word length of the computer as follows. 1) Form all k-mers of S. There are less than n such k-mers; 2) Sort these k-mers lexicographically in time O(nk/w) and 3) Scan through the sorted list to identify the repeats. Instead of the above algorithm one could also employ a prefix tree or a suffix array to get a run time of O(n). Depending on the underlying constant and the values of k and w, the above algorithm could be faster.

1.5.2

Simple Motif Search (SMS)

As has been pointed out before, we are interested in identifying all the patterns of length at most l (with anywhere from 0 to bl/2c wild card characters). For every pattern, the number of occurrences should be output. How does a biologist identify biologically important patterns? This is a challenging task for biologists and will not be addressed in this chapter. Define a (u, v)-class as a class of patterns where each pattern has length u and has exactly v¡ wild characters. For example, GA??C?T belongs to (7, 3)-class. Note that there are ¢ card u−2 u−v |Σ| patterns in a (u, v)-class. v ¡ ¢ To identify the patterns in a (u, v)-class, we perform u−2 sorts. More specifically, for v each possible placement of v wild card characters (excluding at the end positions) in a sequence of length u, we perform a sorting. As an example, consider a case where u = 5 and v = 2. There are three possible placements: C??CC, CC??C, and C?C?C, where C corresponds to any residue. Call every placement as a (u, v)-pattern type. For every (u, v)-pattern type, we perform the following steps. Algorithm SMS

1-16 For every (u, v)-pattern type do 1. If R is a pattern type in (u, v)-class, we generate all possible u-mers in all the sequences of DB. If the sequences in DB have lengths m1 , m2 , . . . , mn , respectively, then the number of u-mers from Si is mi − u + 1, for 1 ≤ i ≤ n. 2. Sort all the u-mers generated in step 1 only with respect to the non-wild card positions of R. For example, if the pattern type under concern is CC??C?C, we generate all possible 7-mers in DB and sort the 7-mers with respect to positions 1, 2, 5, and 7. Employ radix sort (see e.g., [16]). 3. Scan through the sorted list and count the number of occurrences of each pattern. ¡¡ ¢ u ¢ The run time of the above algorithm is O u−2 v N w for a (u, v)-class, where N is the total number of residues in DB and w is the word length of the computer. Now we consider the problem of identifying all of the following patterns: The maximum length is 10. Pattern lengths of interest are: 3, 4, 5, 6, 7, 8, 9 and 10. The maximum number of wild cards are 1, 2, 2, 3, 3, 4, 4 and 5, respectively. In other words we are interested in: (10, 5)-class, (10, 4)-class, . . ., (10, 1)-class, (9, 4)-class, (9, 3)-class, . . ., (9, 1)-class, . . ., (4, 2)-class, (4, 1)-class, and (3,1)-class. Thus the total number of sorts done is 5 µ ¶ X 8 i=0

i

+

4 µ ¶ X 7 i=0

i

THEOREM 1.8

1.5.3

+

4 µ ¶ X 6 i=0

i

+

3 µ ¶ X 5 i=0

i

+

3 µ ¶ X 4 i=0

i

+

2 µ ¶ X 3 i=0

i

+

2 µ ¶ X 2 i=0

i

+

1 µ ¶ X 1 i=0

i

= 429.

SMS algorithm runs in time O(ll/2 N ).

Parallelism

SMS is amenable to parallel implementations [27]. One possibility is to partition the number of sorts equally among the processors. For example, if there are two processors, then a reasonable partition is for the first processor to work on a maximum pattern length of 10 and the second processor to work on the remaining maximum pattern lengths. Processor 1 will perform 219 sorts and the second processor will do 210 sorts. A second possibility is to partition the sequences as equally among the processors as possible and finally merge the occurrence numbers of patterns. A third possibility is to treat each sort as a job. To begin with all the jobs are available. To begin with, each processor takes up an available job. As soon as a job is taken up (by some processor) it is marked unavailable. When a processor completes its job, it takes up another available job. Computation stops when there are no more available jobs. This third technique has been employed in [27] and the speedups obained are close to linear.

1.5.4

TEIRESIAS Algorithm

The TEIRESIAS algorithm [12] addresses a problem similar to Problem 3. Here we define a pattern to be a string of symbols (also called residues) and ?s. A “?” refers to a wild card character. A pattern cannot begin or end with ?. AB?D, EB??DS?R, etc. are examples of patterns. The length of a pattern is the number of characters in it (including the wildcard characters). If P is a pattern, any substring of P that itself is a pattern is called a subpattern of P . For instance AB is a subpattern of AB?D. A pattern P is called a < l, W > pattern if every subpattern of P of length W or more contains at least l residues. A pattern P 0 is said to be more specific than a pattern P if P 0 can be obtained from P by changing one

Algorithms for Motif Search

1-17

or more wild card characters of P into residues and/or by adding one or more residues or wild card characters to the left of P or to the right of P . ABCD and E?AB?D are more specific than AB?D. A pattern P is said to be maximal if there is no pattern P 0 that is more specific than P and which occurs the same number of times in a given database DB as P . The problem TEIRESIAS addresses takes as input a database DB of sequences, the parameters l, W , and q and outputs all < l, W > maximal patterns in DB that occur in at least q distinct sequences of DB. The run time of TEIRESIAS is Ω(W l N log N ), where N is the size of the database (i.e., the number of characters (or residues) in the database). In this section we describe the TEIRESIAS algorithm in some detail. The algorithm consists of two phases. In the first phase elementary < l, W > patterns are identified. An elementary < l, W > pattern is nothing but a pattern which is of length W and which has exactly l residues. This phase runs in time O(N W l ) [12]. In the second phase (known as the convolution phase), elementary patterns are combined (i.e., convolved) to obtain larger patterns. For example, AS?TF and TFDE can be combined to obtain AS?TFDE. All the convolved patterns that pass the support level and which are maximal will be output. The run time of this phase is O(W l N log N ) [12].

1.6

Random Sampling

Random sampling can be employed to speedup computations. In this section we describe a simple form of sampling (presented in [27]) as it applies to Problem 3. Let the input database DB consist of the sequences S1 , S2 , . . . , Sn . Assume that the problem is to determine how frequent are certain patterns. In particular, for each pattern we want to determine if the number of sequences in which it occurs is at least q, where q is a given threshold. We can use the following strategy to solve this problem. Randomly choose a sample S 0 of ²n sequences (for some appropriate value ²), identify the patterns that are frequent in this sample (with an appropriately modified threshold value), and output these patterns. Analysis. Let U be a pattern that occurs q times in DB. Then the number of occurrences q 0 of U in S 0 is Binomially distributed with a mean of ²q. Using Chernoff bounds, P r[q 0 ≤ n −β (1−α)²q] ≤ exp(−α2 ²q/2). If q ≥ 2βαln . Refer to a probability 2 ² , then this probability is n −β of ≤ n as low probability as long as β is any constant ≥ 1. I.e., if a pattern U occurs q timesP in DB then its occurrence in S 0 will be at least (1 − α)²q with high probability. Let n N = i=1 |Si |. I.e., N is the number of residues in DB. The total number of patterns that pass the threshold is clearly < N . If q ≥ α22 ² (β ln n + ln N ), then each pattern that occurs at least q times in DB will occur at least (1 − α)²q times in the sample. Also, if a pattern U occurs at most 1−α 1+α q times in DB, then the expected number of ²q. Using Chernoff bounds, this number will be occurrences of U in the sample is 1−α 1+α ³ ´ ² 3 1+α ≤ (1 − α)²q with probability ≥ 1 − exp −α2 1−α 1+α 3 q . Thus if q ≥ α2 ² 1−α (β ln n + ln N ), then every pattern that occurs at most in S 0 . We arrive at the following

1−α 1+α q

times in DB will occur at most (1 − α)²q times

LEMMA 1.2 Consider the problem of identifying patterns in a database of n sequences. Each pattern of interest should occur in at least q of the input sequences. To solve this problem it suffices to use a random sample of size ²n and a sample threshold of (1 − α)²q. In this case, with high probability, no pattern that has an occurrence of less than 1−α 1+α q in

1-18

References

DB will pass the sample threshold, provided q ≥

3 1+α α2 ² 1−α (β

ln n + ln N ).

Examples: We present two examples to llustrate the usefulness of sampling. In particular, we want to see how small could ² be. For a given n, N, q, we fix suitable values for β and α and use the above constraint on q to evaluate the value of ². The value of α that minimizes 1 1−α α2 1+α is approximately 0.6. Thus we use this value for α. We fix the value of β to be 1. Consider the case of n = 1000, m = 200, N = 200000, α = 0.6, β = 1. Here m refers to the length of each sequence. The condition on q becomes: q ≥ 666.6 ² . If q = 800, then it means that ² could be as small as 0.833. As another example, consider the case where: n = 10000, m = 200, N = 2000000, α = 0.6, β = 1. The condition on q becomes: q ≥ 833.25 . If q = 5000, the value of ² could be as ² small as 0.167.

Appendix A: Chernoff Bounds A Bernoulli trial is an experiment with two possible outcomes viz. success and failure. The probability of success is p. A binomial random variable X with parameters (n, p) is the number of successes in n independent Bernoulli trials, the probability of success in each trial being p. We can get good estimates on the tail ends of binomial distributions (see e.g., [9]). In particular, it can be shown that Lemma A.1. If X is binomial with parameters (n, p), and m > np is an integer, then P r[X ≥ m] ≤ (np/m)m exp(m − np). Also, P r[X ≤ (1 − ²)np] ≤ exp(−²2 np/2) and P r[X ≥ (1 + ²)np] ≤ exp(−²2 np/3) for all 0 < ² < 1.

1.7

Conclusions

In this chapter we have addressed three versions of the motif search problem. We have also surveyed many deterministic and randomized algorithms that have been proposed for these problems. Developing more efficient algorithms will be an important open problem, given the significance of motif search.

Acknowledgements The author thanks S. Balla, C.-H. Huang, V. Thapar, M. Gryk, M. Maciejewski, and M. Schiller for many stimulating discussions. This work has been supported in part by the NSF Grants CCR-9912395 and ITR-0326155.

References

References [1] E. F. Adebiyi, T. Jiang and M. Kaufmann, An efficient algorithm for finding short approximate non-tandem repeats, Bioinformatics 17, Supplement 1, 2001, pp. S5S12. [2] E. F. Adebiyi and M. Kaufmann, Extracting common motifs under the Levenshtein measure: theory and experimentation, Proc. Workshop on Algorithms for Bioinformatics (WABI), Springer-Verlag LNCS 2452, 2002, pp. 140-156. [3] T. L. Bailey and C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proc. Second International Conference on Intelligent Systems for Molecular Biology, 1994, pp. 28-36. [4] T. L. Bailey and C. Elkan, Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning 21(1-2), 1995, pp. 51-80. [5] M. Blanchette, Algorithms for phylogenetic footprinting, Proc. Fifth Annual International Conference on Computational Molecular Biology, 2001. [6] M. Blanchette, B. Schwikowski, and M. Tompa, An exact algorithm to identify motifs in orthologous sequences from multiple species, Proc. Eighth International Conference on Intelligent Systems for Molecular Biology, 2000, pp. 37-45. [7] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, Predicting gene regulatory elements in silico on a genomic scale, Genome Research 15, 1998, pp. 1202-1215. [8] J. Buhler and M. Tompa, Finding motifs using random projections, Proc. Fifth Annual International Conference on Computational Molecular Biology (RECOMB), April 2001. [9] H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Annals of Math. Statistics 23, 1952, pp. 493-507. [10] M. Crochemore and M.-F. Sagot, Motifs in sequences: localization and extraction, in Handbook of Computational Chemistry, Crabbe, Drew, Konopka, eds., Marcel Dekker, Inc., 2001. [11] E. Eskin and P. Pevzner, Finding composite regulatory patterns in DNA sequences, Bioinformatics S1, 2002, pp. 354-363. [12] A. Floratos and I. Rigoutsos, On the time complexity of the TEIRESIAS algorithm, Research Report RC 21161 (94582), IBM T.J. Watson Research Center, April 21, 1998. [13] D. J. Galas, M. Eggert, and M. S. Waterman, Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli, Journal of Molecular Biology 186(1), 1985, pp. 117-128. [14] Z. Galil and K. Park, An improved algorithm for approximate string matching, SIAM Journal of Computing 19(6), 1990, pp. 989-999. [15] G. Hertz and G. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics 15, 1999, pp. 563-577. [16] E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms, W. H. Freeman Press, 1998. [17] U. Keich and P. Pevzner, Finding motifs in the twilight zone, Bioinformatics 18, 2002, pp. 1374-1381. [18] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching and a new serial algorithm, Proc. ACM Symposium on Theory of Computing, 1986, pp. 220-230. [19] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton, Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science 262, 1993, pp. 208-214.

1-19

1-20

References

[20] C. E. Lawrence and A. A. Reilly, An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences, Proteins: Structure, Function, and Genetics 7, 1990, pp. 41-51. [21] H. M. Martinez, An efficient method for finding repeats in molecular sequences, Nucleic Acids Research 11(13), 1983, pp. 4629-4634. [22] E. W. Myers, Incremental alignment algorithms and their applications, Technical Report 86-22, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1986. [23] E. W. Myers, A sublinear algorithm for approximate keyword searching, Algorithmica 12, 1994, pp. 345-374. [24] P. Pevzner and S.-H. Sze, Combinatorial approaches to finding subtle signals in DNA sequences, Proc. Eighth International Conference on Intelligent Systems for Molecular Biology, 2000, pp. 269-278. [25] A. Price, S. Ramabhadran and P. A. Pevzner, Finding subtle motifs by branching from sample strings, Bioinformatics 1(1), 2003, pp. 1-7. [26] S. Rajasekaran, S. Balla, C.-H. Huang, Exact Algorithms for Planted Motif Challenge Problems, Proc. Third Asia-Pacific Bioinformatics Conference, Singapore, 2005. [27] S. Rajasekaran, S. Balla, C.-H. Huang, V. Thapar, M. Gryk, M. Maciejewski, and M. Schiller, Exact Algorithms for Motif Search, Proc. Third Asia-Pacific Bioinformatics Conference, Singapore, 2005. [28] E. Rocke and M. Tompa, An algorithm for finding novel gapped motifs in DNA sequences, Proc. Second International Conference on Computational Molecular Biology (RECOMB), 1998, pp. 228-233. [29] M. F. Sagot, Spelling approximate repeated or common motifs using a suffix tree, Springer-Verlag LNCS 1380, pp. 111-127, 1998. [30] S. Sinha and M. Tompa, A statistical method for finding transcription factor binding sites, Proc. Eighth International Conference on Intelligent Systems for Molecular Biology, 2000, pp. 344-354. [31] R. Staden, Methods for discovering novel motifs in nucleic acid sequences, Computer Applications in the Biosciences 5(4), 1989, pp. 293-298. [32] M. Tompa, An exact method for finding short motifs in sequences, with application to the ribosome binding site problem, Proc. Seventh International Conference on Intelligent Systems for Molecular Biology, 1999, pp. 262-271. [33] E. Ukkonen, Finding approximate patterns in strings, Journal of Algorithms 6, 1985, pp. 132-137. [34] J. van Helden, B. Andre, and J. Collado-Vides, Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, Journal of Molecular Biology 281(5), 1998, pp. 827-842. [35] M. Waterman, R. Arratia, and E. Galas, Pattern Recognition in Several Sequences: Consensus and Alignment, Bulletin of Mathematical Biology 46, 1984, pp. 515-527.

Index A randomized algorithm, 1-12–1-14 Adebiyi, 1-19 Algorithm PMS1, 1-5–1-6 Algorithm PMS2, 1-6–1-8 Algorithm PMS3, 1-8 Algorithm SMS, 1-15–1-16 Altschul, 1-19 Andre, 1-20 Approximation algorithms, 1-2 Arratia, 1-20

Jonassen, 1-19 Kaufmann, 1-19 Keich, 1-19 Landau, 1-19 Lawrence, 1-2, 1-5, 1-19 Liu, 1-19 Maciejewski, 1-18, 1-20 Martinez, 1-15, 1-20 Maximal patterns, 1-17 MEME, 1-2, 1-3 Minimotifs, 1-14 MITRA, 1-2, 1-3 MULTIPROFILER, 1-2, 1-3 Myers, 1-11, 1-20

Bailey, 1-2, 1-19 Balla, 1-3, 1-18, 1-20 Binding domains, 1-14 Blanchette, 1-19 Boguski, 1-19 Brazma, 1-19 Buhler, 1-2, 1-5, 1-19

Neuwald, 1-19

Chernoff, 1-19 Chernoff bounds, 1-18 Cliques, 1-2, 1-3 Collado-Vides, 1-20 CONSENSUS, 1-2, 1-3 Crochemore, 1-12, 1-19

Parallel SMS, 1-16 Park, 1-13, 1-19 Pattern based algorithms, 1-2 Pattern branching, 1-4–1-5 PatternBranching, 1-2, 1-3 Patterns, 1-1, 1-5, 1-10, 1-17 Performance coefficient, 1-2 Pevzner, 1-2–1-4, 1-19, 1-20 Planted motif search, 1-1–1-10 PMS, 1-3 Price, 1-2, 1-20 Probabilistic Analysis, 1-3 Profile based algorithms, 1-2 ProfileBranching, 1-2, 1-3, 1-5 PROJECTION, 1-2, 1-3

Edited motif search, 1-2, 1-10–1-14 Eggert, 1-19 Elkan, 1-2, 1-19 Eskin, 1-19 Exact algorithms, 1-2 Exhaustive enumeration algorithms, 1-3 Expectation maximization, 1-5 Expectation optimization, 1-2 Floratsos, 1-19

Hertz, 1-19 Horowitz, 1-19 Huang, 1-3, 1-18, 1-20

Radix sort, 1-12 Rajasekaran, 1-3, 1-14, 1-19, 1-20 Ramabhadran, 1-2, 1-20 Random projection, 1-5 Random projections, 1-2 Random sampling, 1-17–1-18 Reilly, 1-5, 1-20 Rigoutsos, 1-19 Rocke, 1-2, 1-20

Jiang, 1-19

Sagot, 1-10, 1-12, 1-19, 1-20

Galas, 1-19, 1-20 Galil, 1-13, 1-19 Gibbs sampling, 1-2 GibbsDNA, 1-2, 1-3 Gryk, 1-18, 1-20

1-21

1-22 Sahni, 1-19 Schiller, 1-18, 1-20 Schwikowski, 1-19 Simple motifs search, 1-2, 1-14–1-17 Sinha, 1-20 SP-STAR, 1-3, 1-4 Spelling, 1-10 Staden, 1-20 Stormo, 1-19 Sze, 1-2–1-4, 1-20 TEIRESIAS, 1-16–1-17 Thapar, 1-18, 1-20 Tompa, 1-2, 1-5, 1-19, 1-20 Ukkonen, 1-19, 1-20 van Helden, 1-20 Vilo, 1-19 Vishkin, 1-19 Waterman, 1-19, 1-20 WINNOWER, 1-3, 1-4 Wootton, 1-19

INDEX