Solving the Longest Common Subsequence (LCS) Problem using the Associative ASC Processor with Reconfigurable 2D Mesh

Solving the Longest Common Subsequence (LCS) Problem using the Associative ASC Processor with Reconfigurable 2D Mesh Virdi Sabegh Singh, Hong Wang, Ro...
Author: Conrad Mitchell
6 downloads 0 Views 304KB Size
Solving the Longest Common Subsequence (LCS) Problem using the Associative ASC Processor with Reconfigurable 2D Mesh Virdi Sabegh Singh, Hong Wang, Robert A. Walker Computer Science Department, Kent State University, Kent OH 44242 USA Abstract As new genes are sequenced, it is necessary for molecular biologists to compare the new gene’s biosequence to known sequences. One simple form of DNA sequence comparison can be done by solving the Longest Common Subsequence (LCS) problem. However, these sequences are large, the databases are large, and enormous computing power may be necessary — power not available to the average researcher. Further, it is necessary to find not only the longest common subsequence, but the best LCS when there are multiple LCSs. As a result, previous research has explored heuristic solutions, parallel solutions, and specialized hardware. In this paper, we propose a parallel LCS algorithm and a specialized processor architecture (the associative ASC Processor with reconfigurable 2D mesh), implemented on an Altera FPGA, to solve the exact and approximate match LCS problems. This solution uses inexpensive hardware, can be reconfigured later as new analysis techniques are developed, and moreover the exact match LCS algorithm runs in constant time, making it particularly attractive for processing large biosequences. Index Terms — Associative Computing, Coterie Network, DNA, Dynamic Programming, Genome, Homologous, Longest Common Subsequence, Nucleotide, Sequence Comparison, SIMD

I.

Introduction One problem of great interest to molecular biologists [27] is the problem of sequence comparison [14], in

particular comparing nucleotides of nucleic acids (DNA and RNA). DNA molecules encode the genetic information for most organisms, including instructions for the organism’s life and development. RNA molecules serve a variety of functions; for example, messenger RNA carries information in proteins. In molecular biology, sequence comparison is used to compare nucleic acid sequences to characterize the homology (i.e., correspondence) between two or more related sequences. Nucleic acids are composed of a sequence of nucleotides. DNA molecules are composed of four types of (deoxyribo) nucleotides —A, G, C, and T — and can be millions of units long. RNA molecules are composed of four types of (ribo) nucleotides — A, G, C and U — and are shorter but still up to thousands of units long. Each nucleotide is characterized by the base it contains: adenine (A), guanine (G), and cytosine (C) for both DNA and RNA, and thymine (T) for DNA or uracil (U) for RNA (T and U play similar roles). An organism’s genetic information is encoded in the linear ordering of these nucleotides within its DNA molecules. When a new gene is found, its function is usually unknown, so molecular biologists often compare the new gene’s biosequence to known sequences from a genome bank, under the assumption that the new sequence may be 1

functionally or physically similar to a known gene. As more and more DNA fragments are sequenced, the need for comparing the new sequences to those already known increases.

A. Sequence Analysis By Finding the Longest Common Subsequence (LCS) Comparison between biosequences is usually done by aligning (possibly noncontiguous) subsections of the sequences such that corresponding subsections contain either the same symbols or substitutions. The simplest form of a sequence analysis involves solving the Longest Common Subsequence (LCS) problem [13], where we eliminate the operation of substitution and allow only insertions and deletions. Given an arbitrary string, a subsequence of the string can be obtained by deleting zero or symbols, but not necessarily consecutive ones. For example, if string X is ACCTTGCTC, then AGCTC and ACTTTC are subsequences of X (4 and 1+2 deletions, respectively), whereas TGTCT and CTCG are not. Although there are typically many common subsequences between two strings, some are longer than others. Given strings of size m and n, the Longest Common Subsequence (LCS) of string X [1..m] and Y [1..n], LCS(X,Y), is a common subsequence of maximal length. For example, if string X is AGACTGAGTGA and string Y is ACTGAT, then A, ACT, and ACTGA are all common subsequences of X and Y, but ACTGA is the longest common subsequence if an exact match is required. However, if the deletion of characters is allowed (finding an approximate match), then deleting the third G in X gives ACTGAT as an even longer common subsequence.

B. Solving the Longest Common Subsequence (LCS) Problem Fast solutions to these LCS problems are needed not only in biosequence comparison, but also in data compression, editing error correction and syntactic pattern recognition [3, 31, 19]. However, the current size of genetic sequence databases is quite large, requiring an enormous amount of computing power for sequence analysis. Since so much computing power is usually not available to most investigators, efficient dynamic programming algorithms have been developed [15] with quadratic time complexity and linear space requirements. Unfortunately, since the size of each sequence is large, as is the size of the database, sequential algorithms are not always sufficient. As a result, parallel algorithms based on linear systolic arrays [11], specialized hardware [29], and distributed computing have been developed. However, solutions based on specialized hardware tend to be more expensive and harder (if not impossible) to change as new analysis techniques are developed.

2

The LCS problem is typical solved with dynamic programming techniques after filling strings of size m and n into an m x n table. The table elements can be regarded as vertices in a graph, and the dependencies between the table values define the edges. The LCS problem is solved by finding the longest path between the vertex in the upper left corner and the one in the lower right corner of the table. When solving the LCS problem, there are two issues to consider — finding the longest common subsequence, and finding the best LCS when there are multiple such LCSs (especially probably for approximate match). To find the best LCS, an evaluation score is usually assigned, based on positive values for matches and negative values for substitutions, insertions, or deletions. Dynamic programming techniques consider this score and report the best LCS when properly tuned. In contrast, most other methods lack an evaluation score or use solution methods linked too strongly toward one such evaluation. Study of the theoretical complexity of the LCS problem [3, 25, 12, 22, 10, 9] gives the lower bound of Ω(n2), if the elementary comparison operation is of type ‘equal/unequal’ and the alphabet size is unrestricted. Of the current LCS solutions available, the O(n2/log n) algorithm of Masek and Paterson [32] is theoretically the fastest known. In recent years, several parallel algorithms have been designed [1, 2, 30]. Among them, Aggarwal and Park [1] and Apostolico et al. [2] have independently shown that this problem can be solved in O(logmlogn) time using mn/logm processors on the CREW model, on which concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location. Mi Lu and Hua Lin [26] proposed two parallel algorithms for this problem on the CREW-PRAM model. One takes O(log2m+logn) time with mn/log m processors, which is faster than all the existing algorithms on the same model, and the other takes O(log2mloglogm) time with mn/log2mloglogn processors when log2mloglogm > log n, or otherwise O(logn) time with mn/logn processors.

C. Organization of this Paper In this paper, we describe a new parallel algorithm for solving the LCS problem and an efficient implementation of that algorithm on specialized hardware. Our computing model is described in Section II, along with a description of the associative SIMD ASC Processor that we have implemented using FPGAs. Section III investigates several processing element interconnection networks, and describes the network implemented on our ASC Processor to solve the LCS problem. Our LCS algorithm, targeted toward our ASC Processor with this interconnection network, is then described in Section III. Finally, the paper comments briefly on the implementation of this algorithm and draws some conclusions. 3

Memory

AP

Memory

AP

Memory

AP

Memory

AP

Memory

AP

Memory

AP

Memory

AP

Memory

AP

Assoc. Control Unit (CU)

Figure 1: Associative SIMD Array with PE Interconnection Network

II.

The Associative Computing Paradigm As ASICs and FPGAs continue to grow, a wide range of architectures are being developed to make

effective use of the millions (soon to be billions) of transistors available on those chips. System-on-a-Chip (SOC) architectures are increasingly common, and Multiprocessor SOC architectures [6] combine a tens, possibly hundreds, of powerful processors, a network, and shared memory on a single chip. Our research is aimed at a third alternative — a SIMD Processor Array on a Chip. In contrast to the small number of powerful, independent CISC processors supported by a Multiprocessor SOC, the Processor Array on a Chip supports a thousands of simple RISC processors operating in lock-step SIMD fashion under the direction of a single control unit. While SIMD may have been out of fashion in the 1990s, such SIMD processor arrays are now commercially available [8] and packaged into systems for a variety of applications [33]. These SIMD processor arrays can be further augmented with support for associative computing [7]. Associative computing references memory by content rather than address. In its simplest form, each memory cell is associated with a flag bit, and if a search for key data is successful, this bit is flagged, leaving that memory cell to be processed further as appropriate by the algorithm. An associative SIMD processor array [7, 20] adds associative computing to a SIMD processor array, assigning a dedicated Processing Element (PE) to each set of memory cells (see Figure 1). In associative SIMD computing, PEs search in parallel for a key in their local memories, and PEs whose search is successful are designated responders. Masked instructions can then be used to limit further SIMD processing to only those responders. This SIMD associative computing model has been explored by researchers at Kent State University for 4

over 30 years. Referred to as the ASC model [20, 21] it is particularly well-suited for applications such as data mining, bioinformatics, image processing, and air traffic control. We have implemented several prototypes of an FPGA-based associative SIMD processor array (see Figure 1), called the ASC Processor [17]. Even a small million-gate FPGA can implement one Control Unit and around 70 PEs. Further, the FPGA-based implementation is inexpensive even in small production volumes, and can easily be reconfigured as new variations on the algorithm are developed. The ASC Processor implements associative operations using a dedicated comparison unit in each PE, a mask stack to indicate when the PE is a responder, and special masked instructions to limit further sequential or parallel processing to only the responders. Additional hardware supports max/min search in constant time (with respect to the number of PEs) [5, 17], giving the PE array the ability to flag as responders those PEs that have a maximum (or minimum) value in a particular memory location. This ASC Processor uses a reconfigurable network to support PE communication for associative computing. This reconfigurable PE network allows arbitrary PEs in the PE array to be connected via either a linear array (previously implemented) or a 2D mesh (implemented to support the LCS algorithm described in this paper), without the restriction of physical adjacency. More specifically, each PE in the PE array can choose to stay in the fixed existing network, or opt out of the network so that it is bypassed by any communication. While this solution does not allow the responders to be connected into an arbitrary configuration such as the coterie network described later in Section III.A, it does support communication sufficient for many algorithms, including our LCS algorithm.

III. PE Interconnection Network Virtually every parallel system, whether SIMD or MIMD, supports some kind of PE interconnection network. A linear array or a 2D array (also referred to as a “mesh”) is quite common, often with wrap-around at the ends to form a ring or torus. Support for broadcast (at least from the control unit) and reduction is also common, though some systems augment the mesh with row and column broadcast buses. Other systems add a hypercube for long-distance communication among PEs, or specialized hardware for arbitrary point-to-point communication. As may be expected, there is a tradeoff between added functionality and added hardware.

5

N

W

E

S Figure 2: (a) Coterie Network [24], and (b) Switches for PE Interconnect and Bypass Some systems also add reconfigurability to address particularly application domains. Hermann’s Asynchronous Reconfigurable [16] Mesh was one of the first reconfigurable networks. Used in an associative processor for image processing, PEs in his processor could reconfigure the mesh network to form local regions of connection called coteries; these coteries are described more fully in Section III.A. Our ASC Processor [18] also implements a reconfigurable network to provide a linear array and 2D mesh for communication among a set of associative PEs; this network is summarized briefly in Section III.B. We have added some concepts from coteries to our ASC Processor to support an efficient implementation to the LCS problem, as also described in that section.

A. Coterie Network The coterie network [24] was developed by Weems in the late 1980s as a reconfigurable network to support image processing. As shown in Figure 2a, the coterie network allows a 2D array to be partitioned into a set of distinct regions called coteries. Within a coterie, a broadcast reaches all members of the coterie; since coteries need not be limited to a single row or column this network provides greater functionality than a simple row or column broadcast bus. Coteries can also communicate to exchange information. In their image processing application, separate coteries were assigned to different features in the image, and multi-associative programming was proposed to allow processing those different regions in parallel. However, this model was never fully implemented in hardware. The key to the reconfigurability of the coterie network is the switches that interconnect the PEs, shown in Figures 2. These switches are set by loading specific bits in a mesh control register seen as local storage within each PE. Thus PEs in a particular region can close the appropriate switches to form a coterie for broadcast 6

communication within that region, and PEs on the boundaries of the coterie can close the appropriate switches to communicate to adjacent coteries. One set of switches, shown as bold lines in Figure 2b, connect a given PE to its North, East, West, and South neighbors, for either local or coterie-broadcast communication. Another set of switches, mentioned only briefly by Weems et. al. but important for the LCS algorithm we present later, allows neighboring PEs to bypass a given PE in their communication — for example a PE’s North neighbor could bypass the PE to communicate directly with its South neighbor or its West neighbor. (Technically, Weems [24] described only the corner-turning bypasses and not the horizontal and vertical bypasses but they are an obvious extension and again necessary for our LCS algorithm.)

B. Modifying the ASC Processor’s Reconfigurable Network for the LCS Algorithm The key to the reconfigurability of the ASC Processor [17] developed at Kent State University is the data switch inside each PE. In the linear array supported by the version of the processor mentioned above, the data switch allows a PE to communicate with its two neighbors. However, the ASC Processor is an associative SIMD array, meaning data is accessed by content rather than address, so the ASC Processor must support associative search (all PEs search for a key in their local memory, and those that find it are designated as responders) as well as associative computing (responders are processed further, either sequentially or in parallel). To support associative computing, the data switch has a bypass mode to allow PE communication to skip non-responders. To support the LCS algorithm described in Section IV, we have modified the ASC Processor’s reconfigurable network from a linear array into a 2D mesh, and added some features inspired by the coterie network. To convert the linear array into a 2D mesh, the data switch was augmented to support communication to any of the PE’s four neighbors, and the bypass mode was augmented to allow both horizontal and vertical bypassing of the PE as well as corner-turning bypasses (e.g., directly from a PE’s South to West neighbors). We refer to the resulting network as a reconfigurable 2D mesh to highlight the bypass capability. If the sequences being compared are too large for the available PE array (which is likely), the sequences are “folded” onto the PEs available. The coterie network is a powerful network, though we do not need all its features for our LCS algorithm. Specifically, our algorithm does require coteries of arbitrary shape, but only coteries along each row and column. Therefore we augmented our ASC Processor’s new 2D mesh with row and column broadcast buses, which were then further targeted to the LCS algorithm by limiting the functionality so that only the first processor in the row/column can broadcast to the rest of the row/column. 7

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

Figure 3: (a) Text String Stored in First Row, and (b) Text String After Broadcast Along Column Buses

IV. Algorithm to Solve the LCS Problem on the ASC Processor with Reconfigurable 2D Mesh Inspired by the features of the coterie network and by the parallel string matching algorithms with VLDCs developed by K. L. Chung [23] for the m x n MCCRB model, we developed an LCS algorithm that can be efficiently supported by our ASC Processor when modified to support a reconfigurable 2D mesh (with bypass mode) and row/column broadcast buses as described in Section III.B. This section describes algorithms for finding the longest common subsequence (LCS) of an m-character pattern string within a (usually) larger n-character text string in constant time. First, Section IV.A presents an LCS algorithm to find an exact match of the pattern string within the text string which runs in O(1) time, and then Section IV.B presents a variation to find an approximate match (where the subsequences contain gaps).

A. ASC Processor LCS Algorithm (Exact Match) This section presents the exact match version of our LCS algorithm for the ASC Processor with reconfigurable 2D mesh row/column broadcast buses described in Section II. Initially, we assume the data switches of all the PEs are open, meaning each PE is disconnected from all of its neighbors and no switches are set to bypass the PE. Each PE also has a Match Register “M”, a Length Register “L”, and a Gap Register “G”, all initially set to 0.

Step 1 — Broadcast the Text String Broadcast each character of the n-character text string (stored in the first row) from the first PE in each column to all the other PEs in that column using the column broadcast bus. Suppose the text string T= T(1) T(2)…..T(n) has been stored in row 1 of the PE array such that PE(1,j) stores T(j). To follow an example throughout this section, suppose: Text string T= T(1) T(2)….T(11) = AGACTGACTGA and Patter string P= P(1) P(2) …P(6) = ACTGAC 8

A

G

A

C

T

G

A

C

T

G

A

A

G

A

C

T

G

A

C

T

G

A

A

1

0

1

0

0

0

1

0

0

0

1

A

1

0

6

0

0

0

5

0

0

0

1

C

0

0

0

1

0

0

0

1

0

0

0

C

0

0

0

5

0

0

0

4

0

0

0

T

0

0

0

0

1

0

0

0

1

0

0

T

0

0

0

0

4

0

0

0

3

0

0

G

0

1

0

0

0

1

0

0

0

1

0

G

0

3

0

0

0

3

0

0

0

2

0

A

1

0

1

0

0

0

1

0

0

0

1

A

1

0

2

0

0

0

2

0

0

0

1

C

0

0

0

1

0

0

0

1

0

0

0

C

0

0

0

1

0

0

0

1

0

0

0

Figure 4: (a) PEs Set Bypass Switches Along Common Subsequences, and (b) PEs Identify the LCS We now broadcast text string T from the first PE in each column to all the other PEs in that column using the column broadcast bus. This process is illustrated in Figure 3.

Step 2 — Broadcast the Pattern String Broadcast each character of the m-character pattern string (stored in the first column) from the first PE in each row to all the other PEs in that row using the row broadcast bus. Similarly, the pattern string P is broadcast from the first PE in each row to all the other PEs in that row using the row broadcast bus. Each PE now holds both an element of the text string and an element of the pattern string.

Step 3 — Identify Character Matches In parallel, Push Mask (push “1” onto the top of the mask stack) if row data is equal to column data (i.e., there is a match between the pattern string and text string). Each PE [i,j] now holds the text string T(j) and pattern string P(i). Each PE compares those values and sets the Match Register “M” to 1 if they are equal or 0 otherwise. The resulting Match Register “M” values are shown at the center of the PEs in Figure 4a.

Step 4 — Set Bypass Switches Along the Common Subsequences 1. 2. 3. 4. 5.

In parallel, move the top of the mask stack to the West neighbor’s reg1 In parallel, move the top of the mask stack to the South neighbor’s reg2. BYPASS from West to South in all PEs whose top of the mask stack is 1. In parallel, Push Mask if reg1 = reg2 = 1. BYPASS from North to East in all PEs whose top of the mask stack is 1. This step is the key element of the algorithm, where we try to connect the PEs corresponding to characters

that are common to both strings. The first two lines inform the West and South neighbors of a match, and the third line sets the matched PE to be bypassed from West to South (for example, notice the West-South bypass switch is set in the PE in row 2, column 4, of Figure 4a due to the C-C match in the PE). Lines 4 and 5 now set a North-East bypass in all PEs whose North and South neighbors have a match. The result is that all West-South bypass switches 9

set in the matched cells, and North-East bypass switches are set to connect those matched cells along a common subsequence. Figure 4a illustrates the result for our running example.

Step 5 (Sequential Version) — Identify the LCS 1. 2. 3.

Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor A PE receiving a token adds 1 to the token if its Match Register “M” contains 1, and passes the token on if its West-South bypass switch is set, or stores it in its Length Register “L” if it is the end of an LCS The PE with the largest value in its Length Register “L” is the start of the LCS This is the simplest method for finding the longest of the common subsequences. After line 2 finishes, the

running example (see Figure 4b) gives three long common subsequences ending at PE[1,3] (ACTGAC, of length 6), PE[1,7] (ACTGA, of length 5), and PE[4,2] (GAC, of length 3). Of these, the LCS is ACTGAC, ending at PE[1,7]. Note that finding a maximum value across all PEs is a constant-time operation for a SIMD associative processor, so this step has running time O(nm).

Step 5 (Parallel Version) — Identify the LCS 1. 2. 3.

Each PE at the beginning (bottom) of an LCS sends its [row, column] ID to its West neighbor A PE receiving an ID passes it on, or if it is the end of an LCS subtracts its own ID from the received ID to calculate the length of the path traveled and stores that value in its Length Register “L” The PE with the largest value in its Length Register “L” is the start of the LCS This version of Step 5 takes advantage of the switches that connect the PEs in the common subsequences.

The IDs in line 2 propagate at electrical speeds rather than sequentially so this step has running time O(1). The basic idea of injecting a token this way was used by K. L. Chung [23] in his string matching algorithm and by S. Akl to find the rank of the elements on his Reconfigurable Bus model [28].

B. ASC Processor LCS Algorithm (Approximate Match) The algorithm described above solve the LCS problem for exact match, but does not yet address approximate match, which is of interest to some biologists doing sequence comparison. As an example, consider strings ACCAGG and AGACTGAGG, where the longest common subsequence ACAGG is an approximate match rather than an exact match, and moreover actually occurs twice in the second string (i.e., AGACTGAGG and AGACTGAGG). For approximate match, we do not have a continuous bypass path from the beginning to ending PE. However, we can perform a series of steps to get the LCS even in the case of approximate match. We will illustrate this algorithm for pattern string ACCAGG and text string AGACTGAGG to find approximate match ACAGG. The first steps in the approximate match LCS algorithm are the same as for exact match:

10

A

G

A

C

T

G

A

G

G

T

A

A

G

A

C

T

G

A

G

G

T

A

A

1

0

1

0

0

0

1

0

0

0

1

A

1

0

1

0

0

0

1

0

0

0

1

C

0

0

0

1

0

0

0

0

0

0

0

C

0

0

0

1

0

0

0

0

0

0

0

C

0

0

0

1

0

0

0

0

0

0

0

C

0

0

0

1

0

0

0

0

0

0

0

A

1

0

1

0

0

0

1

0

0

0

1

A

1

0

1

0

0

0

1

0

0

0

1

G

0

1

0

0

0

1

0

1

1

0

0

G

0

1

0

0

0

1

0

1

1

0

0

G

0

1

0

0

0

1

0

1

1

0

0

G

0

1

0

0

0

1

0

1

1

0

0

Figure 6: (a) Regions of Common Subsequence Separated by a “Gap”, and (b) PEs Set Bypass Switches to Span the Gap Step 1 — Broadcast the Text String Broadcast each character of the n-character text string (stored in the first row) from the first PE in each column to all the other PEs in that column using the column broadcast bus. Step 2 — Broadcast the Pattern String Broadcast each character of the m-character pattern string (stored in the first column) from the first PE in each row to all the other PEs in that row using the row broadcast bus. Step 3 — Identify Character Matches In parallel, Push Mask (push “1” onto the top of the mask stack) if row data is equal to column data (i.e., there is a match between the pattern string and text string). Step 4 — Set Bypass Switches Along the Common Subsequences 1. 2. 3. 4.

In parallel, move the top of the mask stack to the West neighbor’s reg1 In parallel, move the top of the mask stack to the South neighbor’s reg2. BYPASS from West to South in all PEs whose top of the mask stack is 1. In parallel, Push Mask if reg1 = reg2 = 1. The problem at this point is that the switches connect regions of the LCS but do not cross the “gap” between

non-contiguous regions of characters. See Figure 5a, where switches close the AC and AGG paths, but those two paths are separated by a “gap” across the shaded PEs in the center of the figure. The exact match LCS algorithm must find a way to span this gap.

Step 5 — Set Bypass Switches to Span Gaps in the Common Subsequences 1. 2. 3. 4. 5. 6. 7.

Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor BYPASS from West to South in all PEs that receive this token whose top of the mask stack is 0 BYPASS Vertically (from North to South) in all PEs above the PE identified in line 2, expect the PE in row 1. Each PE at the start (top) of an LCS sends a token (initially 0) to its South neighbor BYPASS from West to South in all PEs that receive this token whose top of the mask stack is 0 BYPASS Horizontally (from West to East) in all PEs to the right of the PE identified in line 5, expect the PE in the last column. BYPASS from West to South in all PEs whose Vertical and Horizontal bypass switches are set and whose Match Register “M” is 0, and set the Gap Register “G” in those PEs to 1

11

As in Step 5 (Sequential Version) of the exact match algorithm, we inject tokens at the bottom of each common subsequence. Each time a token reaches a gap, it will enter the South port of some PE and then stop there since without a match that PE will not have its West-South bypass switch set to pass along the token. We now close the West-South bypass switch of that PE, along with the Vertical bypass switches of all PEs above it in this column (except the one PE in the first row), allowing the token to cross the gap in “search” of the next region in the sequence. This step is repeated the value propagates through West-South bypass switches across the gap, possibly allowing multiple tokens to reach PEs in the first row. In a similar way, we inject different token symbols from the South port of all those PEs which are at the top of each common subsequence. Each time one of these tokens reaches a gap, it will enter the West port of some PE and then stop there. We now close the West-South bypass switch of that PE, along with the Horizontal bypass switches of all PEs to the right of it in this row (except the one PE in the last column), again allowing the token to cross the gap in “search” of the next region in the sequence. This step is repeated the value propagates through West-South bypass switches across the gap, possibly allowing multiple tokens to reach PEs in the last column. This process is illustrated in Figure 5, which shows the Vertical and Horizontal bypass switches set in Figure 5b. (Note that multiple tokens are passing through the array simultaneously, resulting in more Vertical and Horizontal bypass switches being set than might be expected at first.) Finally, all PEs whose Vertical and Horizontal bypass switches are switches are set, but whose Match Register “M” contains 0, set their Gap Register “G” to 1 and close their West-South bypass switches. These are the PEs corresponding to gaps that can be successfully crossed by the token.

Step 6 — Identify the LCS 1. 2. 3.

Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor A PE receiving a token adds 1 to it if its Match Register “M” contains 1, and passes it on if its West-South bypass switch is set, or if it is the end of an LCS stores it in its Length Register “L” The PE with the largest value in its Length Register “L” is the start of the LCS As in Step 5 (Sequential Version) of the exact match algorithm, we inject tokens at the bottom of each

common subsequence and find the longest such common subsequence. If the user would like to add penalty values for gaps, consider character sequence or gap lengths, etc. those factors can be considered as shown below in Step 6 (Alternate).

Step 6 (Alternate) — Identify the Best LCS 1.

Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor 12

2. 3. 4. 5. 6.

A PE receiving a token adds 1 to the token’s Length value if its Match Register “M” contains 1 A PE receiving a token adds 1 to the token’s Gap value if its Gap Register “G” contains 1 After updating the token value as above, a PE receiving a token passes the token on if its West-South bypass switch is set, or stores it in its Length Register “L” if it is the end of an LCS The PE with the largest value in its Length Register “L” is the start of the LCS If there are multiple LCSs, the “best” LCS can be chosen based on weighted Length and Gap values This alternate version of Step 6 considers the length of the “gap” in the common subsequence. It is basically

the same as the other Step 6, except the token now contains both Length and Gap values. When the token crosses a PE that is a match, the Length value is incremented; when it crosses a gap where Vertical and Horizontal bypass switches are both set, the Gap value is incremented. If it crosses a gap solely along a Vertical or Horizontal bypass switch the Gap value remains unchanged. Other optimizations can also be applied depending on the molecular biologist’s definition of the “best” LSC. For example, if during this step more then one token arrives at a port of a particular PE, we could pass along only that token whose Length value is larger than any token encountered so far, or that token whose Gap value is smaller than any token encountered previously of the same length. To illustrate this process, consider PE [3, 5] (shown shaded in Figure 5b), where one token approaches that PE’s South port and another its East port. The token at its South port has Length 2 and Gap 2, while the token at its East port has Length 3 and Gap 1. If Length is the most important criteria, then the token from the East port will pass on and the token from the South port will be blocked. After all LCSs are found, for each LCS we have the length of the LCS and the number of gaps crossed. These factors can then be considered to select the “best” LCS using whatever weighting seems most appropriate to the molecular biologists performing the sequence comparison.

V.

Conclusion and Future Work As a proof of concept, we modified our ASC Processor to support a reconfigurable 2D mesh (with bypass

mode) and row/column broadcast buses as described in Section III., and then successfully implemented the exact match version of our LCS algorithm on this new ASC Processor. An Altera APEX1000KC FPGA was sufficiently large enough to hold the 6x11 array of PEs used in the examples of Section IV, and ran at a clock speed of 37 MHz. Larger sequences could be compared by either “folding” them onto this small PEs array (still gaining even a modest speed improvement) or by constructing a larger PE array using larger FPGAs. In this paper, we have described a new parallel algorithm for solving the LCS problem and an efficient implementation of that algorithm on specialized hardware. Our hardware builds on our ASC Processor, designed previously and implemented on an Altera FPGA. We were inspired by certain features of the coterie network, and 13

modified our ASC processor to add a reconfigurable 2D mesh and row/column broadcast buses. We then took advantage of the resulting architecture to develop an efficient solution to the LCS problem, for both exact and approximate match. Since the resulting LCS algorithm runs our on FPGA-based ASC Processor, the underlying hardware is inexpensive and can be easily reconfigured as new analysis techniques are developed. In summary, this algorithm / processor combination provides a fast and inexpensive solution to the LCS problem. Future work will concentrate on providing additional parameters to find the “best” common subsequence instead of simply the “longest” common subsequence. In molecular biology, the sequence comparison may need to be parameterized to give different weights to different nucleotide matches or to consider the length or number of the individual subsequences and gaps.

VI. References [1]

A. Aggarwal and J. Park, “Notes on searching in multidimensional monotone arrays,” Proc. 29th Ann. IEEE Synopsis. Foundations of Computer science, pp. 497-512, 1988

[2]

A. Apostolico, M. Atallah, L. Larmore, and S. Mcfaddin, “Efficient parallel algorithms for string editing and related problems,” SIAM J. Computing, Vol. 19, pp. 968-988, Oct. 1990

[3]

A. Aho, D. Hirschberg & J. Ullman, “Bounds on the Complexity of the Longest Common Subsequence Problem,” Journal of the ACM, Vol. 23, No. 1, pp. 1-12, Jan. 1976

[4]

A. Apostolico and Z. Galil (eds.): Pattern Matching Algorithms, Oxford University Press, 1997

[5]

A. D. Falkoff. Algorithms for Parallel search memories. Journal of the ACM 9, Volume 4, pp. 488-511, 1962.

[6]

A. Jerraya and W. Wolf, Multiprocessor Systems-on-Chips, Morgan Kauffman, 2004.

[7]

A. Krikelis, C. Weems. Associative Processing and Processors, IEEE Press, July 1997

[8]

ClearSpeed Products. http://www.clearspeed.com/products

[9]

C. K. Wong and A.K. Chandra. The String-to-String Correction Problem, Journal of the ACM, Vol. 21, pp. 13-16, 1976

[10]

D. Maier. The Complexity of Some Problems on Subsequences and Supersequences, Journal of the ACM, Vol. 25, No. 2, pp. 322-336, 1978

[11]

D. P. Lopresti. “Rapid Implementation of a Genetic Sequence Comparator Using Field Programmable Logic Arrays.” Advanced Research in VLSI, pages 138-152, 1991

[12]

D. S. Hirschberg. “An Information Theoretic Lower Bound for the Longest Common Subsequence Problem”, Inf. Proc. Letters, Vol. 7, No. 1, pp. 40-41, 1978

[13]

D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, Vol. 24, No. 4, pp. 664-675, October 1977

[14]

D. Sankoff and J. B. Kruskal, (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, 1983

[15]

E. W. Myers and W. Miller. “Optimal Alignment in Linear Space.” CABIOS, Vol. 4, No. 1, pp. 11-17, 1988

14

[16]

F. P. Herrmann and C.G. Sodini, “A Dynamic Associative Processor For Machine Vision Applications”, IEEE MICRO, Vol. 12, pp. 3, pp. 31-41, June 1992

[17]

Hong Wang and Robert A. Walker, “Implementing a Scalable ASC Processor”, Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), abstract on p. 267, full text on accompanying CDROM, April 2003.

[18]

H. Wang and Robert A. Walker, “A Scalable Pipelined Associative SIMD Array with Reconfigurable PE Interconnection Network for Embedded Applications”, Proc. of the 17th International Conference on Parallel and Distributed Computing and Systems (PDCS’05), Phoenix, Arizona, November 2005.

[19]

J. Modelevsky, “Computer applications in applied genetic engineering,” Advances in Applied Microbiology, vol. 30, pp. 169-195, 1984

[20]

J. Potter. Associative Computing: A Computing Paradigm, Plenum Publishing, New York, 1992

[21]

J. Potter, Johnnie Baker, Stephen Scott, Arvind Bansal, Chokchai Leangsuksun, and Chandra Asthagiri, “ASC: An Associative Computing Paradigm,” IEEE Computer, Vol. 27, No. 11, pp. 19-25, November 1994

[22]

J. W. Hunt. and T. G. Szymanski. A Fast Algorithm for Computing Longest Common Subsequences, Comm. of the ACM, Vol. 20, No. 5, pp. 350-353, 1977

[23]

K. L. Chung, “O(1)-time parallel string-matching algorithm with VLDCs.” Pattern Recognition Letters, Vol.17, pp 475479, 1996

[24]

M. C. Herdordt, C.C. Weems, and M. J Scudder, “Non –Uniform Region Processing on SIMD arrays using the Coterie Network,” Machine Vision and Applications, Vol. 5, No. 2, pp. 105-125, Spring, 1992

[25]

M. L. Fredman. On Computing the Length of Longest Increasing Subsequences, Disc. Math, Vol. 11, pp.29-35, 1975

[26]

M. Lu and H. Lin, ``Parallel algorithms for the Longest Common Subsequence problem,'' IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 8, pp. 835-848, 1994

[27]

P. A. Pevzner & N. C. Jones.: An Introduction to Bioinformatics Algorithms, MIT press, MA, 2004

[28]

S. Akl. Parallel Computation: Models and Methods, Prentice-Hall, New York, 1997

[29]

T. Hunkerpiller, M. Waterman, R. Jones, M. Eggert, E. Chow, J. Peterson, and L. Hood. Special Purpose VLSI-Based System for the analysis of Genetic Sequences. Human Genome: 1989-90 Program Report, page 101, March 1990

[30]

T. Mathies. “A Fast parallel algorithm to determine edit distance,” Tech. Rep. CMU-CS-88-130, Dept. of Computer. Science, Carnegie Mellon Univ., Pittsburgh, PA, 1988

[31]

V. Chavatal, D. Klarner, and D. Knuth, “Selected combinatorial research problem,” Tech. Rep. STAN-CS-72-292, Stanford Univ., p.26, 1972

[32]

W. J. Masek and M. S. Paterson. A Faster Algorithm for Computing String-Edit Distances, Journal of Computer and System Sciences, Vol. 20, No. 1, pp. 18–31

[33]

WorldScape Defense Massively Parallel Computing. http://www.wscapeinc.com/technology.html

15

Suggest Documents