Alignments of DNA and protein sequences containing frameshift errors

Vol. 12 no. 1 1996 Pages 31-40 Alignments of DNA and protein sequences containing frameshift errors Xiaojun Guan and Edward C.Uberbacher Abstract Mol...

Author: Samson Sharp

1 downloads 0 Views 743KB Size

Report

Download PDF

Recommend Documents

Biological Sequences: DNA, RNA, Protein

Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical

DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches

Recognition of Specific DNA Sequences

Consensus-Based Prediction of RNA and DNA Binding Residues from Protein Sequences

DNA, RNA, AND PROTEIN SYNTHESIS

Protein and DNA Sequence Comparison

Fast Fourier Transform Analysis of DNA Sequences

Up-frameshift (Upf) factors eliminate aberrant mrnas containing

Aligning Flowgrams to DNA Sequences

Sequences containing no 3-term arithmetic progressions

Editing Informational Content of Expressed DNA Sequences and Their Transcripts

Protein, DNA & RNA Structure

DNA, RNA, Protein Sample Prep & DNA Amplification

DNA, Protein Synthesis and Biotechnology EOC Review

DNA Replication and Protein Synthesis B5-B8

Unit 5 DNA, RNA and Protein Synthesis

CH. 8- DNA and protein synthesis

Genetics, DNA and Protein Synthesis Study Guide

FGFR1: RNA, DNA and protein correlations

CHAPTER 10 DNA, RNA, AND PROTEIN SYNTHESIS

Finding flexible patterns in unaligned protein sequences

Codon optimization: Why & how to design DNA sequences for optimal soluble protein expression. Rachel Speer, Ph.D

DNA Vom Gen zum Protein

Vol. 12 no. 1 1996 Pages 31-40

Alignments of DNA and protein sequences containing frameshift errors Xiaojun Guan and Edward C.Uberbacher Abstract Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of'frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is 0(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method. Introduction Knowledge of protein and nucleic acid sequences is central to many aspects of modern biology. Experimentally determined DNA sequences are generally used as the basis for determining the sequence of amino acids in encoded proteins using multiple-frame translation or pattern recognition methods. At present, comparison among the sequences is most commonly done using protein sequence (i.e. deduced amino acid sequence) to understand functional relationships among proteins in the same and in different species (States et al., 1991). A number of sequence comparison algorithms have been introduced. For rigorous analysis, the dynamic programComputer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6364, USA E-mail:[email protected]

© Oxford University Press

ming algorithm proposed by Smith and Waterman (1981), and modified by Gotoh (1982), can be used to find the optimal alignment of two sequences. For fast identification of homologous sequences, heuristic methods such as BLAST (Altschul et al., 1990), based on a probabilistic model, and FastA (Pearson and Lipman, 1988), based on initial &-tuple matches, can be used. These heuristic methods are not as sensitive or accurate as the full dynamic programming method. As experimental data, sequences are subject to error. Error can arise from a variety of sources, such as migration artifacts in sequencing gels, improper gel reading by computer software, methods for removing inconsistencies during sequence assembly, and in data entry. Although some level of error in sequences seems inevitable, different sequencing strategies have different intrinsic error rates and types. Some approaches, such as single-pass cDNA sequencing or cosmid skimming, have virtually no redundancy checks on the sequence, and can have error rates in excess of 10%, while high-redundancy (shotgun) genomic sequencing can be ~ 1 % or less. In considering the impact of errors in nucleotide sequences, it is useful to consider not only the frequency of errors but also their type (substitution, deletion or insertion of one or more bases). In particular, derived amino acid sequences are affected very differently by errors of different types. The most common type of error involves an incorrect read for a base which results in a substitution of one base for another. This weakens the recognition of homologies in a linear and fairly gentle way. The requirement of a consistent reading frame for translation, free of insertions or deletions (i.e. without 'frameshift' errors), is a much more stringent constraint on the quality of sequence data used to recognize homologies with standard sequence comparison algorithms. In the presence of insertion-deletion errors, resulting translations for the coding regions may have only certain parts correct in any one given frame. Therefore, when comparing the protein translation to sequences in a protein database, useful homologies can easily be lost. The trend toward efficient sequencing methods which inherently have less redundancy creates a need for systems which can detect coding regions and detect sequence homologies in the presence of significant error rates.

31

X.Guan and E.C.Uberbacher

Several methods have been developed to deal with sequence errors when comparing sequences. Posfai and Roberts (1992) used an approach based on Fast A to compare the six-frame translations of a DNA sequence to sequences in a protein database. In this method, if two strongly matched segments are adjacent and are in different reading frames, a possible frameshift site has been found. States and Botstein (1991) proposed an algorithm based on a probabilistic model incorporating codon usage information and a non-uniform distribution of error probabilities. In their algorithm, a DNA sequence which may have errors is compared with a protein sequence, and uncertainties are assigned for each base in the DNA sequence as well as for an insertion or a deletion at that base using the prior information. In this method, and for a specific organism, proteins with 33% sequence identity can be recognized in the presence of 1 % frameshifting errors (deletion or insertion) and 5% base substitution. Both existing methods either rely on a probability model or on heuristic rules and involve methods with reduced sensitivity and accuracy compared to more robust dynamic programming methods. While these algorithms have some specialized utility, we describe here a more general and more sensitive approach to dealing with errors in the course of sequence comparison. We have previously described a coding recognition system which functions in the presence of indels (Xu el al., 1995), and in this paper address methods for recognition of protein sequence homologies when frameshifts are present. The algorithm we describe can recognize homology between DNA sequence translations and related protein sequences in the presence of significant (several percent) indel rates in DNA sequences. Furthermore, in this method and for each identified homology, the frameshift sites in the query sequence are located as well. The method is very general because it does not use heuristic rules, specific codon usage information, or any knowledge other than DNA sequences themselves. Our algorithm uses a rigorous dynamic programming approach similar to the Smith-Waterman algorithm, and is therefore more sensitive and accurate than other current methods. Methods The Smith-Waterman algorithm for local alignment modified by Gotoh (1982) is as follows. Given two s e q u e n c e s A=ala2--an a n d B = b\b2...bm, a cost function, w, for k insertions or deletions is defined as w(k) = —(M * k + v), where w^O and v^-Q. The alignment of A and B is performed by creating a score matrix D whose values are computed systematically from the upper left corner of the matrix to the lower right. Two other

32

matrices P and Q are used to calculate matrix D. D{i,j) = max{£>(/ - l , j - \ ) + c{bhaj),

P{i,j),

Q{U),0}

o < /ss«7,o (0, k) = Q{k, 0) = D(k, 0) = £>(0, k) = 0,

\fk > 0

c(bi,Oj) represents the similarity between cij and bj. To find the best local alignment, the maximum score in the score matrix D is found first which corresponds to the right end of an optimal local alignment, then a traceback procedure is performed to find the left end of the optimal local alignment. An example is given in Figure 1. Here the gap penalty function w(k) = - ( 1 *k+ 10), and similarities between amino acids are defined by the protein similarity matrix BLOSUM 62 (Henikoff and Henikoff, 1993). The same gap penalty function and the matrix BLOSUM 62 are used in all the experiments in this paper. When a DNA sequence with frameshift errors is translated into a protein sequence and then aligned with sequences in a protein database, correctly translated segments fall into different frames. In the comparison

M H F C G G T S L I N D Q W V V C P T

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C

0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 10 0 0 0 0

T G

L V S T R A V L T A G H

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 6 0 0 0 0 0 0 0 0 0 0 0 6 0 13 6 1 0 0 0 0 0 0 0 0 0 6 5 2 14 5 2 1 5 0 0 0 0 5 0 0 1 5 6 12 3 6 2 4 1 0 0 1 6 0 0 0 3 10 13 2 5 0 3 2 4 0 0 2 0 0 1 5 13 11 1 2 0 6 4 3 0 0 1 0 1 0 2 14 11 2 1 0 3 4 1 0 0 0 0 0 1 3 13 9 1 0 0 2 2 0 0 0 0 0 0 2 2 14 8 2 1 0 1 0 0 0 0 0 0 1 1 3 11 5 0 0 0 0 0 0 0 1 4 0 1 2 3 15 e 3 2 I 0 0 0 1 5 2 0 1 2 7 16 e 4 3 0 0 0 0 0 4 1 0 1 3 6 15 6 3 9 0 0 0 0 0 3 0 0 2 4 5 14 4 5 7 1 0 0 1 5 2 0 1 3 9 5 12

0 8 0 0 0 4 4 0 0 0 1 0 0 0 0 2 2 2 2

b.

SI:

3 CGGTSLINDQWW 15

S2:

2 CTG-SLVSTRAVL 13

Fig. 1. Two protein sequences, SI = HFCGGTSLINDQWVVCPT and S2 = MCTGSLVSTRAVLTAGH, are compared using the SmithWaterman algorithm. Here w(k) = - ( 1 * k + 10), and similarities between amino acids are defined by the protein similarity matrix BLOSUM 62 (Henikoff and Henikoff, 1993). The maximum score in the score matrix D (a) is 16 at (15,13) and the corresponding alignment is given in (b).

Sequence alignment with error correction

a.

Identities: 63/234 (27%) Query:

36 PYQVSLNSGYHFCGGSLINDQWWSAAHCYKSRIQVRLGEHNINVLEGDEQFI

NAA 91

Sbjct:

41 PFIAFLTTERTMCTGSLVSTRAVLTAGHCVCSPLPV-IRVSFLTLRNGIX2QGIHHQPSGV 99

Query:

92 KIIK--HPNY

S-SWTLNN-DIMLIKLSSPVKLNARVAPVAU--PSACAPAGTQ 138

Sbjct:

100 KVAPGYMPSCMSARQRRPIAQTLSGFDIAIVMLAQMVNLQSGIRVISLPQPSDIPPPGTG 159

Query:

139 CLISGWG---NTLSNGVNNPDLLQCVDAPVLSQADCEAAYPGEITSSMICVGFLEG-GKD 194

Sbjct:

160 VFIVGYGRDDNDRDPSRKNGGILKKGRATIM

ECRHATNG

NPICVKAGQNFGQL 212

Query:

195 SCQGDSGGPW--CNGQLQGIVSWGYGCALPDNPG--V-YTKVCNFVGWIQDTI 243

Sbjct:

213 PAPGDSGGPLLPSLQGPVLGWS--HGVTLPNLPDIIVEYASVARMLDFVRSNI 264

b. framel: Query:

3 6 PYQVSLNSGYHFCGGSLINDQWWSAAH 63

Sbjct:

41 PFIAFLTTERTMCTGSLVSTRAVLTAGH 68

frarae2: Query:

191 GKDSCQGDSGGP 202

Sbjct:

210 GQLPAPGDSGGP 221

frame3: Query:

106 DIMLIKLSSPVKLNARVAPVALP 128

Sbjct:

126 DIAIVMLAQMVNLQSGIRVISLP 148

Fig. 2. The alignments of the translation of the rat mRNA and cercarial elastase using the Smith-Waterman algorithm before and after three deletion errors were introduced, where the rat mRNA is the query sequence and cercarial elastase is the subject sequence, (a) The local alignment of the translation of the rat mRNA sequence with cercarial elastase. (b) The three-frame local alignment of the rat mRNA sequence containing three deletion errors with cercarial elastase. Only small segments from the error-free alignment were picked up in each of the three frame alignments.

for any given frame, correct segments may be small and some segments will be incorrect, resulting in significant mismatch, even if the two sequences are truly homologous. An example is shown in Figure 2. The translation of the rat neutrophil elastase mRNA and human cercarial elastase share 27% sequence identity. The translation of rat mRNA is first aligned with human neutrophil elastase using the standard Smith-Waterman algorithm to show the homology (Figure 2(a)). When three deletion errors are introduced in the rat mRNA sequence, and a threeframe comparison is made using Smith-Waterman, the best local homology found in any frame is very poor (Figure 2(b)). In a comprehensive database search, the original alignment was at position 219 in the result list, but with the indels the best local alignment moves down to position 1410, effectively becoming lost in the noise. As in the above example, consider what happens when the translation in all three frames is aligned independently. Suppose the coding region starts in frame 1, and a deletion occurs in the sequence. The frame 1 translation is aligned with the protein sequence until the deletion site, then the frame 3 translation is aligned with the protein sequence. If a second deletion is encountered, the frame 2 translation picks up the alignment at this site. If instead an insertion

were to occur after the first deletion, the alignment would return to the frame 1 translation. If we align the three frame translations independently, segments of alignments will be scattered throughout the three score matrices. To find potential frameshift sites, one can examine the segments of alignments in all three frames and look for cases where two segments are adjacent and are in different frames. One might consider heuristic rules to decide which segments to bring into an alignment, especially if the homology is quite strong. However, if the homology is weak, or if there are many such segments, it is difficult to find the overall best alignment because of the complexity and combinatorics involved. We describe a new algorithm to find the optimal alignment in such a situation and which also locates the indel sites. Since any combination of deletions and insertions places the correct reading frame into one of three frames, no combinatorial calculation is necessary and a computationally efficient dynamic programming approach can be used. In the standard Smith-Waterman algorithm, a score matrix cell (/J)'s value can depend on three other matrix cells ( ; — l j - 1 ) , (ij — \) and (i-\,j). In the new alignment algorithm we consider not only the three cells in the same matrix, but also the same three cells in the other two

33

X.Guan and E.C.Uberbacher

frame 2

penalty, and let £>,, Ph and Qt, 1 ^ / ^ 3 , denote the matrices for the three frames. These matrices are initialized in the same way as in the Smith-Waterman algorithm. The steps to calculate D\(iJ) are listed below.

g,(/,j) = max{£>,(/,y- 1

w{\), Qx{iJ - 1) + u)

£),(/j) = max{£>,(/- I J

\) + c(bhaJ),Pl{i,j),

/, = max{£> 2 (/- I J ) + «'(1), P2(i - \,j)

h=max{D2(iJ-

\) + w(\),Q2{U- ')

/3 = m a x { 0 2 ( / - I J - ^) + if ?3 -6>

D\(i,j)

c(bhaj),tut

then

D\{i,j) = t3 — 6 Fig. 3. This picture shows the basic idea of our algorithm. When calculating score matrix cell (ij) for frame 1, we not only consider the three matrix cells (/ - IJ - 1), (/ — 1 J) and (ij - 1) in frame 1, but also consider the same three matrix cells in frames 2 and 3, thereby connecting segments of matching sequences in all three frames. This is done for all three score matrices.

frames' matrices. That is, when computing the alignment for a given frame of translation, we also consider whether there is a better partial alignment in either of the other frames prior to this point that can be continued by shifting the frame to the one under consideration (see Figure 3). To prevent an alignment from being constructed from random matches from three frames and very frequent frameshifts, a penalty is imposed for shifting the frame. Let 8 denote the frameshift

I

i2 = ma\{D3(iJ-

1) +

t3 = m a x { £ > 3 ( / - I J if tJ-6> r

D](/j) then

' ^'^' ~ ' Q\{'J) = t2 — 6 _ \\ J) — i }

Identities: 61/234 (26%) Query:

36 PYQVSLNSGYHFCGGSLINDQWWSAAHAT-NPIQVRLGEHNINVLEGDEQFI

Sbjct:

41 PFIAFLTTERTMCTGSLVSTRAVLTAGHCVCSPLPV-IRVSFLTLRNGDQQGIHHQPSGV 99

NAA 90

Frame:

41 111111111111111111111111111111111333333333 33 33 33333333333333 99

Query:

91 KIIK--HPNY

S-SWTLNN-DIMLIKLSSPVKLNARVAPVAL--PSACAPAGLS 137

Sbjct:

100 KVAPGYMPSCMSARQRRPIAQTLSGFDIAIVMLAQMVNLQSGIRVISLPQPSDIPPPGTG 159

Frame:

100 333333 3333333333333333 333333 33 33 33333333 33 33 3333333333333333 159

Query:

138 ALISGWG

NTLSNGVNNPDLLQCVDAPVLSQADCEAAYPGEITSSMICVGFLEG-GKD 193

Sbjct:

160 VFIVGYGRDDNDRDPSRKNGGILKKGRATIM

Frame:

160

ECRHATNG

NPICVKAGQNFGQL 212

Query:

194 SCQGDSGGPW--CNGQLQGIVSWGYGCALPDNPG--V-YTKVCNFVGWIQDTI 242

3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 212

Sbjct:

213 PAPGDSGGPLLPSLQGPVLGWS--HGVTLPNLPDIIVEYASVARMLDFVRSNI

264

Frame:

213 222222221111111111111111111111111111111111111111111111

264

Fig. 4. The alignment of the rat mRNA containing three deletion errors with cercarial elastase using our algorithm.

34

Sequence alignment with error correction

D2 and £>3 are calculated similarly. When the calculation is completed, segments of matching sequences in the three score matrices are threaded together into sequence alignments. Along with each alignment path we record the frame of the current match. In the traceback procedure, we start from the matrix cell of maximum value and traceback the alignment through the three frames' matrices. If there is true homology between the two sequences aligned, the three score matrices will show basically the same alignment with slight differences at the ends. Figure 4 shows the alignment of the same sequences in Figure 2 using the new algorithm.

80 -

X

40 -

100

150

200

250

300

350 score

100

150

200

250

300

350 score

60 -

40 -

0

50

Results and discussion

We evaluated the sensitivity and accuracy of our algorithm in two different ways. Firstly, using several different examples, a DNA sequence coding region was subject to different error rates (errors were introduced randomly) and compared with a homologous protein sequence. As a control, the results were compared with that of the randomly permuted sequences that share the same base composition with the DNA sequence in a manner similar to a previous study by States and Botstein (1991). The goal was to evaluate the impact of different error rates on a known sequence alignments, and determine how much error was tolerable before the homology would become insignificant relative to the noise. Secondly, we searched a DNA sequence with different rates of indel error against a protein database to see whether the sequences from the same protein family were recognized by the algorithm, and how both strong and weak homologies within the result list were affected by different rates of sequence error. Binary comparison and alignment

This evaluation is similar to one performed by States and Botstein (1991). We selected the same two sequences, rat trypsin and human neutrophil, which share 33% sequence identity. We first tested the impact of substitution errors on the sequence alignment. For each error rate, 100 runs of the alignments were performed, each with a new set of errors randomly introduced in the coding region of the rat trypsin sequence. The distribution of the scores is plotted in Figure 5. As a control for the comparison, the DNA sequence was randomly permuted, translated and compared with the target protein sequence. The score distribution is also plotted on the same graph in Figure 5. The purpose was to examine the discrimination between the true alignment and the resulting random alignments. As can be seen in Figure 5, as more errors are introduced, the alignment and discrimination become

, ^TTThrn—,

,

150

350 score

200

250

300

10Z 100

150

200

250

300

350 score

80 -\

15Z 200

250

300

350 score

202 100

150

200

250

300 350

253! 100

150

200

250

300

350

Fig. 5. Two sequences, rat trypsin and human neutrophil, which share 33% sequence identity, were compared with a range of substitution error rates (1-25%). For each error rate, 100 runs of the alignments, each with a new set of errors introduced in the sequences, were performed and the distribution of the alignment scores was plotted as the open bars. The solid bars represented the distribution of the alignment scores of the same DNA sequences, randomly permuted, translated and compared with the target protein sequence.

weaker (as expected). However the discrimination remains significant to very high substitution rates. The performance is basically the same as standard Smith-Waterman in this case (base substitution but no indels). With 33% sequence identity between the two sequences, the algorithm can recognize the homology in the presence of up to 15% substitution errors, as opposed to ~ 5% for the States-Botstein algorithm. A similar evaluation was done to measure the effects of indel errors. Indel errors cause more damage to sequence alignment than substitution errors because indels produce frameshifts, while substitutions only reduce

35

X.Cuan and E.C.Uberbacher

80 -i 40 -

100

150

200

250

300

350 score

100

150

200

250

300

350 score

250

300

350 score

80 -i

50

100

150

200

250

300

350 score

52 50

100

150

200

250

300

350 score 12 50

50

100

150

200

250

300

350 score

10* 150

200

250

300

350 score

15* 100

150

200

250

300

350 score

Fig. 6. Discrimination for sequences with indel errors. Indel error rates (1-15%) were introduced and the distributions of alignments scores were plotted as in Figure 5.

sequence identity or similarity. For the same test sequences, the algorithm adequately separated the actual homologies from noise in the presence of a 7% frameshift error rate (see Figure 6), compared to ~ 1% for the States-Botstein algorithm. As we calculate the three score matrices, the frameshift information is recorded and can be used during traceback to get the positions of the frameshifts which are included in the final alignment (see Figure 4). To simulate the real applications where the background is very large, such as a sequence database, we repeated the above test with the entire SwissProt database as the control set. In Figure 7(a), the score of a perfect alignment between rat trypsin and human neutrophil was plotted along with the score distribution

36

100

150

200

250

300

350

Fig. 7. (a) The score of a perfect alignment between rat trypsin and human neutrophil was plotted (the open bar) along with the score distribution of the alignments of rat trypsin with all sequences in the SwissProt database where the sequences homologous to rat trypsin were removed, (b) Indel errors (1%) were added to rat trypsin and the resulting alignment score distribution was plotted as in Figure 6. The control set is the same set of sequences used in (a). Because of the large number of the sequences in the SwissProt database, the distribution of the control set was scaled down when plotted.

of the alignments of rat trypsin with all sequences in the SwissProt database where the sequences homologous to rat trypsin were removed. The purpose was to show how the perfect alignment between the two sequences is separated from the background noise. In Figure 7(b), 1% indel errors were added to rat trypsin and the resulting alignment score distribution was plotted as in Figure 6. The control set is the same set of sequences used in Figure 7(a). Because of the large number of the sequences in the SwissProt database, the distribution of the control set was scaled down when plotted (note that the tail of the distribution does not extend past 65 along the score axis). Figure 7 shows that although the 1% indels weakened the real alignments and increased the background noise somewhat, the real alignments remained clearly separated from the background. Comparison to protein family and the database One way of evaluating a protein sequence comparison algorithm is to examine how well a member of a protein

Sequence alignment with error correction

Rank Score . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Identity

712 129/129(100%) P00131 CYC3 DESVH DESULFOVIBRIO VULGARI 543 93/106(88%) P00132 CYC3_DESVM DESULFOVIBRIO VULGARI 266 54/109(50%) P00133 CYC3_DESGI DESULFOVIBRIO GIGAS. 160 37/101(37%) P00135 CYC3_DESSA DESULFOVIBRIO SALEXIG 151 40/113(35%) PO0134 CYC3_DESDE DESULFOVIBRIO DESULFU P38554 CY32_DESDN DESULFOVIBRIO DESULFU 140 31/94(33%) 123 40/126(32%) P00136 CY31_DESDN DESULFOVIBRIO DESULFU 75 32/107(30%) P24092 HMWC_DESVH DESULFOVIBRIO VULGARI P32707 NRFB_ECOLI ESCHERICHIA COLI. 59 19/59(32%) P24735 AMPC_PSEAE PSEUDOMONAS AERUGINOS 58 17/43(40%) P00137 CYC3 DESAC DESULFUROMONAS ACETOX 58 22/80(28%) P32707 NRFB ECOLI ESCHERICHIA COLI. 55 14/37(38%) P11006 MAGA_XENLA XENOPUS LAEVIS (AFRIC 55 12/20(60%) P00120 C553_DESVM DESULFOVIBRIO VULGARI 53 12/24(50%) P23857 PSPE_ECOLI ESCHERICHIA COLI. 52 16/70(23%)

b. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

659 127/129(98%) 523 92/107(86%) 251 54/108(50%) 153 25/66(38%) 152 36/81(44%) 143 30/80(38%) 128 40/119(34%) 97 34/109(31%) 89 26/88(30%) 87 28/88(32%) 86 32/99(32%) 85 27/89(30%) 84 23/79(29%) 84 23/76(30%) 82 30/90(33%)

P00131 P00132 P00133 P00134 POO135 P38554 P00136 Q02817 P13728 Q06885 P13730 P28968 P24092 Q05049 P28623

CYC3_DESVH CYC3_DESVM CYC3_DESGI CYC3_DESDE CYC3_DESSA CY32_DESDN CY31_DESDN MUC2_HUMAN SGS3_DROYA GP10_DICDI SGS3_DROER VGLX_HSVEB HMWC_DESVH MUC1_XENLA GUND_CLOCL

DESULFOVIBRIO VULGARI DESULFOVIBRIO VULGARI DESULFOVIBRIO GIGAS.< DESULFOVIBRIO DESULFU DESULFOVIBRIO SALEXIG DESULFOVIBRIO DESULFU DESULFOVIBRIO DESULFU HOMO SAPIENS (HUMAN). DROSOPHILA YAKUBA (FR DICTYOSTELIUM DISCOID DROSOPHILA ERECTA (FR EQUINE HERPESVIRUS TY DESULFOVIBRIO VULGARI XENOPUS LAEVIS (AFRIC CLOSTRIDIUM CELLULOVO

c. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

414 111/125(89%) 332 86/106(81%) 154 49/106(46%) 99 34/88(39%) 96 26/65(40%) 90 43/104(41%) 75 31/95(33%) 74 15/37(41%) 69 27/75(36%) 68 i21/69(30%) 67 25/106(24%) 67 21/56(38%) 65 1 7 / 6 0 ( 2 8 % ) 65 10/29(34%) 64 30/97(31%)

P00131 P00132 P00133 POO134 P38554 P00135 P00136 Q03391 P13730 P13728 P35820 P28968 P37202 P38387 P07978

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI CYC3_DESGI DESULFOVIBRIO GIGAS.< CYC3_DESDE DESULFOVIBRIO DESULFU CY32_DESDN DESULFOVIBRIO DESULFU CYC3_DESSA DESULFOVIBRIO SALEXIG CY31_DESDN DESULFOVIBRIO DESULFU NME4_MOUSE MUS MUSCULUS (MOUSE). SGS3_DROER DROSOPHILA ERECTA (FR SGS3_DROYA DROSOPHILA YAKUBA (FR PSC_DROME DROSOPHILA MELANOGASTE VGLX_HSVEB EQUINE HERPESVIRUS TY DIS3_SCHPO SCHIZOSACCHAROMYCES P SECD_MYCLE MYCOBACTERIUM LEPRAE. HSP2_MOUSE MUS MUSCULUS (MOUSE).

Fig. 8. Searching the Desutfovibrio vulgaris (Hildenborough) gene for cytochrome c3 (GenBank accession no. X04304) in the SwissProt database with different indel error rates. Although the scores and the relative positions changed with different error rates, the same family members were recognized consistently in the top 10 of the returned sequence list, (a) Top 15 sequences returned by the standard Smith-Waterman algorithm, (b) Top 15 sequences returned by our algorithm when 1% indels were added, (c) Top 15 sequences returned by our algorithm when 5% indels were added.

family hits the other members of the family, including those distantly related. As an example of this we use the Desulfovibrio vulgaris (Hildenborough) gene for cytochrome c3 (GenBank accession no. X04304) as the query sequence. We first used the standard Smith-Waterman algorithm to find the homologous sequences in SwissProt protein database (version 31; Bairoch and Boeckmann, 1994). These matching sequences share sequence identity ranging from 88% to 30% with the query. The result is shown in Figure 8(a). The new algorithm was tested using several rates of indel in the query sequence. As is shown in Figure 8(c), the algorithm functioned well with a 5% frameshift rate, and still recognized the homologous sequences in the top 10 of the returned sequence list. In

a comparison, we searched the sequence with a 5% frameshift rate in three frames independently using the standard Smith-Waterman algorithm. Only three of the top 10 homologous sequences remained in the top 10 list, and the rest of the top 10 homologous sequences were not even present in the top 100 sequences returned in all three cases (see Figure 9—only the top 15 of the returned sequences for each frame are listed due to space limitations). The frameshift penalty value used in these tests was 10. With respect to the gap penalty function w(k) = -{k+ 10), we have experimented with different frameshift penalty values, and 10 produced the best overall results.

37

X.Guan and E.C.Uberbacher

Rank Score

Idencicy

16/54(30%) 13/45(29%) 24/84(29%) 19/57(33%) 11/31(35%) 22/71 (31%) 15/38(39%)

P00131 POO132 P00133 QO2221 P11214 P30289 P27350 P19637 P21697 P05113 P14222 P36649 P32494 P09252 Q04591

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI CYC3_DESGI DESULFOVIBRIO GIGAS. COXD_HUMAN HOMO SAPIENS -(HUMAN) . UROT_MOUSE MUS MUSCULUS (MOUSE). RNS3_STRAU STREPTOMYCES AUREOFAC AMY_STRTL STREPTOMYCES THERMOVIO UROT_RAT RATTUS NORVEGICUS (RAT) PLAS_SYNY3 SYNECHOCYSTIS SP. (ST IL5_HUMAN HOMO SAPIENS (HUMAN). PERF_HUMAN HOMO SAPIENS (HUMAN) . YACK_ECOLI ESCHERICHIA COLI. ADA3_YEAST SACCHAROMYCES CEREVIS DPOL_VZVD VARICELLA-ZOSTER VIRUS LIMA_PSEAE PSEUDOMONAS AERUGINOS

6 7 8 9 10 11 12 13 14 15

153 135 56 51 51 49 48 48 48 47 47 46 46 46 45

52/126(41%) 38/101(38%) 12/32(38%) 11/35(31%) 20/64(31%) 19/66(29%) 22/75(29%) 11/31(35%) 25/82(30%) 17/60(28%) 10/29(34%) 9/33(27%) 30/122(25%) 23/81(28%) 9/23(39%)

P00131 P00132 P20627 P27982 P15306 P29474 P09019 P15719 P15407 P35463 P10186 P23571 P19438 P10158 P13399

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI NIFB_ANASP ANABAENA SP. (STRAIN TAT_SIVAG SIMIAN IMMUNODEFICIENC TRBM_MOUSE MUS MUSCULUS (MOUSE). NOS3_HUMAN HOMO SAPIENS (HUMAN). HXB5_XENLA XENOPUS LAEVIS (AFRIC MDHC_MAIZE ZEA MAYS (MAIZE). FRA1_HUMAN HOMO SAPIENS (HUMAN). ETBR_PIG SUS SCROFA (PIG). UNG_HSV11 HERPES SIMPLEX VIRUS ( 3MG_RAT RATTUS NORVEGICUS (RAT). TNR1_HUMAN HOMO SAPIENS (HUMAN). FRA1_RAT RATTUS NORVEGICUS (RAT) TA4 EIMTE EIMERIA TENELLA.

c. 1 2 3 4

69 66 65 58

5

57 52 50 49 49

16/32(50%) 28/95(29%) 21/71(30%) 15/46(33%) 26/121(21%) 25/120(21%) 7/18(39%) 18/75(24%) 13/47(28%) 19/70(27%) 22/88(25%) 29/117(25%) 19/71(27%) 27/101(27%) 13/42(31%)

P00131 P00132 P14309 P10169 Q05049 P13730 P25054 Q01238 P08287 P35463 P35710 P38020 P24152 P18183 P13595

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI PHI0_HOLTU HOLOTHURIA TUBULOSA ( CON8_NEUCR NEUROSPORA CRASSA. MUC1_XENLA XENOPUS LAEVIS (AFRIC SGS3_DROER DROSOPHILA ERECTA (FR APC_HUMAN HOMO SAPIENS (HUMAN). HCC2_CRYCO CRYPTHECODINIUM COHNI H11L_CHICK GALLUS GALLUS (CHICKE ETBR_PIG SUS SCROFA (PIG). SOX5_MOUSE MUS MUSCULUS (MOUSE). KARP_CHLTR CHLAMYDIA TRACHOMATIS EXTN_SORVU SORGHUM VULGARE (SORG HRDB_STRCO STREPTOMYCES COELICOL NCA1_MOUSE MUS MUSCULUS (MOUSE).

a. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

182 37/55(67%) 83 16/33(48%) 72 15/33(45%) 55 12/23(52%) 55 26/101(26%) 54 26/87(30%) 52 17/41(41%) 52 25/101(25%)

b. 1 2 3

4 5

6 7 8 9 10 11 12 13 14 15

52 51 50 49 48 48 48

49 49 48 48 48 47

Fig. 9. Searching the Desulfovibrio vulgaris (Hildenborough) gene for cytochrome c3 (GenBank accession no. X04304) in three frames independently using the standard Smith-Waterman algorithm. Top 15 sequences returned for frame 1 (a), frame 2 (b) and frame 3 (c).

An issue that often accompanies methods based on the Smith-Waterman algorithm is their efficiency. The basic form of our algorithm is a factor of about seven times slower than Smith-Waterman with the same computational complexity (0(nm)). To speed up the algorithm, we made some small modifications without affecting its performance significantly. When calculating a score matrix for a given frame, instead of considering all three matrix cells (/ — l,j— 1) (/' — 1, j) and (/', j — 1) in other frames' matrices, we only consider the matrix cell ( / - 1, j — 1) (see Figure 10). The reason is that if there is a good partial match around the cell (i— \, j — \), it will be reflected in all three cells (/'- 1, j - 1), ( / - 1, j) and (/, j - 1), with slight differences (a insertion, a deletion or a substitution). So the result of considering just cell (/— 1, j — 1) should not differ significantly from considering all three cells

38

(assume the errors are distributed randomly). This reduces the running time by as much as 30% (to about one-fourth the speed of the Smith-Waterman algorithm). In Figure 11, the result of the modified algorithm is compared with that of the original algorithm. Our algorithm will be most useful in database searches when errors may be present in a query DNA sequence. If there are no errors, the new algorithm works as the standard Smith-Waterman algorithm. If frameshift errors are present in the query sequence, the new algorithm will correct the errors and retrieve the homologous sequences. Furthermore, as the new algorithm corrects the frameshift errors, it lists the positions in the final alignment where it makes the error corrections, which provide clues as to where the real errors are (most of the time, the precise positions of the errors are reported).

Sequence alignment with error correction

frame 2 IH.HI

llH.Dl

100

frame 1

150

200

250

300

350 score

80 40 0

0

31 50

100

150

200

250

300

350 score

150

200

250

300

350

Fig. 10. Schematic for the modified algorithm in which only one matrix cell (/ - 1J - 1) in other frames' matrices is considered in calculating matrix cell (ij) for the present frame.

As an experiment we used the algorithm to locate a known frameshift error in GenBank. We chose a sequence (P.adspersus Tpa2 gene, partial, accession no. X74339) in GenBank (release 89.0) that contains a frameshift error. This sequence (its protein translation) was not in the SwissProt database (version 31; Bairoch and Boeckmann, 1994), and served as a good test case as an unknown query sequence to SwissProt. The top two sequences returned in the search using our algorithm are shown in Figure 12. The two alignments consistently show a frameshift at position 36 in the protein translation (or position 111 in the DNA sequence), the actual frameshift site. Another use of the new algorithm is when a heuristic search method is used for fast identification of homology, and a query sequence hits several segments of a sequence in different frames. The new algorithm can be used to show the overall alignment of the two sequences and report potential frameshift errors. Conclusion

This paper presented an algorithm for detecting and correcting frameshift errors that occur in protein-encoding regions and can corrupt the recognition of important homologies. The algorithm appears to be robust, and can recognize weakly related sequences in the presence of

80 -i

Fig. 11. This figure shows that the simplification of the algorithm had no significant effect on the performance of the algorithm. We repeated the test described in Figure 5. (a) We added 3% substitution errors to the rat trypsin sequence and then compared it to the human neutrophil sequence using the original and modified algorithms. The result of the original algorithm is shown in (al) and the result of the modified algorithm in (a2). (b) The same test was repeated with 3% indel errors added to the rat trypsin sequence. The result of the original algorithm is shown in (bl) and the result of the modified algorithm in (b2).

~ 7% frameshift error and provide an optimal alignment. The algorithm provides a capability which can make single-pass or low-redundancy sequence data more informative, thereby reducing the necessity for highredundancy sequencing for gene and protein characterization purposes. This work complements earlier work designed to recognize frameshifts in coding regions using pattern recognition (Xu et al., 1995), and both methods combined provide a powerful technology which can improve the efficiency and reduce the costs of genomic and cDNA sequencing.

39

X.Cuan and E.C.Uberbacher

Hit 1 >P10978 POLX_TOBAC NICOTIANA TABACUM (CO score: 122 I d e n t i t i e s : 2 7 / 6 8 (40%) Query: 1 ELTEEIYMEQPPRFKDEQRPDLVCKRHRSIYAKKQA-RAWNIKINEVFTQQDFQRSKADP 59 Sbjct:

931 DLEEEIYMEQPEGFEVAGKKHMVCKLNKSLYGLKQAPRQWYMKFDSFMKSQTYLKTYSDP

990

Frame:

931 222222222222222222222222222222222222333333333333333333333333

990

Query:

60 CLYTKKLA 67

Sbjct:

991 CVYFKRFS 998

Frame:

991 33333333 998

Hit 2 >P25600 YCH4_YEAST SACCHAROMYCES s c o r e : 85 I d e n t i t i e s : 2 5 / 7 3 (34%) Query:

CEREVIS

2 LTEEIYMEQPPRFKDEQRPDLVCKRHRSIYAKKQAE-AWNIKINEVFTQQDFQRSKADPC 60

Sbjct:

12 MDEPIYVKQPPGFVNERNPDYWELYGGMYGLKQAPLLWNEHINNTLKKIGFCRHEGEHG

71

Frame:

12 222222222222222222222222222222222222333333333333333333333333

71

Query:

61 LYTKKLARRWIYM 73

Sbjct:

72 LYFRSTSDGPIYI 84

Frame:

72 3333333333333 84

Fig. 12. We selected a sequence (P.adspersus Tpa2 gene, partial, accession no. X74339) from GenBank (release 89) that contains a frameshift error and searched it against the SwissProt protein database. The top two homologous sequences returned from the search both suggested a frameshift in the sequence, around the position where the actual frameshift occurs.

Acknowledgements We would like to thank the manuscript reviewers for their important suggestions. This research was supported by the Office of Health and Environmental Research. US Department of Energy under contract no. DE-AC05-84OR21400 with Martin Marietta Energy System. Inc.

References Altschul.S.F., Gish.W.. Miller.W. Myers.E.W. and Lipman.D.J. (1990) Basic local alignment search tool. J. Mol. Bioi. 215, 403-410. Bairoch.A. and Boeckmann,B. (1994) The SWISSPROT protein sequence data bank: current status. Nucleic Acids Res., 22. 3578-3580. Gotoh.O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol.. 162, 705-708. Hcnikoff.S. and HenikofT.J. (1993) Performance evaluation ofaminoacid substitution matrices. Proteins, 17, 49-61. Pearson,W.R. and Lipman.D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. of Sci. USA, 85, 2444-2448. Posfai.J. and Roberts.R.J. (1992) Finding errors in DNA sequences. Proc. Nail. Acad. of Sci. USA, 89, 4698-4702. Smith,T.F. and Waterman,M. (1981) Comparison of biosequences. Adv. Appl. Math., 2, 482-489. States,D. J. and Botstein.D. (1991) Molecular sequence accuracy and the analysis of protein coding regions. Proc. Natl. Acad. of Sci. USA, 88. 5518-5522. States.D.J.. Gish.W. and Altschul.S. F. (1991) Improved sensitivity of nucleic acid database searches using application-specific matrices. Methods. 3, 66-70. Xu.Y.. Mural,R.J. and Uberbacher.E.C. (1995) Correcting sequencing errors in DNA coding regions using a dynamic programming approach. Comput. Applic. Biosci. 11, 117-124. Received on June 30. 1995; accepted on October 16, 1995

40