Strategies for de novo DNA sequencing

Strategies for de novo DNA sequencing Anna Blomstergren Royal Institute of Technology Department of Biotechnology Stockholm 2003 © Anna Blomstergr...
Author: Joanna Matthews
6 downloads 2 Views 1MB Size
Strategies for de novo DNA sequencing

Anna Blomstergren

Royal Institute of Technology Department of Biotechnology Stockholm 2003

© Anna Blomstergren Department of Biotechnology Royal Institute of Technology Alba Nova University Center SE-106 91 Stockholm Sweden Printed at Universitetsservice US AB Box 700 14 SE-100 44 Stockholm Sweden ISBN 91-7283-608-3

Anna Blomstergren 2003. Strategies for de novo DNA sequencing. Department of Biotechnology, Albanova University Center, Royal Institute of Technology, Stockholm, Sweden. ISBN 91-7283-608-3

Abstract The development of improved sequencing technologies has enabled the field of genomics to evolve. Handling and sequencing of large numbers of samples require an increased level of automation in order to obtain high throughput and consistent quality. Improved performance has lead to the sequencing of numerous microbial genomes and a few genomes from higher eukaryotes and the benefits of comparing sequences both within and between species are now becoming apparent. This thesis describes both the development of automated purification methods for DNA, mainly sequencing products, and a comparative sequencing project. The initially developed purification technique is dedicated to single stranded DNA containing vector specific sequences, exemplified by sequencing products. Specific capture probes coupled to paramagnetic beads together with stabilizing modular probes hybridize to the single stranded target. After washing, the purified DNA can be released using water. When sequencing products are purified they can be directly loaded onto a capillary sequencer after elution. Since this approach is specific it can be applied to multiplex sequencing products. Different probe sets are used for each sequencing product and the purifications are performed iteratively. The second purification approach, which can be applied to a number of different targets, involves biotinylated PCR products or sequencing products that are captured using streptavidin beads. This has been described previously, but here the interaction between streptavidin and biotin can be disrupted without denaturing the streptavidin, enabling the re-use of the beads. The relatively mild elution conditions also enable the release of sensitive biotinylated molecules. Another project described in this thesis is the comparative sequencing of the 40 kb cag pathogenicity island (PAI) in four Helicobacter pylori strains. The results included the discovery of a novel gene, present in approximately half of the Swedish strains tested. In addition, one of the strains contained a major rearrangement dividing the cag PAI into two parts. Further, information about the variability of different genes could be obtained. © Anna Blomstergren, 2003 Keywords: DNA sequencing, DNA purification, automation, solid-phase, streptavidin, biotin, modular probes, Helicobacter pylori, cag PAI.

What’s past is prologue... William Shakespeare

LIST OF PUBLICATIONS This thesis is based on the following manuscripts, which in the text will be referred to by their roman numerals: I.

Anna Blomstergren, Deirdre O’Meara, Morten Lukacs, Mathias Uhlén and Joakim Lundeberg (2000), Cooperative oligonucleotides in purification of cycle sequencing products, Biotechniques 29(2), 352363.

II.

Anna Blomstergren, Anders Holmberg, Morten Lukacs and Joakim Lundeberg (2003), Automated purification of multiplex cycle sequencing products suitable for capillary electrophoresis, submitted.

III.

Anders Holmberg, Anna Blomstergren, Morten Lukacs, Joakim Lundeberg and Mathias Uhlén (2003), Reversible biotin-streptavidin interaction with release using non-ionic aqueous solutions at elevated temperatures, submitted.

IV.

Anna Blomstergren, Annelie Lundin, Christina Nilsson, Lars Engstrand, Joakim Lundeberg (2003), Comparative analysis of the complete cag pathogenicity island sequence in four Helicobacter pylori isolates, Gene, in press.

TABLE OF CONTENTS INTRODUCTION ......................................................................1 1 Historical background ..........................................................2 2 Structure and properties of DNA ........................................3 3 Sequencing methods..............................................................5 3.1 3.2 3.3 3.4 3.5

Sanger sequencing ...................................................................................... 5 Maxam and Gilbert..................................................................................... 8 Pyrosequencing .......................................................................................... 8 Single molecule sequencing ..................................................................... 11 Sequencing by hybridization (SBH)......................................................... 12

4 Major strategies for genome and transcript sequencing.13 4.1 Sequencing of genomic DNA................................................................... 13 4.1.1 Complete genome sequencing ......................................................... 13 4.1.1.1 Clone-by-clone shotgun sequencing ....................................... 14 4.1.1.2 Whole-genome shotgun sequencing ....................................... 16 4.1.1.3 Creating shotgun libraries ....................................................... 18 4.1.1.4 Gap closure ............................................................................. 20 4.1.2 Sequencing specific regions of the genome..................................... 21 4.1.2.1 Primer walking........................................................................ 21 4.1.2.2 Directed PCR amplification .................................................... 22 4.2 Sequencing of transcripts (cDNA) ........................................................... 23

5 Sequencing technologies .....................................................24 5.1 Amplification of templates ....................................................................... 24 5.1.1 Cultivation of plasmids and M13..................................................... 24 5.1.2 PCR amplification ........................................................................... 26 5.1.3 Rolling circle amplification ............................................................. 27 5.2 Generation of Sanger fragments ............................................................... 28 5.3 Purification of sequencing products ......................................................... 29 5.3.1 Precipitation techniques................................................................... 29 5.3.2 Filtration methods............................................................................ 30 5.3.3 Magnetic bead techniques................................................................ 31

5.3.3.1 Hybridization based techniques .............................................. 31 5.3.3.2 Streptavidin-biotin .................................................................. 31 5.3.3.3 Unspecific capture of DNA..................................................... 32 5.3.3.4 Capture of dideoxynucleotides................................................ 33 5.4 Separation and detection........................................................................... 33 5.4.1 Electrophoresis ................................................................................ 33 5.4.2 Mass spectrometry ........................................................................... 34 5.5 Automation............................................................................................... 35

6 Data analysis ........................................................................35 6.1 6.2 6.3 6.4

Quality assessment ................................................................................... 36 Assembly .................................................................................................. 36 Comparing sequences............................................................................... 37 Annotation ................................................................................................ 38

PRESENT INVESTIGATIONS..............................................41 7 Solid-phase purification of DNA........................................42 7.1 Hybridization based technique for the purification of cycle sequencing products ................................................................................................. 42 7.2 Purification using the biotin-streptavidin system ..................................... 44 7.3 Comparison of the two assays .................................................................. 47

8 Comparative sequencing of H. pylori ................................47 8.1 8.2 8.3 8.4 8.5

Helicobacter pylori................................................................................... 47 The cag pathogenicity island.................................................................... 49 Strategy for comparing the cag PAI in four clinical isolates of H. pylori 51 Nucleotide and amino acid sequence variation ........................................ 53 Major rearrangements............................................................................... 53

9 Concluding remarks............................................................55 10 Acknowledgments................................................................56 11 References ............................................................................58 Original Papers I-IV ................................................................73

Strategies for de novo DNA sequencing

INTRODUCTION

It is now 50 years since James Watson and Francis Crick proposed the double helical structure of DNA. This paved the way not only for understanding the relationships between DNA and proteins, but also for the development of powerful tools for studying genes and genomes. It is now possible to sequence a large number of DNA samples at relatively low cost, which has enabled the sequencing of the human genome (to 98% completion), but also the genomes of mouse, fruit fly, rice and a number of microbial species. Soon the genome sequences of the rat and our close relative the chimpanzee will be described and several hundred other genome sequencing projects are ongoing. Comparing the genomic sequences of different species gives important information through the identification of conserved regions between distantly related species, as well as through the differences between closely related species. The mouse genome, for example, has been widely used to discover regulatory and coding regions that are conserved between humans and mice and comparison of human and chimpanzee genomes may give important clues about genes involved in complex abilities like speech.

1

Anna Blomstergren

1 Historical background In 1943 Oswald Avery and coworkers discovered that transfer of DNA between different strains of Pneumococcus could convey the ability to produce capsules (Avery et al. 1944). Since the trait was inherited by following generations of bacteria, the conclusion was that DNA could be the molecule encoding genes. This was first met with skepticism since DNA was regarded as being too simple to convey all the genetic information of an organism, but interest in DNA had been triggered. Ten years later James Watson and Francis Crick published a landmark article in which they proposed a double helical structure for DNA. They also described the complimentary nature of the two strands of the helix (Watson and Crick 1953a; Watson and Crick 1953b). Two articles (Franklin and Gosling 1953; Wilkins et al. 1953), in the same issue of Nature, presenting x-ray photographs of DNA supported and to some extent provided the foundations for Watson and Crick’s conclusions. Francis Crick and coworkers went on to describe the flow of information in a cell, known as the central dogma, which include the transcription of DNA to RNA and the translation of RNA to protein (Crick 1958). A few years later they also showed that the genetic code, used for translating the RNA sequence to amino acids, is based on multiples of three (Crick et al. 1961), and shortly thereafter the genetic code was determined independently by the groups of Hara Khorana and Marshall Nirenberg (Khorana 1965; Nirenberg et al. 1965). Principles for sequencing and amplifying DNA laid the foundations for the field of genomics in the 1970s and 1980s. Two independent methods for sequencing DNA were developed in 1977, one by Fred Sanger and coworkers (Sanger et al. 1977) and the other by Allan Maxam and Walter Gilbert (Maxam and Gilbert 1977). The other key achievement in the field of molecular biology was the invention of an amplification method for DNA, called the polymerase chain reaction (PCR) by Kary Mullis in the mid 1980s (Mullis et al. 1986). All these accomplishments led to the first complete sequencing of the genome of a free living organism, Haemophilus influenzae (Fleischmann et al. 1995). The number of completed genomes has grown rapidly over the years and is continuing to increase. Helicobacter pylori became the first organism of which two unrelated strains from the same species were completely sequenced (Tomb et al. 1997; Alm et al. 1999). In 1990 the large project of sequencing the 3 billion bases of the human genome was initiated through an international collaboration called the Human Genome Project (HGP). The plan was to complete the genome sequence in 15 years, starting with the construction of high-resolution maps and then moving onto DNA 2

Strategies for de novo DNA sequencing sequencing. In 1998 Craig Venter proposed that a whole genome shotgun approach, which had first proved successful with microbial genomes, should be used for sequencing the complete human genome (Venter et al. 1998) and this was done by Celera Genomics Corporation in a parallel effort to the HGP. In 2002 both groups published draft versions of the human genome sequence, in Nature and Science, respectively (Lander et al. 2001; Venter et al. 2001).

2 Structure and properties of DNA DNA is the carrier of genetic information in most organisms. The DNA is located in the nucleus, and to some extent in the mitochondria and chloroplasts, of eukaryotic cells and in the cytoplasm of prokaryotic cells. The central dogma describes the flow of information in a cell. First DNA is transcribed into messenger RNA (mRNA), which is then translated into protein. If the cell divides into two daughter cells the entire genome needs to be replicated in order for each cell to have a copy. All of these processes have been studied extensively and a number of different enzymes are involved (Lodish et al. 1997). Nucleic acids are linear polymers of nucleotides. For DNA (deoxyribonucleic acid) the nucleotides consist of three components: a phosphate, a deoxyribose (a five carbon sugar) and an organic base (figure 1A). The bases can be adenine (A), cytosine (C), guanine (G) or thymine (T) and are either purines (A and G) or pyrimidines (C and T). In RNA (ribonucleic acid) deoxyribose is replaced by ribose and thymine is replaced by uracil (U). DNA in its native form consists of two antiparallel strands, with phosphate and deoxyribose units providing the backbones, which form a double helix (figure 1B). The two strands are held together by van der Waals forces and a large number of hydrogen bonds between the bases. In order to maintain a constant distance between the two strands a purine must form hydrogen bonds with a pyrimidine, which is possible for the combinations of A with T and G with C (figure 1C). This means that if one strand harbors a certain sequence, the other strand must have the complementary sequence for the double helix to form. If DNA is exposed to heat or high pH the two strands will separate, i.e. the DNA will be denatured, but if favorable conditions are restored the two strands will spontaneously re-hybridize to each other.

3

Anna Blomstergren

Figure 1. The structure of DNA. A) Chemical composition of the sugar-phosphate backbone. B) Structure of the double helix. C) Hydrogen bonds between the organic bases.

4

Strategies for de novo DNA sequencing

3 Sequencing methods A number of methods for sequencing the order of nucleotide bases in a DNA molecule have been developed since Fred Sanger and Alan Coulson presented the first technique in 1975 (Sanger and Coulson 1975). Here the most common methods for de novo DNA sequencing are briefly described. By far the most common sequencing approach is Sanger sequencing, and this thesis will focus on this methodology.

3.1 Sanger sequencing As previously mentioned, a method for sequencing DNA, called the “plus and minus” method, was described by Fred Sanger and Alan Coulson in 1975 (Sanger and Coulson 1975). Two years later Sanger and his coworkers described a new, more efficient method (Sanger et al. 1977), which has been fundamental to the field of DNA sequencing. This method became known as the chain termination method, the dideoxynucleotide method or, simply, Sanger sequencing. In the initial setup (figure 2) a 32P-labeled primer was annealed to the DNA template. This primer acted as a starting point for DNA polymerase to synthesize the complementary strand. The extension continued until the polymerase incorporated a modified nucleotide, a dideoxynucleotide or terminator, which had been added to the reaction mixture. Since dideoxynucleotides lack a free hydroxyl group at the 3’ position of the ribose no further phosphates could be added and the DNA chain extension terminated. By performing four reactions, each with a specific terminator and thus ending at a position corresponding to that specific base, four different sets of DNA fragments could be obtained. These fragments were then separated by electrophoresis through a polyacrylamide gel and detected by autoradiography. Several modifications and improvements have been made to the initial technique. The 32P-label has been replaced by various fluorophores (Ansorge et al. 1986; Smith et al. 1986; Ansorge et al. 1987; Swerdlow and Gesteland 1990; Karger et al. 1991), enabling the use of only one gel lane per sample and thus increasing throughput. The introduction of energy transfer (ET) dyes further improved the performance (Ju et al. 1995a; Ju et al. 1995b; Metzker et al. 1996). In addition the fluorescent labels can now be placed on the terminator instead of on the primer

5

Anna Blomstergren

Figure 2. Basic principle of Sanger sequencing using dye primer chemistry.

6

Strategies for de novo DNA sequencing (Rosenthal and Charnock-Jones 1992; Kumar et al. 1999), allowing the sequencing reaction to be performed in one tube. Automated sequencers have been developed that simplify the separation and detection, as well as a variety of different software packages for analyzing of the obtained sequence data. An example of the created sequence output using terminators labeled with four different dyes can be seen in figure 3.

Figure 3. Partial sequence electropherogram obtained from a dye terminator cycle sequencing reaction sequenced on a capillary electrophoresis unit. After the introduction of thermostable polymerases a new technique for generating Sanger fragments, called cycle sequencing, was introduced (Innis et al. 1988; Carothers et al. 1989; Murray 1989; Manoni et al. 1992; Wang et al. 1992). Cycle sequencing has the advantages of needing less template and being able to start directly from double-stranded templates. The basic principle is the same as for classical Sanger sequencing, but a temperature profile is used. After adding all the reagents, the mixture is heated in order to denature the DNA template. The temperature is then lowered enough to allow annealing of the primer and extension by the polymerase. When the temperature is again raised, the created Sanger fragments will be denatured from the template, which is then available for annealing of a new primer. By alternating between the two temperatures, large numbers of sequencing fragments can be obtained, although the reaction is linear rather than exponential as in PCR.

7

Anna Blomstergren

3.2 Maxam and Gilbert In 1977 a chemical method for sequencing DNA was presented by Allan Maxam and Walter Gilbert (Maxam and Gilbert 1977). They exposed a 32P-labeled DNA molecule to reagents that first damaged and then removed a base from its sugar (figure 4). The backbone of the DNA molecule was weakened at these positions and could therefore easily be broken. The removal of bases was limited to one residue for every 50 to 100 bases, while the cleavage of the backbone was performed to completion. Four different reactions, affecting different bases, were performed and the resulting fragments were separated and analyzed on polyacrylamide gels. The first reaction affected the DNA at G and A, but the reaction was 5-fold faster for G, resulting in dark bands for G and weak bands for A. In the second reaction the opposite was achieved, and dark bands were obtained for A while G gave weaker bands. The third reaction cleaved the DNA with similar efficiency at both C and T, while the fourth only affected C. These four reactions in combination provided more than enough information to elucidate the DNA sequence from the autoradiograph of the gel. If both strands are sequenced this technique can also detect 5-methylcytosine, since it will produce a gap in the sequence of one strand while the other strand will have a G. When PCR was introduced the 32P-labels could be exchanged for fluorophores attached to primers, generating labeled PCR products suitable for use as templates for the sequencing reaction.

3.3 Pyrosequencing In 1987 Pål Nyrén described how DNA polymerase activity could be monitored by bioluminescence (Nyrén 1987), and soon thereafter a DNA sequencing method based on this system was presented (Hyman 1988). This sequencing method was cumbersome in practice, requiring several passes of samples through six sequential columns with immobilized enzymes. In 1998 the pyrosequencing method was described (Ronaghi et al. 1998). In pyrosequencing a primer is annealed to a single-stranded DNA template followed by sequential addition of one nucleotide at a time. When the correct nucleotide for the position adjacent to the 3’ end of the primer is added it will be incorporated by DNA polymerase and pyrophosphate (PPi) will be released (figure 5A). ATP sulphurylase will then convert PPi into ATP and finally light will be

8

Strategies for de novo DNA sequencing

Figure 4. Schematic illustration of sequencing with the Maxam Gilbert method.

9

Anna Blomstergren

Figure 5. Enzymatic reactions involved in pyrosequencing.

emitted by firefly luciferase (figure 5B and C). The amount of light emitted is proportional to the number of nucleotides incorporated into the growing chain for small numbers of incorporations, and it can be detected with a suitable instrument, for example a CCD camera. Apyrase is included in the reaction in order to degrade unreacted dNTPs and the ATP produced before the next nucleotide is added (figure 5D). This also has the advantage of reducing background signals from mismatch incorporations since the apyrase will compete with the polymerase for the nucleotides and incorrect nucleotides are incorporated more slowly than the correct nucleotide. Substitution of dATP to α-thio dATP is necessary to avoid background noise since ATP is a substrate for luciferase (Ronaghi et al. 1996). In order to increase the efficiency of the sequencing, single-stranded DNA-binding proteins (SSB) have been added to the reaction (Ronaghi 2000; Ehn et al. 2002). An example of a pyrosequencing result is shown in figure 6. Since pyrosequencing produces rather short reads of approximately 50 bp (Agaton et al. 2002; Ronaghi and Elahi 2002) it is most suited for applications like tag sequencing or single nucleotide polymorphism detection. Advantages of pyrosequencing include the facts that no labels or electrophoresis are needed and detection is performed in real time.

10

Strategies for de novo DNA sequencing

Figure 6. Tag sequencing using pyrosequencing. A poly(A)-tail can be seen at the end.

3.4 Single molecule sequencing The possibility of single molecule sequencing was proposed by Jett et al. in 1989 (Jett et al. 1989) and several groups are working towards a functional system (Sauer et al. 2001; Stephan et al. 2001; Werner et al. 2003). A DNA molecule in which all nucleotides are labeled with base-specific fluorescent labels must first be established. This DNA fragment is then attached to a microsphere and introduced into a flow channel where it is immobilized. Addition of an exonuclease will start the sequential degradation of the DNA fragment from the 3’ end, releasing one nucleotide at the time. The flowing buffer carries the released nucleotides to a detector, where the fluorescence is measured. This sequencing method will significantly increase both speed (with a rate of 100-1000 bases per second) and read-length if all the obstacles can be overcome. The read-length is limited by the stability of the labeled DNA fragment and the processivity of the exonuclease. Several kb could probably be sequenced in a single reaction, which is significantly more than with conventional sequencing methods. Additional problems associated with single molecule sequencing are that extremely sensitive detection methods and labels with large fluorescence quantum yields are needed, since the fluorescence of single nucleotides must be measured. All buffers and reagents need to be extensively purified since the method is very sensitive to fluorescent contaminants.

11

Anna Blomstergren

Several other approaches to single molecule sequencing have recently been proposed. The majority of these methods are based on the detection of incorporation of labeled nucleotides rather than the degradation of DNA. Single molecule sequencing is attractive for de novo sequencing of DNA but to date a successful sequencing experiment has still not been performed. Recently the J. Craig Venter Science Foundation announced a $500,000 Technology Prize, which will be awarded for advances allowing the human genome to be sequenced for $1,000 or less. To date, the various single-molecule sequencing approaches are probably the most promising candidates for this prize.

3.5 Sequencing by hybridization (SBH) This DNA sequencing method is based on annealing a labeled unknown DNA fragment to a large number of short oligonucleotides, usually between 5 and 25 nt long. The sequence can then be deciphered from the hybridization pattern (Drmanac et al. 1993; Drmanac and Drmanac 2001; Drmanac et al. 2002). In most cases either the probes or a number of targets are attached to a solid phase as a DNA array. Initially, the complete set of probes of a certain length, for example all 65,336 combinations of 8-mers, were used, but selected sets of fewer oligonucleotides, for example all non-complementary probes, have also been applied. SBH has the potential to sequence longer DNA fragments than conventional methods and it can easily be miniaturized. The problems that still need to be resolved for SBH include variations in hybridization stability between different probes, false positive signals from probes with one-base mismatches and ambiguous reads when repetitive sequences are present in the target sequence (Marziali and Akeson 2001). These problems are of less importance when SBH is used for comparative sequencing or mutation analysis, but they are troublesome for de novo sequencing.

12

Strategies for de novo DNA sequencing

4 Major strategies for genome and transcript sequencing 4.1 Sequencing of genomic DNA In whole genome sequencing the aim is either to produce a complete, continuous sequence of high quality or a fragmented draft version of the genome. Although the draft version can be produced faster and at lower cost compared to the complete sequence, only the latter can be reliably used in different analyses, since (for example) a gene that is not found in the sequence is truly missing. If a certain region of the genome is of special interest it can be sequenced separately. For example, a gene, known to be associated with a specific disease can be sequenced from a number of individuals with differing symptoms or disease outcomes to establish the connection between genotype and phenotype.

4.1.1 Complete genome sequencing Two major strategies (clone-by-clone and whole-genome shotgun approaches) have been used for whole genome sequencing. The most suitable method depends on the organism to be sequenced. For relatively small and non-repetitive genomes, the whole genome shotgun method is advantageous, since mapping and construction of large insert clones are avoided. More complex genomes, like the human genome, are difficult to sequence using this method. The high level of repetitive sequences cause difficulties in the assembly of the genome. Combinations of the two methods, hybrid strategies, might be more successful when sequencing complex genomes.

13

Anna Blomstergren

4.1.1.1

Clone-by-clone shotgun sequencing

The public effort to sequence the human genome was performed using a clone-byclone strategy (Lander et al. 2001). Initially, a three-stage divide and conquer approach was adopted (figure 7A), in which three different clone libraries were constructed (National Research Council 1988; Venter et al. 1996). First, a library of yeast artificial chromosome (YAC) clones was created, containing DNA fragments of approximately 1 Mb. This library was used to generate a lowresolution map of the genome (or chromosome) by identifying shared landmarks on overlapping clones. These landmarks included sequence-tagged sites (STSs; sites that can be uniquely amplified by PCR), or restriction fragment sites. Second, the inserts from suitable YAC clones covering the genome were fragmented into 40 kb pieces and subcloned into cosmid vectors. A high-resolution map was then constructed by identifying overlapping landmarks in the cosmid clones. This sequence-ready map could be used to select cosmid clones that form a minimal overlapping set, known as a tiling path. Third, the cosmid clones in the tiling path were further randomly fragmented and subcloned into M13 or plasmid vectors, carrying inserts of 1-10 kb. Finally, enough of these clones to cover the cosmid insert with an eight to ten-fold redundancy were sequenced, and computational assembly of the obtained sequence was performed. The random, or shotgun, sequencing of the cosmids ensured high accuracy, due to its redundancy. This approach, however, was subject to a number of problems. To obtain even a lowresolution map covering the complete genome or a complete chromosome proved to be very difficult. Another problem was the instability of a high proportion of the YAC clones: almost 50% showed structural instability resulting in deletions or rearrangements. Cosmid clones also showed these instability problems to some extent. This approach could be simplified due to two scientific advances: the increase in computational power, which made shotgun sequencing of fragments significantly larger than a cosmid possible, and the development of a new vector, the bacterial artificial chromosome (BAC). The BAC could harbor an insert of 350 kb and was far more stable than the previously used YACs and cosmids. Using this new vector to replace both YACs and cosmids converted the three-stage strategy to a twostage strategy (figure 7B), which was applied to a large portion of the human genome (Green 2001; Lander et al. 2001).

14

Strategies for de novo DNA sequencing

Figure 7. The clone-by-clone approach for genome sequencing using A) a three stage and B) a two-stage strategy. To circumvent some of the problems associated with the approaches described above Craig Venter and coworkers proposed a new strategy in 1996 (Venter et al. 1996). In this simplified approach to sequencing large genomes, a library of BACs, with 15-fold coverage and containing inserts of 150 kb, is first created. The inserts of these BAC clones are then sequenced, generating approximately 500 bases, from each end. The sequences obtained from the BAC ends are called sequence-tagged connectors (STCs) and are scattered throughout the genome, spaced approximately 5 kb apart (if 300,000 BACs are used for the human genome). The BACs are also fingerprinted using one restriction enzyme in order to detect unreliable clones, containing for example deletions or chimeras. Finally, one or a few seed BACs, chosen as starting points, can be sequenced with the same shotgun approach as previously described for the cosmid clones. When the sequences of the BAC inserts are obtained they can be compared to the STCs of the other BACs in the library, theoretically identifying approximately 30 overlapping BAC clones for each seed BAC. Two clones showing minimal overlap at each end of the seed BAC can then be chosen for further sequencing. This approach significantly reduces the need for extensive mapping, allowing sequence generation to be started earlier. 15

Anna Blomstergren

4.1.1.2

Whole-genome shotgun sequencing

Instead of first cloning large fragments of the genome into vectors like YACs or BACs, whole genome shotgun sequencing directly fragments the entire genome into pieces suitable for plasmid vectors (figure 8). Sequencing a large number of random plasmid inserts, a few kb in size, in both directions then yields a highly redundant set of sequence reads, each approximately 500 bases long. Assembly of the sequence reads is done computationally and typically results in a number of contigs separated by gaps, which need to be closed by directed strategies. This strategy has proven to be effective for the sequencing of microbial genomes (Fleischmann et al. 1995; Fraser and Fleischmann 1997), but its value for complex genomes, e.g. the human genome has been debated (Venter et al. 1998; Butcher 2001; Green 2001; Lander et al. 2001; Venter et al. 2001; Green 2002; Myers et al. 2002; Waterston et al. 2002). In the whole genome shotgun approach for sequencing the human genome, described by Craig Venter and performed by Celera (Venter et al. 1998; Venter et al. 2001), the genome is fragmented into three different libraries of varying insert sizes. Most of the sequencing templates originate from a plasmid library containing 2 kb inserts, while fewer templates from a low-copy-number plasmid library containing 10 or 50 kb inserts are used for medium-range linking. The obtained sequence reads are then assembled using a complex algorithm capable of handling the approximately 70 million reads. This assembly produces a number of contigs, which are then ordered into scaffolds based on the presence of read pairs. In order to be able to assemble the genome properly it is important to obtain these read pairs by sequencing the plasmid inserts in both directions. The read pairs provide valuable information since they are physically connected and the distance between connected reads is known, enabling them to be used for confirmation of an assembly and for ordering contigs. If two sequences from the same clone are located in different contigs, this clone can be used for primer walking in order to close the gap.

16

Strategies for de novo DNA sequencing Since the entire genome has to be assembled simultaneously, instead of the 150 kb of a BAC insert, whole genome shotgun sequencing demands much greater computational capacity. Repetitive sequences will also cause problems, especially interspersed repeats where more or less identical copies of a sequence are located far from each other. When a clone-based strategy is used these repeats would probably be located in different clones, and even if they could cause problems in the mapping process they would not interfere with the sequencing.

Figure 8. Whole genome shotgun sequencing. The obtained contigs are ordered into scaffolds using information from read pairs (bold lines) spanning the gaps. The best way to sequence a complex genome might very well be to use a combination of whole genome sequencing and a clone-based approach. Whole genome shotgun sequencing would produce large quantities of sequence data while the mapping of larger clones was underway, thereby shortening the total time needed for the project. The large insert clones would then provide a scaffold on which the whole genome shotgun sequences could be assembled, significantly reducing the problems associated with the assembly process.

17

Anna Blomstergren

4.1.1.3

Creating shotgun libraries

Regardless of which of the above strategies is chosen, shotgun libraries must still be created. A shotgun library should be completely random and contain inserts of relatively uniform size. Libraries with large inserts can be somewhat more biased, since some regions might contain complete genes that are lethal to the Escherichia coli host. These regions will then be under-represented in the library. Short insert libraries display less of these problems since they contain the complete gene less often, but too short inserts will reduce the benefits of sequencing from both directions in order to obtain read pairs. The library should preferably be large enough to contain sufficient clones to cover the genome (or BAC) at least eight to ten times. The traditional method for creating a library (figure 9) is to shear the DNA using sonication or nebulization (Sambrook and Russell 2001). Restriction enzymes can also be used, but they are generally less random. The ends of the fragments are repaired to obtain blunt ends prior to ligation with a linearized plasmid vector. The pool of plasmid clones thus generated, each containing an insert, constitutes the shotgun library, and the inserts can then be sequenced using the methods described in subsequent sections of this thesis. A number of variations on the traditional approach have been described. A major concern is the ligation of blunt ends, which is quite inefficient and can result in the formation of chimeric inserts as well as self-ligated vectors. Blunt end ligation can be avoided by the introduction of adaptors (Haymerle et al. 1986; D'Souza et al. 1989; Povinelli and Gibbs 1993; Andersson et al. 1994; Andersson et al. 1996b). Oligonucleotides are ligated to the blunt ends of the inserts and in some cases also to the vector, using an excess of oligonucleotides to drive the reaction. The adaptors create complementary overhangs on the insert and vector, making the ligation much more efficient. If overhangs of 11 bases are used the ligation can even be omitted, since the annealing of inserts to the vector is stable enough (Nisson et al. 1991; Rashtchian et al. 1992; Andersson et al. 1994; Andersson et al. 1996b). Adaptors will also significantly reduce the formation of chimeras since the adaptors annealed to the inserts are complementary to the vector overhangs, but not to the other insert overhangs. Yet another advantage comes from the fact that the vector can be cleaved using two restriction fragments, which efficiently prevents re-circularization in the absence of an insert.

18

Strategies for de novo DNA sequencing

Figure 9. Shotgun library construction. The template DNA is fragmented by nebulization or sonication and the ends of the obtained fragments are repaired using for example T4 DNA polymerase and/or Klenow polymerase to create blunt ends. Preparative agarose gel electrophoresis can then be used to obtain fragments of the desired size range. Finally the fragments are ligated into a plasmid vector, which has been cut with a blunt end generating restriction enzyme and treated with a phosphatase (generally calf intestine phosphatase, cip) to prevent self-ligation. Another approach to avoid conventional blunt end ligation is employed by a commercially available kit (Invitrogen, Carlsbad, CA, USA), which has been widely used since its introduction in 2000. In this kit, a linearized vector is supplied with Vaccinia virus topoisomerase I covalently bound to the 3’ ends. When this vector is mixed with blunt end fragments that have been dephosphorylated the enzyme is released and the fragments are ligated to the vector in a highly efficient manner (Shuman 1994). Since the fragments are dephosphorylated there is no risk of chimera formation, but the empty vector does re-circularize to some extent. Two selection systems, blue-white selection and a gene lethal to E. coli, are included in the vector in order to discriminate between clones containing an insert and those consisting solely of vector sequence.

19

Anna Blomstergren

4.1.1.4

Gap closure

Shotgun sequencing is used, at some stage, in both the clone-by-clone strategy and whole genome shotgun sequencing. During this random sequencing stage the majority of the template is covered. Eventually, the sequencing of more clones will mainly lead to higher redundancy, but not to “new” sequence. At this point it is time to move from random sequencing to directed methods in order to close the remaining gaps. Before gap closure is started it is prudent to check the obtained assembly. This can be done by comparing a virtual restriction fragment pattern of the obtained sequence with the true experimentally determined pattern of the template. Another method is to check the distances between read pairs, i.e. the forward and reverse sequence reads from the same clone. Special care has to be taken when repeats are present in the sequence. Repeats that are larger than a sequencing read, and where the copies are similar, will be difficult to distinguish from each other. Correct assembly of these regions generally requires software specifically designed for this task (Tammi et al. 2002; Tammi et al. 2003), but if the repeats are identical or too similar not even this will suffice. If the repeats are interspersed it might be possible to obtain each copy separately by using, for example, PCR with primers located in the unique flanking sequences. Primer walking can then be used to sequence these PCR products. For tandem repeats, the only option might be to determine the number of copies of the repeat by agarose gel electrophoresis of a restriction fragment, without being able to completely resolve the DNA sequence. Once the assembly is believed to be correct the contigs are ordered as far as possible into scaffolds of contigs. This is mainly done by using read pairs spanning the gaps, but specific markers can also be used to map the contigs. The gaps between contigs are either sequence gaps, where a spanning clone is present, or physical gaps, where no clone is present. Sequencing gaps can arise from a cloning bias, low redundancy or problems in the sequencing reaction. In the first two cases clones covering the gap can be identified and additional sequence information can be obtained using primer walking (further described in section 4.1.2.1). Problems in the sequencing reaction can originate from the presence of secondary structure, which hinders the polymerase. Sometimes this can be solved by sequencing the other strand or by testing other sequencing chemistries or enzymes (Kukanskis et al. 2000). When this does not work the clone can be further subdivided into fragments 300-500 bases long, which can then be cloned and sequenced (McMurray et al. 1998). This 20

Strategies for de novo DNA sequencing will usually disturb the secondary sequence sufficiently to allow sequencing through it. The closure of physical gaps requires other methods. The first approach to be tested is usually to design primers close to the contig ends and try to span the gaps using PCR. If successful, the PCR products can then be sequenced using primer walking. In the case of microbial genomes or BACs, primer walking can be performed directly on the template (as described in section 4.1.2.1). This method is useful when the DNA region is unstable in subclones. Another approach is to isolate a large insert clone containing the missing region. This has been done by using, for example, subtractive hybridization (Frohme et al. 2001) or screening by hybridization (Yang et al. 2003). Recently, PCR-assisted contig extension was described for the use of cap closure (Carraro et al. 2003). This technique involves stepwise extension from the end of a contig using one specific and one arbitrary PCR primer, and has previously been used in other applications (Sterky et al. 1998).

4.1.2 Sequencing specific regions of the genome In some projects a certain region of the genome is of specific interest. For example regions related to the virulence of a pathogenic bacterium, or a known disease gene. In these cases it can be advantageous to study only the specific region in a large number of strains, individuals or even species. Two major approaches to accomplish the directed sequencing of a genomic region are described in the following sections. If the desired region is mostly unknown, primer walking (either directly on genomic material or on a subclone) must be used, while directed PCR amplification can be used if the sequence is known in a closely related organism, or if flanking sequences on both sides of the region are known. Repetitive DNA will be a major concern in both of these approaches since it will prevent the design of primers with unique priming sites. In these cases, the DNA has to be subcloned prior to sequencing.

4.1.2.1

Primer walking

If the sequence of a small section close to the region of interest is known, or can be introduced (using for example a transposon), primer walking can be used for sequencing. A primer annealing to the known region is used in a sequencing reaction and the obtained data are used to extend the known sequence. Designing a new primer close to the 3’ end of the previous read continues the walking. For 21

Anna Blomstergren confirmation, a second primer, pointing in the other direction, can be used to generate sequence data for the other strand (Voss et al. 1993). These steps are repeated until the entire region is covered. Primer design and synthesis are generally bottlenecks in the primer walking approach. To avoid the delays they may cause, a number of groups have used either short 8-mers or modular primers (Kotler et al. 1994; Jones and Hardin 1998; Kostina et al. 2000). Libraries containing a large number of these primers are presynthesized, and for each round of sequencing a new primer is selected from the library. If the template is long, for example a bacterial genome, 8-mer probes might not be specific enough. Then, modular primers ranging from 5 to 7 nucleotides can be used (Kotler et al. 1994; Kostina et al. 2000). These primers combine specificity with comparably small libraries. When large templates, for example BACs or bacterial genomes, are used for direct sequencing, a large excess of sequencing reagents is needed to drive the reaction (Heiner et al. 1998). Large amounts of template are also needed, which significantly reduces the applicability of primer walking since reagents are expensive and it is cumbersome to prepare large amounts of BAC or genomic DNA.

4.1.2.2

Directed PCR amplification

If sequences flanking the region of interest are known they can be used for amplification using PCR. Fragments as large as 35kb have been successfully amplified by this method (Barnes 1994), enabling it to be applied to quite large regions. The obtained PCR products can then be sequenced using the same primers as used in the PCR reactions, and if necessary primer walking can be performed. This will significantly reduce the amount of template needed compared to direct primer walking. If desired, the fragment can be cloned into a vector using A/Tcloning for further studies. Another directed PCR amplification approach can be applied if the complete sequence of, for example, a closely related strain is known (as described in paper IV). This sequence can then be used for designing suitable primers for PCR. As before, these PCR products can be sequenced using the same primers. The obtained PCR products should overlap one another sufficiently to avoid gaps when the obtained sequences are assembled. This approach will also significantly reduce the amount of template needed compared to primer walking.

22

Strategies for de novo DNA sequencing

4.2 Sequencing of transcripts (cDNA) The fraction of a eukaryotic genome that represents coding sequence can be very low, especially for mammals and plants. If the major interest is in identifying and analyzing genes, whole genome sequencing is rather inefficient for these genomes. A more rewarding approach can then be to sequence the transcripts or tags of the transcripts (Adams et al. 1991; Adams et al. 1992), frequently denoted expressed sequence tags (ESTs). The first steps involve the isolation of mRNA and synthesis of complementary DNA (cDNA). In eukaryotes the poly(A)-tail present at the 3’ end of almost all genes can be utilized, both as a handle for purification and as a primer site for the first strand synthesis. The cDNAs are then cloned into a suitable vector and sequenced. In most cases the clones are sequenced from the 5’ end to avoid problems that arise from sequencing through the poly(A)-tail, although anchored primers that consist of an oligo(T) region followed by one or a few degenerate bases can be used to sequence the 3’ end (Khan et al. 1991; Liao and Gong 1997; Grayburn and Sims 1998). When enough clones have been sequenced the reads are assembled into clusters and the consensus sequences from each cluster are searched against relevant databases to identify the genes. To further study the structure and function of a gene product, it is necessary to know the complete sequence. This can be obtained by primer walking along fulllength clones from a cDNA library prepared as described above. However, a more efficient way is to ligate several full-length cDNAs into concatemers, randomly fragment them and clone them into sequencing vectors (Andersson et al. 1997; Yu et al. 1997). Prior to concatenation, restriction sites are introduced at the ends of the cDNAs, enabling both efficient ligation and recognition of junctions in the data analysis. When sequence reads are assembled they are first restricted in silico at these sites, resulting in a separate assembly for each gene. As the databases containing gene sequences grow, there is less need for long sequences in order to identify a gene. This has led to the development of a technique called serial analysis of gene expression (SAGE), where concatemers of short tags of cDNA are cloned and sequenced (Velculescu et al. 1995). Each tag is only nine bases long, but allows the unique identification of 95% of the human genes. Another suitable method for sequencing tags long enough for gene identification is pyrosequencing (Agaton et al. 2002). The focus of complete gene expression profiling has now shifted to the field of microarrays, where several thousands of genes can be studied simultaneously. However, EST sequencing can still be of interest for genomes where very little is known about the transcriptome, since this approach has the capability to discover novel genes. Ongoing EST sequencing projects include the amphibious axolotl, honeybee, silkworm, sheep, pig, coffee and poplar (http://wit.integratedgenomics.com/GOLD/). 23

Anna Blomstergren

5 Sequencing technologies When a strategy has been chosen for a sequencing product and either a library of clones or genomic DNA has been prepared, it is time to begin the actual sequencing process. Usually this involves amplification of template, the generation and purification of Sanger fragments, separation and detection. In projects where large numbers of sequences are being produced there is also a need for efficient automation.

5.1 Amplification of templates In most cases an amplification step is needed before a cycle sequencing reaction can be performed. The most appropriate method for this depends on the type of template involved, and the number of samples to be sequenced. In most sequencing projects the templates are cloned into plasmids or M13 vectors, enabling the use of any of the amplification methods described here (figure 10). For high-throughput sequencing projects the cost per sample and the handling time becomes increasingly important.

5.1.1 Cultivation of plasmids and M13 Plasmids or M13 phages can be amplified through cultivation followed by lysis and purification. Plasmids are transformed or electroporated into E. coli cells, which are then cultivated. After harvesting, by centrifugation of the cell suspension and removal of growth medium, the cells can be lyzed using nonionic or ionic detergents, organic solvents, alkali or heat (Sambrook and Russell 2001). The most commonly used methods employ either alkali treatment (Birnboim and Doly 1979) or boiling (Holmes and Quigley 1981), often combined with detergent, RNaseA and lysozyme treatment (Konecki and Phillips 1998; Marra et al. 1999; Li et al. 2002). Chromosomal DNA is more sensitive to shearing than plasmid DNA, and when the cells are subjected to denaturing conditions during lysis, the chromosomal DNA strands will be separated while the plasmid DNA strands are held together, since they are topologically intertwined. When normal conditions are restored the plasmids will reform their double-stranded form while chromosomal DNA is precipitated together with proteins and remnants of the cell wall (Sambrook and Russell 2001).

24

Strategies for de novo DNA sequencing

Figure 10. Amplification options for plasmid or M13 templates. After collecting the cleared lysate, the plasmid can be further purified using a number of methods, including centrifugation in CsCl-ethidium bromide gradients, differential precipitation, ion-exchange chromatography or gel filtration. A number of protocols for preparing plasmids in a 96-well format, using for example filterplates (Ruppert et al. 1995; Itoh et al. 1999; Harris et al. 2002) or carboxylated magnetic particles (Skowronski et al. 2000; Elkin et al. 2001), have also been described. These have been developed for high-throughput and can often be used in automated workstations, but they are usually quite expensive. A method developed at Washington University (Marra et al. 1999) has been used for highthroughput plasmid preparation prior to sequencing in our core facility. This technique involves lyzing the cells using Tween-20, RNaseA, lysozyme and a oneminute exposure to microwave radiation. After lysis the plates are centrifuged and the cleared lysate is collected. No further purification is needed prior to sequencing the plasmids. The protocol is very inexpensive and requires relatively few manipulations. When M13 phage is used as a vector it is transformed or electroporated into E. coli in the same manner as plasmids. When the bacteria are cultivated on agar plates the infected cells form plaques, due to their slower growth rates. A plaque is then used to inoculate a liquid culture from which M13 phage can be purified. The infected E. coli cells will produce hundreds of phage particles in each generation. These particles are released into the growth medium without lyzing or killing of the cells. This enables the virus particles to be harvested by simply centrifuging the cell suspension and collecting the supernatant. In the traditional protocol (Sanger et al. 1980; Messing 1983; Sambrook and Russell 2001) the phage is then concentrated by precipitation using polyethylene glycol (PEG) in the presence of salt at high concentration. Extraction with phenol releases the single-stranded phage DNA, which is then collected by ethanol precipitation. This protocol is not suited for high-throughput purification and a number of variants have been 25

Anna Blomstergren developed, in which the phenol extraction has been replaced by lysis of the phage particles using detergents (Eperon 1986; Mardis 1994), heat (Beck and Alderton 1993), NaI (Wilson 1993) or NaClO4 (Andersson et al. 1996a; Marziali et al. 1999). Several methods have also introduced modifications to accommodate use of 96-well format, using glass-fiber filter plates (Andersson et al. 1996a; Marziali et al. 1999), magnetic particles (Fry et al. 1992; Wilson 1993; Johnson et al. 1996) or centrifuges capable of handling deep-well plates (Wilson 1993). When magnetic particles are used the target DNA can be captured either by unspecific binding of nucleic acids to carboxylated beads (Wilson 1993) or more specific hybridization methods where oligonucleotide probes are attached to the beads (Fry et al. 1992; Johnson et al. 1996).

5.1.2 PCR amplification Polymerase chain reaction (PCR) amplification was first described by Kary Mullis and coworkers (Mullis et al. 1986) and has revolutionized molecular biology. It has the capacity to amplify DNA exponentially and the key elements are the use of a thermostable DNA polymerase and a cyclic temperature profile. First the temperature is raised sufficiently to denature the two DNA strands in the template. Two primers are then annealed, one on each strand, and extended by DNA polymerase at the optimal temperature for the enzyme. After extension the temperature is again raised to denature the newly synthesized strands from the template and the procedure is repeated. Since the synthesized DNA can act as template in the following cycles, the amount of DNA will increase rapidly. Initially E. coli DNA polymerase was used, and addition of enzyme after the denaturation step was necessary for each cycle since the high temperature inactivated the enzyme. Another problem was the risk of unspecific primer annealing due to the low stringency of hybridization at low temperatures. The use of thermostable DNA polymerases, like Taq DNA polymerase from Thermus aquaticus (Chien et al. 1976), simplified the procedure significantly as the enzyme is stable throughout the reaction (Saiki et al. 1988). The optimal temperature for extension is 72°C for Taq DNA polymerase, enabling the use of stringent annealing temperatures for the primers. Using a thermostable polymerase improves the specificity, yield, sensitivity and maximum product length of the PCR amplification. PCR is a powerful method for amplifying DNA prior to sequencing. It can be used regardless of whether the template is in the form of plasmids, M13 phage or genomic DNA. If plasmids or M13 are used as template, the reaction can be performed directly from a picked colony or plaque. Care has to be taken with 26

Strategies for de novo DNA sequencing genomic DNA to avoid multiple priming sites, leading to nonspecific amplification products. In most cases the PCR products are purified before being used as templates in a sequencing reaction. This is done mainly to remove unextended primers that would result in Sanger fragments of incorrect lengths, but it is also beneficial to remove excess nucleotides and misprimed PCR fragments such as primer dimers. Several purification methods have been used, including agarose gel purification (Tracy and Mulcahy 1991; Leonard et al. 1998), precipitation (Høgdall et al. 1999), filtration (Clarke and Diggle 2002) and an enzymatic approach using exonuclease I combined with shrimp alkaline phosphatase (ExoI/SAP) (Werle et al. 1994). An alternative method is to use low amounts of primers in the PCR reaction, and thus avoid the need for a purification step (Silva et al. 2001).

5.1.3 Rolling circle amplification An amplification method that has recently been applied to sequencing templates is rolling circle amplification (RCA). A number of viruses that contain a singlestranded circular genome use rolling circle replication to multiply their DNA. Initially, RCA was used to amplify short DNA circles for generating tandem repeats (Fire and Xu 1995; Liu et al. 1996). In this technique, a primer is annealed to the circular template and a polymerase extends it, generating concatenated copies complementary to the template in an isothermal process. If, instead, two primers are added, one for each strand, the amplification of the template becomes exponential since the second primer can anneal to the newly synthesized strand (Lizardi et al. 1998). This process is called hyperbranched rolling circle amplification (HRCA), and was first used to detect point mutations in small amounts of human genomic DNA. When RCA is performed on very short templates any DNA polymerase can be used, probably because the synthesized strand is released from the template due to the small radii of the circles (Liu et al. 1996). If the template is larger, for example a plasmid or phage, the polymerase needs to have strand-displacing activity in order for RCA to occur. The most commonly used polymerase is Φ29 DNA polymerase, which is a highly processive polymerase with strand displacement and proofreading 3’-5’ exonuclease activity, but other enzymes like exo(-) Vent DNA polymerase and Bst large fragment DNA polymerase have also been employed (Lizardi et al. 1998). For plasmid-size targets the rate of RCA is approximately 20 copies per hour, limiting the applicability of the method. To overcome this limitation, multiplyprimed RCA was developed (Dean et al. 2001; Detter et al. 2002; Nelson et al. 2002). Here, instead of specific primers, random hexamers are used in the reaction, leading to the generation of multiple replication forks. The double27

Anna Blomstergren stranded product can be used as template in sequencing reactions without any purification, although dilution is sometimes necessary, due to the high viscosity of the product. Since Φ29 DNA polymerase displays an exonuclease activity it is important to use exonuclease-resistant primers to prevent their degradation. The use of multiply-primed RCA for template preparation in sequencing projects is advantageous since high and uniform yields of DNA can be obtained with few manipulations, decreasing hands-on time compared to most other methods. Isothermal strand displacement amplification has also been applied for the amplification of complete microbial genomes prior to PCR, direct sequencing or library construction (Detter et al. 2002).

5.2 Generation of Sanger fragments As already described, a number of improvements have been made to the original protocol for chain termination sequencing. The most common sequencing chemistry today is to use terminators labeled with four different energy-transfer dyes. Dye terminator chemistry has the advantage that any primer can be used for the sequencing reaction, while dye primer chemistry is more or less restricted to universal sequencing primers, since labeling of primers is expensive. Another advantage of dye terminator chemistry is the possibility to perform the reaction in a single tube, instead of one for each base when dye primer chemistry is used. If PCR-amplified templates are used, left over PCR primers can be extended in the sequencing reaction, generating unreadable sequence data. This problem is avoided if dye primer chemistry is used, since the fragments extended from PCR primers will not be labeled. On the other hand, false stops (where the polymerase is prematurely released from a template without incorporation of a terminator) will be a problem for dye primer chemistry, but not for dye terminator chemistry, for the same reason. Signals are significantly increased when energy-transfer dyes are used instead of conventional dyes. Energy transfer dyes consist of one donor and one acceptor dye. When exposed to a laser beam, the donor will absorb energy and transfer it to the acceptor, which will emit light of a different wavelength.

28

Strategies for de novo DNA sequencing

5.3 Purification of sequencing products Considerable efforts have been made by a large number of research groups to develop the “ultimate” purification method for sequencing products, which would be fast, cheap, efficient and automated. To date, no single method has been able to out-compete the rest, they all have specific advantages and disadvantages. The demands on the purification method will depend on the sequencing chemistry and separation platform used. For instance, the use of dye terminators demands more thorough purification than the use of dye primers, whether separated on slab gels or capillaries, since the excess labeled terminators will otherwise result in dyeblobs that obscure part of the sequence. The importance of purifying the cycle sequencing products has increased with the introduction of capillary sequencers since they are more sensitive to salt, template and other impurities than slab gels. The different purification methods that can be used can be divided into three major categories, depending on whether they are based on precipitation, spin columns/filter plates, or magnetic beads.

5.3.1 Precipitation techniques The traditional purification method is ethanol or isopropanol precipitation, usually combined with a 70% ethanol wash, and sometimes also preceded by a phenol/chloroform extraction. When ethanol is used, addition of EDTA or salts (e.g. sodium acetate, ammonium acetate or magnesium chloride) can improve the performance of the precipitation. Ethanol precipitation is a very cheap method that does not require any expensive equipment. However, template will be co-purified with the sequencing products and if the precipitation is not properly optimized excess salt and terminators may also be precipitated. In addition, it is difficult to automate the process, due to the centrifugation steps. Another disadvantage is the variable yield of sequencing products. A modified precipitation protocol, in which n-butanol is used instead of ethanol has also been described (Tillett and Neilan 1999). It has several advantages over traditional ethanol precipitation, since it gives higher yields (especially of short DNA fragments), requires shorter centrifugation steps, co-precipitates less salts, and avoids the need for a washing step.

29

Anna Blomstergren

5.3.2 Filtration methods A large number of commercial kits are available for the removal of dye terminators by filtration. Spin columns are generally used for low-throughput applications while filter plates with 96 or 384 wells are used for medium- and high-throughput applications. One approach is to fill the columns or filter plate wells with a gel separation matrix, consisting of spheres with uniformly sized pores. When the cycle sequencing products are passed through the matrix the small molecules, such as salt and nucleotides, diffuse into the pores where they are retained. Longer DNA fragments pass through the matrix and can be recovered in the filtrate (figure 11A). Another approach is to use a filter that acts as a molecular sieve, allowing small molecules to pass while larger DNA fragments are retained (figure 11B). The sequencing products can then be obtained by resuspension in the desired loading buffer.

Figure 11. Basic principles of filtration methods. A) The matrix consists of spheres with pores, the small molecules diffuse into the pores and are retained. B) The filter allows small molecules to pass while large molecules are retained. Both approaches are fast, and they are becoming quite inexpensive, probably due to the high competition between different companies. They are relatively easy to automate if vacuum filtration is used instead of centrifugation, and more or less any liquid handling robot fitted with a vacuum manifold can be used. The removal of salts and dye terminators is efficient using filtration methods and the yield is quite high, but unfortunately template, misprimed products or false stops will not be removed. 30

Strategies for de novo DNA sequencing

5.3.3 Magnetic bead techniques Our laboratory has a long tradition of utilizing magnetic beads in a number of procedures, for example: solid phase sequencing (Stahl et al. 1988; Hultman et al. 1989; Uhlen et al. 1992; Wahlberg et al. 1992), in vitro mutagenesis (Hultman et al. 1990), gene assembly (Stahl et al. 1993), solid-phase cloning (Hultman and Uhlén 1994), immunomagnetic separation (Stark et al. 1996), diagnostic detection assays (O´Meara et al. 1998b), preparation of single-stranded DNA for pyrosequencing (Holmberg unpublished), and purification of PCR or cycle sequencing products (Papers I, II and III). Magnetic bead assays are easy to automate using liquid handling robots equipped with magnetic stations. In addition, buffer exchanges are efficient and fast, enabling thorough washes of the captured moiety. Magnetic beads have been utilized in a number of ways for the purification of sequencing products, some of which are described below.

5.3.3.1 Hybridization based techniques Specific oligonucleotides coupled to magnetic beads have been used for the purification of single-stranded DNA. Due to the high specificity of hybridizations, this type of approach can be used in diagnostics for extracting viral particles that are present in very low concentrations (Albretsen et al. 1990; van Doorn et al. 1994; Millar et al. 1995; O´Meara et al. 1998b). Another possible application is the purification of M13 templates prior to sequencing reactions (Fry et al. 1992; Johnson et al. 1996). In Papers I and IV, a technique utilizing cooperative oligonucleotides for the purification of cycle sequencing products is described (figure 12A). This technique is direction- and vector-specific, enabling the iterative purification of multiplex cycle sequencing products (further described in section 8.1).

5.3.3.2 Streptavidin-biotin Streptavidin-biotin is a versatile system for purification of both DNA and other biomolecules. The non-covalent interaction involved is very stable, allowing harsh washes to remove all contaminants. A more thorough description of the biotinstreptavidin system is presented in section 8.2. Several groups have described the use of biotin and streptavidin for purification of sequencing products (Tong and Smith 1992; Tong and Smith 1993; Fangan et al. 1999; Ju 2002). In most cases the sequencing reaction is performed using a biotinylated primer (Figure 12B), either internally labeled, when dye primers are used (Tong and Smith 1992; Tong and Smith 1993) or 5’-labeled when dye terminators are used (Fangan et al. 1999). Biotinylation of the terminators, together with the use of dye primers has also been 31

Anna Blomstergren

Figure 12. Different approaches for purification of sequencing products using magnetic beads. A) Hybridization. B) Streptavidin beads and biotinylated products. C) Unspecific capture of DNA. D) Capture of unincorporated teminators. described (Ju 2002). It is beneficial to have the dye at one end of the sequencing fragment and the label at the other. Fragments missing one of these features will then either not be captured or not detected, thereby reducing the background. The sequencing fragments can be eluted from the beads prior to loading on a DNA sequencer, either using denaturing or non-denaturing conditions (Paper III). These techniques have the advantages of not co-purifying template DNA while efficiently removing salt and dye terminators. Streptavidin beads are relatively expensive, but if sequencing products can be released in a non-denaturing fashion, and the beads thus can be re-used, costs can be lowered significantly.

5.3.3.3 Unspecific capture of DNA Solid-phase reversible immobilization (SPRI) is a purification method in which DNA is precipitated onto the surface of carboxylated magnetic particles (figure 12C). After washing, the DNA can be eluted using water. In the original SPRI protocol, polyethylene glycol (PEG) and sodium chloride were used to precipitate the DNA onto the beads (Hawkins et al. 1994). This is not a suitable approach for the purification of sequencing products since capillary sequencers are sensitive to salt. Instead, an ethanol solution containing tetra ethylene glycol (TEG) is used to precipitate the DNA onto the beads, followed by a 70% ethanol wash (Elkin et al. 2002). Large templates, e.g. BACs or rolling circle amplified DNA, will remain bound to the beads under the conditions used to elute sequencing products, while smaller templates will be co-eluted. This method can be useful for sequencing projects where, for example, primer walking is used, since there is no requirement for modified primers or universal sequence in the products.

32

Strategies for de novo DNA sequencing 5.3.3.4 Capture of dideoxynucleotides Instead of capturing the desired sequencing products, as described above, this technique specifically removes the unincorporated dye terminators (Springer et al. 2003). Since no washes of the beads are necessary this method is faster than other magnetic bead approaches, but template, salt and other impurities will remain in the sample (figure 12D).

5.4 Separation and detection 5.4.1 Electrophoresis Once the staggered set of Sanger fragments has been generated and purified they need to be separated and detected. The separation method must have sufficient resolution to be able to discriminate between two DNA fragments differing in size by a single nucleotide. The most common approach to achieve this separation is to use electrophoresis. Initially, sequencing was performed manually, and the fragments were separated on polyacrylamide slab-gels prior to detection by autoradiography. The introduction of automated slab gel sequencers increased throughput and decreased the handling time. In these sequencers, fluorescently labeled DNA fragments were detected in real time at a fixed position by a CCD camera as the DNA migrated through the gel. Disadvantages with slab-gel electrophoresis include time consuming and tedious operations like pouring gels (using hazardous chemicals as acrylamide), loading samples and tracking gel images. Capillary sequencers circumvent these problems by performing electrophoresis in thin capillaries instead of slab-gels. The cross-linked polyacrylamide gels are substituted with replaceable matrices (linear polyacrylamide, for example), enabling the same capillary to be used more than 100 times. In addition, since heat is dispersed more rapidly from a capillary than a slab gel the electrophoresis can be performed at higher voltages, reducing run times. The samples are loaded, one sample per capillary, using electrokinetic injection. A certain number of ions will enter the capillary during the injection and if ionic molecules other than the sequencing fragments, like salt or nucleotides, are present they will compete with the DNA and thus lower the signal (Ruiz-Martinez et al. 1998). Therefore, it is important to efficiently purify the sequencing products prior to loading. Since a sequencing sample is confined to the capillary to which it was injected no lane tracking is necessary, making analysis of the raw data easier and faster. The human genome has mainly been sequenced using 96-capillary sequencers (ABI3700, Applied Biosystems and MegaBACE 1000, Amersham Pharmacia 33

Anna Blomstergren Biotech). These were introduced approximately at approximately the same time as Celera launched their sequencing effort and shortly after the human genome consortium entered full-scale production sequencing. Although the 96-array capillary sequencers increased the throughput, the production of the draft sequence of the human genome would not have finished ahead of schedule without the use of a large number of sequencers. It is now possible to load a capillary sequencer with enough sample plates and reagents to perform eight runs, each taking approximately 3 hours, start the sequencer and come back the next day to collect the data. Since capillary electrophoresis only uses a small fraction of the prepared sequencing mixture (the sequencers have amol detection limits compared to the pmol sequencing reactions) miniaturization is possible in order to decrease reagent costs. Several groups have performed sub-microliter sequencing reactions in nanoreactors, for example fused–silica capillaries, prior to capillary electrophoresis (Soper et al. 1998; Hadd et al. 2000; Pang and Yeung 2000; Paegel et al. 2002).

5.4.2 Mass spectrometry Matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) can be used as an alternative to electrophoresis for the separation and identification of Sanger fragments. In this approach, the DNA sample is dried together with a UV-absorbing matrix. When the sample is subjected to a short pulse of UV light DNA ions are ablated into the gas phase, generally these ions are monovalent and intact. A high voltage pulse accelerates the ions in an electric field giving them a common kinetic energy and subsequently they are passed into a flight tube. When the ions pass through the vacuum of the flight tube they are separated according to size. An ion-to-electron conversion detector is placed at the other end of the flight tube, registering the time of flight (TOF) from the original laser pulse for each fragment. The TOF can then be used to calculate the mass, and since the nucleotide bases have different molecular masses the DNA sequence can be deduced. MADI-TOF-MS is a very fast method for separating and detecting Sanger fragments. Unfortunately, the read length is only about 100 bases, making it unsuitable for most de novo sequencing applications (Marziali and Akeson 2001).

34

Strategies for de novo DNA sequencing

5.5 Automation The automation of DNA sequencing has resulted in a vast number of different commercially available as well as “home-made” robotic workstations. These range from small semi-automatic instruments for liquid handling, like a plate filler or a Hydra (Robbins Scientific), to huge robots capable of performing several operations simultaneously, for example the Genesis RSP (Tecan). Pipetting heads have evolved from a single tip to 384 tips, but the most commonly used have eight tips (Biomek™1000, BekmanCoulter) or 96 tips (Microlab 4200, Hamilton; PlateMate™, Matrix; Biorobot 8000, Qiagen). Most of these workstations can be used to set up PCR reactions or cycle sequencing reactions, since this only requires a few pipetting steps per sample. Many liquid-handling workstations can also be equipped with vacuum manifolds or magnetic separation stations, enabling a number of purification techniques to be automated. A workstation that was developed at our department but is now commercially available (Magnatrix 1200, Magnetic Biosolutions AB) utilizes a movable magnet in order to perform the magnetic separation in the tips, enabling fast and efficient bead recovery. Colonies or plaques can be automatically picked and placed into microtiter plates using a picker robot equipped with a camera and one or several needles. This saves time and manual labor. However, the currently available robots are not as good as a trained human at deciding if a colony is worth picking.

6 Data analysis Once sequencing reads have been obtained, the data analysis process starts. The choice of software and analysis mainly depends on the type of project. This area is too extensive, and the number of available software tools too large, to be covered comprehensively in this thesis. The exemplified software packages are freely available for academic users and they are listed, together with references in table 1 (quality assessment, assembly and alignment) and table 2 (annotation).

35

Anna Blomstergren

6.1 Quality assessment The raw data of a sequence read is first base-called, using an appropriate algorithm for the sequencing platform involved. The base caller will reduce the background originating from spectral overlap of the different dyes. Correction for mobility shifts caused by the dyes is also performed. Next, a quality value is assigned to each base in the read, this is a measure of the probability that the base call is incorrect. The most commonly used software for quality scoring is Phred (Ewing and Green 1998; Ewing et al. 1998). A Phred score of 10 represents a 10% error probability, while a Phred scores of 20 and 30 corresponds to 1% and 0.1% error rates, respectively. The standard approach when measuring read lengths is to give the number of continuous bases with a Phred score of 20 or more. Previously, accuracy was also commonly measured in terms of the quality of a read (rather than a base). If the true sequence was known, accuracy could be calculated as the number of correct bases divided by the total number of bases.

6.2 Assembly In a project where a large region of DNA is sequenced using several overlapping reads, these reads have to be assembled correctly. If no repetitive sequence is present this is quite straightforward. A computer program compares the reads to each other and determines the best way to join them. This can be done in at least two ways, and software packages using both kinds of method have been developed. One approach is to choose a read, compare it to all other reads and determine which are sufficiently similar. These reads are then assembled into a contig and the process is repeated for the next read (Gap4). If the first read to be compared is of poor quality this can negatively affect the assembly of contigs. In a second approach, all reads are compared to all other reads before any assembly is performed. The best matches are then assembled first to obtain the best overall assembly possible (Phrap). This will result in better assemblies, but require more computational power. These two methods can be described to two ways of putting a jigsaw puzzle together: either starting with one piece and finding the pieces that fit it before moving on to the next piece, or looking at all the pieces and working out how they all fit together, simultaneously. The second approach would, of course, be impossible without computational assistance if there were any more than a few pieces.

36

Strategies for de novo DNA sequencing

Quality values are a useful asset when assemblies are being made. For instance, deviation at a corresponding base between two similar reads may be due either to a sequencing error, or to the reads originating from different parts of the DNA sequence, despite their similarity (for example they may be two copies of a repeat). When one or both bases have low quality assignations the difference can usually be attributed to sequencing error, but if they both have high quality values they should probably not be assembled together. When several copies of similar repeats are present in a sequence the assembly process will become very difficult and most programs will produce faulty assemblies. However, software has been specifically developed to resolve these regions (TRAP), for example by producing multiple alignments and comparing the sequence in defined nucleotide positions (DNPs). These positions are then used to group the reads into different copies of the repeat.

6.3 Comparing sequences A large number of alignment tools that can be used for aligning short, similar nucleotide or amino acids sequences have been developed. The main differences between alignment tools and assembly tools are the nature of their inputs and purpose. Assembly tools are used to put fragments of a common template together, forming a consensus sequence, while alignment tools compare consensus sequences from different templates. Some of the alignment tools can only handle pair-wise alignments, while other can handle multiple sequences (for example CLUSTALW). They use different algorithms, but if the differences between sequences are small most will perform well. In this case the major discernable difference between the tools is in the output format and the way the results are displayed. The best choice for a particular alignment depends mainly on the purpose of the alignment and what is to be done with the data afterwards. Large DNA sequences, for example complete bacterial genomes from different species or strains, are more difficult to align correctly. Problems arise partly from the larger number of bases to be compared, but mainly from the presence of insertions/deletions and rearrangements. Most traditional alignment tools have difficulties handling large gaps in one or more sequences, but several software packages have been developed to solve these problems (including AVIDA/MAVIDA, BlastZ, MGA and MUMmer). In order to extract useful information from these long alignments, new visualization tools have also been introduced (PipMaker and VISTA), but both the alignment and visualization packages could be improved significantly in terms of user friendliness and performance. 37

Anna Blomstergren Table 1. Software tools for quality assessment, assembly and alignment. Function

Name

Web address

Reference

Quality assessment

Phred

http://www.phrap.org/

Assembly

Gap4 (Staden)

http://www.mrclmb.cam.ac.uk/pubseq/ http://www.phrap.org/

(Ewing and Green 1998; Ewing et al. 1998) (Bonfield et al. 1995)

Phrap TRAP Alignment

AVID/MAVID

http://baboon.math.berkeley. edu/mavid/

BlastZ

http://bio.cse.psu.edu/

CLUSTALW

http://www.ebi.ac.uk/clustalw/

MGA

http://bibiserv.techfak.unibielefeld.de/mga/ http://www.tigr.org/software/ mummer/ http://bio.cse.psu.edu

MUMmer Visualization of long alignments

(Multi) PipMaker VISTA

http://www-gsd.lbl.gov/vista/

(Gordon et al. 1998) (Tammi et al. 2002; Tammi et al. 2003) (Bray et al. 2003; Bray and Pachter 2003) (Schwartz et al. 2003b) (Thompson et al. 1994) (Hohl et al. 2002) (Delcher et al. 1999; Delcher et al. 2002) (Schwartz et al. 2000; Schwartz et al. 2003a) (Mayor et al. 2000)

6.4 Annotation One of the major challenges for any genome-sequencing project is to identify and annotate all coding and regulatory regions (Stein 2001). Full understanding of a genome requires annotation at nucleotide, protein and process levels. Nucleotide-level annotation starts with mapping known genes, gene tags and other markers using for example BLASTN or SSAHA. This process connects the genomic sequence with any pre-existing physical or genetic maps. Gene finding is also a major part of the nucleotide-level annotation. In prokaryotic genomes this is a fairly straightforward procedure, in which open reading frames (ORFs) exceeding a specified threshold length, are identified and start and stop codons are 38

Strategies for de novo DNA sequencing verified. Prokaryotic genomes consist mainly of coding sequence, which simplifies this process. In eukaryotic genomes, especially of higher organisms, only a small proportion of the genome encodes genes. Together with the more problematic gene design of eukaryotes, in which genes are split into several exons divided by introns, this significantly complicates gene finding. A number of software algorithms have been developed to predict genes in eukaryotic genomes (for example GeneMark.hmm, Genie, GENSCAN and Grail), but they are still far from optimal. Sequences of known transcribed sequences can be very useful in predicting genes. Data from cDNA or EST sequencing projects can be compared with the genomic sequence in order to identify similarities. The most powerful method for predicting protein-coding regions in a genome is to combine ab initio gene predictors with similarity data (using, for example GenomeScan and Grail/Exp). There are a number of other features that are of interest in a genome, for example non-coding RNAs and regulatory regions. Non-coding RNAs are generally found by similarity searches, although certain characteristic secondary structures can be sought instead (elements of tRNAs, for instance). Unfortunately, similarity searches will only identify those RNAs that are already known, and there are probably a large number of small RNAs that have not yet been identified. The same is true for regulatory regions, where known transcription factor binding sites can be found by similarity searches, but many regulatory regions remain unknown. By comparing two species, e.g. human and mouse, it is possible to identify conserved intergenic regions that could represent regulatory regions. In protein-level annotation attempts are made to create a catalogue of the proteins in an organism. This catalogue should contain the name and putative function of each protein. To date, only a fraction of the genes found during the nucleotidelevel annotation of an organism corresponds to known, well-characterized proteins. One reason for this is the simple fact that DNA is easier to characterize than proteins. Usually, attempts are made to group the vast number of unknown proteins into protein families. This is accomplished by similarity searches against protein or protein domain databases (for example SWISS-PROT, PFAM, PROSITE and SMART). During evolution protein genes are sometimes duplicated, followed by divergence of the copies, leading to the formation of paralogue genes. These are similar in sequence, but can have completely different functions. Separating orthologue genes (which are descendants from the same ancestral gene rather than originating from a duplication) from paralogues can be very difficult, and human curation of the protein families is usually needed. Perhaps the most difficult part of annotating a genome is the process-level, or functional, annotation. The aim here is to describe how the proteins encoded by the genome relate to metabolism, the cell cycle, cell death and so on. A standard vocabulary, called gene ontology (GO), describing the function of eukaryotic 39

Anna Blomstergren genes has been developed (http://www.geneontology.org/). This vocabulary is used to describe the proteins at molecular function, biological process and cellular component levels. “Molecular function” is concerned with aspects such as the enzymatic activity of the protein, while “biological process” describes the broader system in which the protein participates. “Cellular component” describes the subcellular location of the protein. Computational comparison of sequences is often not sufficient for full functional annotation, and additional investigations may often be required, for example microarray analyses, to gather information on the roles that proteins play in specific physiological processes. Table 2. Software tools for annotation. Function

Name

Web address

Reference

Similarity searches

Blastn

http://www.ncbi.nlm.nih.gov/ BLAST/ http://www.sanger.ac.uk/Software/ analysis/SSAHA/ http://genemark.biology.gatech.edu /GeneMark/

(Altschul et al. 1990) (Ning et al. 2001) (Besemer and Borodovsky 1999) (Reese et al. 2000) (Burge and Karlin 1997) (Uberbacher and Mural 1991) (Yeh et al. 2001) (Xu and Uberbacher 1997) (Bateman et al. 2000) (Hofmann et al. 1999) (Ponting et al. 1999) (Schuler 1997)

SSAHA Ab initio gene prediction

GeneMark.hmm

Genie

Grail

http://www.fruitfly.org/seq_tools/ genie.html http://genes.mit.edu/GENSCAN. html http://compbio.ornl.gov/Grail-1.3/

GenomeScan Grail/EXP

http://genes.mit.edu/genomescan/ http://compbio.ornl.gov/grailexp/

PFAM

http://pfam.wustl.edu/

PROSITE

http://www.expasy.org/prosite/

SMART

http://smart.embl-heidelberg.de/

SWISS-PROT

http://www.ebi.ac.uk/swissprot/

GENSCAN

Gene predictions using similarity data Protein databases

40

Strategies for de novo DNA sequencing

PRESENT INVESTIGATIONS

This thesis is based on four papers, three of which describe the development of two new methods for purifying DNA, while the fourth describes an effort to compare the cag pathogenicity island (PAI) in four Helicobacter pylori strains. Both of the developed purification techniques utilize magnetic beads, which are ideal for automation, but exploit different strategies to capture the DNA. However, both techniques were unsuitable for the approach used to sequence the 40 kb region in H. pylori, where specific primers were used to PCR-amplify overlapping sections. The first purification technique requires all of the samples to contain the same vector specific sequence, making it impossible to use for primer walking or directed PCR approaches. The second technique requires biotinylated primers, which are too expensive when hundreds of primers have to be used. Both techniques were developed as tools for purifying cloned inserts, which is an important step in high throughput sequencing strategies. In the subsequent sections these purification techniques are described, then some background concerning H. pylori and the cag PAI is given, followed by a description of the strategy used for sequencing this region and a presentation of the results obtained with it.

41

Anna Blomstergren

7 Solid-phase purification of DNA As sequencing capacity requirements increase, so does the importance of automated solutions. Solid-phase techniques for purifying of biomolecules such as DNA are easy to automate, especially if magnetic beads are used as the solidphase. A number of different methods have been developed for the purification of PCR or cycle sequencing products. In this thesis two different purification strategies that can be applied to cycle sequencing products are presented: a hybridization technique (Paper I and II) that is suitable for multiplex sequencing reactions, and a technique utilizing streptavidin to capture biotinylated DNA (Paper III). The major breakthrough incorporated into the second approach is a step allowing biotinylated biomolecules to be released from a streptavidin-coated support without denaturation.

7.1 Hybridization based technique for the purification of cycle sequencing products Hybridization between oligonucleotides and their complementary nucleic acid targets is used in numerous applications, for example PCR amplification, cycle sequencing, nucleic acid blotting and microarrays. In studies described in Papers I and II, oligonucleotides coupled to magnetic beads were used for the purification of cycle sequencing products. Previous methods using probes on magnetic beads to capture single-stranded template prior to sequencing have relied on long capture probes (Fry et al. 1992) or triple-helix formation (Johnson et al. 1996) in order to stabilize the hybridization complexes. Here, we have used shorter adjacently annealing modular probes instead. The stabilizing effect of modular probes is shown in Paper I, but it has also been evaluated in more detail in previous studies (Lane et al. 1997; Raja et al. 1997; O´Meara et al. 1998a; O´Meara et al. 1998b; Nilsson et al. 1999). It has been suggested that the increased stability arises from van der Waals forces between adjacent nucleotides, together with an opening up of secondary structures in the target sequence. Even a one-base gap between the modular probes significantly decreases the stability of the complex. Paper I describes the purification technique, which was then further developed and applied to multiplex cycle sequencing reactions reported in Paper II. A capture probe, covalently coupled to a paramagnetic bead, and an adjacently annealing modular probe were designed to hybridize between the primer site and the insert on the cycle sequencing products (figure 13A). Different sets of probes were added in an iterative fashion as outlined in figure 13B in order to purify multiplex sequencing products. 42

Strategies for de novo DNA sequencing

Figure 13. Purification of multiplex cycle sequencing products. A) Two probe sets were designed to anneal between the primer site and the insert. B) Duplex cycle sequencing was performed by adding two primers. One probe set at a time was then added in order to iteratively capture the corresponding sequencing products. The asterisked supernatant (*) corresponds to a normal single cycle sequencing reaction. 43

Anna Blomstergren In figure 14 the results from the purification of single, duplex and quadruplex cycle sequencing reactions are presented (unpublished data). In the latter case the cycle sequencing products were obtained simultaneously from two directions on two different templates. This purification strategy has five major advantages. First, any type of sequencing chemistry and any sequencing primer can be used to generate sequencing products. Second, non-incorporated primer or misprimed sequences are not captured, which reduces background signals and increases the accuracy. Third, the sequencing templates can be plasmids, PCR products, M13 or even BACs, as long as they contain vector-specific sequence. If double-stranded template DNA is used it will not be co-purified, which is an advantage since excess template can interfere with capillary sequencing. Fourth, multiplex cycle sequencing products can be iteratively purified due to the specificity of the technique. This reduces costs as well as instrument and handling times per sample. Fifth, the beads are easy to regenerate in an automated fashion and can be re-used at least 12 times, which results in low reagent costs for the purification.

Figure 14. Results from single, duplex and quadruplex cycle sequencing reactions. Two vectors were used for the quadruplex reaction (pUC18 and pBluescript), but here the results from only one of the vectors (pBluescript) are shown.

7.2 Purification using the biotin-streptavidin system The biotin-streptavidin system is the strongest non-covalent biological interaction known, with a Kd of 4·10-14 (Savage 1992). This has been exploited in a variety of applications in research and diagnostics, where streptavidin can be coupled to a solid phase while biotin is coupled to the moiety of interest. Harsh conditions, such as formamide at high temperatures, are generally required to separate the biotin from the streptavidin resulting in denaturation of the streptavidin molecule. 44

Strategies for de novo DNA sequencing

A

B

Figure 15. Purification of A) PCR products and B) cycle sequencing products using streptavidin beads. The beads can be regenerated due to the non-denaturing elution conditions used. In Paper III a technique for the non-denaturing release of biotin from streptavidin was investigated. Surprisingly, this could be accomplished with non-ionic aqueous solutions at elevated temperatures. The purification of both PCR products and cycle sequencing products is described in the paper. Briefly, the PCR or cycle sequencing reaction was performed using one biotinylated primer (figure 15). The resulting product was captured on the streptavidin beads and the complex was washed. Elution was then accomplished by heating to 70ºC for one second in deionized water. The washing was important in order to remove salt, which significantly reduces the elution efficiency even at low concentrations. Divalent salts, like MgCl2, are especially detrimental to the release of biotin from streptavidin, as can be seen in figure 16. Analyses in which the beads were subjected to two consecutive elutions clearly showed that no DNA remained on the beads after elution (figure 17). The relatively mild elution ensured that both the streptavidin and the biotin remained intact, enabling regeneration of the streptavidin beads. 45

Anna Blomstergren

Figure 16. Effect of different salts on elution efficiency. Purification using this method has been employed for the preparation of PCR products for printing microarrays (Wirta unpublished). A modified version of this technique has also been used in the preparation of single stranded templates for pyrosequencing. After capturing PCR products on streptavidin beads, the complex is treated with sodium hydroxide in order to separate the two strands. The nonbiotinylated strain is discarded and the beads are washed prior to elution of the biotinylated strand (Holmberg unpublished). Further possible applications include solid-phase cloning and the purification of genomic DNA.

Figure 17. Elution efficiency. A) Biotinylated PCR product was bound to streptavidin beads and then eluted after washing. B) The beads were then subjected to a second elution. Peaks at 35 and 85 seconds correspond to markers, while the peak between70 and 75 seconds corresponds to PCR product. 46

Strategies for de novo DNA sequencing

7.3 Comparison of the two techniques Both techniques utilize magnetic beads as a solid-phase and they have been automated on similar workstations (Magnatrix 8000 for the hybridization assay and Magnatrix 1200 for the streptavidin assay, both from Magnetic Biosolutions, Stockholm, Sweden). It might seem redundant to develop two quite similar methods, but they both have distinct advantages. The hybridization technique has the advantage of specificity, which allows for purification of multiplex cycle sequencing products. It also removes excess primer and misprimed sequencing products, which could interfere with the electrophoresis. Using streptavidin beads to capture biotinylated molecules is a more flexible approach, since the same beads can be used together with any biotinylated product, and the strong interaction allows more stringent washes. Both methods have the possibility of regenerating the beads, leading to significantly reduced reagent costs.

8 Comparative sequencing of H. pylori Paper IV in this thesis describes a comparison of the cag pathogenicity island (PAI) in four clinical isolates of H. pylori. H. pylori is a common human pathogen that causes a number of gastric diseases, and therefore the understanding the bacterial virulence factors and their impact on pathogenicity is of great importance. Two isolates were obtained from patients with gastric cancer and two from patients with duodenal ulcers. By selecting isolates from patients of the same age and sex as well as from the same geographic region (Enroth et al. 2000), the effects of host and environmental factors were minimized.

8.1 Helicobacter pylori To date, 26 formally named Helicobacter species have been isolated from a number of mammals (Gueneau and Loiseaux-De Goer 2002). The human pathogen H. pylori was cultured for the first time by Barry Marshall and Robin Warren in 1982 (Marshall and Warren 1984). H. pylori is a spiral-shaped, microaerophilic, Gram-negative rod with unipolar sheathed flagella (figure 18A) (Lee et al. 1993; Covacci et al. 1999). It is primarily found in the antrum, but sometimes also in the duodenum or corpus of the human stomach (figure 18B) where it colonizes the mucus lining of the gastric epithelium. Approximately half of all humans in the world are infected with H. pylori, but the numbers are higher in developing countries and lower in industrialized countries, thus low 47

Anna Blomstergren socioeconomic status is associated with increased prevalence of infection (Parsonnet 1995; Pounder and Ng 1995; Mitchell 2001). Infection usually occurs early in life and it is chronic if not treated (Mitchell et al. 1992; Miyaji et al. 2000). H. pylori is believed to be spread through person-to-person contact, but further studies are needed to clarify the route of transmission (Covacci et al. 1999; Mitchell 2001; Björkholm et al. submitted).

Figure 18. Schematic pictures of A) H. pylori, a spiral shaped bacterium with unipolar flagella and B) a human stomach. The link between H. pylori and gastro/duodenal disease was at first met with skepticism (Blaser 1987), but is now generally accepted. Infection by H. pylori always leads to inflammation, but in some individuals peptic ulcers or even gastric cancers develop (Lee et al. 1993). The formation of duodenal ulcer seems to be mutually exclusive to gastric ulcer or cancer. H. pylori predominantly colonizes the antrum of the stomach where there are no acid-producing parietal cells. Overproduction of acid can lead to acid leaking into the duodenum where intestinal cells are then replaced by gastric epithelial cells. This, in turn, enables H. pylori to colonize the duodenum and leads to an increased risk of duodenal ulcers. If, instead, the acid production is lower than normal, H. pylori can colonize the corpus, normally protected by low pH. This occurs during atrophy development in the corpus when the number of parietal cells decreases resulting in an increase of pH. The epithelium then changes to an antrum-like structure, allowing H. pylori to colonize the corpus and thus increasing the risk of both gastric ulcers and gastric cancer in the corpus (Israel and Peek 2001). Gastric cancers are the third most common form of cancer (Parkin et al. 2001), surpassed in frequency only by lung cancer and breast cancer, and H. pylori is classified as a class 1 carcinogen in humans (IARC 1994). 48

Strategies for de novo DNA sequencing H. pylori strains isolated from unrelated individuals rarely have the same genetic fingerprint when analyzed by, for example, PFGE (Alm and Trust 1999). In 1999, H. pylori became the first species for which two unrelated strains had been fully sequenced (Alm et al. 1999). Strain 26695 was isolated from a British gastritis patient (Tomb et al. 1997), while strain J99 was isolated from an American duodenal ulcer patient (Alm et al. 1999). Comparison of the two completely sequenced genomes showed that much of the variation consisted of point mutations in the third base of the codons, and therefore had no impact on the expressed proteins. On the other hand, 7% of the genes were unique to the respective strains, indicating that more fundamental rearrangements of the genome do occur (Alm et al. 1999; Alm and Trust 1999; Doig et al. 1999). Most of the strain-specific genes were found in the two plasticity zones (PZ), which have a different G+C content compared to the rest of the genome, indicating horizontal transfer (Alm et al. 1999; Salama et al. 2000).

8.2 The cag pathogenicity island Bacterial virulence traits can be encoded by particular chromosomal regions called pathogenicity islands (PAIs). PAIs are present in virulent strains of the bacteria, but absent or only sporadically present in less pathogenic strains. The PAIs often differ in G+C content compared to the rest of the genome, and are flanked by direct repeats or insertion sequence (IS) elements (Hacker et al. 1997). The presence of cryptic “mobility” genes, such as integrases, transposases, oris of plasmids or IS elements, indicate that they were probably acquired through horizontal transfer. PAIs are often somewhat unstable and partial or complete loss of the PAI is common (Hacker et al. 1997). In H. pylori a PAI, named cag after the cytotoxin associated gene (cagA), can be found in more than 90% of all strains associated with severe gastric diseases (Ikenoue et al. 2001; Stein et al. 2001). The cag PAI is an approximately 40 kb genetic element flanked by 31 bp direct repeats, containing 27 genes, as shown in figure 19 (Censini et al. 1996; Akopyants et al. 1998; Stein et al. 2001). The flanking direct repeats are identical to the core sequence of the left and right arms of the IS element IS605, the second repeat is also the end of the glr gene (Censini et al. 1996; Stein et al. 2001). In some strains the cag PAI is split into two parts, either by an IS element or by two IS elements together with intervening DNA (Censini et al. 1996; Akopyants et al. 1998; Stein et al. 2001).

49

Anna Blomstergren

Figure 19. Overview of the cag PAI. The locations of the 31 bp flanking repeats are marked by the black arrows. Grey arrows mark genes and point in their direction of transcription. Known homologs are listed just below the arrows. A number of different naming schemes have been used for the cag genes. The most common is the one used for strain 26695, designating the genes HP0520 to HP0547. The cag PAI encodes a type IV secretion system (TFSS), involved in a number of pathogenic processes, for example IL-8-induction and conformational changes of epithelial cells (Fischer et al. 2001; Stein et al. 2001). The CagA protein, encoded by the HP0547 gene (cagA) in the cag PAI, is translocated into the host cells by the TFSS. Once inside the host cell it is tyrosine phosphorylated, which starts a cascade of host responses, eventually leading to morphological changes called the scattering phenotype (Segal et al. 1999; Odenbreit et al. 2000; Stein et al. 2000; Fischer et al. 2001; Stein et al. 2001; Selbach et al. 2002). About half of the genes in the cag PAI, not including cagA, are also involved in the induction of expression and secretion of the chemokine Interleukin-8 (IL-8) by the host cells (Censini et al. 1996; Fischer et al. 2001; Stein et al. 2001; Selbach et al. 2002). It has been proposed that this induction is caused either by translocation of a second effector molecule or by the binding of the secretion apparatus to host cells receptors (Stein et al. 2001; Selbach et al. 2002). IL-8 induces the inflammatory response of the host.

50

Strategies for de novo DNA sequencing

8.3 Strategy for comparing the cag PAI in four clinical isolates of H. pylori Isolates of H. pylori were obtained from patient biopsies and genomic DNA was prepared after minimal rounds of culturing. Since H. pylori has been completely sequenced twice, directed sequencing of a specific genomic region was possible without subcloning or mapping. The two cag PAI sequences obtained from the completed genomes were aligned, and primers for PCR and sequencing were designed using this alignment as a template (figure 20). The primers were mainly located in regions where the two genomes agreed and care was taken to avoid repetitive sequences. In order to obtain a satisfactory coverage of the region, primers were designed to produce PCR products of approximately 700 bp with 100 bp overlap between adjacent PCR products. Sequencing in both directions was then performed using the same primers as in the PCR and the obtained sequences were assembled into contigs. A number of PCR reactions failed on one or more of the four strains. Analysis of the obtained sequence data showed that this was mainly due to nucleotide variation between different H. pylori strains. Therefore, new sets of primers were designed using the strain-specific sequences obtained in order to close the gaps between the contigs. Some regions were impossible to amplify with PCR due to either repetitive sequences or large insertions of unknown sequence. These regions were analyzed using cycle sequencing directly on genomic DNA. Even then, it proved impossible completely resolve two regions. One of them, in the middle of gene HP0527 and approximately 3 kb long, consisted of previously described repetitive units (Liu et al. 1999), making directed sequencing impossible. The second region was only present in one of the strains and seemed to be a very large insertion or rearrangement of DNA from other parts of the H. pylori chromosome (further discussed below). When the cag PAI had been assembled for each strain individually, the consensus sequences could be compared. The four sequences (each consisting of two or three contigs) were aligned together with the cag PAI sequence from the two completed genomes of strains 26695 and J99 using multi PipMaker software (Schwartz et al. 2000). Localization of the cag PAI genes was straightforward since the completed genomes had already been annotated. The open reading frames were translated into amino acid sequence and compared. Differences between the strains included single nucleotide differences, insertions/deletions, insertion sequence elements (IS elements) and major rearrangements.

51

Anna Blomstergren

Figure 20. Strategy for comparing the cag PAI of four different H. pylori strains. 52

Strategies for de novo DNA sequencing

8.4 Nucleotide and amino acid sequence variation When the nucleotide sequences of each gene were compared in the four sequenced strains, the detected variation ranged from 1.3% to 6.5%. The corresponding variation for the amino acid sequence ranged from 0.3% to 10.3%. If a nucleotide variation occurs in the third base of the codon it will not always affect the encoded amino acid sequence, due to the degenerative genetic code. A totally random distribution of nucleotide mutations would result in roughly 67% affecting base one or two in the codons involved, and thus be non-synonymous. In the cag PAI of H. pylori the fraction of non-synonymous mutations ranges from 6% in gene HP0525 to 64% in gene HP0547. This suggests that the amino acid sequence encoded by some genes cannot be altered, without loss of viability, i.e. the structure of the protein must be conserved to maintain functionality, but other genes can be mutated more freely. Interestingly, the gene product of HP0547 (CagA), which has the highest frequency of non-synonymous mutations, is translocated into the host cells, where it triggers a number of cellular responses. For this protein, variation might be advantageous in order to allow adaptation to new hosts. Other proteins forming the core structure of the TFSS, allowing CagA translocation, are less prone to mutation. A number of small insertions or deletions, each comprising only a few bases, can be found when comparing the four strains. The vast majority of these are located in the non-coding regions between genes. Only a few are located intragenically, and only one of them disturbs the reading frame of a gene (HP0536 in Ca73, one of the strains obtained from a cancer patient) while the others are in multiples of three bases. It is difficult to judge the importance of nucleotide variations in intergenic regions, including promoter regions, since very little is known about the regulation of gene expression in H. pylori.

8.5 Major rearrangements A number of major rearrangements were found in the four studied H. pylori strains. An IS element was located in one of the strains (Ca52) obtained from a cancer patient. This IS element has previously been described in a different strain at another location of the cag PAI. Since the IS element in Ca52 was located between two clusters of genes, both transcribed towards the IS element, it should not have any effect on the expression of the cag genes. It was also shown that all four strains were able to translocate CagA and induce IL-8 in host cells, indicating functional cag PAIs.

53

Anna Blomstergren One of the strains obtained from a duodenal ulcer patient (Du52:2) harbored a major rearrangement near the 3’ end of the cag PAI. This rearrangement was not completely resolved due to its size and the presence of duplicated sequences. When primer walking was performed from the 5’ end, genes HP0509 and HP0510 were found. These genes are located upstream from the cag PAI in other strains. Primer walking from the 3’ end revealed a duplication of the end motif of the cag PAI, together with the remnants of an IS element. There are at least two possible explanations for this feature: DNA including at least one duplication may have been inserted into the cag PAI, or a chromosomal rearrangement may have occurred. Further investigations are needed in order to determine the exact nature of this region. Perhaps the most surprising finding in these four newly sequenced strains was that three of them contained a novel gene, while the fourth had a deletion of the corresponding region. This new gene, HP0521B, was located where the gene HP0521 is normally found. The gene started in the sequence preceding HP0521, but in a different reading frame, while its stop codon was the same as for HP0521. A number of additional strains were examined, either by sequencing or by PCR, in order to determine how common this new variant was. Approximately half of the tested Swedish strains harbored the novel HP0521B. The additional strains also showed that the region close to the stop codons of the genes was rather unstable, displaying a number of different one or two base insertions or deletions. The resulting frame shifts have led to variation in the N-terminal regions of the translated proteins and resulted in the co-translation of HP0521 and HP0522 in two strains. Even though the cag PAI has been extensively studied and several variable regions have been previously characterized there is still much to discover. This study shows the immense variability that can occur between different bacterial strains, even within the same species. Therefore, sequencing of one complete genome of a microbial organism provides a far from complete description of the species’ nature.

54

Strategies for de novo DNA sequencing

9 Concluding remarks Even though the landmark accomplishment of sequencing the human genome is now more or less completed there is still a need for further development of the techniques used for genome analysis. Recently a prize was announced that will be awarded to those who enable the human genome to be sequenced for $1000 or less. Achieving this goal will probably require the development of an alternative approach to Sanger sequencing, where single molecule sequencing is one candidate. Never the less improved approaches to conventional sequencing will probably be needed for years to come, until a new method has been developed and implemented. The focus of the genomic field is now shifting towards comparative genomics either between species or between individuals of the same species. In this thesis the virulence associated cag PAI from four strains of H. pylori has been compared. The results show that even in a rather well studied region of a genome there are novel features, especially in prokaryotes where the variability is often high.

55

Anna Blomstergren

10 Acknowledgments Tack... ... alla ni som jobbar eller har jobbat på institutionen. Det har varit många roliga DNA corner resor, julfester, disputationsfester, födelsedagsfika, peköl för att inte tala om vanliga luncher. ... Joakim. För att du har varit min handledare och för din fantastiska förmåga att alltid se något positivt i de mest nedslående resultat! För att du alltid är pigg på att höra lite skvaller och vet var du skall gå när du är godissugen. Ta hand om dig! ... Mathias för att du ständigt sprider entusiasm omkring dig och för ditt intresse i automatiseringen av Anders och mina metoder. ... Anders. För all din hjälp med datorer, robotar och magnetkulor. Dessutom för sällskap på konferenser, trevliga fester och goda middagar. ... Deirdre. Thank you for helping me out at the lab and teaching me the basics. It was nice to have you as a lab neighbor (until Afshin occupied the bench between us). ... Sophia, Per-Åke, Stefan, Peter N och Afshin för att ni alltid tar er tid. ... alla ni på SMI. Framförallt Annelie, Christina och Lars. Tack för all er hjälp och ert stöd med Helicobacter projektet. Tack också för trevligt sällskap på konferenser på så vitt skillda ställen som Perth och Helsingör. ... Bahram. Sekvensningsgurun! Du betyder otroligt mycket för alla som sysslar med sekvensning på institutionen. Det är tur att vi har dig! ... alla andra tekniker som har jobbat med sekvensning genom åren. Jag tänker inte ens försöka räkna upp er för då kommer jag garanterat glömma någon! Det har varit kul att dela lab med er sedan vi flyttade och ni har alltid varit otroligt hjälpsamma när jag har behövt använda något instrument. ... Monica, Tina, Pia, Mona och Tommy. För att ni håller ordning på oss förvirrade doktorander! ... K93-orna Martin, Henrik, Stina och Kristoffer. Nu är det bara du kvar Stina! Det har varit 10 roliga år sedan vi började på teknis. Indien kommer jag aldrig glömma! 56

Strategies for de novo DNA sequencing

... Micke och Peru för all hjälp med linux, script och datorer i övrigt. ... mina tidigare och nuvarande rumskamrater. Speciellt Anna G, Nina och Stina för att ni är precis lika frusna och godistokiga som jag! ... alla anställda på Magnetic Biosolutions. Henrik för programmeringshjälp och trevliga middagar. Petra för att jag fick dela rum med dig på GSAC. Robert för humor och sällskap (nåja) på nyårsdagen, läs den här så vet du vad DNA är sedan! Morten för trevligt sällskap på konferenser och för all hjälp jag har fått när det gäller magnetkulor! Lycka till med laxarna. ... Stina. För att du alltid har funnits där när jag har haft idiotiska frågor man inte vill ställa till någon annan och för alla luncher vi har ätit ihop. Lycka till med din avhandling och med allt annat! ... Martin för att du disputerar tre veckor före mig så att jag har någon att fråga om allt det praktiska ;) ...tjejgänget: Jenny G, Stina, Ingrid och Karin. För alla trevliga middagar. Det är väl snart dags igen... ... Jenny A och Ingrid. För att ni är mina bästa vänner, trots att vi inte alltid ses så ofta som vi borde. ... alla andra vänner jag har därute i verkligheten. ... alla medlemmar i Pq och Kårspexet genom åren för kul fester, mycket slit och mycket vänskap! ... alla jag har glömt och som borde stå med, ni vet själva varför! ... Mamma, Pappa och Peter. Ni finns alltid i mitt hjärta. Ni är den bästa familj som finns! ... Johan. Jag älskar dig. Tack för ditt oändliga stöd och tålamod den här hösten. ... Dynal, Stiftelsen för strategisk forskning, Vetenskapsrådet och Knut och Alice Wallenbergs stiftelse för finansiellt stöd.

57

Anna Blomstergren

11 References Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., Merril, C. R., Wu, A., Olde, B., Moreno, R. F. and et al. (1991). "Complementary DNA sequencing: expressed sequence tags and human genome project." Science 252(5013): 1651-6. Adams, M. D., Dubnick, M., Kerlavage, A. R., Moreno, R., Kelley, J. M., Utterback, T. R., Nagle, J. W., Fields, C. and Venter, J. C. (1992). "Sequence identification of 2,375 human brain genes." Nature 355(6361): 632-4. Agaton, C., Unneberg, P., Sievertzon, M., Holmberg, A., Ehn, M., Larsson, M., Odeberg, J., Uhlen, M. and Lundeberg, J. (2002). "Gene expression analysis by signature pyrosequencing." Gene 289(1-2): 31-9. Akopyants, N. S., Clifton, S. W., Kersulyte, D., Crabtree, J. E., Youree, B. E., Reece, C. A., Bukanov, N. O., Drazek, E. S., Roe, B. A. and Berg, D. E. (1998). "Analyses of the cag pathogenicity island of Helicobacter pylori." Mol Microbiol 28(1): 3753. Albretsen, C., Kalland, K. H., Haukanes, B. I., Havarstein, L. S. and Kleppe, K. (1990). "Applications of magnetic beads with covalently attached oligonucleotides in hybridization: isolation and detection of specific measles virus mRNA from a crude cell lysate." Anal Biochem 189(1): 40-50. Alm, R. A., Ling, L. S., Moir, D. T., King, B. L., Brown, E. D., Doig, P. C., Smith, D. R., Noonan, B., Guild, B. C., deJonge, B. L., Carmel, G., Tummino, P. J., Caruso, A., Uria-Nickelsen, M., Mills, D. M., Ives, C., Gibson, R., Merberg, D., Mills, S. D., Jiang, Q., Taylor, D. E., Vovis, G. F. and Trust, T. J. (1999). "Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori." Nature 397(6715): 176-80. Alm, R. A. and Trust, T. J. (1999). "Analysis of the genetic diversity of Helicobacter pylori: the tale of two genomes." J Mol Med 77(12): 834-46. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10. Andersson, B., Povinelli, C. M., Wentland, M. A., Shen, Y., Muzny, D. M. and Gibbs, R. A. (1994). "Adaptor-based uracil DNA glycosylase cloning simplifies shotgun library construction for large-scale sequencing." Anal Biochem 218(2): 300-8. Andersson, B., Lu, J., Edwards, K. E., Muzny, D. M. and Gibbs, R. A. (1996a). "Method for 96-well M13 DNA template preparations for large-scale sequencing." Biotechniques 20(6): 1022-7. Andersson, B., Wentland, M. A., Ricafrente, J. Y., Liu, W. and Gibbs, R. A. (1996b). "A "double adaptor" method for improved shotgun library construction." Anal Biochem 236(1): 107-13. Andersson, B., Lu, J., Shen, Y., Wentland, M. A. and Gibbs, R. A. (1997). "Simultaneous shotgun sequencing of multiple cDNA clones." DNA Seq 7(2): 63-70. Ansorge, W., Sproat, B. S., Stegemann, J. and Schwager, C. (1986). "A non-radioactive automated method for DNA sequence determination." J Biochem Biophys Methods 13(6): 315-23.

58

Strategies for de novo DNA sequencing Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. and Zenke, M. (1987). "Automated DNA sequencing: ultrasensitive detection of fluorescent bands during electrophoresis." Nucleic Acids Res 15(11): 4593-602. Avery, O., MacLeod, C. and McCarty, M. (1944). "Studies on the chemical nature of the substance inducing transformation of pneumococcal types. Induction of transformation by a desoxyribonucleic acid fraction isolated from Pneumococcus type III." J Exp Med 79: 137-158. Barnes, W. M. (1994). "PCR amplification of up to 35-kb DNA with high fidelity and high yield from lambda bacteriophage templates." Proc Natl Acad Sci U S A 91(6): 2216-20. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L. and Sonnhammer, E. L. L. (2000). "The Pfam Protein Families Database." Nucleic Acids Res 28(1): 263266. Beck, S. and Alderton, R. P. (1993). "A strategy for the amplification, purification, and selection of M13 templates for large-scale DNA sequencing." Anal Biochem 212(2): 498-505. Besemer, J. and Borodovsky, M. (1999). "Heuristic approach to deriving models for gene finding." Nucleic Acids Res 27(19): 3911-3920. Birnboim, H. C. and Doly, J. (1979). "A rapid alkaline extraction procedure for screening recombinant plasmid DNA." Nucleic Acids Res 7(6): 1513-23. Björkholm, B., Guruge, J., Karlsson, M., O'Donnell, D., Engstrand, L., Falk, P. and Gordon, J. (submitted). "Gnotobiotic transgenic mice reveal that transmission of Helicobacter pylori is facilitated by loss of acid-producing parietal cells in donor and recipients." Blaser, M. J. (1987). "Gastric Campylobacter-like organisms, gastritis, and peptic ulcer disease." Gastroenterology 93(2): 371-83. Bonfield, J., Smith, K. and Staden, R. (1995). "A new DNA sequence assembly program." Nucleic Acids Res 23(24): 4992-4999. Bray, N., Dubchak, I. and Pachter, L. (2003). "AVID: A Global Alignment Program." Genome Res. 13(1): 97-102. Bray, N. and Pachter, L. (2003). "MAVID multiple alignment server." Nucleic Acids Res 31(13): 3525-3526. Burge, C. and Karlin, S. (1997). "Prediction of Complete Gene Structures in Human Genomic DNA." J Mol Biol 268(1): 78-94. Butcher, J. (2001). ""Celeras method failed", says Human Genome Project." The Lancet 357: 531. Carothers, A. M., Urlaub, G., Mucha, J., Grunberger, D. and Chasin, L. A. (1989). "Point mutation analysis in a mammalian gene: rapid preparation of total RNA, PCR amplification of cDNA, and Taq sequencing by a novel method." Biotechniques 7(5): 494-6, 498-9. Carraro, D. M., Camargo, A. A., Salim, A. C., Grivet, M., Vasconcelos, A. T. and Simpson, A. J. (2003). "PCR-assisted contig extension: stepwise strategy for bacterial genome closure." Biotechniques 34(3): 626-8, 630-2. Censini, S., Lange, C., Xiang, Z., Crabtree, J. E., Ghiara, P., Borodovsky, M., Rappuoli, R. and Covacci, A. (1996). "cag, a pathogenicity island of Helicobacter pylori,

59

Anna Blomstergren encodes type I- specific and disease-associated virulence factors." Proc Natl Acad Sci U S A 93(25): 14648-53. Chien, A., Edgar, D. B. and Trela, J. M. (1976). "Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus." J Bacteriol 127(3): 1550-7. Clarke, S. C. and Diggle, M. A. (2002). "Automated PCR/sequence template purification." Mol Biotechnol 21(3): 221-4. Covacci, A., Telford, J. L., Del Giudice, G., Parsonnet, J. and Rappuoli, R. (1999). "Helicobacter pylori virulence and genetic geography." Science 284(5418): 132833. Crick, F. H. (1958). "On protein synthesis." Symp Soc Exp Biol 12: 138-63. Crick, F. H., Leslie Barnett, F., Brenner, S. and Watts-Tobin, R. (1961). "General nature of the genetic code for proteins." Nature 192: 1227-1232. Dean, F. B., Nelson, J. R., Giesler, T. L. and Lasken, R. S. (2001). "Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification." Genome Res 11(6): 1095-9. Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O. and Salzberg, S. L. (1999). "Alignment of whole genomes." Nucleic Acids Res 27(11): 2369-76. Delcher, A. L., Phillippy, A., Carlton, J. and Salzberg, S. L. (2002). "Fast algorithms for large-scale genome alignment and comparison." Nucleic Acids Res 30(11): 247883. Detter, J. C., Jett, J. M., Lucas, S. M., Dalin, E., Arellano, A. R., Wang, M., Nelson, J. R., Chapman, J., Lou, Y., Rokhsar, D., Hawkins, T. L. and Richardson, P. M. (2002). "Isothermal strand-displacement amplification applications for high-throughput genomics." Genomics 80(6): 691-8. Doig, P., de Jonge, B. L., Alm, R. A., Brown, E. D., Uria-Nickelsen, M., Noonan, B., Mills, S. D., Tummino, P., Carmel, G., Guild, B. C., Moir, D. T., Vovis, G. F. and Trust, T. J. (1999). "Helicobacter pylori physiology predicted from genomic comparison of two strains." Microbiol Mol Biol Rev 63(3): 675-707. Drmanac, R., Drmanac, S., Strezoska, Z., Paunesku, T., Labat, I., Zeremski, M., Snoddy, J., Funkhouser, W. K., Koop, B., Hood, L. and et al. (1993). "DNA sequence determination by hybridization: a strategy for efficient large-scale sequencing." Science 260(5114): 1649-52. Drmanac, R. and Drmanac, S. (2001). "Sequencing by hybridization arrays." Methods Mol Biol 170: 39-51. Drmanac, R., Drmanac, S., Chui, G., Diaz, R., Hou, A., Jin, H., Jin, P., Kwon, S., Lacy, S., Moeur, B., Shafto, J., Swanson, D., Ukrainczyk, T., Xu, C. and Little, D. (2002). "Sequencing by hybridization (SBH): advantages, achievements, and opportunities." Adv Biochem Eng Biotechnol 77: 75-101. D'Souza, C. R., Deugau, K. V. and Spencer, J. H. (1989). "A simplified procedure for cDNA and genomic library construction using nonpalindromic oligonucleotide adaptors." Biochem Cell Biol 67(4-5): 205-9. Ehn, M., Ahmadian, A., Nilsson, P., Lundeberg, J. and Hober, S. (2002). "Escherichia coli single-stranded DNA-binding protein, a molecular tool for improved sequence quality in pyrosequencing." Electrophoresis 23(19): 3289-99.

60

Strategies for de novo DNA sequencing Elkin, C., Kapur, H., Smith, T., Humphries, D., Pollard, M., Hammon, N. and Hawkins, T. (2002). "Magnetic bead purification of labeled DNA fragments for highthroughput capillary electrophoresis sequencing." Biotechniques 32(6): 1296, 1298-1300, 1302. Elkin, C. J., Richardson, P. M., Fourcade, H. M., Hammon, N. M., Pollard, M. J., Predki, P. F., Glavina, T. and Hawkins, T. L. (2001). "High-throughput plasmid purification for capillary sequencing." Genome Res 11(7): 1269-74. Enroth, H., Åkerlund, T., Sillen, A. and Engstrand, L. (2000). "Clustering of clinical strains of Helicobacter pylori analyzed by two-dimensional gel electrophoresis." Clin Diagn Lab Immunol 7(2): 301-6. Eperon, I. C. (1986). "Rapid preparation of bacteriophage DNA for sequence analysis in sets of 96 clones, using filtration." Anal Biochem 156(2): 406-12. Ewing, B. and Green, P. (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities." Genome Res 8(3): 186-94. Ewing, B., Hillier, L., Wendl, M. C. and Green, P. (1998). "Base-calling of automated sequencer traces using phred. I. Accuracy assessment." Genome Res 8(3): 175-85. Fangan, B. M., Dahlberg, O. J., Deggerdal, A. H., Bosnes, M. and Larsen, F. (1999). "Automated system for purification of dye-terminator sequencing products eliminates up-stream purification of templates." Biotechniques 26(5): 980-3. Fire, A. and Xu, S. Q. (1995). "Rolling replication of short DNA circles." Proc Natl Acad Sci U S A 92(10): 4641-5. Fischer, W., Puls, J., Buhrdorf, R., Gebert, B., Odenbreit, S. and Haas, R. (2001). "Systematic mutagenesis of the Helicobacter pylori cag pathogenicity island: essential genes for CagA translocation in host cells and induction of interleukin8." Mol Microbiol 42(5): 1337-48. Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M. and et al. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd." Science 269(5223): 496-512. Franklin, R. and Gosling, R. (1953). "Molecular configuration in sodium thymonuclease." Nature 171(4356): 740-741. Fraser, C. M. and Fleischmann, R. D. (1997). "Strategies for whole microbial genome sequencing and analysis." Electrophoresis 18(8): 1207-16. Frohme, M., Camargo, A. A., Czink, C., Matsukuma, A. Y., Simpson, A. J., Hoheisel, J. D. and Verjovski-Almeida, S. (2001). "Directed gap closure in large-scale sequencing projects." Genome Res 11(5): 901-3. Fry, G., Lachenmeier, E., Mayrand, E., Giusti, B., Fisher, J., Johnston-Dow, L., Cathcart, R., Finne, E. and Kilaas, L. (1992). "A new approach to template purification for sequencing applications using paramagnetic particles." Biotechniques 13(1): 124131. Gordon, D., Abajian, C. and Green, P. (1998). "Consed: a graphical tool for sequence finishing." Genome Res 8(3): 195-202. Grayburn, W. S. and Sims, T. L. (1998). "Anchored oligo(dT) primers for automated dye terminator DNA sequencing." Biotechniques 25(3): 340-1, 344-6.

61

Anna Blomstergren Green, E. D. (2001). "Strategies for the systematic sequencing of complex genomes." Nat Rev Genet 2(8): 573-83. Green, P. (2002). "Whole-genome disassembly." Proc Natl Acad Sci U S A 99(7): 4143-4. Gueneau, P. and Loiseaux-De Goer, S. (2002). "Helicobacter: molecular phylogeny and the origin of gastric colonization in the genus." Infect Genet Evol 1(3): 215-23. Hacker, J., Blum-Oehler, G., Mühldorfer, I. and Tschäpe, H. (1997). "Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution." Mol Microbiol 23(6): 1089-1097. Hadd, A. G., Goard, M. P., Rank, D. R. and Jovanovich, S. B. (2000). "Sub-microliter DNA sequencing for capillary array electrophoresis." J Chromatogr A 894(1-2): 191-201. Harris, D., Engelstein, M., Parry, R., Smith, J., Mabuchi, M. and Millipore, J. L. (2002). "High-speed plasmid isolation using 96-well, size-exclusion filter plates." Biotechniques 32(3): 626-8, 630-1. Hawkins, T. L., O'Connor-Morin, T., Roy, A. and Santillan, C. (1994). "DNA purification and isolation using a solid-phase." Nucleic Acids Res 22(21): 4543-4. Haymerle, H., Herz, J., Bressan, G. M., Frank, R. and Stanley, K. K. (1986). "Efficient construction of cDNA libraries in plasmid expression vectors using an adaptor strategy." Nucleic Acids Res 14(21): 8615-24. Heiner, C. R., Hunkapiller, K. L., Chen, S. M., Glass, J. I. and Chen, E. Y. (1998). "Sequencing multimegabase-template DNA with BigDye terminator chemistry." Genome Res 8(5): 557-61. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999). "The PROSITE database, its status in 1999." Nucl. Acids. Res. 27(1): 215-219. Hohl, M., Kurtz, S. and Ohlebusch, E. (2002). "Efficient multiple genome alignment." Bioinformatics 18(90001): 312S-320. Holmes, D. S. and Quigley, M. (1981). "A rapid boiling method for the preparation of bacterial plasmids." Anal Biochem 114(1): 193-7. Hultman, T., Ståhl, S., Hornes, E. and Uhlén, M. (1989). "Direct solid phase sequencing of genomic and plasmid DNA using magnetic beads as solid support." Nucleic Acids Res 17(13): 4937-46. Hultman, T., Murby, M., Ståhl, S., Hornes, E. and Uhlén, M. (1990). "Solid phase in vitro mutagenesis using plasmid DNA template." Nucleic Acids Res 18(17): 5107-12. Hultman, T. and Uhlén, M. (1994). "Solid-phase cloning to create sublibraries suitable for DNA sequencing." J Biotechnol 35(2-3): 229-38. Hyman, E. D. (1988). "A new method of sequencing DNA." Anal Biochem 174(2): 42336. Høgdall, E., Boye, K. and Vuust, J. (1999). "Simple preparation method of PCR fragments for automated DNA sequencing." J Cell Biochem 73(4): 433-6. IARC (1994). "Schistosomes, liver flukes and Helicobacter pylori. IARC Working Group on the Evaluation of Carcinogenic Risks to Humans. Lyon, 7-14 June 1994." IARC Monogr Eval Carcinog Risks Hum 61: 1-241. Ikenoue, T., Maeda, S., Ogura, K., Akanuma, M., Mitsuno, Y., Imai, Y., Yoshida, H., Shiratori, Y. and Omata, M. (2001). "Determination of Helicobacter pylori

62

Strategies for de novo DNA sequencing virulence by simple gene analysis of the cag pathogenicity island." Clin Diagn Lab Immunol 8(1): 181-6. Innis, M. A., Myambo, K. B., Gelfand, D. H. and Brow, M. A. (1988). "DNA sequencing with Thermus aquaticus DNA polymerase and direct sequencing of polymerase chain reaction-amplified DNA." Proc Natl Acad Sci U S A 85(24): 9436-40. Israel, D. A. and Peek, R. M. (2001). "Pathogenesis of Helicobacter pylori-induced gastric inflammation." Aliment Pharmacol Ther 15: 1271-1290. Itoh, M., Kitsunai, T., Akiyama, J., Shibata, K., Izawa, M., Kawai, J., Tomaru, Y., Carninci, P., Shibata, Y., Ozawa, Y., Muramatsu, M., Okazaki, Y. and Hayashizaki, Y. (1999). "Automated filtration-based high-throughput plasmid preparation system." Genome Res 9(5): 463-70. Jett, J. H., Keller, R. A., Martin, J. C., Marrone, B. L., Moyzis, R. K., Ratliff, R. L., Seitzinger, N. K., Shera, E. B. and Stewart, C. C. (1989). "High-speed DNA sequencing: an approach based upon fluorescence detection of single molecules." J Biomol Struct Dyn 7(2): 301-9. Johnson, A. F., Wang, R., Ji, H., Chen, D., Guilfoyle, R. A. and Smith, L. M. (1996). "Purification of single-stranded M13 DNA by cooperative triple-helix-mediated affinity capture." Anal Biochem 234: 83-95. Jones, L. B. and Hardin, S. H. (1998). "Octamer-primed cycle sequencing using dyeterminator chemistry." Nucleic Acids Res 26(11): 2824-6. Ju, J., Kheterpal, I., Scherer, J. R., Ruan, C., Fuller, C. W., Glazer, A. N. and Mathies, R. A. (1995a). "Design and synthesis of fluorescence energy transfer dye-labeled primers and their application for DNA sequencing and analysis." Anal Biochem 231(1): 131-40. Ju, J., Ruan, C., Fuller, C. W., Glazer, A. N. and Mathies, R. A. (1995b). "Fluorescence energy transfer dye-labeled primers for DNA sequencing and analysis." Proc Natl Acad Sci U S A 92(10): 4347-51. Ju, J. (2002). "DNA sequencing with solid-phase-capturable dideoxynucleotides and energy transfer primers." Anal Biochem 309(1): 35-9. Karger, A. E., Harris, J. M. and Gesteland, R. F. (1991). "Multiwavelength fluorescence detection for DNA sequencing using capillary electrophoresis." Nucleic Acids Res 19(18): 4955-62. Khan, A. S., Wilcox, A. S., Hopkins, J. A. and Sikela, J. M. (1991). "Efficient double stranded sequencing of cDNA clones containing long poly(A) tails using anchored poly(dT) primers." Nucleic Acids Res 19(7): 1715. Khorana, H. G. (1965). "Polynucleotide synthesis and the genetic code." Fed Proc 24(6): 1473-87. Konecki, D. S. and Phillips, J. J. (1998). "TurboPrep II: an inexpensive, high-throughput plasmid template preparation protocol." Biotechniques 24(2): 286-8, 290-3. Kostina, M., Azhikina, T., Gorodentseva, T., Berg, D. and Sverdlov, E. (2000). "Contiguous strings of strongly binding short oligonucleotides as a useful tool for completing sequencing experiments." DNA Seq 10(6): 355-64. Kotler, L., Sobolev, I. and Ulanovsky, L. (1994). "DNA sequencing: modular primers for automated walking." Biotechniques 17(3): 554-9.

63

Anna Blomstergren Kukanskis, K. A., Siddiquee, Z., Shohet, R. V. and Garner, H. R. (2000). "Mix of sequencing technologies for sequence closure: an example." Biotechniques 28(4): 630-2, 634. Kumar, S., Fuller, C. W., Nampalli, S., Khot, M., Livshin, I., Sun, L., Hamilton, S., Samols, S. B., Mamone, J. A., Hujer, K. M., McArdle, B. F., Nelson, J. R. and Duthie, S. (1999). "Uniform band intensities in fluorescent dye terminator sequencing." Nucleosides Nucleotides 18(4-5): 1101-3. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., StangeThomann, N., Stojanovic, N., Subramanian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R., Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A., Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R., Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty, A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J., Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer, S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell, J. H., Metzker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., Weinstock, G. M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A., Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach, J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier, E., Robert, C., Wincker, P., Smith, D. R., Doucette-Stamm, L., Rubenfield, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer, M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang, J., Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W., Federspiel, N. A., Abola, A. P., Proctor, M. J., Myers, R. M., Schmutz, J., Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou, M., Schultz, R., Roe, B. A., Chen, F., Pan, H., Ramser, J., Lehrach, H., Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blocker, H., Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bateman, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B., Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T., Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Harmon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang, W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent, W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M., McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J., Ponting, C. P., Schuler,

64

Strategies for de novo DNA sequencing G., Schultz, J., Slater, G., Smit, A. F., Stupka, E., Szustakowski, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis, J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh, R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand, K. A., Patrinos, A., Morgan, M. J., Szustakowki, J., de Jong, P., Catanese, J. J., Osoegawa, K., Shizuya, H., Choi, S. and Chen, Y. J. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Lane, M. J., Paner, T., Kashin, I., Faldasz, B. D., Li, B., Gallo, F. J. and Benight, A. S. (1997). "The thermodynamic advantage of DNA oligonucleotide ‘stacking hybridization’ reactions: energetics of a DNA nick." Nucleic Acids Res. 25(3): 611-616. Lee, A., Fox, J. and Hazell, S. (1993). "Pathogenicity of Helicobacter pylori: a perspective." Infect Immun 61(5): 1601-10. Leonard, J. T., Grace, M. B., Buzard, G. S., Mullen, M. J. and Barbagallo, C. B. (1998). "Preparation of PCR products for DNA sequencing." Biotechniques 24(2): 314-7. Li, Y., Knobloch, O. and Hahn, H. P. (2002). "An extended boiling method for small-scale preparation of plasmid DNA tailored to long-range automated sequencing." J Biochem Biophys Methods 51(1): 69-74. Liao, J. and Gong, Z. (1997). "Sequencing of 3' cDNA clones using anchored oligo(dT) primers." Biotechniques 23(3): 368-70. Liu, D., Daubendiek, S., Zillman, M., Ryan, K. and Kool, E. (1996). "Rolling circle DNA synthesis: Small circular oligonucleotides as efficient templates for DNA polymerases." J Am Chem Soc 118: 1587-1594. Liu, G., McDaniel, T. K., Falkow, S. and Karlin, S. (1999). "Sequence anomalies in the Cag7 gene of the Helicobacter pylori pathogenicity island." Proc Natl Acad Sci U S A 96(12): 7011-6. Lizardi, P. M., Huang, X., Zhu, Z., Bray-Ward, P., Thomas, D. C. and Ward, D. C. (1998). "Mutation detection and single-molecule counting using isothermal rolling-circle amplification." Nat Genet 19(3): 225-32. Lodish, H., Baltimore, D., Berk, A., Zipursky, L., Matsudaira, P. and Darnell, J. (1997). Molecular cell biology. New York, Scientific American Books. Manoni, M., Pergolizzi, R., Luzzana, M. and De Bellis, G. (1992). "Dideoxy linear PCR on a commercial fluorescent automated DNA sequencer." Biotechniques 12(1): 48-50, 52-3. Mardis, E. R. (1994). "High-throughput detergent extraction of M13 subclones for fluorescent DNA sequencing." Nucleic Acids Res 22(11): 2173-5. Marra, M. A., Kucaba, T. A., Hillier, L. W. and Waterston, R. H. (1999). "Highthroughput plasmid DNA purification for 3 cents per sample." Nucleic Acids Res 27(24): e37. Marshall, B. J. and Warren, J. R. (1984). "Unidentified curved bacilli in the stomach of patients with gastritis and peptic ulceration." Lancet 1(8390): 1311-5. Marziali, A., Willis, T. D., Federspiel, N. A. and Davis, R. W. (1999). "An automated sample preparation system for large-scale DNA sequencing." Genome Res 9(5): 457-62.

65

Anna Blomstergren Marziali, A. and Akeson, M. (2001). "New DNA sequencing methods." Annu Rev Biomed Eng 3: 195-223. Maxam, A. M. and Gilbert, W. (1977). "A new method for sequencing DNA." Proc Natl Acad Sci U S A 74(2): 560-4. Mayor, C., Brudno, M., Schwartz, J. R., Poliakov, A., Rubin, E. M., Frazer, K. A., Pachter, L. S. and Dubchak, I. (2000). "VISTA : visualizing global DNA sequence alignments of arbitrary length." Bioinformatics 16(11): 1046-7. McMurray, A. A., Sulston, J. E. and Quail, M. A. (1998). "Short-insert libraries as a method of problem solving in genome sequencing." Genome Res 8(5): 562-6. Messing, J. (1983). "New M13 vectors for cloning." Methods Enzymol 101: 20-78. Metzker, M. L., Lu, J. and Gibbs, R. A. (1996). "Electrophoretically uniform fluorescent dyes for automated DNA sequencing." Science 271(5254): 1420-2. Millar, D. S., Withey, S. J., Tizard, M. L., Ford, J. G. and Hermon-Taylor, J. (1995). "Solid-phase hybridization capture of low-abundance target DNA sequences: application to the polymerase chain reaction detection of Mycobacterium paratuberculosis and Mycobacterium avium subsp. silvaticum." Anal Biochem 226(2): 325-30. Mitchell, H. M., Li, Y. Y., Hu, P. J., Liu, Q., Chen, M., Du, G. G., Wang, Z. J., Lee, A. and Hazell, S. L. (1992). "Epidemiology of Helicobacter pylori in southern China: identification of early childhood as the critical period for acquisition." J Infect Dis 166(1): 149-53. Mitchell, H. M. (2001). Epidemiology of infection. Helicobacter pylori physiology and genetics. Mobley, H., Mendz, G. and Hazell, S. Washington, DC, ASM Press. 1: 7-18. Miyaji, H., Azuma, T., Ito, S., Abe, Y., Gejyo, F., Hashimoto, N., Sugimoto, H., Suto, H., Ito, Y., Yamazaki, Y., Kohli, Y. and Kuriyama, M. (2000). "Helicobacter pylori infection occurs via close contact with infected individuals in early childhood." J Gastroenterol Hepatol 15(3): 257-62. Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G. and Erlich, H. (1986). "Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction." Cold Spring Harb Symp Quant Biol 51 Pt 1: 263-73. Murray, V. (1989). "Improved double-stranded DNA sequencing using the linear polymerase chain reaction." Nucleic Acids Res 17(21): 8889. Myers, E. W., Sutton, G. G., Smith, H. O., Adams, M. D. and Venter, J. C. (2002). "On the sequencing and assembly of the human genome." Proc Natl Acad Sci U S A 99(7): 4145-4146. National Research Council (1988). Mapping and sequencing the human genome. Washington DC, National Academy Press. Nelson, J. R., Cai, Y. C., Giesler, T. L., Farchaus, J. W., Sundaram, S. T., Ortiz-Rivera, M., Hosta, L. P., Hewitt, P. L., Mamone, J. A., Palaniappan, C. and Fuller, C. W. (2002). "TempliPhi, phi29 DNA polymerase based rolling circle amplification of templates for DNA sequencing." Biotechniques Suppl: 44-7. Nilsson, P., O'Meara, D., Edebratt, F., Persson, B., Uhlén, M., Lundeberg, J. and Nygren, P. Å. (1999). "Quantitative investigation of the modular primer effect for DNA and peptide nucleic acid hexamers." Anal Biochem 269: 155-161.

66

Strategies for de novo DNA sequencing Ning, Z., Cox, A. J. and Mullikin, J. C. (2001). "SSAHA: a fast search method for large DNA databases." Genome Res 11(10): 1725-9. Nirenberg, M., Leder, P., Bernfield, M., Brimacombe, R., Trupin, J., Rottman, F. and O'Neal, C. (1965). "RNA codewords and protein synthesis, VII. On the general nature of the RNA code." Proc Natl Acad Sci U S A 53(5): 1161-8. Nisson, P. E., Rashtchian, A. and Watkins, P. C. (1991). "Rapid and efficient cloning of Alu-PCR products using uracil DNA glycosylase." PCR Methods Appl 1(2): 1203. Nyrén, P. (1987). "Enzymatic method for continuous monitoring of DNA polymerase activity." Anal Biochem 167(2): 235-8. O´Meara, D., Nilsson, P., Nygren, P. Å., Uhlén, M. and Lundeberg, J. (1998a). "Capture of single-stranded DNA assisted by oligonucleotide modules." Anal Biochem 255: 195-203. O´Meara, D., Yun, Z., Sönnerborg, A. and Lundeberg, J. (1998b). "Cooperative oligonucleotides mediating direct capture of hepatitis C virus RNA from serum." J Clin Microbiol 36: 2454-2459. Odenbreit, S., Puls, J., Sedlmaier, B., Gerland, E., Fischer, W. and Haas, R. (2000). "Translocation of Helicobacter pylori CagA into gastric epithelial cells by type IV secretion." Science 287(5457): 1497-500. Paegel, B. M., Yeung, S. H. and Mathies, R. A. (2002). "Microchip bioprocessor for integrated nanovolume sample purification and DNA sequencing." Anal Chem 74(19): 5092-8. Pang, H. M. and Yeung, E. S. (2000). "Automated one-step DNA sequencing based on nanoliter reaction volumes and capillary electrophoresis." Nucleic Acids Res 28(15): E73. Parkin, D. M., Bray, F. I. and Devesa, S. S. (2001). "Cancer burden in the year 2000. The global picture." Eur J Cancer 37 Suppl 8: S4-66. Parsonnet, J. (1995). "The incidence of Helicobacter pylori infection." Aliment Pharmacol Ther 9 Suppl 2: 45-51. Ponting, C., Schultz, J., Milpetz, F. and Bork, P. (1999). "SMART: identification and annotation of domains from signalling and extracellular protein sequences." Nucl. Acids. Res. 27(1): 229-232. Pounder, R. E. and Ng, D. (1995). "The prevalence of Helicobacter pylori infection in different countries." Aliment Pharmacol Ther 9 Suppl 2: 33-9. Povinelli, C. M. and Gibbs, R. A. (1993). "Large-scale sequencing library production: an adaptor-based strategy." Anal Biochem 210(1): 16-26. Raja, M. C., Zevin-Sonkin, D., Shvartzburd, J., Kotler, L. and Ulanovsky, L. (1997). "DNA sequencing with modular primers using a two-step protocol with thermostable polymerase at the second step." Biotechniques 23(3): 362-4, 366, 368. Rashtchian, A., Buchman, G. W., Schuster, D. M. and Berninger, M. S. (1992). "Uracil DNA glycosylase-mediated cloning of polymerase chain reaction-amplified DNA: application to genomic and cDNA cloning." Anal Biochem 206(1): 91-7. Reese, M. G., Kulp, D., Tammana, H. and Haussler, D. (2000). "Genie---Gene Finding in Drosophila melanogaster." Genome Res. 10(4): 529-538.

67

Anna Blomstergren Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlén, M. and Nyrén, P. (1996). "Realtime DNA sequencing using detection of pyrophosphate release." Anal Biochem 242(1): 84-9. Ronaghi, M., Uhlén, M. and Nyrén, P. (1998). "A sequencing method based on real-time pyrophosphate." Science 281(5375): 363, 365. Ronaghi, M. (2000). "Improved performance of pyrosequencing using single-stranded DNA-binding protein." Anal Biochem 286(2): 282-8. Ronaghi, M. and Elahi, E. (2002). "Pyrosequencing for microbial typing." J Chromatogr B Analyt Technol Biomed Life Sci 782(1-2): 67-72. Rosenthal, A. and Charnock-Jones, D. S. (1992). "New protocols for DNA sequencing with dye terminators." DNA Seq 3(1): 61-4. Ruiz-Martinez, M. C., Salas-Solano, O., Carrilho, E., Kotler, L. and Karger, B. L. (1998). "A sample purification method for rugged and high-performance DNA sequencing by capillary electrophoresis using replaceable polymer solutions. A. Development of the cleanup protocol." Anal Chem 70(8): 1516-27. Ruppert, A., Szalay, B., van den Boom, D., Horst, G. and Koster, H. (1995). "A filtration method for plasmid isolation using microtiter filter plates." Anal Biochem 230(1): 130-4. Saiki, R. K., Gelfand, D. H., Stoffel, S., Scharf, S. J., Higuchi, R., Horn, G. T., Mullis, K. B. and Erlich, H. A. (1988). "Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase." Science 239(4839): 487-91. Salama, N., Guillemin, K., McDaniel, T. K., Sherlock, G., Tompkins, L. and Falkow, S. (2000). "A whole-genome microarray reveals genetic diversity among Helicobacter pylori strains." Proc Natl Acad Sci U S A 97(26): 14668-73. Sambrook, J. and Russell, D. W. (2001). Molecular cloning A laboratory manual. Cold Spring Harbor, Cold Spring Harbor Laboratory Press. Sanger, F. and Coulson, A. R. (1975). "A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase." J Mol Biol 94(3): 441-8. Sanger, F., Nicklen, S. and Coulson, A. R. (1977). "DNA sequencing with chainterminating inhibitors." Proc Natl Acad Sci U S A 74(12): 5463-7. Sanger, F., Coulson, A. R., Barrell, B. G., Smith, A. J. and Roe, B. A. (1980). "Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing." J Mol Biol 143(2): 161-78. Sauer, M., Angerer, B., Ankenbauer, W., Foldes-Papp, Z., Gobel, F., Han, K. T., Rigler, R., Schulz, A., Wolfrum, J. and Zander, C. (2001). "Single molecule DNA sequencing in submicrometer channels: state of the art and future prospects." J Biotechnol 86(3): 181-201. Savage, D. (1992). Avidin-Biotin Chemistry: A Handbook. Rockford, Pierce Chemical Company. Schuler, G. D. (1997). "Pieces of the puzzle: expressed sequence tags and the catalog of human genes." J Mol Med 75(10): 694-8. Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R. and Miller, W. (2000). "PipMaker--a web server for aligning two genomic DNA sequences." Genome Res 10(4): 577-86.

68

Strategies for de novo DNA sequencing Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Green, E. D., Hardison, R. C. and Miller, W. (2003a). "MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences." Nucleic Acids Res 31(13): 3518-24. Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D. and Miller, W. (2003b). "Human-mouse alignments with BLASTZ." Genome Res 13(1): 103-7. Segal, E. D., Cha, J., Lo, J., Falkow, S. and Tompkins, L. S. (1999). "Altered states: involvement of phosphorylated CagA in the induction of host cellular growth changes by Helicobacter pylori." Proc Natl Acad Sci U S A 96(25): 14559-64. Selbach, M., Moese, S., Meyer, T. F. and Backert, S. (2002). "Functional analysis of the Helicobacter pylori cag pathogenicity island reveals both VirD4-CagA-dependent and VirD4-CagA-independent mechanisms." Infect Immun 70(2): 665-71. Shuman, S. (1994). "Novel approach to molecular cloning and polynucleotide synthesis using vaccinia DNA topoisomerase." J Biol Chem 269(51): 32678-32684. Silva, W. A., Jr., Costa, M. C., Valente, V., Sousa, J. F., Paco-Larson, M. L., Espreafico, E. M., Camargo, S. S., Monteiro, E., Holanda, A. J., Zago, M. A., Simpson, A. J. and Neto, E. D. (2001). "PCR template preparation for capillary DNA sequencing." Biotechniques 30(3): 537, 540-2. Skowronski, E. W., Armstrong, N., Andersen, G., Macht, M. and McCready, P. M. (2000). "Magnetic, microplate-format plasmid isolation protocol for high-yield, sequencing-grade DNA." Biotechniques 29(4): 786-8, 790, 792. Smith, L. M., Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Connell, C. R., Heiner, C., Kent, S. B. and Hood, L. E. (1986). "Fluorescence detection in automated DNA sequence analysis." Nature 321(6071): 674-9. Soper, S. A., Williams, D. C., Xu, Y., Lassiter, S. J., Zhang, Y., Ford, S. M. and Bruch, R. C. (1998). "Sanger DNA-sequencing reactions performed in a solid-phase nanoreactor directly coupled to capillary gel electrophoresis." Anal Chem 70(19): 4036-43. Springer, A. L., Booth, L. R., Braid, M. D., Houde, C. M., Hughes, K. A., Kaiser, R. J., Pedrak, C., Spicer, D. A. and Stolyar, S. (2003). "A rapid method for manual or automated purification of fluorescently labeled nucleic acids for sequencing, genotyping, and microarrays." J Biomol Tech 14(1): 17-32. Stahl, S., Hultman, T., Olsson, A., Moks, T. and Uhlen, M. (1988). "Solid phase DNA sequencing using the biotin-avidin system." Nucleic Acids Res 16(7): 3025-38. Stahl, S., Hansson, M., Ahlborg, N., Nguyen, T. N., Liljeqvist, S., Lundeberg, J. and Uhlen, M. (1993). "Solid-phase gene assembly of constructs derived from the Plasmodium falciparum malaria blood-stage antigen Ag332." Biotechniques 14(3): 424-34. Stark, M., Reizenstein, E., Uhlén, M. and Lundeberg, J. (1996). "Immunomagnetic separation and solid-phase detection of Bordetella pertussis." J Clin Microbiol 34(4): 778-84. Stein, L. (2001). "Genome annotation: from sequence to biology." Nat Rev Genet 2(7): 493-503.

69

Anna Blomstergren Stein, M., Rappuoli, R. and Covacci, A. (2000). "Tyrosine phosphorylation of the Helicobacter pylori CagA antigen after cag-driven host cell translocation." Proc Natl Acad Sci U S A 97(3): 1263-8. Stein, M., Rappuoli, R. and Covacci, A. (2001). The cag pathogenicity island. Helicobacter pylori: physiology and genetics. Mobley, H., Mendz, G. and Hazell, S. Washington, DC, ASM Press. 1: 345-353. Stephan, J., Dorre, K., Brakmann, S., Winkler, T., Wetzel, T., Lapczyna, M., Stuke, M., Angerer, B., Ankenbauer, W., Foldes-Papp, Z., Rigler, R. and Eigen, M. (2001). "Towards a general procedure for sequencing single DNA molecules." J Biotechnol 86(3): 255-67. Sterky, F., Holmberg, A., Alexandersson, G., Lundeberg, J. and Uhlen, M. (1998). "Direct sequencing of bacterial artificial chromosomes (BACs) and prokaryotic genomes by biotin-capture PCR." J Biotechnol 60(1-2): 119-29. Swerdlow, H. and Gesteland, R. (1990). "Capillary gel electrophoresis for rapid, high resolution DNA sequencing." Nucleic Acids Res 18(6): 1415-9. Tammi, M. T., Arner, E., Britton, T. and Andersson, B. (2002). "Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs." Bioinformatics 18(3): 379-88. Tammi, M. T., Arner, E. and Andersson, B. (2003). "TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences." Comput Methods Programs Biomed 70(1): 47-59. Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Res 22(22): 4673-80. Tillett, D. and Neilan, B. A. (1999). "n-butanol purification of dye terminator sequencing reactions." Biotechniques 26(4): 606-8, 610. Tomb, J. F., White, O., Kerlavage, A. R., Clayton, R. A., Sutton, G. G., Fleischmann, R. D., Ketchum, K. A., Klenk, H. P., Gill, S., Dougherty, B. A., Nelson, K., Quackenbush, J., Zhou, L., Kirkness, E. F., Peterson, S., Loftus, B., Richardson, D., Dodson, R., Khalak, H. G., Glodek, A., McKenney, K., Fitzegerald, L. M., Lee, N., Adams, M. D., Venter, J. C. and et al. (1997). "The complete genome sequence of the gastric pathogen Helicobacter pylori." Nature 388(6642): 539-47. Tong, X. and Smith, L. M. (1992). "Solid-phase method for the purification of DNA sequencing reactions." Anal Chem 64(22): 2672-7. Tong, X. and Smith, L. M. (1993). "Solid phase purification in automated DNA sequencing." DNA Seq 4(3): 151-62. Tracy, T. E. and Mulcahy, L. S. (1991). "A simple method for direct automated sequencing of PCR fragments." Biotechniques 11(1): 68-75. Uberbacher, E. and Mural, R. (1991). "Locating Protein-Coding Regions in Human DNA Sequences by a Multiple Sensor-Neural Network Approach." Proc Natl Acad Sci U S A 88(24): 11261-11265. Uhlen, M., Hultman, T., Wahlberg, J., Lundeberg, J., Bergh, S., Pettersson, B., Holmberg, A., Stahl, S. and Moks, T. (1992). "Semi-automated solid-phase DNA sequencing." Trends Biotechnol 10(1-2): 52-5.

70

Strategies for de novo DNA sequencing van Doorn, L. J., Kleter, B., Voermans, J., Maertens, G., Brouwer, H., Heijtink, R. and Quint, W. (1994). "Rapid detecion of hepatitis C virus RNA by direct capture from blood." J Med Virol 42: 22-28. Velculescu, V. E., Zhang, L., Vogelstein, B. and Kinzler, K. W. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7. Venter, J. C., Smith, H. O. and Hood, L. (1996). "A new strategy for genome sequencing." Nature 381(6581): 364-6. Venter, J. C., Adams, M. D., Sutton, G. G., Kerlavage, A. R., Smith, H. O. and Hunkapiller, M. (1998). "Shotgun sequencing of the human genome." Science 280(5369): 1540-2. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q., Kodira, C. D., Zheng, X. H., Chen, L., Skupski, M., Subramanian, G., Thomas, P. D., Zhang, J., Gabor Miklos, G. L., Nelson, C., Broder, S., Clark, A. G., Nadeau, J., McKusick, V. A., Zinder, N., Levine, A. J., Roberts, R. J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Di Francesco, V., Dunn, P., Eilbeck, K., Evangelista, C., Gabrielian, A. E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P., Heiman, T. J., Higgins, M. E., Ji, R. R., Ke, Z., Ketchum, K. A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G. V., Milshina, N., Moore, H. M., Naik, A. K., Narayan, V. A., Neelam, B., Nusskern, D., Rusch, D. B., Salzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z., Wang, A., Wang, X., Wang, J., Wei, M., Wides, R., Xiao, C., Yan, C., Yao, A., Ye, J., Zhan, M., Zhang, W., Zhang, H., Zhao, Q., Zheng, L., Zhong, F., Zhong, W., Zhu, S., Zhao, S., Gilbert, D., Baumhueter, S., Spier, G., Carter, C., Cravchik, A., Woodage, T., Ali, F., An, H., Awe, A., Baldwin, D., Baden, H., Barnstead, M., Barrow, I., Beeson, K., Busam, D., Carver, A., Center, A., Cheng, M. L., Curry, L., Danaher, S., Davenport, L., Desilets, R., Dietz, S., Dodson, K., Doup, L., Ferriera, S., Garg, N., Gluecksmann, A., Hart, B., Haynes, J., Haynes, C., Heiner, C., Hladun, S., Hostin, D., Houck, J., Howland, T., Ibegwam, C., Johnson, J., Kalush, F., Kline, L., Koduru, S., Love, A., Mann, F., May, D., McCawley, S., McIntosh, T., McMullen, I., Moy, M., Moy, L., Murphy, B., Nelson, K., Pfannkoch, C., Pratts, E., Puri, V., Qureshi, H., Reardon, M., Rodriguez, R., Rogers, Y. H., Romblad, D., Ruhfel, B., Scott, R., Sitter, C., Smallwood, M., Stewart, E., Strong, R., Suh, E., Thomas, R., Tint, N. N., Tse, S., Vech, C., Wang, G., Wetter, J., Williams, S., Williams, M., Windsor, S., WinnDeen, E., Wolfe, K., Zaveri, J., Zaveri, K., Abril, J. F., Guigo, R., Campbell, M. J., Sjolander, K. V., Karlak, B., Kejariwal, A., Mi, H., Lazareva, B., Hatton, T., Narechania, A., Diemer, K., Muruganujan, A., Guo, N., Sato, S., Bafna, V., Istrail, S., Lippert, R., Schwartz, R., Walenz, B., Yooseph, S., Allen, D., Basu, A., Baxendale, J., Blick, L., Caminha, M., Carnes-Stine, J., Caulk, P., Chiang, Y. H., Coyne, M., Dahlke, C., Mays, A., Dombroski, M., Donnelly, M., Ely, D.,

71

Anna Blomstergren Esparham, S., Fosler, C., Gire, H., Glanowski, S., Glasser, K., Glodek, A., Gorokhov, M., Graham, K., Gropman, B., Harris, M., Heil, J., Henderson, S., Hoover, J., Jennings, D., Jordan, C., Jordan, J., Kasha, J., Kagan, L., Kraft, C., Levitsky, A., Lewis, M., Liu, X., Lopez, J., Ma, D., Majoros, W., McDaniel, J., Murphy, S., Newman, M., Nguyen, T., Nguyen, N., Nodell, M., Pan, S., Peck, J., Peterson, M., Rowe, W., Sanders, R., Scott, J., Simpson, M., Smith, T., Sprague, A., Stockwell, T., Turner, R., Venter, E., Wang, M., Wen, M., Wu, D., Wu, M., Xia, A., Zandieh, A. and Zhu, X. (2001). "The sequence of the human genome." Science 291(5507): 1304-51. Voss, H., Wiemann, S., Grothues, D., Sensen, C., Zimmermann, J., Schwager, C., Stegemann, J., Erfle, H., Rupp, T. and Ansorge, W. (1993). "Automated lowredundancy large-scale DNA sequencing by primer walking." Biotechniques 15(4): 714-21. Wahlberg, J., Holmberg, A., Bergh, S., Hultman, T. and Uhlén, M. (1992). "Automated magnetic preparation of DNA templates for solid phase sequencing." Electrophoresis 13(8): 547-51. Wang, B., Fang, Q., Williams, W. V. and Weiner, D. B. (1992). "Double-stranded DNA sequencing by linear amplification with Taq DNA polymerase." Biotechniques 13(4): 527-30. Waterston, R. H., Lander, E. S. and Sulston, J. E. (2002). "On the sequencing of the human genome." Proc Natl Acad Sci U S A 99(6): 3712-3716. Watson, J. D. and Crick, F. H. (1953a). "Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid." Nature 171(4356): 737-738. Watson, J. D. and Crick, F. H. (1953b). "Genetical implications of the structure of deoxyribonucleic acid." Nature 171(4361): 964-967. Werle, E., Schneider, C., Renner, M., Volker, M. and Fiehn, W. (1994). "Convenient single-step, one tube purification of PCR products for direct sequencing." Nucleic Acids Res 22(20): 4354-5. Werner, J. H., Cai, H., Jett, J. H., Reha-Krantz, L., Keller, R. A. and Goodwin, P. M. (2003). "Progress towards single-molecule DNA sequencing: a one color demonstration." J Biotechnol 102(1): 1-14. Wilkins, M., Stokes, A. and Wilson, H. (1953). "Molecular structure of deoxypentose nucleic acids." Nature 171(4356): 738-740. Wilson, R. K. (1993). "High-throughput purification of M13 templates for DNA sequencing." Biotechniques 15(3): 414-6, 418-20, 422. Xu, Y. and Uberbacher, E. C. (1997). "Automated gene identification in large-scale genomic sequences." J Comput Biol 4(3): 325-38. Yang, T. J., Yu, Y., Nah, G., Atkins, M., Lee, S., Frisch, D. A. and Wing, R. A. (2003). "Construction and utility of 10-kb libraries for efficient clone-gap closure for rice genome sequencing." Theor Appl Genet 107(4): 652-660. Yeh, R.-F., Lim, L. P. and Burge, C. B. (2001). "Computational Inference of Homologous Gene Structures in the Human Genome." Genome Res. 11(5): 803-816. Yu, W., Andersson, B., Worley, K. C., Muzny, D. M., Ding, Y., Liu, W., Ricafrente, J. Y., Wentland, M. A., Lennon, G. and Gibbs, R. A. (1997). "Large-scale concatenation cDNA sequencing." Genome Res 7(4): 353-8.

72

Strategies for de novo DNA sequencing

73