Chapter 4 Sequencing DNA and Databases

Chapter 4 – Sequencing, Databases Chapter 4 Sequencing DNA and Databases Introduction One of the most remarkable scientific advancements in history i...
Author: Nigel Gilmore
2 downloads 1 Views 460KB Size
Chapter 4 – Sequencing, Databases

Chapter 4 Sequencing DNA and Databases Introduction One of the most remarkable scientific advancements in history is the molecular biology revolution. In 1953 James Watson and Francis Crick proposed a molecular structure for DNA, which Oswald Avery had previously shown to be the genetic material. The next question was to determine how this genetic information coded for the proteins that carry out cellular functions. Scientists therefore wanted to examine the sequences of the DNA they were working with. The first DNA sequences were determined by very laborious methods that generated relative short sequences. Rapid DNA sequencing methods were developed in the mid 1970's which allowed scientists to generate more sequence data. Less than four decades later, the technology has moved so quickly that the genomic nucleotide sequences of numerous organisms, including mouse and humans have been completed. For example, in July 1995, the first entire genomic sequence of an organism, the bacterium H. influenzae was published in Science. There are 1,830,137 base pairs with 1727 predicted genes. This article has 40 authors (contributing scientists)! Since then, the complete genomic sequences have been determined for 1167 microbes, and the list grows by a genome every 1-2 weeks (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). In April 1996, the sequence of the entire genome for the first eukaryotic organism, the yeast Saccharomyces cerevisiae, was completed. Saccharomyces cerevisiae has 16 chromosomes comprised of a total of 12,068,000 base pairs. It is estimated that there are 5,885 proteinencoding genes. The genome sequence of the C. elegans nematode was published in 1998 (Science 1998, vol. 282, p. 2012). The authors sequenced over 97,000,000 bases, identifying an estimated 19,099 predicted protein-coding genes. The Drosophila genome was finished in 1999. The human genome has been completed. In 6/07, 186 eukaryotic genomes were being sequenced. In 6/09 there were 581 genomic projects, with 264 that are in the process of being assembled and 27 are functionally complete, including a number of fungi, Drosophila (fruit fly), C. elegans (roundworm), Arabidopsis (a weed that is used for basic plant research), rice, mouse and humans (see http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi for a list of the genomes). Genomic projects of chimpanzee, cow, dog, pig, chicken, sea squirt, pufferfish, along with many others, are in the process of being assembled. The growth of GenBank, the sequence database at the National Institutes of Health (NIH), is exponential; currently about 1 billion base pairs of DNA are deposited monthly, and that rate has doubled every year for the past several years. There are currently 100 billion bases from 165,000 different organisms stored in the US, European and Japanese Sequence databases. Researchers predict that in the future it should take less than a day to determine the entire sequence of a microbe and maybe as little as several weeks to determine the sequence of a human. This year we will be cloning, sequencing, and analyzing cDNA from the duckweed Landoltia punctata. Since there is only limited genomic sequence information available on this organism, it is likely that every sequence you generate will be novel. To help contribute to the scientific community and help provide information on the relatedness between different species we would like to add the sequence information that you generate to the international genomic databases.  2014 WSSP

4-1

Chapter 4 – Sequencing, Databases Since other scientists will be using and relying on your data, it is essential that you analyze the quality of your sequence before this data is submitted to the databases. We will therefore spend a significant effort going over how to interpret your sequence information. In this chapter we will first go over some of the background for how DNA sequencing is performed.

I. DNA Sequencing Theory The method of DNA sequencing most commonly used today is the enzymatic method originally developed by Sanger in 1977. This method is commonly referred to as dideoxy or chain termination sequencing. In this method a short oligonucleotide primer is hybridized to the DNA template that is to be sequenced (Fig 4-1). A DNA polymerase is then used to initiate DNA synthesis extending from the primer in the 5’ to 3’ direction. The synthesized DNA is complementary to the template strand of DNA. The reaction contains deoxynucleotides (dNTPs: dATP, dCTP, dGTP, TTP) that the polymerase uses to extend the chain. However, the reaction also Fig 4-1 Diagram of DNA sequencing using the contains a small quantity of chain termination method. All the DNA strands dideoxynucleotides (ddNTPs) (Fig 4initiate at the same position and are terminated by 2). The ddNTP nucleotides are the addition of labeled ddNTPs. The different size lacking the OH group at the 3’ chains can be separated on a gel or column and the position. In DNA synthesis, each sequence determined. new nucleotide is added to the 3' OH group of the last nucleotide added. However, once the polymerase incorporates a ddNTP onto the end of the chain it can not be further extended (Fig 4-3). Since the incorporation of the ddNTP is random, some DNA chains will be terminated near the beginning of the synthesis, while other will be extend further and terminate at other positions. Only a small percentage of the chains become terminated at any particular base. Because all four deoxynucleotides are present in the reaction, chain elongation proceeds normally until, by chance, DNA polymerase inserts a dideoxy nucleotide (shown as colored letters in Fig 4-1). If the ratio of deoxynucleotide to the Fig 4-2 The difference between dTTP and ddTTP is the presence dideoxy versions is high enough, some DNA strands will succeed in adding several hundred nucleotides before of an H at the 3’ position. insertion of the dideoxy version halts the process. At the  2014 WSSP

4-2

Chapter 4 – Sequencing, Databases end of the incubation period, the fragments are separated on a gel or column by length from longest to shortest (Fig 4-1). The resolution is so good that a difference of one nucleotide is enough to separate that strand from the next shorter and next longer strand. Sequencing instruments have been developed to automate this process. These machines use different florescent labels on each of the bases Fig. 4-3. Top strand: The deoxy-C at the end of the chain allows the addition of the next in the reaction to detect the DNA fragments as base (A). Bottom strand: The dideoxy-C at they electrophorese off the gel or column. the end of the chain prevents the addition of Since four different dyes are used, all the the next base, terminating the chain. From reactions can be done in a single tube, thus Pearson Education Inc increasing throughput. Sensitive lasers scan the bottom of the gels and record the nucleotides that migrate off the gel. The figure below shows that fluorescent patterns of a sequencing run (Fig 4-3). Each peak corresponds to a different base in the DNA sequence. The sequence above the waveforms is the DNA sequence that has been interpreted by the instrument. The waveform in the example is fairly clean and there are no ambiguities. However, not all your sequences will be this straight forward!

Fig 4-4 An example of a DNA sequence waveform. Each peak represents a chain termination at a particular bp in the sequence. The color of the peak represents the specific fluorescent dye detected. The letters above indicate the sequence determined by the instrument.

II. The use of computers in DNA sequence analysis. A. Introduction As discussed in the first part of this chapter, the development of fast and inexpensive methods has caused an explosion in the number of DNA sequences that have been determined. The growth of GenBank, the sequence database at NIH, is exponential; currently about 1 billion base pairs of DNA are deposited monthly, and that rate has doubled every year for the past several years. There are currently 100 billion bases from 165,000 different organisms stored in the US, European and Japanese Sequence databases.

 2014 WSSP

4-3

Chapter 4 – Sequencing, Databases

Fig 4-5. Growth of GenBank

Fig 4-6 Growth of DNA sequences. From Baxevanis, AD (2011) Current Protocols in Bioinformatics, Wiley Online Library.

With this fantastic success, however, has come a problem—too much DNA!! Thus, as DNA sequence information is generated, a problem with storage and analysis of the vast amounts of information becomes apparent. This type of problem is ideally suited to computers. Computers serve as tools for handling the vast amounts of sequence information generated by molecular biologists. Computers do much more for molecular biologists than just store sequence information. Programs have also been written which analyze the DNA. For instance, it is important to know where the protein coding sequences are located on a DNA fragment, what convenient restriction enzyme sites are present in the DNA fragment, what gene regulatory sequences are present, etc. Computers can easily and rapidly determine this information. Another very important job for computers is to find similarities between an unknown DNA or protein sequence and a known DNA or protein sequence in the databases. At the molecular level, cells from all organisms function in remarkably similar ways. In fact, it has been shown that regulatory molecules in a yeast cell can function in the same way in human cells, and vice versa. Thus, information about a gene in one organism can shed light on a similar gene in another organism. Computers will take a sequence that you enter and compare it with all the DNA sequences currently known to find such cross-species homologs. These comparisons have dramatically increased our understanding of molecular processes in all organisms. The following is an introduction to DNA sequence analysis using computers. It is important to understand the terminology as well as to know what is available to you. A scientist who does not have a working knowledge of computers or is unaware of the available resources cannot function in today’s research environment.

B. Searching the Databases: The first thing that anyone wants to know about their DNA sequence is whether it has been found before. The most straightforward way to accomplish this is to compare the DNA sequence you have determined with all other DNA sequences present in the molecular biology  2014 WSSP

4-4

Chapter 4 – Sequencing, Databases databases. This type of search (nucleic acid by nucleic acid) has several advantages, including speed. A nucleic acid-by-nucleic acid database search, however, also has disadvantages. For instance, let’s say that the gene that encodes your cDNA has never been sequenced before, but a gene with the same function in another organism has been sequenced. Would these two genes show homology at the DNA level? The answer is, “not necessarily.” It is known that proteins having similar functions in different organisms often share significant sequence homology at the level of the amino acids, i.e., the two proteins have similar amino acids aligned in a similar fashion. However, because of the degeneracy of the genetic code, the genes encoding these similar proteins may show little homology on the DNA level. Therefore you should also translate your DNA sequence into its cognate protein sequence and use the latter to search the Protein databases. Even if you have already identified your gene by searching the DNA databases, this type of search is essential because it will help you find proteins from other organisms that have similar protein sequence. It is possible that there may be little information about the function of the protein that you have identified. However, closely related homologs may be present in other organisms, such as Drosophila, C. elegans, yeast or humans, in which a large body of information has already been obtained on the function and activity of the protein. In some cases, the three-dimensional structure of one of the homologs may have been determined by NMR or by X-ray crystallography. However, first let’s talk about DNA-by-DNA searches.

C. Sequence Databases. A database is a group of related records that are stored by computers. For instance, a catalog store might want to set up a database for all its customers. In this case, each record might contain a customer’s name, address, and phone number. The complete list of such records would comprise the database. Numerous computer programs have been developed which manipulate such databases in extremely powerful ways. Databases for molecular biologists contain information pertaining to sequence, structure, and function of biological molecules. There are two major types of databases in molecular biology— those that contain DNA sequence information and those that contain protein sequence information. You will be expected to understand both types of databases. DNA and protein databases come in several different varieties. The reason for this is mostly historical. Laboratories across the world realized at about the same time that computers would be necessary to analyze all the information coming on-line in molecular biology. These individual labs spearheaded efforts in their various countries to begin biological databases. As a result, DNA and protein databases were developed in the United States, Europe, and Japan. In each case, institutions were set up to maintain and update the databases. These separate databases still exist, but they are no longer isolated. Thus, when you do a database search today, you generally search all existing databases, not just the one present in your own country. The combined sequence information of all the databases is referred to as the non-redundant (or nr) database. However, there also databases with specific types of DNA sequences that can be searched as well.

 2014 WSSP

4-5

Chapter 4 – Sequencing, Databases Each record in all the various databases contains essentially the same information. Most important is the actual sequence, i.e., the DNA or protein sequence submitted by the individual scientist. In addition, each record contains the name of the organism from which it came, the date it was submitted, the address of the laboratory that did the submission, a reference to a published paper if available, and a brief description of the sequence.

D. An example of a database entry (DNA sequence): LOCUS DEFINITION

AB231879 1383 bp mRNA linear INV 07-JUN-2006 Artemia franciscana mRNA for zinc finger protein Af-Zic, complete cds. ACCESSION AB231879 VERSION AB231879.1 GI:94966317 KEYWORDS . SOURCE Artemia franciscana ORGANISM Artemia franciscana Eukaryota; Metazoa; Arthropoda; Crustacea; Branchiopoda; Anostraca; Artemiidae; Artemia. REFERENCE 1 AUTHORS Aruga,J., Kamiya,A., Takahashi,H., Fujimi,T.J., Shimizu,Y., Ohkawa,K., Yazawa,S., Umesono,Y., Noguchi,H., Shimizu,T., Saitou,N., Mikoshiba,K., Sakaki,Y., Agata,K. and Toyoda,A. TITLE A wide-range phylogenetic analysis of Zic proteins: Implications for correlations between protein structure conservation and body plan complexity JOURNAL Genomics 87 (6), 784-792 (2006) PUBMED 16574373 REFERENCE 2 (bases 1 to 1383) AUTHORS Aruga,J. and Toyoda,A. TITLE Direct Submission JOURNAL Submitted (10-AUG-2005) Jun Aruga, RIKEN Brain Science Institute, Laboratory for Comparative Neurogenesis; 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan (E-mail:[email protected], URL:http://www.brain.riken.go.jp/labs/lcn/, Tel:81-48-467-9791, Fax:81-48-467-9792) FEATURES Location/Qualifiers source 1..1383 /organism="Artemia franciscana" /mol_type="mRNA" /db_xref="taxon:6661" gene 1..1383 /gene="Af-Zic" CDS 1..1383 /gene="Af-Zic" /codon_start=1 /product="zinc finger protein Af-Zic" /protein_id="BAE94140.1" /db_xref="GI:94966318" /translation="MTASLSASVMNPSFIKRESPASATALFVPNQFSAVPNFGFHHVP SACATEQSSEMLNPFVDNHLRLNDQSNFQGYHHPHHGQIQQHHLGSYAARDFLFRRDM GLGMGLEAHHTHAAQHHHMFDPSHAAAAAHHAMFTGFDHNTMRLPTEMYTRDASGYAA QQFHQMGSMAPMAHPASAGAFLRYMRTPIKQELHCLWVDPEQPSPKKTCGKTFGSMHE IVTHITVEHVGGPECTNHACFWQGCVRNGRAFKAKYKLVNHIRVHTGEKPFPCPFPGC GKVFARSENLKIHKRTHTGEKPFKCEFEGCDRRFANSSDRKKHSHVHTSDKPYNCKVR GCDKSYTHPSSLRKHMKVHGKSPPPASSGCDSDENESIADTNSDSAASPSPSSHDSSQ VQVNHNRPPNHHNLGLGFTNPGHIGDWYVHQSAPDMPVPPATEHSPIGPPMHHPPNSL NYFKTELVQN" ORIGIN 1 atgactgcta gtttaagtgc aagcgtgatg aatccaagtt ttataaagag ggaaagtcct 61 gcatcggcta cagccctgtt cgtaccaaac caatttagtg cagtgcctaa ttttggattt 121 caccatgttc ctagtgcttg tgcaactgag caaagtagtg aaatgctgaa cccttttgtg

(Note: the rest of the DNA sequence was deleted to save space)

There are two parts to any sequence file, the annotation and the sequence. The annotation portion includes items such as the definition (a description of the DNA or protein sequence and what it codes for), accession number (a unique identification number that allows easy retrieval from any database) and the CDS (the protein coding sequence). A large clone may encode  2014 WSSP

4-6

Chapter 4 – Sequencing, Databases several proteins. The sequence portion is numbered and formatted into units of 10 bases or amino acids. Note that although there is a protein sequence in this file, it is part of the annotation. There is another very important thing to remember about sequences and information found in these databases. This information is submitted by scientists not unlike you. The accuracy of these sequences is not guaranteed by the database managers (who are often computer scientists) and is only as reliable as the scientist(s) who submitted the sequence. As a result, some types of sequences are very good (e.g. C. elegans genomic sequence has an error rate of