DNA and RNA. Chapter Physical structure of DNA

Chapter 4 DNA and RNA Except for some viruses, life’s genetic code is written in the DNA molecule (aka deoxyribonucleic acid). From the perspective o...
Author: Bennett Gardner
0 downloads 0 Views 3MB Size
Chapter 4

DNA and RNA Except for some viruses, life’s genetic code is written in the DNA molecule (aka deoxyribonucleic acid). From the perspective of design, there is no human language that can match the simplicity and elegance of DNA. But from the perspective of implementation—how it is actually written and spoken in practice—DNA is a linguist’s worst nightmare. DNA has four major functions: (1) it contains the blueprint for making proteins and enzymes; (2) it plays a role in regulating when the proteins and enzymes are made and when they are not made; (3) it carries this information when cells divide; and (4) it transmits this information from parental organisms to their offspring. In this chapter, we will explore the structure of DNA, its language, and how the DNA blueprint becomes translated into a physical protein.

Figure 4.1: Structure of DNA.

4.1 Physical structure of DNA Few people in literate societies can avoid seeing a picture of DNA. Physi1



cally, DNA resembles a spiral staircase. For our purposes here, imagine that we twist the staircase to remove the spiral so we are left with the ladder-like structure depicted in Figure 4.1. The two backbones to this ladder are composed of sugars (S in the figure) and phosphates (P); they need not concern us further. The whole action of DNA is in the rungs. Each rung of the ladder is composed of two chemicals, called nucleotides or base pairs, that are chemically bonded to each other. DNA has four and only four nucleotides: adenine, thymine, guanine and cytosine, usually abbreviated by the first letter of their names—A, T, G, and C. These four nucleotides are very important, so their names should be committed to memory. Inspection of Figure 4.1 reveals that the nucleotides do not pair randomly with one another. Instead A always pairs with T and G always pairs with C. This is the principle of complementary base pairing that is critical for understanding many aspects of DNA functioning. Because of complementary base pairing, if we know one strand (i.e., helix) of the DNA, we will always know the other helix. Imagine that we sawed apart the DNA ladder in Figure 4.1 through the middle of each rung and threw away the entire right-hand side of the ladder. We would still be able to know the sequence of nucleotides on this missing piece because of the complementary base pairing. The sequence on the remaining left-hand piece starts with ATGCTC, so the missing right-hand side must begin with the sequence TACGAG. DNA also has a particular orientation in space so that the “top” of a DNA sequence differs from its “bottom.” The reasons for this are too complicated to consider here, but the lingo used by geneticists to denote the orientation is important. The “top” of a DNA sequence is called the 5’ end (read “five prime”) and the bottom is the 3’ (“three prime”) end.1 If DNA nucleotide sequence number 1 lies between DNA sequence number 2 and the “top,” then it is referred to as being upstream from DNA sequence 2. If it lies between sequence 2 and the 3’ end, then it is downstream from sequence 2.


DNA Replication

Complementary base pairing also assists in the faithful reproduction of the DNA sequence, a process geneticists call DNA replication. When a cell divides, both of the daughter cells must contain the same genetic instructions. Consequently, DNA must be duplicated so that one copy ends up in one cell and the other in the second cell. Not only does the replication process have to be carried out, but it must be carried out with a high degree of fidelity. Most cells in our bodies—neurons being a notable exception—are constantly dying and being replenished with new cells. For example, the average life span of some skin cells is on the order of one to two days, so the skin that you and I had last month is not the same skin that we have today. By living into our eighties, we will have experienced well over 10,000 generations of skin cells! If this book 1 The terms 5’ and 3’ refer to the position of carbon atoms that link a nucleotide to the DNA backbone.




Figure 4.2: DNA replication.

were to be copied sequentially by 10,000 secretaries, one copying the output of another, the results would contain quite a lot of gibberish by the time the task was completed. DNA replication must be much more accurate than that.2 Replication involves a series of protein and enzymes that we will call the replication stuff. The first step in DNA replication occurs when an enzyme (cannot get away from those enzymes, can we?) separates the rungs much as our mythical saw cut them right down the middle (the left-hand picture in Figure 4.2). Enzymes then grab on to nucleotides floating free in the cell, glue them on to their appropriate partners on the separated stands, and synthesize a new backbone (right side of Figure 4.2. The situation is analogous to opening the zipper of your coat, but as the teeth of the zipper separate, new teeth appear. One set of new teeth binds to the freed teeth on the left hand side of the original zipper, while another set bind to the teeth on the original right hand side. When you get to the bottom, you are left with two completely closed zippers, one on the left and the other on the right of your jacket front.3

2 Of course, DNA does not replicate with 100% accuracy, and problems in replication may cause irregularities and even disease in cells. However, we do have the equivalent of DNA proofreading mechanisms that serve two purposes—helping to insure that DNA is copied accurately and preventing DNA from becoming too damaged from environmental factors. A genetic defect in one proofreading mechanism leads to the disorder xeroderma pigmentosum that eventually results in death from skin cancer. 3 My apologies to molecular biologist for this oversimplified account of replication.


4.3. RNA


Table 4.1: Some important types of RNA.


Name Messenger RNA

Abbreviation mRNA

Ribosomal RNA


Transfer RNA


Interference RNA


Function Carries the message from the DNA to the protein factory Comprises part of the protein factory Transfers the correct building block to the nascent protein Interferes with the DNA message


Before discussing the major role of DNA, it is important to discuss DNA’s first cousin, ribonucleic acid or RNA. Besides its chemical composition, RNA has important similarities and differences with DNA. First, like DNA, RNA has four and only four nucleotides. But unlike DNA, RNA uses the nucleotide uracil (abbreviated as U) in place of thymine (T). Thus, the four RNA nucleotides are adenine (A), cytosine (C), guanine (G), and uracil (U). Second, the nucleotides in RNA also exhibit complementary base pairing. The RNA nucleotides may pair with either DNA or other RNA molecules. When RNA pairs with DNA, G and C always pair together, T in DNA always pairs with A in RNA, but A in DNA pairs with U in RNA. When RNA pairs with RNA, then G pairs with C and A pairs with U. Third, RNA is single-stranded (usually) while DNA is double-stranded. That is, RNA does not have the ladder-like structure of the DNA in Figure 3.1. Instead, RNA would look like Figure 4.1 after the ladder was sawed down the middle and one half of it discarded (with, of course, the added proviso that U would substitute for T in the remaining half). Fourth, while there is one type of DNA, there are several different types of RNA, each of which perform different duties in the cell. The different types of importance for this text are listed in Table 4.1. Note the abbreviations.4 In terms of function, think of DNA as the monarch of the cell, giving all the orders. Unlike human monarchs, however, king DNA is unable to leave the throne room (i.e., the cell’s nucleus) and hence, can never execute his own orders. The different types of RNA correspond to the various types of henchmen who carry out the King’s orders. Some occupy buildings in outlying districts (ribosomal RNA), others transport material to strategic locations (transfer RNA), while yet others act as messengers to give instructions on what to build (messenger RNA). A fourth type (interfering RNA) actually disrupts other messages from the monarch! As we will see, the common language of the realm is the genetic 4 The use of iRNA is not standard terminology in genetics but is used here to conform to mRNA, rRNA, and tRNA. Molecular biologists will refer to what is called interference RNA here as micro RNA (miRNA), short interfering RNA(siRNA) or the process RNA interference (RNAi).




code and it is communicated by the way of complementary base pairing.


The genetic code

DNA is a blueprint. It does not physically construct anything. Before discussing how the information in the DNA results in the manufacture of a concrete molecule, it is it important to obtain an overall perspective on the genetic code. It is convenient to view the genome for any species as a book with the genetic code as the language common to the books of all life forms. The “alphabet” for this language has four and only four letters given by four nucleotides in DNA (A, T, C, and G) or RNA (A, U, G and C). In contrast to human language, where a word is composed of any number of letters, a genetic “word” consists of three and only three nucleotide letters. Each genetic word symbolizes an amino acid. (We will define an amino acid later.) For example, the nucleotide sequence AAG is “DNAese” for the amino acid phenylalanine, the sequence GTC denotes the amino acid glutamine, and the sequence AGT stands for the amino acid serine. Like natural language, DNA has synonyms. That is, there is more than one triplet nucleotide sequence symbolizing the same amino acid. For example, ATA and ATG in the DNA both denote the amino acid tyrosine. Table 3.1 gives the genetic code in terms of the DNA triplet words. The sentence in the DNA language is a series of words that gives a sequence of amino acids. For example, the DNA sentence AACGTATCGCAT would be read as a polypeptide chain composed of the amino acids leucine-histidine-serine-valine. Because of the triplet nature of the DNA language, it is not necessary to put spaces between the words. Given the correct starting position, the language will translate with 100% fidelity. Like natural written language, part of the DNA language consists of punctuation marks. For example, the nucleotide DNA triplets ATT, ATC, and ACT are analogous to a period (.) in ending a sentence—all three signal the end of a polypeptide chain. They are called stop codons. TAC is a start codon. It acts as the capital signaling the first word in a new sentence. Biologically, it instructs the cell to “start the peptide chain here.” Finally, DNA, just like a book, is organized into chapters. The chapters correspond to the chromosomes, so their number will vary from one species to the next. The book for humans consists of 24 different chapters, one for each chromosome (22 autosomes plus the X and the Y). The book for other species may contain fewer or more chapters with little correlation between the number of chapters and the complexity of the life forms. The differences between natural human language and DNAese are as important as the similarities. All differences reduce to the fact that human language is coherent while DNA is the most muddled and disorganized communication system ever developed. First, the chapters in a human language book are arranged to tell a coherent story. There is no such ordering to chromosomes. 5



Figure 4.3: The genetic code in DNA codons.

* Start codon. Amino Acids (with single letter abbreviation): Ala = Alanine(A); Arg = Arginine(R); Asn = Asparagine(N); Asp = Aspartic acid(D); Cys = Cysteine(C); Gln = Glutamine(Q); Glu = Glutamic Acis(E); Gly = Glycine(G); His = Histidine(H); Ile = Isoleucine(I); Leu = Leucine(L); Lys = Lysine (K); Met = Methionine(M); Phe = Phenylalanine(P); Ser = Serine(S); The = Threonine (T); Trp = Tryptophan(W); Tyr = Tyrosine(Y); Val = Valine(V) Second, sentences in English physically follow one another with one sentence qualifying, embellishing, or adding information to another in order to complete a line of thought. The genetic language rarely, if ever, has a logical sequence. Metaphorically, one DNA sentence might describe the weather, the next give two ingredients for a chili recipe, and the third could be a political aphorism. Third, whereas it is absurd to write an English compound sentence with a paragraph or two interspersed between the two independent clauses. DNA frequently places independent clauses of the same sentence in entirely different chapters. Fourth, no English book would be published where most sentences are interrupted with what appears to be the musings of a chimpanzee randomly striking a keyboard. A single DNA sentence may be perforated with over a dozen long sequences of such apparent nonsense. Fifth, with natural language it is considered bad rhetoric to repeat the same thought in adjacent sentences, let alone in the same words. With DNA repetition is the norm, not the exception. Not only does DNA continuously stutter, 6



stammer, and hem and haw, but it also contains numerous nonsensical passages that are repeated thousands of times, sometimes in the same chapter. Finally, the size of the DNA “book” for any mammalian species far exceeds that of any book written by a human. With eighty-some characters per line and thirty-some lines on a page, a 500 page book contains about 1,500,000 English letters. It would take over 2,000 such books to contain the DNA book of homo sapiens. And most of the characters in these 2,000 volumes have no apparent meaning!

4.5 4.5.1

Protein synthesis Definitions

We now examine the specifics of how blueprint in the DNA guides the manufacture of a protein. Although we have already spoken of proteins and enzymes, we must now take a closer look at these molecules. The basic building block for any protein or enzyme is the amino acid. There are twenty amino acids used in constructing proteins, most of which contain the suffix “ine,” e.g., phenylalanine, serine, tyrosine. Amino acids are frequently abbreviated by three letters, usually the first three letters of the name—e.g., phe for phenylalanine, tyr for tyrosine. There are three major sources for the amino acids in our bodies. First, the cells in our bodies can manufacture amino acids from other, more basic compounds (or, as the case may be, from other amino acids). Second, proteins and enzymes within a cell are constantly being broken down into amino acids. Finally, we can obtain amino acids from diet. When we eat a juicy steak, the protein in the meat is broken down into its amino acids by enzymes in our stomach and intestine. These amino acids are then transported by the blood to other cells in the body. A series of amino acids physically linked together is called a polypeptide chain. For now, think of a polypeptide chain as a linear series of boxcars coupled together. The boxcars are the amino acids and their couplings the chemical bonds holding them together. The series is linear in the sense that it does not branch into a Y-like structure. The notion of a polypeptide chain is absolutely crucial for proper understanding of genes, so permit some latitude to digress into terminology. Unfortunately, there are no written conventions for the language used to describe polypeptide chains, so terminology can be confusing to the novice. Typically, the word peptide is used to describe a chain of linked amino acids when the number of amino acids is small, say, a dozen or less. The word peptide is also used as an adjective and suffix to describe a substance that is composed of amino acids. For example, a peptide hormone is a hormone that is made up of linked amino acids, and a neuropeptide is a series of linked amino acids in a neuron. The phrases polypeptide chain or polypeptide usually refer to a longer series of coupled amino acids, sometimes numbering in the thousands. Be wary, 7



however. One can always find exceptions to this usage. We are now ready to define our old friend the protein. A protein is one or more polypeptide chains physically joined together and taking on a three dimensional configuration. The polypeptide chain(s) comprising a protein will bend, fold back upon themselves, and bond at various spots to give a molecule that is no longer a simple linear structure. An example is hemoglobin, a protein in the red blood cells that carries oxygen. It is composed of four polypeptide chains that bend and bond and join together. Some proteins contain chemicals other than amino acids. For example, a lipoprotein contains a lipid (i.e., fat) in addition to the amino acid chain. Many of the receptors that reside on the cell membrane (but sometimes within the cell) are complexes that involve several proteins and lipids. Finally, we must recall the definition of an enzyme. An enzyme is a particular class of protein responsible for metabolism. With these definitions in mind, we can now present one definition of a gene. A gene is a sequence of DNA that contains the blueprint for the manufacture of a peptide or a polypeptide chain. Such genes are sometimes qualified by calling them structural genes or coding regions. A synonym for gene is locus (plural = loci), the Latin word meaning site, place, or location.


The process of protein synthesis

We can now look at the actual “manufacturing process” whereby the information on the DNA blueprint eventually becomes translated into a physical molecule. There are five steps in this process. Table X.X lists them by temporal sequence, giving both a common sense and technical definition. We will describe each of these in turn.

Step 1: Transcription

Depicted in Figure 4.4, transcription is the processes whereby a section of DNA gets “read” and chemically “photocopied” into a molecule of RNA. As in DNA replication, there are a number of different enzymes required for transcription, each one performing a single task such as unwinding the double helix, cutting the bonds to make the DNA single stranded, or adding RNA nucleotides. Let us forgo the names and roles of these enzymes and simply call them the “transcription stuff.” Transcription begins with a promoter region in the DNA. A promoter region is a section of DNA that the transcription stuff recognizes and binds to to start the transcription process. The promoter region is located at the upper end of a gene, just before the coding region (the section of nucleotides giving the blueprint for the protein). See the upper part of Figure 4.4. After binding with the DNA, the transcription stuff unwinds the double helix and breaks the bonds between the nucleotides, making the DNA single stranded. From one of the DNA strands, the enzymes then synthesize an RNA molecule by grabbing free nucleotides, selecting the one that corresponds to the current 8



Figure 4.4: Protein synthesis: transcription.

DNA nucleotide, and “glueing” it to the RNA chain (see the bottom panel of Figure 4.4. If the DNA sequence is GCTAGA..., then the RNA sequence that is synthesized will read CGAUCU.... In this way, the information in the DNA is faithfully preserved in the RNA, albeit in the genetic equivalent of a “mirror image.” Because transcription requires a promoter region, not all of the DNA is regularly transcribed. Only 5% of human DNA ever becomes transcribed. What is the rest of the DNA doing? We postpone discussion of this important topic until later (Section X.X).

Step 2: Post transcriptional modification (editing)

In large multicellular life forms, there is a problem with freshly transcribed DNA–interspersed among the blueprint are sections of nucleotides that are, in terms of the information for making the protein, meaningless gibberish. The second step in protein synthesis occurs when the sections of nonsense get edited out from the actual message (see Figure 4.5). Logically this process should be called editing, but such a term appears too common sensical to molecular biologists who usually refer to it as post transcriptional modification. Also, you all know the meaning of “blueprint” and “message section” as well as “junk” or “non message segment.” Such a situation is intolerable to scientists who spend untold hours concocting fancy vocabulary to refer to something that everyone knows in the first place. The same is true for post transcriptional modification. The sections of RNA that contain the actual message and blueprint for the polypeptide chain are called exons. Those sections of transcribed RNA that do not contain the message are called introns. Hence, in editing, the introns get cut out and the exons get spliced together. 9



Figure 4.5: Protein synthesis: post transcriptional modification (editing).

The resulting molecule is called messenger RNA, abbreviated as mRNA. Messenger RNA starts with some “header information” followed by the actual code for the peptide chain. There is also some “tail information” in the form of RNA nucleotides attached to the end of the molecule. One important term for mRNA is the codon. A codon is a series of three adjacent mRNA nucleotides that contain the message for a specific amino acid. (Sometimes the term codon is also used to refer to the triplet in DNA that gives rise to the three nucleotides in mRNA.)

Step 3: Transportation

Transcription and editing take place in the nucleus. The cell’s protein factories (ribosomes), however, are located outside of the nucleus in the cytoplasm. The next step in protein synthesis is simple–the mRNA exits the nucleus, enters the cytoplasm of the cell, and using its RNA “header information,” attaches to a ribosome. This step is called transportation.

Step 4: Translation.

In translation, the message in mRNA is “read” (aka translated) and a physical molecule is constructed from that message. To understand translation, we must first learn some more about ribosomes and the molecule called transfer RNA or tRNA. A ribosome is composed of a number of proteins and a specific type of RNA called ribosomal RNA or rRNA. There is actually a strong similarity between the function of the ribosome as a protein factory and the physical structure of the ribosome. The ribosome contains an assembly line and also a large storage area containing amino acids, each one with a chemical “barcode” attached to it 10



Figure 4.6: Schematic of a transfer RNA (tRNA) molecule.

that specifies the type of amino acid. That chemical barcode is written in RNA and the RNA-amino acid molecule is called transfer RNA or tRNA. Figure 4.6 presents a schematic of a tRNA molecule. The “barcode” that specifies the amino acid that the molecule carries is a three nucleotide sequence called an anticodon. There are several three-dimensional loops in tRNA that are simply called “other RNA” in the figure. Finally there is the amino acid itself. In Figure 4.6, the amino acid is denoted as Trp, the abbreviation for tryptophan. Each ribosome “stores” a large number of tRNA molecules. The process of translation is depicted in Figure 4.7. Think of the ribosomal “factory floor” as a table with two workers, one on each side, facing each other. A number of other workers are also present surrounding the table. An mRNA molecule enters the factory and because of its header information, binds to the table and moves across it until an mRNA codon called a start codon tells the workers to start constructing a polypeptide chain (Figure 4.7, section A). One worker at the table looks at the first mRNA codon, reaches into the bin of tRNA molecules and picks out the one with the appropriate anticodon. For example, if the first codon is AUG, then the worker will select a tRNA molecule with the anticodon UAC which carries the amino acid methionine (Met). The mRNA molecule then moves down the table. Our worker reads the next mRNA codon which in Figure 4.7, is AAA. The worker attaches the matching tRNA molecule. In this case, it is the one with the anticodon UUU carrying the amino acid phenylalanine (Phe). The researcher on the other side of the table attaches the two amino acids to each other. The adhesion is done chemically with a peptide bond. The situation is now depicted in section (B) of Figure 4.7. The mRNA now moves through the ribosome. Our first worker reads the next mRNA codon (AGC) and attaches the appropriate tRNA molecule (one with the anticodon UGR carrying threonine or Thr). The second worker attaches the amino acid from this tRNA to the now expanding polypeptide chain. Meanwhile, workers further down the factory floor detach the amino acid from the tRNA molecule. That tRNA will be recycled to pick up another amino acid and rejoin the other tRNA molecules in the storage area of the ribosome (section C of Figure 4.7). 11



Figure 4.7: Protein synthesis: translation.


CHAPTER 4. DNA AND 4.6.RNA CASE STUDY: HEMOGLOBIN CHAIN FIRST. This cycle is repeated and repeated until the mRNA codon is a punctuation march that signals a stop to the process. We now have a polypeptide chain.

Step 5: Post translational modification

It is convenient to think of the polypeptide chain in linear terms, much as the box cars of a train. The linear structure of a polypeptide is important information, but physics has another fate in store for our friend. The twenty amino acids used in proteins differ in electrical charges and respond differently to water, salts, and the temperature and acidity of the locale in which the polypeptide chain is produced. These physical forces, assisted or in some cases, subverted by other molecule cause the polypeptide chain to begin to fold back on itself and take on a three dimensional configuration even as it is leaving the ribosome. This protein folding is virtually universal and is necessary if the protein is to have an active biological effect. The lock and key mechanisms that allow proteins such as enzymes and receptors to bind and perform their action is determined by the three dimensional, folded structure of the protein. Polypeptide folding is a universal and necessary action. But there are many other transformations that can happen to the polypeptide before it becomes a biologically active molecule or BAM. Some proteins will have a sugar added to them. Others, a fat. In some cases, two or more polypeptide chains must join together to produce a BAM. For example, the nicotinic receptor in the brain that responds to the neurotransmitter acetylcholine (and also nicotine, the additive ingredient in tobacco products) requires five polypeptide chains to join together. There are even cases in which the newly translated polypeptide chain must be sliced up to generate BAMs. In short, there are many different things that can happen after translation to make the polypeptide into a BAM. Furthermore, some post translational modification mechanism can repeatedly occur and then be undone. Remember that second messenger system in cell communication? Many of these systems involve long and complicated pathways. Many of the proteins and enzymes in a signaling pathway must be activated in order for the signal to proceed. Hence, these proteins are constantly being activated and then deactivated, depending on the signaling needs of the cell.


Case study: Hemoglobin chain first.

The hemoglobin protein will figure prominently in several different sections of this book, so it will be used here to illustrate the genetic code and the organization of the genome. It will also help us to practice the genetic lingo we have learned in this chapter. The gory details about hemoglobin can be ignored. Concentrate on the big picture–what makes up a gene? When we breathe in air, a series of chemical reactions in our lungs extracts oxygen atoms and implants them into the hemoglobin protein in our red blood cells. The red blood cells pulse through our arteries and eventually reach tiny 13


Figure 4.8: The b hemoglobin-like gene cluster (human).

capillaries in body tissues (e.g., liver cells, pancreas cells, muscle cells, neurons, etc.) where the hemoglobin releases the oxygen atoms. In humans over five months of age, hemoglobin is composed of four polypeptide chains, two a chains and two b chains. 5 Each chain is coded for by a separate gene. Let’s examine the gene for the b Figure 4.8 depicts the DNA segment containing the gene for the b polypeptide. This long section of DNA section is located on chromosome 11 and is over 60,000 nucleotides long (or 60 kb, where kb denotes a kilobase or 1,000 base pairs). Only the tiny box with the label b contains the blueprint for the b peptide chain. (For the moment, ignore the boxes labeled e, Gg, Ag, and d.) The boxes labeled yb1 and yb2 are called pseudogenes for the b locus. A pseudogene is a nucleotide sequence highly similar to a functional gene but its DNA is not transcribed and/or translated. In short, a pseudogene does not produce a polypeptide capable of becoming a biologically active molecule. The middle section of Figure 4.8 gives the structure of the b coding section, including the “punctuation marks.” Note the promoter regions and recall that this is the area that the transcription stuff binds to and begins the transcription process. The are also two punctuation marks downstream of the promoter. The first indicates where transcription is to begin and the second marks the first codon for translation. The b coding section is roughly 1,600 base pairs long and includes three 5 I am lying again. Some adult hemoglobin contains the d chain, but we can ignore that to simplify matters.



Figure 4.9: The a hemoglobin-like gene cluster (human).

exons. The first exon is composed of the 90 nucleotides that have the code for the first 30 amino acids in the peptide chain, the second exon codes for the 31st through 104th amino acids, and the last for the remaining 40. Hence, of the 1,600 base pairs only 438 contain blueprint material. Hence, only about 25% of the whole b locus contains the actual blueprint and processing information for the polypeptide chain. The final section of Figure 4.8 gives the actual nucleotide sequence for the beginning of exon 1 as well as the amino acid sequence. Recall the triplet nature of he DNA codons. In DNAese, the first three coding letters are GTG, so the first amino acid is Valine (Val). Figure 4.9 depicts the DNA region for the a chains. This is located on an entirely different chromosome from the b cluster, chromosome 16, and is roughly 30kb in length. The boxes labeled a1 and a2 both contain the blueprint for the a peptide chain. This is an example of a gene duplication—the DNA for both of these loci is transcribed, edited and translated into the same a chains. Like the b chain, there is also a pseudogene for the a locus, denoted in Figure 3.8 by the box labeled ya1 . (Once again, ignore the boxes labeled z1 , and z2 . ) The actual structure for the two a loci is very similar to that of the b locus—they too have three exons—and is not depicted in Figure 4.9. To get a biologically active hemoglobin molecule, both of the a genes will be transcribed, edited into mRNA which is transported into the cytoplasm and translated into a polypeptides which will fold back on themselves taking on a three dimensional configuration. The b gene will undergo a similar process, giving a folded b polypeptide. Two a and two b chains will be joined together (an example of post translational modification). Heme groups are glued into this molecule (another case of post translational modification). A heme group contains a special iron ion that “catches” and binds an oxygen atom to it. Figure 4.10 depicts the structure of the biologically active hemoglobin molecule. To understand the organization of the human genome, let’s play a mind game. Suppose that you were enrolled in a class in biochemical engineering. Your professor gives you the assignment to develop a machine (or other such mechanism) to produce a molecule like hemoglobin. What grade would you receive if you turned in something resembling human hemoglobin? Think about this for a moment. You would have a massive blueprint but only a fraction of it has useable sections. Further some useable sections will not actually be used at all (pseudogenes). If you were to manufacture something 15


Figure 4.10: The hemoglobin molecule (red = a chain, blue = b chain, green = heme group).

Figure from Haemoglobin.png


from a useable section (i.e., the two a sections in Figure 4.9 and the b section in Figure 4.8), you would end up with a non functional product. Instead, you need to copy these sections and then cut out the parts that makes the product unusable and paste the rest together (the editing process after transcription). You also have a problem in terms of efficient production, With two useable a sections but only one useable b section, you will be producing two a “widgets” for every b widget. Hence, you will have to have other complicated mechanisms to ratchet down the production of a units and/or increase the rate of b unit production. Imagine yourself as the professor in this class reading and grading such a proposal. What grade would you give? The point is that the human genome is not logical and efficient from an engineering standpoint. In fact, it appears to be so capricious that one wonders how any individual cell could function at all let alone create a viable large multicellular life form. There is a very good reason for this complexity, but we postpone discussion of it after we see many other examples of illogical and inefficient design in genetics.


Facts about the human genome

The 1980s saw new technologies that changed the face of genetics and cell biology. During that decade, geneticists began speculating about searching for 16

CHAPTER 4. DNA AND RNA 4.7. FACTS ABOUT THE HUMAN GENOME the “holy grail” of human genetics–determining the nucleotide sequence or the ordering of the As, C, Gs, and Ts for the whole human genome. At the time, determining the sequence for even a small section of DNA was time consuming and costly. Trying to do this for the the whole genome would be impossible given current resources for biological research. Undeterred, groups of geneticists imagined a “big science” project for biology and medicine that would be the equivalent of the physicists’ massive particle accelerators and the astronomers’ space telescopes. They approached the US Congress and in 1987 initial funds were given to the human and environmental research programs at the Department of Energy (DOE). In 1990, the DOE and the US National Institutes of Health joined forces to co-ordinate this effort and the project was fully funded at $3 billion over 15 years. Soon, institutions from the United Kingdom, Japan, and other industrialized nations joined the project. The official name of this enterprise? The Human Genome Project (HGP). Because of advances in both biotechnology and computing science, a rough draft of the sequence was announced in 2000 at a press conference headed by U.S. President Bill Clinton and British Prime Minister Tony Blair. Three years later, the full sequence was published. Uncharacteristic of many government funded projects, the HGP came in early and under budget. The following are some of the major facts that emerged from the HGP. A consensus sequence. There is no such thing as THE human genome sequence. With the exception of identical twins, there are as many human genome sequences as there are members of homo sapiens that have ever existed. Instead, what the HGP produced was a consensus sequence that acts as a map with which geneticists can compare other sequences. It is also a composite sequence, derived from the genomes of several individuals. 3.2 billion nucleotides. The information content of the human genome sequence is 3.2 billion nucleotides. To see what this means, focus on one cell in a human male. For the autosomal chromosomes, that cell has two copies. Let’s throw one of each copy away, leaving us with 21 autosomal chromosomes and the X and the Y chromosomes. Each of these is double-stranded. Let’s make them single-stranded, tossing away the other strand. We would be left with 3.2 billion nucleotides. 20,000 to 25,000 genes. Molecular biologists define a gene as a section of DNA with the blueprint for a polypeptide chain. According to this definition, we humans have somewhere between 20,000 and 25,000 genes, much fewer than anticipated at the beginning of the project. Roughly 2% of the human genome contains the blueprint for polypeptides. Given the number of genes and the average size of a gene, it is straight forward to arrive at a rough estimate of the proportion of the human genome that actually codes for proteins. It turns out that it is a very small proportion, roughly 2%. Adding the introns, promoter regions and the DNA that codes for the various RNAs does little to alter the fact that the blueprint content is only a small fraction of our total DNA. Most of human DNA has no discernible function. If only 2% of our DNA codes for proteins, what does the rest do? The surprising answer is that we do 17

4.7. FACTS ABOUT THE HUMAN GENOME CHAPTER 4. DNA AND RNA not really know, There are definitely some regions that act as regulatory areas. That is, they modulate the “dimmer switch” of protein coding regions. But even if we double or triple the amount of these areas that we already know about, we cannot come close to explaining the other 98% of our DNA. An unknown amount of our DNA is probably junk. Just because we do not know the function of a section of DNA does not mean that that section has no function. Hence, trying to determine the percent of the human genome that is not functional is problematic. Still, the genome contains large sections that repeat themselves over and over or can be deleted or inverted with no apparent effect. Also, there roughly 10,000 pseudogenes. Virtually all geneticists agree that some areas of human DNA are junk (i.e., non functional), but there is no consensus on the percent that is unnecessary.