Computational Methods for Rational Oligonucleotide PCR Primer Design and Analysis:

Bioinformatics Workshop #2 Computational Methods for Rational Oligonucleotide PCR Primer Design and Analysis: Two Scenarios Using GCG¥’s SeqLab. ‘Not...

Author: Randolph Palmer

0 downloads 2 Views 1MB Size

Report

Download PDF

Recommend Documents

PCR and Primer Design

PCR Primer Design. Primer Design 27

Troubleshooting. Nested PCR. Primer design

Bistro-Primer - Tool to design and validate specific PCR primer pairs for phylogenetic analysis

RATIONAL TRIGONOMETRY: COMPUTATIONAL VIEWPOINT

Research Methods, Design, and Analysis

System Analysis and Design methods

Visual data analysis with computational intelligence methods

Computational Methods for Sustainable Energy

RATIONAL DESIGN of DRUG FORMULATIONS USING COMPUTATIONAL APPROACHES

CES Computational Analysis Methods for Engineers Assignment 1:

[Computational Numerical Methods for Engineers]

Computational Methods for the Analysis of Footwear Impression Evidence

FastPCR Software for PCR Primer and Probe Design and Repeat Search

Plant growth analysis: an evaluation of experimental design and computational methods

Primer Design with Specific PCR Product using Particle Swarm Optimization

Design-for-Empowerment-for-Design: Computational Structures for Design Democratization

Practical Methods for Design and Analysis of Complex Surveys

Mycoplasma Plus PCR Primer Set

Essential Mathematics for Computational Design

Computational Methods in Physics

COMPUTATIONAL DESIGN

Computational Methods for RNA Secondary Structure

Primer and Probe Design

Bioinformatics Workshop #2

Computational Methods for Rational Oligonucleotide PCR Primer Design and Analysis: Two Scenarios Using GCG¥’s SeqLab. ‘Not your ordinary primer design.’ The two scenarios: 1) A complicated case where the target DNA is unknown and the sequences are ‘difficult’ to align — the “guessmer” — useful for discovering genes in organisms where they have not yet been identified when the gene’s encoded protein sequence is known in several other, related organisms. Here the example is the prion gene in primates; and 2) A case that you can do on your own where all the DNA sequences are known and ‘easily’ aligned — the Human Papilloma Virus major capsid protein L1 — type and strain differentiation.

Fall 2006; a GCG¥ Wisconsin Package™ SeqLab® tutorial for Florida State University sponsored by the School of Computational Science6 (SCS).

Author and Instructor: Steven M. Thompson

Steve Thompson BioInfo 4U 2538 Winnwood Circle Valdosta, GA, USA 31601-7953 [email protected] 229-249-9751 ¥ GCG is the Genetics Computer Group, part of Accelrys Inc., producer of the Wisconsin Package for sequence analysis. © 2006 BioInfo 4U Steven M. Thompson

2

Introduction The Polymerase Chain Reaction, PCR, developed at Cetus Corporation by Kary Mullis in the mid ‘80’s (Saiki, et al., 1988), for which he won the Nobel Prize, and patented by Hoffman La Roche and Perkins-Elmer Corporation, has revolutionized modern molecular biology. From Jurassic Park scenarios in popular novels, to everyday research in countless laboratories across the world, to cutting-edge forensic pathology techniques, PCR is being used to analyze tinier concentrations of DNA than ever before imagined possible. PCR allows the investigator to analyze any stretch of DNA in any organism where at least some sequence information is known, either in that organism or in related organisms. It can isolate, and amplify up to around a million-fold, just a few molecules of DNA from complex environmental mixtures, even where the DNA is significantly degraded — the ramifications are incredibly far-reaching. It has been employed, among many examples, to analyze DNA in Egyptian mummies, preserved prehistoric insects in amber, ancient fossilized leaves, and both ice-age frozen and tar-pit preserved mastodons and other animals from the ‘great age of mammals.’

Claims were even made of dinosaur DNA recovery from specimens recovered in a Utah

coalmine, though the results were later proven to be contamination. The practical applications are extensive in medicine, especially in the field of prenatal genetics and, in particular with HIV, immediately postnatal diagnosis.

Other pathologies such as Lyme disease are also extremely amenable to PCR diagnosis.

Furthermore, molecular evolutionists now have a tremendous tool for inferring phylogenies of any organism, whether they can be cultured or not.

Furthermore, forensics has been completely turned about.

Now

investigators can isolate the DNA from incredibly obscure bits of physical evidence, ala CSI, to positively exclude suspects based on distinct patterns, fingerprints, within their DNA. Using it to ‘prove’ guilt is more difficult because of the population genetics statistics involved, however, even these probabilities can be demonstrated within several magnitudes of order. PCR has truly changed the face of molecular biology. PCR is a modified primer extension reaction using a thermostable DNA polymerase that allows for the heat dissociation of newly formed complimentary DNA and subsequent hybridization of oligonucleotide probes to the target regions for subsequent rounds of amplification. The scope and methods of PCR are huge and many varied and way beyond the aim of this workshop — I will not attempt to teach anything of the actual procedure. Refer to any good, modern text in molecular biology for details (for some good, early reviews of PCR methodology see Mullis [1990], White et al. [1989], and Cherfas [1990]). What I will attempt to teach is a rational method for inferring appropriate oligonucleotide probes, often known as primers, for PCR or hybridization screening analysis. These oligonucleotides are usually about 20 or more bases in length and target the beginning and ending locations of the PCR amplification process. Coupled with PCR techniques and/or ultra sensitive hybridization screenings, oligonucleotide primers have allowed the ‘fishing out’ of thousands of genes from complex genomes that would have previously been extremely difficult to ever even find, yet alone sequence. Present-day economic, automated synthesis and the ready availability of nucleotides, have made primers commonplace.

(This has also facilitated the

development of reliable methods for the introduction of site-specific mutations into known sequences.) Because of the high specificity and adjustable stringency of oligonucleotide hybridization, the sequence 3

knowledge of a relatively short stretch of unique DNA is sufficient to rapidly isolate and/or amplify, clone if desired, and sequence the corresponding gene. However, whatever technique one may use, primers are essential ingredients. PCR and hybridization screening both require the design of appropriate primers. This can be a ‘hit-or-miss’ affair or you can use computational methods to greatly assist the efficiency of the process. Several strategies can be imagined for the design of oligonucleotide primers. If an exact nucleotide sequence is known, then a single oligonucleotide probe for hybridization or a pair of primers for PCR of a defined sequence can simply be selected, tested, and synthesized. In the absence of a defined DNA sequence, sometimes a group of similar DNA sequences can be aligned and a consensus sequence created from which primers can be designed. However, this is often not possible because DNA can be very, very difficult to align. In some cases one may even be forced to work off of either a small portion of a protein sequence from an Edman degradation reaction or, as will be illustrated in this exercise, a consensus pattern from a group of related proteins — the luxury of using DNA directly is often not available. When nucleotide data is lacking or problematic, amino acid sequences can be back translated to provide the necessary primers. In the absence of exact protein sequence data, a consensus pattern from a group of related proteins can often be used. Using amino acid sequence information requires one to back translate the sequence though. This is not a trivial chore though, because of the degeneracy of the genetic code. There are 64 possible codons for 20 amino acids. Because of this, many different back translation probe techniques have been employed. Two are, either utilizing large pools of short oligonucleotides whose sequences are highly degenerate, or using small pools, or even just one pair, of longer oligonucleotides of lesser or no degeneracy. All organisms have preferential biases in codon usage and this information can be used to advantage in deciding which codons to synthesize out of all of the possible choices. This strategy of choosing the longest defined stretches of unambiguous peptide and back translating them to their most probable oligonucleotides, is known as designing “guessmers.” Guessmers contain the combination of codons most likely to match the authentic gene. Guessmers work because the decrease in hybridization stability caused by mismatched bases is offset by an increase in stability from using longer sequences. In most cases, mismatches will occur in only the third position of incorrect codon choices and, therefore, at least two of the three bases will still be matched. Naturally, the biggest constraint on utilizing this type of strategy is that relatively long stretches of amino acid sequence are required.

Because of this, guessmers are particularly appropriate when strong and sufficiently long

consensus elements can be discovered in a protein family. They should be at least 30 nucleotides in length, in order to insure sufficient hybridization despite potential mismatches, though PCR primers are seldom designed as long as hybridization probes. It’s also not worth the extra effort and bother to synthesize them longer than about 70 bases. For very some early, very good descriptions of the factors involved in guessmer design and analysis and references to primary literature see Sambrook et al. (1990) and Wood (1987). The first portion of today’s tutorial will explore guessmer design. In order to discover possible consensus patterns within a known protein family for the design of a guessmer, the individual members must be 4

maximally aligned and then a consensus must be created.

Alignment is usually achieved through an

automated progressive, pairwise alignment procedure, here the GCG program PileUp, which inserts gaps to align the full length of its members.

Other automated alignment methods are also available such as

Thompson and Higgins’ ClustalW (1994), Smith and Smith’s PIMA (1995), and Gupta et al.’s MSA (1995), as are several different manual alignment editors.

Consensus sequences can then be created from the

alignment. Many methods merely rely on the positional frequency of individual symbols; however, some utilize much more information. Profile analysis (Gribskov et al., 1989) is one of these. Profile analysis takes advantage of the BLOSUM (Henikoff and Henikoff, 1992) Dayhoff style scoring matrices (Schwartz and Dayhoff, 1979) that utilize the relative conservation of various amino acid substitutions within the alignment. Therefore, the resultant consensus residues are the most evolutionarily conserved rather than just statistically the most frequent. This can mean much more to us than an ordinary consensus and is especially appropriate in the design of the type of guessmer that we will be simulating — that is, a situation in which much sequence information for the protein of interest is known in other organisms but not in the one we are studying. I will illustrate the design of guessmers using the prion protein as an example.

The prion molecule is

responsible for a debilitating disease in animals and yet is encoded by the organism’s own DNA; the gene is expressed in both normal and afflicted cells. Large amounts of proteinaceous plaques aggregate and are deposited in the brains of afflicted animals. The prion protein has an unknown natural function but is found in very high quantities in the brain of animals infected with the degenerative neurological diseases scrapie and Bovine Spongiform Encephalopathy, in wild stock, and kuru, Creutzfeldt-Jacob Disease, or GerstmannStraussler Syndrome in humans. It is also involved in Fatal Familial Insomnia and gained notoriety as the harbinger of “Mad-Cow Disease.” In humans the gene maps to position 20p12-pter and the disease can be inherited in an autosomal dominant fashion. Seventeen pathologic allelic variants are listed in OMIM (1995). One of the most peculiar aspects of the prion is no infective nucleotide entity has ever been found, yet the protein particle itself is highly infectious. Somehow the infectious protein particle induces a posttranslational, pathological change in the host’s normal protein to convert it to the aberrant isoform. The primary amino acid sequence is not changed, only the structural conformation of the protein is different. Stanley B. Prusiner of the University of California, San Francisco, won the 1997 Stockholm’s Karolinska Institute Nobel Prize in physiology or medicine because of his work on this system. For further information, see Prusiner’s article in Science, available on the World Wide Web at: http://www.sciencemag.org/feature/data/prusiner/245.shl. The second scenario utilizes a human papillomavirus (HPV) dataset. HPV is known to be associated with many varieties of human genital cancers. The DNA from certain types of HPV, in particular types 16 and 18, has been found integrated into various sites on human chromosomes, especially 12q13, and is often associated with the cis-activation of cellular oncogenes and/or the establishment of heritable fragile sites (OMIM).

HPV exists in a dizzying number of genetic types — there are almost 2000 HPV nucleotide

sequences including around 50 complete HPV genomes in GenBank (Bilofsky, et al. 1986)! Some types appear relatively benign while others have powerful etiologic roles.

5

The ability to easily discriminate between HPV types is obviously a valuable diagnostics tool. PCR provides a proven methodology for achieving just this. The HPV major capsid protein, or L1 gene as it is known, has proven to be a reliable locus for this technique. The HPV viral coat is largely built from this protein, and, therefore, represents the first and major antigen presented to the host. Hence, the selective pressure is quite intense on the molecule: It evolves quickly enough to provide sufficient variation between types for screening purposes and yet has strongly conserved areas to provide for ‘universal’ primers. One paired set, the socalled MY09/11 consensus, has been extensively used for this purpose. See, for other historic examples, the articles by Tenti, Nagano, Stewart and their collaborators (all 1996). I have already prepared a multiply aligned DNA sequence dataset of the L1 region from about 50 different HPV sequences most similar to type 16 for the second scenario. This dataset will not require the design of guessmers, as these sequences have quite a high degree of similarity, enough to make this region quite easy to align at the DNA level. From the multiple sequence alignment provided, you will be able to design your own ‘universal’ and type/strain specific primers. Furthermore, using the GCG primer design software, you can test the efficiency of the commercial MY09/11 universal set, and compare them to your newly designed primers. Finally, you can review the results of a database search that I completed using the MY09/11 primers to see just how specific and/or universal they are for HPV L1 genes. The Tutorial: A ‘Real-Life’ Project Oriented Approach I write these tutorials from a ‘lowest-common-denominator’ biologist’s perspective. That is, I only assume that you have fundamental molecular biology knowledge, but are relatively inexperienced regarding computers. As a consequence of this they are written quite explicitly. Therefore, if you do exactly what is written, it will work. However, this requires two things: 1) you must read very carefully and not skim over vital steps, and 2) you mustn’t take offense if you already know what I’m discussing. I’m not insulting your intelligence. This also makes the tutorials longer than otherwise necessary. Sorry. I use bold type in the tutorial for those commands and keystrokes that you are to type in at your console or for buttons that you are to click in SeqLab. I also use bold type for section headings. Screen traces are shown in a “typewriter” style Courier font. and “////////////” indicates abridged data. The arrow symbol, “>“ indicates the system prompt and should not be typed as a part of commands. Really important statements may be underlined. Specialized “X-server” graphics communications software is required to use GCG’s SeqLab interface. This needs to be installed separately on personal style ‘Wintel’ or Macintosh machines but comes standard with most UNIX operating systems. The details of X and of connecting to the GCG server on campus will not be covered in this exercise. If you are unsure of these procedures ask for assistance in the computer laboratory. I am also available for individualized personal help in your own laboratories if you are having difficulties connecting to the GCG server from there, just contact me at [email protected]. A couple of tips at this point should be mentioned though. Rather than holding mouse buttons down, to activate items, just click on them;

6

and do not close windows with the X-server software’s close icon in the upper right- or left-hand window corner, rather, always use GCG’s “Close” or “Cancel” or “OK” button. Standard operating procedure first step in much of molecular biology research Probe genomic digests, shotgun clones, or cDNA libraries, or PCR methods toward the same end. But, how do you design the oligonucleotide(s)? One way — defined DNA: Based on known DNA sequences define and test probes/primers to any level of specificity using a multiple sequence alignment of those sequences and primer design and analysis software, such as GCG’s Prime. This is covered in the second portion of the tutorial. Another way — the guessmer — ‘universal’ primers based on protein homology: start from known protein sequences and find strong consensus elements within them; BackTranslate the consensus elements to yield consensus DNA sequences; use Prime to locate candidate primers within the conserved DNA regions; test candidate primers’ suitability with FindPatterns and Prime. Get started — SeqLab and primer design Use the powerful X-based Graphical User Interface (GUI) sequence editor SeqLab to fully appreciate multiple sequence alignments and, especially, to manipulate them.

SeqLab is a part of the Accelrys Genetics

Computer Group’s (GCG) Wisconsin Package. This comprehensive package of sequence analysis programs is used worldwide and is one of my primary support responsibilities on campus. The package should initialize automatically as soon as you log onto the GCG server. This process activates all of the programs within the package and displays the current version of both the software and all of its accompanying databases. Log on to the campus GCG server Mendel with an X tunneled ssh terminal connection (that’s a capital X!): > ssh -X [email protected] I placed a file in a publicly accessible GCG directory to make the last part of this section doable in ‘real time.’ Therefore, after logging in to the GCG server, issue the following command to copy this file into your account. > fetch primer-tutorial.prion.finds Llist your directory (ls) using the long form option (-l) on the new file to see how big it is: > ls -l primer-tutorial.prion.finds -rw-r--r--

1 stevet

gcg

49285 Jun 22 20:14 primer-tutorial.prion.finds

Next, issue the command “seqlab &” (without the quotes) in your terminal window to fire up the SeqLab interface. The ampersand, “&,” is not necessary but it really helps out by launching SeqLab as a background process so that you can retain control of your initial terminal window: 7

> seqlab & The command should produce two new windows, the first an introduction with an “OK” box; check “OK.” You should now be in SeqLab’s “List” mode. Before beginning the analyses, go to the “Options” menu and select “Preferences . . .” We should check a few options there to insure that SeqLab runs its most intuitive manner. If you were involved in last month’s workshop, there is no need to repeat this section on SeqLab’s preferences. It ‘remembers’ your settings. First notice that there are three different “Preferences” settings that can be changed: “General”, “Output,” and “Fonts;” start with “General.” The “Working Dir . . .” setting will be the directory from which SeqLab was initially launched. This is where all SeqLab’s working files will be stored; it can be changed in your accounts if desired, however, leave it as is for now. Be sure that the “Start SeqLab in:” choice has “Main List” selected (buttons are pushed in and shaded when they are turned on) and that “Close the window” is selected under the “After I push the “Run” button:” choice. Next select the “Output” Preference. Be sure “Automatically display new output” is selected. Finally, take a look at the “Fonts” menu. We’ll leave these choices as is, but if you’re dealing with really big alignments, then picking a smaller Editor font point size may help to see more of your alignment on the screen at once. Click “OK” to accept any changes. 1) The first case — the guessmer, from proteins to primers The scenario You are given a particular protein to investigate, here the prion protein. It is unknown in the particular organism that your boss wants you to work with, let’s say for the purpose of the tutorial, the strange lemur-like critter the aye-aye, however, you are certain that the same protein has been worked with in other related organisms. You want to use PCR methods to isolate the gene, so you’ll need to come up with some primers. There are many ways to approach this design problem. I will present one useful when the protein’s sequence is known in several representative cases, and [let’s assume, for the purpose of the exercise that] the DNA is too divergent to align directly. The first step is to look for it in the protein databases. We are going to use GCG’s database browser program LookUp to do this. a) LookUp the UniProt protein database We need to know proper database identity names or accession codes to find entries of interest in sequence databases. Database text searching programs are often the easiest way to do this. There are several methods; the NCBI Entrez program is one of the more powerful, EMBL/EBI’s SRS is another. Here we’ll use GCG’s LookUp program because it creates an output file that can be used as an input list file to other GCG programs. Insure that your “SeqLab Main Window” shows “Mode: Main List.” Launch “LookUp” through the “Functions” “Database Reference Searching” menu.

In the “LookUp”

window be sure that “Search the chosen sequence libraries” is checked and that “UniProt” is the only library selected. Under the main query section of the window, type the word “prion” following the category “Definition” and the word “primate” in the “Organism” category. The “Organism” category supports any 8

proper taxonomic name, making it a great way to restrict your searches. Press the “Run” button. This should find most of the prion proteins from primates in the UniProt database; since aye-ayes are primates, this is a logical approach. The program will next display the results of the search; scroll through your output and then “Close” the window. The very top portion of my LookUp output file follows below: !!SEQUENCE_LIST 1.0 LOOKUP in: uniprot of: "([SQ-DEF: prion*] & [SQ-ORG: primate*])" 71 entries

October 9, 2006 19:45 ..

UNIPROT_SPROT:PRIO_AOTTR ! ID: 02f50101 ! DE Major prion protein precursor (PrP) ! DE antigen) (Fragment). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_ATEGE ! ID: 03f50101 ! DE Major prion protein precursor (PrP) ! DE antigen) (Fragment). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_ATEPA ! ID: 04f50101 ! DE Major prion protein precursor (PrP) ! DE antigen). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_CALJA ! ID: 0bf50101 ! DE Major prion protein precursor (PrP) ! DE antigen). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_CALMO ! ID: 0cf50101 ! DE Major prion protein precursor (PrP) ! DE antigen) (Fragment). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_CEBAP ! ID: 10f50101 ! DE Major prion protein precursor (PrP) ! DE antigen). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_CERAE ! ID: 11f50101 ! DE Major prion protein precursor (PrP) ! DE antigen). ! GN Name=PRNP; Synonyms=PRP; UNIPROT_SPROT:PRIO_CERAT ! ID: 12f50101 ! DE Major prion protein precursor (PrP)

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

(PrP27-30) (PrP33-35C) (CD230

////////////////////////////////////////////////////////////////////////

Be careful that all of the proteins included in the output from any text-searching program are appropriate. In this case, upon a quick perusal, I see at least one of the entries is not a true prion, it’s a prion-like protein: UNIPROT_SPROT:PRND_HUMAN ! ID: 56f60101 ! DE

Prion-like protein doppel precursor (PrPLP) (Prion protein 2).

This entry should either be edited out of the list file, or it can be removed after loading the list into the SeqLab editor display. An option, if you use an editor, is to comment out the undesired sequences by placing an exclamation point, “!,” in front of the unwanted lines. GCG uses exclamation points as remark delineators. Select the LookUp output file in the “SeqLab Output Manager” and press the “Add to Main List” button; close the window afterwards. Next, be sure that the LookUp output file is selected in the “SeqLab Main Window” and switch “Mode:” to “Editor.” This will load the file into the SeqLab editor and allow us to align the entries and perform further analyses. ‘Grab and drag’ the lower-right corner of the display to expand it to a more convenient size. The display should look similar to the graphic at the top of the following page below: 9

Select the prion-like entry, “PRND_HUMAN.” Press the “CUT” button to remove it. Explore the dataset; use the horizontal scroll bar to move along the length of the sequences, and the vertical scroll bar to see the rest of the entries. The “1:1” slider on top allows you to ‘zoom’ in and out on the dataset; move it to “2:1” so that you can see most of the length at once. Double-click on various entries’ names to see their database annotations (or single click the “INFO” icon with the sequence entry name selected). Entries can be analyzed and databases searched through the “Functions” menu, but not now — we’ve got too much to cover tonight. Change the “Display:” box from “Residue Coloring” to “Graphic Features.” Now the display shows a schematic of the database feature information from each entry. Double-click on various colored regions of the alignment (or use the “Features” choice under the “Windows” menu); a “Sequence Features” window will describe the features within the region of the sequence that you selected. Select the feature to show more details. I selected one of the alpha helices in the human prion and my display looks like the graphic below:

10

“Close” the “Sequence Feature” window. Switch the “Display:” back to “Residue Coloring” after checking out the “Graphic Features” representation. Also use the “File” menu “Save As . . .” button to save the dataset as an RSF file.

Give it an filename that makes sense such as “prion,” but leave the “.rsf”

extension so that you’ll recognize the type of file that it is in your directory. RSF files contain sequence data, names, and annotation — the acronym stands for “rich sequence format.” b) PileUp the hits and evaluate the results Now we need to align all of these proteins to determine the most conserved areas, those areas most suitable in which to locate primers. Therefore, select all of the prion sequence entries in the editor window either by dragging the mouse through them all (if they were to all fit in the window), by using click on the top and bottom-most entries, or by selecting “Select All” from the “Edit” menu. Now go to the “Functions” “Multiple Comparison” menu and choose “PileUp.”

ClustalW+ is also available there for situations too complicated

for PileUp, but this dataset readily aligns with PileUp. You may want to see all the options that are available, although we don’t need to use any in this example. To do so, click on the “Options” button and scroll through the window; “Close” it when finished. Depending on the level of divergence in a dataset, better multiple sequence alignments can often be generated by using alternate scoring matrices (the –Matrix= option, with the BLOSUM30 matrix being the most suitable for the most diverged datasets, Henikoff and Henikoff, 1992) and/or different gap penalties. Gap penalties can be adjusted as desired but the defaults usually work quite well. Furthermore, GCG’s –InSitu option can be incredibly effective at realigning regions within an alignment (see Workshop #1). However, these sequences are all similar enough that we can just run PileUp using the GCG defaults, therefore, just press “Run” in the “PileUp” window and the program will launch. PileUp will first compare every sequence with every other one. This is the pairwise nature of the program, and then it will progressively merge them into an alignment in the order of determined similarity, from most to least. The window will go away and then, after a few moments, depending on the complexity of the alignment and the load on the server, new output windows will automatically display. The top window will be the Multiple Sequence Format (MSF) output from your PileUp run. Notice the BLOSUM62 matrix and gap introduction and extension penalties used by default. Scroll through your alignment to check it out and then “Close” the window afterwards. A greatly abridged version of my primate prion MSF file follows below: !!AA_MULTIPLE_ALIGNMENT 1.0 PileUp of: @/home/thompson/.seqlab-mendel/pileup_1.list Symbol comparison table: GenRunData:blosum62.cmp

CompCheck: 1102

GapWeight: 8 GapLengthWeight: 2 pileup_1.msf Name: Name: Name: Name: Name:

MSF: 664

q7kyz4_human q7kyy8_human o75942_human q6ses1_human prio_cerae

Type: P Len: Len: Len: Len: Len:

664 664 664 664 664

October 11, 2006 15:38 Check: 282 Check: 963 Check: 7681 Check: 6122 Check: 2703

11

Weight: Weight: Weight: Weight: Weight:

Check: 4298 .. 1.00 1.00 1.00 1.00 1.00

Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name: Name:

prio_cerdi prio_cerat prio_macsy prio_thege prio_atege q5ub85_atepa q9tu20_varvv q86xr1_human prio_human q5qpb4_human q53yk7_human prio_gorgo q6fgn5_human q27h91_human prio_hylla prio_hylsy prio_pantr q5u0k3_human

Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len: Len:

664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664

Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check: Check:

2703 4010 4010 4214 4488 5143 6846 3331 5841 4002 5841 6237 6291 5263 6422 6422 6422 6324

Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight:

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

///////////////////////////////////////////////////////////// // 1 q7kyz4_human q7kyy8_human o75942_human q6ses1_human prio_cerae prio_cerdi prio_cerat prio_macsy prio_thege prio_atege q5ub85_atepa q9tu20_varvv q86xr1_human prio_human q5qpb4_human q53yk7_human prio_gorgo q6fgn5_human q27h91_human prio_hylla prio_hylsy prio_pantr q5u0k3_human q540c4_human q5ub98_chisa q5ub97_cacca prio_ponpy q5ub99_pitir prio_colgu q6jl99_macmu prio_prefr prio_macar prio_macfa prio_macfu prio_macmu prio_macne prio_papha prio_cermo prio_mansp prio_cerne prio_erypa prio_certo

~~~~~~~~~~ ~~~~~~~~~~ MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~MLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV ~~~~~~~MLV ~~~~~~~~~~ ~~~~~~~~~~ MANLGCWMLV ~~~~~~~~~~ MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV MANLGCWMLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~MLV ~~~~~~~MLV

~~~~~~~~~~ ~~~~~~~~~~ LFVATWSDLG LFVATWSDLG VFVATWSDLG VFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG ~~~~~~~~~~ ~~~~~~~~~~ LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG ~~~~~~~~~~ ~~~~~~~~~~ LFVATWSNLG ~~~~~~~~~~ LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG LFVATWSDLG VFVATWSDLG LFVATWSDLG

~~~~~~~~~~ ~~~~~~~~~~ LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG ~~~~~~~~GG ~~~~~~~~~~ LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG ~~~~~~~~GG ~~~~~~~~GG LCKKRPKPGG ~~~~~~~~GG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG LCKKRPKPGG

50 ~~~~~~~~~~ ~~~~~~~~~~ WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG ~~~~~~~~~~ WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG WNTGGSRYPG

~~~~~~~~~~ ~~~~~~~~~~ QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP ~~~~~~~~~~ QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP QGSPGGNRYP

//////////////////////////////////////////////////////////////////// q5ub92_cebap

~~~~~~~~~~ ~~~~

12

prio_atepa q5ub94_lagla prio_aottr q5ub96_alobe prio_calmo q5uba0_calmo q5ub87_calja q5ub88_calgo prio_calja q5ub90_9prim q5ub95_braar q5ub93_saisc q5ub91_aotle q5ub89_leoro q5ub86_cebpy prio_saisc q1l6p5_micmu q27h88_human q5tg42_human q5t2t6_human q5t2t5_human q5tg43_human q5tg34_human q5t2t8_human q5tg35_human q15196_human

~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ AYRGFIFKQT ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ AYRGFIFKQT AYRGFIFKQT AYRGFIFKQT ~~~~~~~~~~

~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ SKPF ~~~~ ~~~~ ~~~~ SKPF SKPF SKPF ~~~~

After scrolling through your alignment and then “Close”ing its window, the next window visible will be the “SeqLab Output Manager.” This very important window will contain all of the output from your current SeqLab session. Files may be displayed, printed, saved with other names and/or in other locations, and deleted from this window. We need to use an extremely important function at this point; press the “Add to Editor” button and specify “Overwrite old with new” in the next window when prompted, to take your MSF output and merge it with the RSF file in the open editor. This will keep all feature information intact, yet renumber all of its reference locations. “Close” the “Output Manager” after loading your new alignment. The next window will contain PileUp’s cluster dendrogram; in the primate prion case, the graphic below:

This similarity dendrogram can be very helpful for determining whether the sequences used are all appropriate. The length of the vertical lines is proportional to the difference between the sequences. In this 13

case I think that we should exclude all of the human outlier sequences seen at far left in the dendrogram, just keeping the main central cluster. However, realize that this tree is not an evolutionary tree. No phylogenetic inference algorithms, such as maximum likelihood or parsimony, nor any ‘mutliple-hit’ correction models, such as Jukes-Cantor or Kimura, are used in its construction. PileUp’s dendrogram merely indicates the relative similarity of the sequences and, therefore, the clustering order in which the alignment was built. After loading our new alignment, my SeqLab Editor display looks like the following screen dump at a “4:1” zoom ratio: Notice

the

nice

representing

columns

columns

of

of

color aligned

residues. However, also notice that the alignment

appears

in

two

different

sections, presumably true prions and those outliers seen in the dendrogram — further evidence that we are trying to align ‘apples and oranges.’ Double-click on the entry names at the very bottom of the alignment: turns out that they are prion “interacting” proteins, not prions themselves.

Select all of these non-

prion outlier sequences and “CUT” them from the alignment. It also turns out the sequence just above this group, Q27H88_HUMAN, isn’t a real prion either; it’s another prion-like protein. There were nine outliers in the dendrogram. I didn’t do a very good job of spotting all the non-prions on my perusal of the LookUp output. Oh well; they are easy to get rid of at this stage. Select and “CUT” it from the alignment as well. To insure that no columns of gaps are left in an alignment after cutting sequences out of, it is always a good idea to use the “Edit” “Select All,” and then “Edit,” “Remove Gaps . . . ,” “Columns of gaps” functions. Do so at this time, and then use the “File” menu to “Save As . . .” the alignment. Use the same name as before and “Overwrite” the file. Return your display to a “1:1” zoom ratio. c) Determine areas of maximal conservation To identify regions of the alignment most appropriate for designing universal primers or probes, we need to decide what regions are most highly conserved. To design a hybridization probe, one, most highly conserved section is chosen; to design paired PCR primers, two flanking, highly conserved areas are chosen. A good way of doing this is to calculate the running average similarity using a sliding window approach. The GCG graphics program PlotSimilarity does this so that we can easily visualize the positional conservation of a multiple sequence alignment. The program uses a sliding window along with a similarity matrix, such as BLOSUM62, to indicate which portions are most conserved and which are most variable. The program can also produce a color mask that corresponds to the plot by representing peaks with dark grays. This can be overlaid on the alignment in the SeqLab editor to see exactly where the similarity rises and falls. 14

An

advantage of running PlotSimilarity on a protein alignment rather than a DNA alignment is that the peaks on the plot not only represent the most conserved regions of the alignment, but also those areas most resistant to evolutionary change due to the algorithm’s use of the BLOSUM matrix in its calculations. Insure that all of the sequence entries are still selected. Next go to the SeqLab “Functions” menu; select “Multiple Comparison” and then “PlotSimilarity.”

You may get a “Which selection” box if you have

previously selected a region of the alignment; if you do, specify “Selected sequences” not “Selected region.” This will produce a PlotSimilarity dialog box. We need to change some of the program defaults there, so choose “Options . . . .” Check “Save SeqLab colormask to” and “Scale the plot between:” the “minimum and maximum values calculated from the alignment.” The first option’s output file will be used in the next step and the second specification launches the program’s –Expand option. This blows up the plot, scaling it between the maximum and minimum similarity values observed so that the entire graph is used rather than just the portion of the Y-axis that the alignment happens to occupy. The Y-axis of the resulting plot will use the similarity values from the default amino acid scoring matrix or you can specify an alternative. “Close” the “PlotSimilarity Options” window; notice that the “Command Line:” box in the program window now reflects your updated options. Click the “Run” box to launch the program. The output will quickly return. “Close” the plotsimilarity.cmask display and the “Output Manager” and then take a look at the similarity plot. My example follows below:

This example shows a great deal of sequence similarity. Strong peaks can be seen centered about positions 40 and 90, and throughout 125 to 270 or thereabout. The ordinate scale here is dependent on the scoring matrix used by the program, by default the BLOSUM62 table in which amino acid identities vary from 4 to 11. The dashed line across the middle shows the average similarity value for the entire alignment, here about 3.8. “Close” the PlotSimilarity window after noting where appropriate sections of high conservation within the alignment occur. 15

Next, go to the SeqLab “File” menu; select “Open Color Mask Files.” Select the file displayed in the dialog box, “plotsimilarity.cmask;” click “Add” and then “Close.” Notice that the display is now represented in various gray-tones — the intensity of color is proportional to the level of similarity in the alignment at that point, averaged over the default window of 10 amino acids. Notice the correspondence between the original plot’s peaks and valleys and the color mask’s dark and light areas. My screen dump is shown here:

The point of these similarity visualization techniques is to identify those regions of the alignment that will be most appropriate for designing universal primers — areas of high conservation, obviously. Try to identify stretches that correspond to around 100 bases, i.e. around 30 to 40 amino acids. Decide whether you want to design a single hybridization probe, the central repeat region here looks great, or paired PCR primers based on the observed similarity. Either case will do for the exercise. I will illustrate paired PCR guessmers by choosing the furthest separated, most highly conserved regions I can find. If designing a single hybridization probe, choose the single, longest, least ambiguous sequence you can find based on all the information you have. If designing PCR primers, choose two highly conserved stretches that bracket the longest portion of the alignment possible. This is obviously a subjective decision and depends on how much of the sequence you will be trying to amplify. Regardless, choose the longest regions possible, as I stated above, at least 30 to 40 amino acids long, in order to get target regions at least 100 base pairs apiece. We will isolate the best primers within these stretches. Decide which exact sequence regions to use; write down your selections. I selected residues 21 through 51 for my upstream primer, and 210 through 260 for my downstream primer. 16

d) Use ProfileMake to create a consensus We need to generate a consensus of the sequence alignment next. We could use the “Consensus” tool under SeqLab’s “Edit” menu; however, the most powerful protein sequence consensus method I am aware of is the Profile algorithm. This algorithm uses all of the data of an alignment, its conservation and its variability, as well as the BLOSUM matrix to create a new alignment specific similarity matrix. Certainly, in this case, because of the high similarity of all the sequences, the difference would be trivial, but sometimes it can make a big difference. A profile, and its inherent consensus, is created with the program ProfileMake. Be sure that all of your sequences are selected and then go to the “Functions” “Multiple Comparison” menu and launch “ProfileMake.” Punch the “Options” button, select “Write the consensus into a sequence file,” and supply an appropriate filename. This will launch the program’s –SeqOut option to generate a normal GCG sequence file of the consensus in addition to the profile. Leave the other options as they are and “Close” the “Options” window. Press “Run” in the “ProfileMake” program window and check out the results. Take a look at the consensus sequence. The abridged primate prion profile consensus sequence follows: !!AA_SEQUENCE 1.0 (Consensus) (Peptide) PROFILEMAKE v4.50 of: @/home/thompson/.seqlab-mendel/ profilemake_12.list Length: 287 Sequences: 61 MaxScore: 1062.85 October 13, 2006 16:05 Gap: 1.00 Len: 1.00 GapRatio: 0.33 LenRatio: 0.10 input_12.rsf{Q7KYZ4_HUMAN} input_12.rsf{Q7KYY8_HUMAN} input_12.rsf{O75942_HUMAN} input_12.rsf{Q6SES1_HUMAN} input_12.rsf{PRIO_CERAE} input_12.rsf{PRIO_CERDI} input_12.rsf{PRIO_CERAT} input_12.rsf{PRIO_MACSY} input_12.rsf{PRIO_THEGE} input_12.rsf{PRIO_ATEGE} input_12.rsf{Q5UB85_ATEPA} input_12.rsf{Q9TU20_VARVV}

From: From: From: From: From: From: From: From: From: From: From: From:

1 1 1 1 1 1 1 1 1 1 1 1

To: To: To: To: To: To: To: To: To: To: To: To:

143 135 287 287 287 287 287 287 287 282 273 225

Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight:

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0 //////////////////////////////////////////////////////////////////////////////// Symbol comparison table: GenRunData:blosum62.cmp Relaxed treatment of non-observed characters Exponential weighting of characters Length: 287 October 13, 2006 16:05 Type: P

FileCheck: 982

Check: 7501

1

MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP

51

PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QPHGGGWGQP

101

HGGGWGQPHG GGWGQPHGGG WGQPHGGGTH NQWNKPSKPK TNMKHMAGAA

151

AAGAVVGGLG GYMLGSAMSR PLIHFGNDYE DRYYRENMYR YPNQVYYRPV

201

DQYSNQNNFV HDCVNITIKQ HTVTTTTKGE NFTETDVKMM ERVVEQMCIT

251

QYEKESQAYY QRGSSMVLFS SPPVILLISF LIFLIVG

..

You may want to look at your resultant “.prf” file. It’s a big table of numbers that doesn’t make a whole lot of sense on first inspection; however, it is a tremendously powerful tool in subsequent analysis steps. Other 17

programs can read and interpret all of those numbers to perform very sensitive database searches and alignments by utilizing the information within it which penalizes misalignments in phylogenetically conserved areas more than in variable regions. Load your new profile consensus sequence into the editor. Do this with the “Windows” menu “Output Manager” window. Select your new consensus sequence file name there and press the “Add to Editor” button. “Close” the “Output Manager” window after loading the consensus sequence. e) Select and use BackTranslate on the consensus sequence In an actual lab situation your peptide probe regoinn(s) may not be as long as my examples. I was fortunate to find such strong consensus elements in the prion protein. Regardless of what length regions you come up with though, they are still peptide sequences and oligonucleotide probes are necessary for both hybridization and PCR methodology. Backtranslation is not trivial because of the degeneracy of the genetic code. GCG has addressed this problem with their program BackTranslate. Alternate codons are indicated in the output along with their order of preference, based on the codon usage table that you specify, for each amino acid of the sequence. You can choose from them; the program generates either the most probable or the most ambiguous sequence. To use BackTranslate you must decide which codon usage table you want the program to utilize. By default BackTranslate will use a frequency table designed from highly expressed E. coli genes. Therefore, if you’re working with an E. coli gene, the program’s default is appropriate. However, if your protein comes from anything else, you will want to use an alternate table. GCG provides a few alternate data files in a public data library with the GCG logical name GenMoreData. The available tables, in addition to the default codon usage table,

ecohigh.cod,

are:

celegans_high.cod,

celegans_low.cod,

drosophila_high.cod,

human_high.cod, maize_high.cod, and yeast_high.cod. Even more tables are available at various molecular biology data servers such as IUBIO (http://iubio.bio.indiana.edu/soft/molbio/codon/).

The

TRANSTERM database at the European Bioinformatics Institute (ftp://ftp.ebi.ac.uk/pub/databases/transterm/) also contains several, and an especially good selection derived from a recent GenBank version comes from the CUTG database (http://www.kazusa.or.jp/codon/) available in GCG format through various SRS servers (e.g. see http://srs.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-lib+CUTG). Furthermore, if you are not satisfied with any of the available options, GCG has a program, CodonFrequency, that enables you to create your own codon frequency table from known coding sequences. Select your profile consensus sequence entry (only).

Now go to the “Functions” “Translation”

“BackTranslate . . .” menu; specify “Selected Sequence,” if asked. In the BackTranslate program window change the type of sequence produced from the “Would you like to see:” most ambiguous default to “table of back-translations and the most probable sequence.” You also need to change the “Codon Frequency Table . . .” from the default “ecohigh.cod” to something more reasonable, so press the button and choose “human_high.cod” from the “Chooser for Codon Frequency Table” window that pops up. Press the “OK” button in the “Chooser” window after selecting the human table, and then press “Run” in the program 18

window. Display the output file and notice how each codon is listed. An abridged version of my prion backtranslation sequence data file is shown below: !!NA_SEQUENCE 1.0 BACKTRANSLATE of: : input_14.rsf{prion}

check: 7501

from: 1

to: 287

Description: (Consensus) (Peptide) PROFILEMAKE v4.50 of: @/home/thompson/.seqlab-mendel/profi Accession/ID: ====================General comments==================== (Consensus) (Peptide) PROFILEMAKE v4.50 of: @/home/thompson/.seqlab-mendel/profilemake_12.list MaxScore: 1062.85 October 13, 2006 16:05

Length: 287

Sequences: 61

Using codon frequencies from: /usr/local/gcg/share/codon/human_high.cod CheckFile: 1528 CODONFREQUENCY January 24, 1991 From an existing codon frequency From an existing codon frequency From an existing codon frequency From an existing codon frequency From an existing codon frequency Met

Ala

ATG 1.00

GCC GCG GCT GCA

240

120

8 Met

AAC 0.78 AAT 0.22

154

Leu CTG CTC TTG CTT CTA TTA 197

0.58 0.26 0.06 0.05 0.03 0.02

Gly GGC GGG GGA GGT

0.50 0.24 0.14 0.12

340

Cys

Trp

TGC 0.68 TGT 0.32

TGG 1.00

394

371

14 Leu

ATG 1.00

CTG CTC TTG CTT CTA TTA 172

215 15 Thr ACC ACG ACA ACT

0.53 0.17 0.17 0.13

Asn

16:55 file: Humprb4l_217_1054.Cod FileCheck: 8577 file: Humprb4m_217_928.Cod FileCheck: 7623 file: Humprb1s_51_763.Cod FileCheck: 8371 file: Humprb1_51_946.Cod FileCheck: 119 file: Humptaa_156_488.Cod FileCheck: 9052 . . .

0.58 0.26 0.06 0.05 0.03 0.02

Val GTG GTC GTT GTA

0.64 0.25 0.07 0.05

190

Leu CTG CTC TTG CTT CTA TTA 157

0.58 0.26 0.06 0.05 0.03 0.02

Phe

Val

TTC 0.80 TTT 0.20

GTG GTC GTT GTA

155

193

0.64 0.25 0.07 0.05

Ala GCC GCG GCT GCA

0.53 0.17 0.17 0.13

103

21 Trp

0.57 0.15 0.14 0.14

TGG 1.00

145 22 Cys

148

Ser AGC TCC TCT AGT TCG TCA 74

0.34 0.28 0.13 0.10 0.09 0.05

Asp GAC 0.75 GAT 0.25

126

Leu CTG CTC TTG CTT CTA TTA 114

0.58 0.26 0.06 0.05 0.03 0.02

Gly GGC GGG GGA GGT

0.50 0.24 0.14 0.12

162

Leu CTG CTC TTG CTT CTA TTA 265

0.58 0.26 0.06 0.05 0.03 0.02

28 Lys

Lys

TGC 0.68 TGT 0.32

AAG 0.82 AAA 0.18

AAG 0.82 AAA 0.18

169

119

119

Arg CGC CGG AGG AGA CGT CGA 70

0.37 0.21 0.18 0.10 0.07 0.06

19

Pro CCC CCT CCG CCA 94

0.48 0.19 0.17 0.16

Lys

Pro

AAG 0.82 AAA 0.18

CCC CCT CCG CCA

98

120

0.48 0.19 0.17 0.16

/////////////////////////////////////////////////////////////////////////// prion_14.seq

Length: 861

October 13, 2006 16:35

Type: N

Check: 1187

1

ATGGCCAACC TGGGCTGCTG GATGCTGGTG CTGTTCGTGG CCACCTGGAG

51

CGACCTGGGC CTGTGCAAGA AGCGCCCCAA GCCCGGCGGC TGGAACACCG

101

GCGGCAGCCG CTACCCCGGC CAGGGCAGCC CCGGCGGCAA CCGCTACCCC

151

CCCCAGGGCG GCGGCGGCTG GGGCCAGCCC CACGGCGGCG GCTGGGGCCA

201

GCCCCACGGC GGCGGCTGGG GCCAGCCCCA CGGCGGCGGC TGGGGCCAGC

251

CCCACGGCGG CGGCTGGGGC CAGCCCCACG GCGGCGGCTG GGGCCAGCCC

301

CACGGCGGCG GCTGGGGCCA GCCCCACGGC GGCGGCTGGG GCCAGCCCCA

351

CGGCGGCGGC TGGGGCCAGC CCCACGGCGG CGGCACCCAC AACCAGTGGA

401

ACAAGCCCAG CAAGCCCAAG ACCAACATGA AGCACATGGC CGGCGCCGCC

451

GCCGCCGGCG CCGTGGTGGG CGGCCTGGGC GGCTACATGC TGGGCAGCGC

..

/////////////////////////////////////////////////////////////////

The final, resultant nucleotide sequence is the most likely coding sequence for the consensus polypeptide we specified using the codon frequency chart we chose. A recommended enhancement, within those codons in potential primers, is to prepare a mixture of oligo’s containing the various codons for those positions that are particularly ambiguous, such as the serine at position 17 in the example above. AGC is used 34% of the time, but TCC is also used 28% of the time. Several more analyses are necessary before synthesizing your new probes, however. We need to discover which portions of the consensus elements that we have identified make the best primers.

And of those portions, we need to determine if they have significant internal

complementation such that strong ‘hairpin’ structures would be formed, and we should also check for self- and primer-dimer complementation. The GCG program Prime can be used for all these tests. We also need to run a DNA database search to make sure that only the type of genes that we are interested in are ‘found.’ The GCG program FindPatterns is probably best for this type of search because it does not allow gapping. f)

Use Prime to locate ‘good’ primers within your candidate regions

The GCG program Prime can locate acceptable primers within a DNA template, Prime+ will work with genomic length sequences. The programs are quite powerful and contain many, many options to maximize flexibility. We will use Prime here to find the best forward and reverse primers within the defined 5’ and 3’ sequence regions identified above based on sequence similarity. We’ll use Prime to localize the best primers within our defined stretches of DNA, eventually locating nucleotide hybridization guessmers of 30 to 50 bases, corresponding to a peptide of 10 to 17 residues, or PCR primers around 20 to 30 bases in length, corresponding to a peptide 7 to 10 amino acid residues long. (Although, in ‘real life,’ PCR primers may need to be even longer to maximize annealing potential.) But before we run Prime, we need to load the new backtranslated sequence into our Editor display so it is available for analysis. Therefore, use “Add to Editor” from the SeqLab “Output Manager” to load the new DNA sequence. The sequence will not load aligned to its respective protein coding region. In fact, this would 20

be impossible because it is three times longer than the protein sequence, which would need to be spaced out to leave two gaps between every amino acid to reconcile the two. It will load starting at position one in the Editor display. Just realize that it is no longer aligned to the alignment above. Specify the upstream and downstream backtranslated sequence regions for Prime to search. We’ll use Prime’s –Begin1, –End1, –Begin2, –End2 and –Include options to restrict our primers to the predefined target regions. However, remember that there is a three to one numbering discrepancy between the DNA and protein sequence. Be sure that your backtranslated sequence is loaded and that it is the only one selected in the SeqLab editor, and then select the overall range within it for your target product with the “Edit” “Select Range” function.

In other words, choose that range delineated by your 5’ and 3’ most target locations

identified in your PlotSimilarity notes, times three to compensate for the numbering discrepancy. In my example’s case that’s from base number “63” through “780.” Launch “Prime” through the “Functions” or “Windows” menu.

Specify “Selected region” in the “Which selection” window when prompted by the

“Which Selection” box rather than “Selected sequence.” You may want to specify a slightly longer “Primer Length” then the 18 through 22 default, though it isn’t necessary. I changed this parameter to a “Minimum” of “20” and a “Maximum” of “50” to help take into account potential mismatches introduced in the backtranslation step. Also, set “Maximum product length” to the maximum length of the selected region on your backtranslated sequence selected. This will be the maximum value displayed, in my case “718.” Save an RSF file to add annotation to your existing prion dataset by checking “Save results as features in file.” Choose “Options” next; note that lots and lots of options are available. This makes the program very flexible and very powerful. Check “Specify PCR target range.” This activates the –Begin2 and –End2 command line options.

The “PCR target starting” and “ending position” parameters identify those positions.

Therefore, specify the 3’ end of your upstream candidate region and the 5’ end of your downstream candidate region respectively, again considering the three to one numbering discrepancy. The “starting position” is “153” and the “ending position” is “630” in my example. Also be sure that “Minimum % of specified PCR target range to be included in product” is “100.0.” This effectively brackets your desired product forcing all primer searching to be performed within the desired primer binding regions. Just below that section in the “Prime Options” window, check “Save primers found to a pattern file,” the –FoundPrimers option, and designate an appropriate output data file name that makes sense to you (e.g. “prion.prime.dat”). This option saves the primers in a special GCG pattern data format file, which can be read by Prime and other programs, as well as in the standard text output. If you are looking for only the one best hybridization probe rather than paired primers, the –ForwardPrimers option is obviously necessary.

Accept the rest of the

program defaults; “Close” the “Prime Options” window and press “Run” in the Prime program window. Your first pass may not find anything. Mine didn’t. You’ll be able to easily tell because your “.rsf” and “.dat” files will be empty. This is because the default experimental conditions set by Prime are very restrictive. Often you will have to change many of these parameters in subsequent program runs to find any primers at all. As is often the case, Prime did not find any primers on my first pass.

Prime can sometimes be quite

frustrating to run because of these stringent parameters. However, it’s best to start with the very stringent 21

default conditions and slowly relax them, versus going the other way round, though Prime does have an option, “Ignore . . . constraints,’ that does allow you to do that, if you become too frustrated. Use the “Output Manager” to select and “Display” the file that ends with the “.prime” extension. This text file describes the conditions used in the run and lists acceptable primers, if there were any, with their corresponding melting temperatures. The “.prime” file also points out exactly which parameters prevented success. Therefore, if no primers were discovered, either repeat the Prime run with different, more permissive, parameters, or choose different and/or longer target sections on the consensus sequence. You will have to experiment with changing these parameters to discover the combination that works. You may be forced to rerun the program a number of times adjusting parameters and the regions searched until you are successful. This can be frustrating — just persevere. Use the same data file output name in subsequent runs so that you end up with only the one successful set of universal primers. Based on our “.prime” report we can see that the parameter that most prevented success in this case is GC content. Therefore, repeat the run using a less stringent GC content (or whatever parameters you are having troubles with). Sometimes it will take many passes through the program adjusting different parameters each time in order to finally get something acceptable.

Play with the options to find the best primers in the

backtranslated sequence. The “Windows” menu contains a ‘shortcut’ listing of all programs used in the current session; you can launch any of them from there as well as from the “Functions” menu. Relaunch “Prime” through the “Windows” menu and make the suggested parameter changes. From the “.prime” file and the “Prime Options” window we can see that default GC content is required to be between 40 and 50 percent whereas our backtranslated sequence appears to be quite a bit higher than that. The GCG program Composition can give you an exact count of nucleotide content if you need it. Therefore, I will increase my – GCMaxPrimer parameter by increasing “Primer % G+C” “Maximum” to “75” and see what happens. It makes sense to allow the same GC content in the product, so change the “Product % G+C” “Maximum,“ – GCMaxProduct, to “75” as well. “Run” the program again after you get all your settings specified.

As

mentioned above, an alternative option is to turn most of the constraints off by selecting the button next to “Ignore most of the constraints set by default . . .” and working your way toward more restrictive conditions rather than the other way around. As frustrating as Prime can be, it certainly can point out the exact conditions that must be altered from standard PCR reactions in order to have any success in the wet lab. In my case, rerunning the program with “GCMaxPrimer” and “GCMaxProduct” set to “75” did the trick. Whether this is a totally impossible PCR condition is not indicated by the program, so do not blindly accept the results! This may all seem like a genuine pain just to get a couple of primers for PCR, however, realize that successful primers found in this manner will most likely work with all similar organisms for this particular gene. You will not have to repeat the experience until you are given a totally different system on which to work. When the program successfully finishes and the output is displayed, check out the various files. The “.dat” data file lists the primers in a special ‘pattern’ format that can be used in subsequent Prime runs and in other

22

GCG programs such as FindPatterns. The primer locations are noted as comments in the data file; look it over and then “Close” its window. My example follows below: !!PATTERNS 1.0 This file contains possible primers for the template sequence: /home/thompson/.seqlab-mendel/input_17.rsf{prion.backtranslated} .. forward1 1 GCAACCGCTACCCCCCCCAG ! 75 -> 94 forward2 1 CCAAGCCCGGCGGCTGGAAC ! 15 -> 34 forward3 1 AAGCCCGGCGGCTGGAACAC ! 17 -> 36 forward4 1 TGCAAGAAGCGCCCCAAGCC ! 2 -> 21 forward5 1 GCAAGAAGCGCCCCAAGCCC ! 3 -> 22 forward6 1 CAACCGCTACCCCCCCCAGG ! 76 -> 95 reverse1 1 GCTCCACCACGCGCTCCATC ! 674 -> 655 reverse2 1 CTCCACCACGCGCTCCATCATC ! 673 -> 652 reverse3 1 CCACCACGCGCTCCATCATCTTC ! 671 -> 649 reverse4 1 ACCACGCGCTCCATCATCTTCAC ! 669 -> 647 reverse5 1 ACGCGCTCCATCATCTTCACGTC ! 666 -> 644 reverse6 1 CCACCACGCGCTCCATCATCTTCAC ! 671 -> 647 reverse7 1 CACCACGCGCTCCATCATCTTCAC ! 670 -> 647 reverse8 1 CGCTCCATCATCTTCACGTCGGTCTC ! 663 -> 638 reverse9 1 TCTGCTCCACCACGCGCTCC ! 677 -> 658 reverse10 1 TCCACCACGCGCTCCATCATC ! 672 -> 652 reverse11 1 GCTCCATCATCTTCACGTCGGTCTC ! 662 -> 638 reverse12 1 TGCTCCACCACGCGCTCCATC ! 675 -> 655 reverse13 1 GCTCCACCACGCGCTCCATCATC ! 674 -> 652 reverse14 1 ACATCTGCTCCACCACGCGCTC ! 680 -> 659 reverse15 1 CATCTGCTCCACCACGCGCTC ! 679 -> 659

“Close” the RSF file display window; we’ll be using that file just below, but it’s not much to read. The abridged “.prime” results from my successful run showing my set of primate prion primers (say that six times, real fast) follow below. The primers are ranked in terms of an annealing score with smaller numbers being better and the best primers at the top. The first three pairs shown here are all equally good. Read the Prime program “Help” upon a subsequent program run, if you are interested in how this function is calculated. I’ve indicated those parameters changed from their defaults with bold type in the following abridged screen trace: PRIME of: input_17.rsf{prion.backtranslated} ck: 8405 October 13, 2006 22:44 INPUT SUMMARY -------------

from: 63 to: 780

Input sequence: /home/thompson/.seqlab-mendel/input_17.rsf{prion.backtranslated} Primer constraints: primer size: 20 - 50 primer 3' clamp: S although this is often turned off in real experiments! primer sequence ambiguity: NOT ALLOWED primer GC content: 40.0 - 75.0% primer Tm: 50.0 - 65.0 degrees Celsius primer self-annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) unique primer binding sites: required primer-template and primer-repeat annealing. . . 3' end: ignored total: ignored repeated sequences screened: none specified Product constraints: product length: 478 - 718 product GC content: 40.0 - 75.0% product Tm: 70.0 - 95.0 degrees Celsius product must include the region from 153 - 630

23

duplicate primer endpoints: NOT ALLOWED difference in primer Tm: < 2.0 degrees Celsius primer-primer annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) PRIMER SUMMARY -------------forward

reverse

Number of primers considered:

2643

1384

Number of primers rejected for . . . primer 3' clamp: primer sequence ambiguity: primer GC content: primer Tm: non-unique binding sites: primer self-annealing: primer-template annealing: primer-repeat annealing:

19 0 2478 120 0 7 0 0

78 0 0 767 0 65 0 0

19

474

Number of primers accepted:

PRODUCT SUMMARY --------------Number of products considered:

9006

Number of products rejected for. . . product length: product GC content: product Tm: product position: duplicate primer endpoints: difference in primer Tm: primer-primer annealing:

61 283 0 1915 2015 3419 1130

Number of products accepted: Number of products saved: Maximum overlap between products:

183 25 718 bp

THE FOLLOWING PRODUCTS ARE SORTED BY THEIR ANNEALING SCORE -------------------------------------------------------------------------------Product: 1 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (20-mer): reverse primer (20-mer):

5' 3' 137 GCAACCGCTACCCCCCCCAG 156 736 GCTCCACCACGCGCTCCATC 717 forward

reverse

75.0 61.5

70.0 60.5

primer %GC: primer Tm (degrees Celsius):

PRODUCT ------product length: 600 product %GC: 73.0 product Tm: 88.7 degrees Celsius difference in primer Tm: 1.1 degrees Celsius

24

annealing score:

53

optimal annealing temperature: 65.3 degrees Celsius -------------------------------------------------------------------------------Product: 2 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (20-mer): reverse primer (22-mer):

5' 3' 137 GCAACCGCTACCCCCCCCAG 156 735 CTCCACCACGCGCTCCATCATC 714 forward

reverse

75.0 61.5

63.6 59.9

primer %GC: primer Tm (degrees Celsius):

PRODUCT ------product length: 599 product %GC: 73.0 product Tm: 88.7 degrees Celsius difference in primer Tm: 1.6 degrees Celsius annealing score: 53 optimal annealing temperature: 65.2 degrees Celsius -------------------------------------------------------------------------------Product: 3 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (20-mer): reverse primer (23-mer):

5' 3' 137 GCAACCGCTACCCCCCCCAG 156 733 CCACCACGCGCTCCATCATCTTC 711 forward

reverse

75.0 61.5

60.9 60.2

primer %GC: primer Tm (degrees Celsius):

PRODUCT ------product length: 597 product %GC: 73.0 product Tm: 88.7 degrees Celsius difference in primer Tm: 1.3 degrees Celsius annealing score: 53 optimal annealing temperature: 65.3 degrees Celsius /////////////////////////////////////////////////////////////////////////////

A graphics “.figure” window shows where the primers anneal to your sequence schematically. Blue tick marks indicate forward primers, red reverse ones. The graphic from the above run is shown below on the next page: 25

“Close” the graphics window. Be sure to “Add to Editor” the “prime.rsf” file displayed in the “Output Manager.” Choose “Overwrite old with new” in the “Reloading Same Sequence” window that pops up. This will merge the new feature annotation that locates the successful primers onto your existing RSF file. “Close” the “Output Manager” window after loading your new feature annotation. Take a look at this new feature information by changing your “Display:” to the “Graphic Features” cartoon representation, and zoom out to “4:1” so that most of the entire sequence can be seen at once. The products appear as orange arches, and the primers appear as upstream green, and downstream red, diamonds. Double-click on one of the new features and then select that entry in the “Feature” window to see a description. It should look something like the graphic displayed below, where an upstream prion primer is described against a backdrop of the editor window:

An alternative primer design approach is to individually isolate forward and reverse primers separately. Use the “Select: forward” and “reverse primers, only” options to do this. You may be able to ‘zero-in’ on regions a bit more specifically with this approach. If you do design primers with this alternative forward only and reverse only method, you’ll need to test the pairs together with each other in a subsequent program run to be sure that they won’t anneal with each other so badly as to interfere with the reaction. The Prime program also allows you to do this, and to test any other primers desired, by specifying an input data file of primers at run26

time. The output “.dat” file that we wrote out above is one of these data files. That way, rather than discovering the best primers within a specified template, the program tests and ranks all the primers fed to it against the specified template. Another GCG program, PrimePair can test a data file of input primers against themselves in the absence of a template.

One restriction to both programs is they will not tolerate

mismatches or ambiguities in their primers or in those sites where they anneal, so all ambiguities must be taken out of the primers and template annealing regions to be tested. g) Will your primers only ‘find’ the correct genes? Candidate primers need to pass one more test before being synthesized. You should check your primers’ specificity to insure that your primers will not hybridize to completely the wrong type of sequence by checking them against the DNA database. This step can also point out, and allow you to correct if necessary, errors in your primer sequence created in the backtranslation step, if enough DNA sequences are available in the database to allow a comparison. The GCG program FindPatterns is probably best for this. It can be used to screen your candidate primers against the entire DNA database or any other GCG sequences desired. There are several advantages to using FindPatterns over standard similarity searching software such as BLAST or FastA: 1) you can test more than one primer at a time against as many sequences as you want; 2) the algorithm will not allow any gapping of your primers to the template, which would represent loop structures in the hybrid and should not be allowed; 3) similarities don’t count — identities are required but mismatches are allowed by option; again, just what you want in primer analysis; and 4) word size parameters are not relevant since the algorithm doesn’t use them, therefore, they don’t have to be messed with (which you would need to do if you were using heuristic style similarity searches such as BLAST since they are not designed to find short regions of DNA similarity).

For these reasons, I do not recommend using BLAST or FastA style

searches for testing primer specificity. The easiest way to run FindPatterns is to provide it with your primers as an input file rather than typing them in interactively. To do this FindPatterns needs its input file of patterns to be in exactly the same pattern data format as Prime can produce with its “Save primers found to a pattern file” –FoundPrimers option. This makes it relatively easy to test them against the database. However, running a full-blown FindPatterns GenBank search would require too much time for you to see the results in the time constraints of our tutorial. Therefore, I have already run this search and am providing that output file for you; it’s the one that you initially fetched when you began the exercise. If you would like to run this type of analysis for your own research, it is very important to use appropriate parameters! You need to specify the correct pattern data file, a realistic mismatch level, and the –Batch option. Give a mismatch level of slightly less than 20% the length of your shortest primer sequence. The less than 20% mismatch cut-off level is a ‘rule-of-thumb’ because that is the number of expected mismatches if all codon choices were made on a completely random basis. In the example that I am providing I used a mismatch level of about 10%. If running FindPatterns from the command line rather than from SeqLab, the program will ask you which sequences you want to find your pattern in. These are not your primer sequences; these are the sequences you want to search your primer patterns against.

Therefore, answer with either all of GenBank or the

appropriate subdivision of GenBank. Since I am trying to find prions in aye-ayes, the primate portion of 27

GenBank is most relevant and I used gb_pr:*, which means that I want to search all of the sequences in the primate subdivision of GenBank. (See Worksho #1 and the GenHelp User’s Guide chapter Using Sequences, topic Using Database Sequences, subtopic Nucleic Acid Database tables, if this still confuses you.) Do not run FindPatterns today against GenBank — just scroll through and note the types of sequences that were found by my example run. Temporarily switch to the terminal window behind the SeqLab window and use the UNIX “more” utility to do this. Press the , not the return key, to go from one page to the next. The abridged output file, “primer-tutorial.prion.finds” follows below on the next few pages: > more primer-tutorial.prion.finds ! FINDPATTERNS on gb_pr:* allowing 2 mismatches ! Using patterns from: primer-tutorial.prion.dat APU15164 ck: 8218 len: 759 ciceps major prion protein precursor ge

October 14, 2006 16:57 ..

! U15164 Ateles paniscus x Ateles fus

PrPrR1 /Rev

CAGTACAGCAACCAGAACAACTTCG 499: TGGAT cagtacaacaaccagaacaactttg TGCAC mis=2

PrPrR2 /Rev

CAGTACAGCAACCAGAACAACTTCGT 499: TGGAT cagtacaacaaccagaacaactttgt GCACG mis=2

PrPrR3 /Rev

CAGTACAGCAACCAGAACAACTTCGTG 499: TGGAT cagtacaacaaccagaacaactttgtg CACGA mis=2

PrPrR4 /Rev

CAGTACAGCAACCAGAACAACTTCGTGC 499: TGGAT cagtacaacaaccagaacaactttgtgc ACGAC mis=2

PrPrR5 /Rev

CAGTACAGCAACCAGAACAACTTCGTGCA 499: TGGAT cagtacaacaaccagaacaactttgtgca CGACT mis=2

PrPrR10 /Rev GAGAACATGTACCGCTACCCCAACCA 451: ATCGT gaaaacatgtaccgttaccccaacca AGTAT mis=2 GGU15166 ck: 5814 len: 762 protein precursor gene, complete cds. 6

! U15166 Gorilla gorilla major prion

PrPrF1

GCCTGTGCAAGAAGCGCCCCAAGCC 59: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2

PrPrF4

GGCCTGTGCAAGAAGCGCCCCAAGC 58: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrF5

GGGCCTGTGCAAGAAGCGCCCCAAG 57: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2

PrPrF8

GGGCCTGTGCAAGAAGCGCCCCAAGC 57: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrR1 /Rev

CAGTACAGCAACCAGAACAACTTCG 502: TGGAT cagtacagcaaccagaacaactttg TGCAC mis=1

PrPrR2 /Rev

CAGTACAGCAACCAGAACAACTTCGT 502: TGGAT cagtacagcaaccagaacaactttgt GCACG mis=1

PrPrR3 /Rev

CAGTACAGCAACCAGAACAACTTCGTG 502: TGGAT cagtacagcaaccagaacaactttgtg CACGA mis=1

PrPrR4 /Rev

CAGTACAGCAACCAGAACAACTTCGTGC 502: TGGAT cagtacagcaaccagaacaactttgtgc ACGAC mis=1

PrPrR5 /Rev

CAGTACAGCAACCAGAACAACTTCGTGCA 502: TGGAT cagtacagcaaccagaacaactttgtgca CGACT mis=1

28

HSPRP2

ck: 7852

len: 2,301 ! X83416 H.sapiens PrP gene, exon 2.

1/96 PrPrF1

GCCTGTGCAAGAAGCGCCCCAAGCC 77: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2

PrPrF4

GGCCTGTGCAAGAAGCGCCCCAAGC 76: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrF5

GGGCCTGTGCAAGAAGCGCCCCAAG 75: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2

PrPrF8

GGGCCTGTGCAAGAAGCGCCCCAAGC 75: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrR1 /Rev

CAGTACAGCAACCAGAACAACTTCG 496: TGGAT gagtacagcaaccagaacaactttg TGCAC mis=2

PrPrR2 /Rev

CAGTACAGCAACCAGAACAACTTCGT 496: TGGAT gagtacagcaaccagaacaactttgt GCACG mis=2

PrPrR3 /Rev

CAGTACAGCAACCAGAACAACTTCGTG 496: TGGAT gagtacagcaaccagaacaactttgtg CACGA mis=2

PrPrR4 /Rev

CAGTACAGCAACCAGAACAACTTCGTGC 496: TGGAT gagtacagcaaccagaacaactttgtgc ACGAC mis=2

PrPrR5 /Rev

CAGTACAGCAACCAGAACAACTTCGTGCA 496: TGGAT gagtacagcaaccagaacaactttgtgca CGACT mis=2

//////////////////////////////////////////////////////////////////////////////// SSU08308 ck: 4959 len: 762 on protein gene, complete cds. 2/95

! U08308 Symphalangus syndactylus pri

PrPrF1

GCCTGTGCAAGAAGCGCCCCAAGCC 59: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2

PrPrF4

GGCCTGTGCAAGAAGCGCCCCAAGC 58: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrF5

GGGCCTGTGCAAGAAGCGCCCCAAG 57: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2

PrPrF8

GGGCCTGTGCAAGAAGCGCCCCAAGC 57: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2

PrPrR1 /Rev

CAGTACAGCAACCAGAACAACTTCG 502: TGGAT cagtacagcagccagaacaactttg TGCAC mis=2

PrPrR2 /Rev

CAGTACAGCAACCAGAACAACTTCGT 502: TGGAT cagtacagcagccagaacaactttgt GCACG mis=2

PrPrR3 /Rev

CAGTACAGCAACCAGAACAACTTCGTG 502: TGGAT cagtacagcagccagaacaactttgtg CACGA mis=2

PrPrR4 /Rev

CAGTACAGCAACCAGAACAACTTCGTGC 502: TGGAT cagtacagcagccagaacaactttgtgc ACGAC mis=2

PrPrR5 /Rev

CAGTACAGCAACCAGAACAACTTCGTGCA 502: TGGAT cagtacagcagccagaacaactttgtgca CGACT mis=2

PrPrR10 /Rev GAGAACATGTACCGCTACCCCAACCA 454: ATCGT gaaaacatgcaccgctaccccaacca AGTGT mis=2 SSU08310 ck: 136 in gene, complete cds. 2/95 PrPrR1 /Rev

len: 783

! U08310 Saimiri sciureus prion prote

CAGTACAGCAACCAGAACAACTTCG

29

523: TGGAT cagtacagcaaccagaacaactttg TGCAC mis=1 PrPrR2 /Rev

CAGTACAGCAACCAGAACAACTTCGT 523: TGGAT cagtacagcaaccagaacaactttgt GCACG mis=1

PrPrR3 /Rev

CAGTACAGCAACCAGAACAACTTCGTG 523: TGGAT cagtacagcaaccagaacaactttgtg CACGA mis=1

PrPrR4 /Rev

CAGTACAGCAACCAGAACAACTTCGTGC 523: TGGAT cagtacagcaaccagaacaactttgtgc ACGAC mis=1

PrPrR5 /Rev

CAGTACAGCAACCAGAACAACTTCGTGCA 523: TGGAT cagtacagcaaccagaacaactttgtgca CGACT mis=1

Databases searched: GenBank, Release 153.0, Released on 14Dec2006, Formatted on 15Dec2006 Total finds: 395 Total length: 320,531,468 Total sequences: 91,570 CPU time: 3:16:52.39

Only prion sequences were found by our new primers — excellent. An example FindPatterns command line run is shown below. Notice the GCG –check command line ‘super-option’ that lists all of the available options within a program and gives you a chance to use any of them. Remember, do not run FindPatterns on GenBank here today: > findpatterns -check FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -Default Prompted Parameters: -PATterns=GAATTC,RGGAY [-OUTfile=]findpatterns.find

patterns to be found the output file name

Local Data Files: -DATa=pattern.dat

a file with a set of patterns

Optional Parameters: -MISmatch=1 allows mismatches in the search for your subsequence -NAMes makes an output file in "file of filenames" format -ONEstrand searches only the top strand of nucleotide sequences -SIXbase searches only for patterns with six or more symbols -CIRcular searches all sequences as if they were circular Press q to quit or for more: -ALL does an "overlapping-set" search in nucleotide sequences -PERFect looks only for perfect matches -APPend appends the pattern data file to the output file -SHOw shows every file searched even if there are no finds -TERminal writes output to the terminal screen instead of a file -NOMONitor suppresses the screen trace showing each file -ONCe limits finds to patterns found a maximum of 1 time -MINCuts=1 limits finds to patterns found a minimum of 1 time -MAXCuts=3 limits finds to patterns found a maximum of 3 times -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -SINce=6.90 limits search to sequences dated on or after June 1990 -BATch Submits the program to run in the batch queue Add what to the command line ?

-data=primer.dat -mismatch=2 -batch

30

FINDPATTERNS in what sequence(s) ?

gb_pr:*

What should I call the output file (* findpatterns.find *) ?

prion.finds

** findpatterns will run as a batch or at job. ** findpatterns was submitted using the command: " batch " warning: commands will be executed using /bin/sh job 848528595.b at Wed Nov 20 14:23:15 1996

This is the conclusion of the main part of today’s computer laboratory. You can either log out or leave the computers as they are with SeqLab and the terminal active. If you don’t log out, I’ll go around and log you out and clean up any loose ends.

I hope that you all will have come to realize the tremendous help that

computational technology can be in this area by going through today’s tutorial. Obviously the same general ideas taught here can be tailored to any particular system for the design of primers to any level of specificity. For more help in this area and for personal sequence analysis consultation, contact me for more information: Steve Thompson, [email protected]. 2) More practice: universal and strain specific primer design from a pre-built DNA alignment I doubt whether you’ll have time to do this section of the tutorial during our allotted two hours, but I encourage you to log back on to the GCG server at some point in the future and work through the following example yourself. As always, I am available for any personalized help you may need. As with the first portion of the tutorial, I have placed some files in a publicly accessible GCG directory to make this portion of the tutorial less tedious. After logging in to the GCG server and launching a terminal window, issue the following command to copy those files into your account. > fetch primer-tutorial.L1.* Next list your directory (ls) using the long form option (-l) on the new files to see what and how big they are: > ls -l primer-tutorial.* -rw-rw-rw-rw-r--r--rw-r--r--rw-r--r--

1 1 1 1

stevet stevet stevet stevet

gcg gcg gcg gcg

2113 170858 623104 49285

Jun Jun Jun Jun

11 17 13 22

18:48 10:45 17:47 20:14

primer-tutorial.L1.dat primer-tutorial.L1.finds primer-tutorial.L1.rsf primer-tutorial.prion.finds

Next launch SeqLab with the standard command: > seqlab & First I want you to see what the HPV L1 alignment looks like. Be sure the “Mode:” “Main List” choice is selected in your main window and then go to the “File” menu. Pick “Add sequences from” and select “Sequence Files.” (GCG format compatible sequences or list files are accessible through this route. Use SeqLab’s Editor “Import” function to directly load GenBank or ABI/SCF trace format sequences without the need to reformat.) This will produce an “Add Sequences” window from which you can select sequences to add to your working.list. The “Filter” box is very important here! You can control which files are displayed by 31

choosing an appropriate text string. Since we want to use the alignment I already prepared and you fetched in the previous step, put the extension “.rsf” in the “Filter” box (including the period); be sure to leave the “*” wild card. Press the “Filter” button to display all of the RSF files in your working directory. Select the file entitled “primer-tutorial.L1.rsf” from the “Files” box, and then check the “Add” and then the “Close” buttons at the bottom of the window, to put the file in your “working.list.” It will appear in the SeqLab “Main List” window. Be sure it is selected and switch to “Editor” “Mode:” to load the RSF file into the SeqLab editor. Notice that all of the sequences now appear in the editor window with the bases and residues color-coded. Any portion of or all of the sequences loaded are now available for analysis by any of the GCG programs. Expand the window full-screen. The display will look something like the following graphic:

Each DNA sequence is shown together with its proper amino acid translation directly below. The nucleotide sequences are listed by their official GenBank entry name (LOCUS identifier). The protein sequences are not annotated because they did not come form the database; they were translated from their corresponding DNA sequences. Select every DNA sequence in the alignment by clicking each entry name that does not have the word frame within it; also do not select the last consensus sequence, cons25pct. Normally this click is done with the left mouse button; however, a bug in the Linux version of SeqLab switched this function to the right mouse button. Be sure to scroll up and down through the entire alignment. Release the key to scroll and then repress selecting the new screen of DNA entries as you go. The SeqLab main window will now look similar to the graphic on the following page, with every other sequence selected: 32

Now that all of the DNA entries are highlighted (and only the DNA entries, but not the consensus sequence at the bottom), run PlotSimilarity on this DNA sequence alignment in the same manner as you did on prion protein dataset.

That is, go to the SeqLab “Functions” menu; select “Multiple Comparison” and then

“PlotSimilarity.” Then choose “Options . . .” and check “Save SeqLab colormask to” and “Scale the plot between:” the “minimum and maximum values calculated from the alignment.”

“Close” the

“PlotSimilarity Options” window. Click the “Run” box to launch the program. The output will quickly return. “Close” the plotsimilarity.cmask display and the “Output Manager” and then take a look at the similarity plot. My example follows below:

33

“Close” the PlotSimilarity graphics window after you’ve checked it out. As before, go up to the SeqLab “File” menu; select “Open Color Mask Files.” Select the file displayed in the dialog box, “plotsimilarity.cmask;” click “Add” and then “Close.” As with the prion dataset, identify those regions of the alignment that will be most appropriate for designing universal primers — areas of high conservation — but now we’re dealing directly with DNA.

Also note areas of low conservation; these are candidate regions for strain specific

primers. Take some notes of the general areas that appear promising to you. For the purpose of this exercise try to identify two of the more highly conserved regions that flank the longest stretch of L1 possible. Also note where the deepest and furthest separated valleys are. Try to get regions that are at least 100 or so bases long. I’ll show a screen dump graphic of the first upstream conserved region from around column 130 to 230 that I happened to pick below:

a) Design ‘universal’ primers Go to the very bottom of the alignment.

Select the sequence labeled “cons25pct” (only).

This is a

consensus sequence generated from the DNA alignment above it at the 25 percent agreement level that I produced through the “Consensus” tool of the “Edit” menu. The other DNA sequences should now no longer be selected. Scroll through the consensus sequence until you find the upstream highly conserved area that you noted above. Determine the exact base that you want your selection to begin and end at by selecting the candidate bases and noting their “col:” positions in the lower left-hand corner of the display. Remember, we want stretches about 100 bases long. Write these numbers down. Repeat this procedure with your 3’ candidate region. Now go up to the “Edit” menu and click on “Select Range.” This will produce a dialog box in which you can type the desired range; enter the overall length that you want to deal with, just like before, that is, the 5’ most base through the 3’ most base noted above. After typing in the numbers punch “Select” and then “Close” the box. Notice that the range specified is now highlighted on the display. 34

We’ll use Prime to again locate the best forward and reverse primers within the defined 5’ and 3’ sequence candidate regions identified. As before, the primer target regions within this delineation are specified with the –Begin2=, –End2=, and –Include= options to force primer discovery within the conservation peaks identified. As before, launch Prime from the “Functions” menu; choose “Primer Selection” and then “Prime.” This should produce a dialog box asking whether you want to analyze the selected sequences or the selected region; choose “Selected region.” A Prime program box will display. If you haven’t logged out since doing the prion exercise, press the “GCG Defaults” button in the “Prime Program Window” to reset all the parameters from what you ended up with then. Regardless, adjust the “Maximum PCR Product Length” up to the maximum allowed, the full length of your selected region. Next, check “Save results as features in file prime.rsf” to add any primer locations found to the your current RSF file’s annotation. Choose “Options” next and scroll down through the extensive options list in the “Prime Options” window. Check “Specify PCR target range.” This activates the –Begin2 and –End2 options, as before with the prion dataset. The “PCR target starting” and “ending position” parameters identify those positions and need to be changed to restrict the target range accordingly. Therefore, as before, specify the 3’ end of your upstream candidate region and the 5’ end of your downstream candidate region respectively; also be sure that “Minimum % of specified PCR target range to be included in product” is set to “100.0” to force all primer searching within the desired primer binding regions. Also check the “Save primers found to a pattern file” option and give it an appropriate name like “HPV.all.dat,” to create a pattern data file for this dataset. “Close” the options window. Press “Run” in the program window after making your selections. Prime will now search for the best primers within the restricted areas specified. The output will quickly display; as with the prion data, your “.rsf” and “.dat” files may be empty because of the strict default parameters. In my first run against cons25pct I couldn’t find any primers. Upon perusing my “.prime” output the worst offender again seemed to be GC content, so I lowered “product GC minimum” from the default 40% to “25%” in a subsequent run. You’ll probably have to do the same, since the consensus sequence is so AT rich. Repeat the run with the new parameters. Remember the “Windows” menu has a ‘shortcut’ listing to all programs used in the current session so it can be used to relaunch “Prime.” parameters, and then press “Run” in the “Prime” main window. Note

the

“Command

Line”

parameters in the setup for my successful run, shown in the screen dump graphic to the right here:

35

Change the appropriate

When the program successfully finishes “Close” the RSF file display window, but look over the “HPV.all.dat” data file and then “Close” its window. My example follows below: !!PATTERNS 1.0 This file contains possible primers for the template sequence: /users/thompson/.seqlab-mendel/input_61.rsf{cons25pct} .. forward1 1 GACTACTTGCTGTTGGACATC ! 77 -> 97 forward2 1 GAATATGTGACACGCACAAAC ! 31 -> 51 forward3 1 CTAGACTACTTGCTGTTGGAC ! 74 -> 94 forward4 1 CCCTGTATCTAAGGTTGTAAGC ! 3 -> 24 forward5 1 GTATCTAAGGTTGTAAGCACGG ! 7 -> 28 forward6 1 AAGCACGGATGAATATGTGAC ! 21 -> 41 forward7 1 TGTATCTAAGGTTGTAAGCACG ! 6 -> 27 forward8 1 GCACGGATGAATATGTGACAC ! 23 -> 43 forward9 1 TGCAGGCAGTTCTAGACTAC ! 63 -> 82 forward10 1 GGATGAATATGTGACACGCAC ! 27 -> 47 forward11 1 CTACTTGCTGTTGGACATCC ! 79 -> 98 forward12 1 CGGATGAATATGTGACACGC ! 26 -> 45 forward13 1 ACTACTTGCTGTTGGACATCC ! 78 -> 98 reverse1 1 TGCGTCCCAAAGGAAACTG ! 1387 -> 1369 reverse2 1 CTGGCCTTAAATCCTGCTTG ! 1418 -> 1399 reverse3 1 TTGCGTCCCAAAGGAAAC ! 1388 -> 1371

Display and scroll through your successful “.prime” file and then “Close” its window. Remember, the best are at the top of the file. The beginning of my successful universal primer search “.prime” file is shown below over the next couple of pages; changed parameters are highlighted in bold: PRIME of: input_62.rsf{cons25pct} 2003 16:09

ck: 2375

from: 1 to: 1646

October 17,

INPUT SUMMARY ------------Input sequence: /users/thompson/.seqlab-mendel/input_62.rsf{cons25pct} Primer constraints: primer size: 18 - 22 primer 3' clamp: S although this is often turned off in real experiments! primer sequence ambiguity: NOT ALLOWED primer GC content: 40.0 - 55.0% primer Tm: 50.0 - 65.0 degrees Celsius primer self-annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) unique primer binding sites: required primer-template and primer-repeat annealing. . . 3' end: ignored total: ignored repeated sequences screened: none specified Product constraints: product length: 1217 - 1419 product GC content: 25.0 - 55.0% product Tm: 70.0 - 95.0 degrees Celsius product must include the region from 234 - 1450 duplicate primer endpoints: NOT ALLOWED difference in primer Tm: < 2.0 degrees Celsius primer-primer annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) PRIMER SUMMARY --------------

36

forward

reverse

Number of primers considered:

500

482

Number of primers rejected for . . . primer 3' clamp: primer sequence ambiguity: primer GC content: primer Tm: non-unique binding sites: primer self-annealing: primer-template annealing: primer-repeat annealing:

108 18 214 104 0 11 0 0

124 18 202 90 0 16 0 0

45

32

Number of primers accepted: PRODUCT SUMMARY

--------------Number of products considered:

1440

Number of products rejected for. . . product length: product GC content: product Tm: product position: duplicate primer endpoints: difference in primer Tm: primer-primer annealing:

700 0 0 0 305 71 320

Number of products accepted: Number of products saved: Maximum overlap between products:

44 25 1419 bp

THE FOLLOWING PRODUCTS ARE SORTED BY THEIR ANNEALING SCORE -------------------------------------------------------------------------------Product: 1 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (19-mer): reverse primer (20-mer): forward

5' 3' 193 CATGCAGGCAGTTCTAGAC 211 1605 AGAGGTAGATGAGGTGGTGG 1586

reverse

primer %GC: primer Tm (degrees Celsius):

52.6 50.1

55.0 51.9

PRODUCT ------product length: 1413 product %GC: 34.7 product Tm: 73.7 degrees Celsius difference in primer Tm: 1.8 degrees Celsius annealing score: 53 optimal annealing temperature: 51.7 degrees Celsius -------------------------------------------------------------------------------Product: 2

37

[DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (20-mer): reverse primer (20-mer): forward

5' 3' 192 TCATGCAGGCAGTTCTAGAC 211 1599 AGATGAGGTGGTGGGTGTAG 1580

reverse

primer %GC: primer Tm (degrees Celsius):

50.0 51.5

55.0 52.5

PRODUCT ------product length: 1408 product %GC: 34.7 product Tm: 73.6 degrees Celsius difference in primer Tm: 1.0 degrees Celsius annealing score: 57 optimal annealing temperature: 52.1 degrees Celsius -------------------------------------------------------------------------------Product: 3 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer (21-mer): reverse primer (19-mer):

5' 3' 209 GACTACTTGCTGTTGGACATC 229 1519 TGCGTCCCAAAGGAAACTG 1501 forward

reverse

47.6 50.9

52.6 52.5

primer %GC: primer Tm (degrees Celsius):

PRODUCT ------product length: 1311 product %GC: 34.2 product Tm: 73.4 degrees Celsius difference in primer Tm: 1.6 degrees Celsius annealing score: 57 optimal annealing temperature: 51.8 degrees Celsius //////////////////////////////////////////////////////////////////

My HPV L1 universal primer example location schematic is shown to the right here in a screen dump graphic:

38

“Close” the graphics window. “Add to Editor” the “prime.rsf” file displayed in the “Output Manager” and “Overwrite old with new” in the “Reloading Same Sequence” window to merge the new primer feature annotation onto your existing RSF file. “Close” the “Output Manager” window afterwards. Look at this new feature information by changing your “Display:” to “Graphic Features” and zoom out to “16:1” so that the entire length can be seen at once. PCR products are again coded as orange arches, upstream primers as green diamonds, and downstream primers as red diamonds. Double-click on the new feature and then select that entry in the “Feature” window to see a description. It should look similar to the graphic shown below:

“Close” the “Feature” window to return to SeqLab’s main window. The sample pattern data file that you ‘fetched’ at the beginning of this section contains the My09/11 published primer set mentioned in the Introduction. That data file, “primer-tutorial.L1.dat,” follows below: The primers used in the L1 primer design computer laboratory ! The published My09/11 set: !

MY11

MY11a MY11b MY11c MY11d MY11e MY11f MY11g MY11h !

MY09

1 1 1 1 1 1 1 1 1

GCMCAGGGWCATAAYAATGG GCaCAGGGaCATAAcAATGG GCcCAGGGtCATAAcAATGG GCaCAGGGtCATAAcAATGG GCcCAGGGaCATAAcAATGG GCaCAGGGaCATAAtAATGG GCcCAGGGtCATAAtAATGG GCaCAGGGtCATAAtAATGG GCcCAGGGaCATAAtAATGG

1

0 ! no ambiguity allowed 0 0 0 0 0 0 0 0

CGTCCMARRGGAWACTGATC

39

0 ! no ambiguity allowed

..

MY09a MY09b MY09c MY09d MY09e MY09f MY09g MY09h MY09i MY09j MY09k MY09l MY09m MY09n MY09o MY09p

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

CGTCCaAaaGGAaACTGATC CGTCCcAagGGAaACTGATC CGTCCaAgaGGAaACTGATC CGTCCcAggGGAaACTGATC CGTCCaAagGGAaACTGATC CGTCCcAaaGGAaACTGATC CGTCCaAggGGAaACTGATC CGTCCcAgaGGAaACTGATC CGTCCaAaaGGAtACTGATC CGTCCcAagGGAtACTGATC CGTCCaAgaGGAtACTGATC CGTCCcAggGGAtACTGATC CGTCCaAagGGAtACTGATC CGTCCcAaaGGAtACTGATC CGTCCaAggGGAtACTGATC CGTCCcAgaGGAtACTGATC

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

As mentioned in the prion exercise, these data files are in a special GCG format known as a pattern data file. You’ll have at least three primer pattern data files in your working directory now, one from the prion exercise, one sample data file, “primer-tutorial.L1.dat,” and your new HPV all data file. I’d like to discuss this format in more detail here. It is structured after GCG’s restriction enzyme data files, and begins with some helpful documentation or an appropriate explanation at the top of the file. Two periods “..” are essential in all GCG data files and separate header information from the data below. Each entire pattern needs to be on one line apiece without any gaps (ambiguity syntax is supported in most GCG programs that read these data , but not in the primer analyses ones); it needs to be prefaced with a name, the offset number 1, and followed with an optional overhang number 0. The exact column in which the various fields appear is not important, but the order of the fields is vital. Comments can be embedded anywhere by placing an exclamation point before them. See if you can figure out how to submit this data file to Prime in order to test how well the commercial primers stack up against your custom designed ones with the universal consensus sequence. You’ll have to use the “Select forward and reverse primers from one file” option. Use the “Primers” button there to produce a dialog box, “Primer Chooser for Prime;” click on its “Primer Data File. . .” button, and use the “File Chooser” to pick my sample “primer-tutorial.L1.dat” and check “OK.” “Close” the “Primer Chooser” window. The “Prime Options” window should now show that you are using the appropriate data file; “Close” the “Prime Options” window and “Run” the program. b) Design strain specific primers To find suitable strain specific primers, first check out the phylogenetic tree shown on the following page inferred from the alignment that we have been working with. This particular tree was estimated using a distance-based method with the Kimura correction model and the Fitch-Margoliash least-squares fit algorithm. It is representative of the trees that this alignment will yield with most all inference algorithms.

I have

arbitrarily placed the HPV type 16 assemblage at the top of the tree and imply no directionality by this. Horizontal branch lengths are directly proportional to evolutionary divergence in units of substitutions per site; vertical branch length has no meaning other than to separate clades. 40

The Human Papillomaviruses most closely related to Type 16 are illustrated below in a phylogenetic tree:

Pick your favorite clade from the tree (a clade is a group of organisms all related to a common ancestor not shared by some other group). Go for any level of specificity — from just a couple of individual sequences 41

composing a clade up to a whole bunch of related sequences in a larger clade. Now go back to the SeqLab editor. Deselect the consensus sequence by clicking it (but remember that on Linux SeqLab you need to use the right mouse button). Next, select that group of DNA sequences that you wish to design strain specific primers for. Remember to press the key to select (right-click in Linux) more than one nonadjacent entry. For my example I chose the clade that includes the type 6 and 11 strains. After selecting those sequences that you want to design strain specific primers for, go back to the “Functions” menu and run PlotSimilarity on just these chosen sequences. Your plot should look something like my example shown below, but with your particular choices selected:

Areas of high conservation in this plot that correspond to areas of divergence in the previous plot are ideal for designing strain specific primers. In fact, you can even produce printouts on clear plastic transparencies and overlay the two to exactly localize the regions (or use a light box with paper printouts). Another rationale is to selectively deselect particular sequences and note the shift in valleys rather than peaks. Both methods can be very powerful and can tremendously help in the design of primers to any level of phylogenetic specificity desired. This really does work — I’ve had many clients successfully use the technique in real experiments! The actual primer design phase of this step precedes exactly the same as before. Close the windows that overlay the SeqLab editor. With your group of desired sequences still chosen, go to the “Edit” menu and select “Consensus.” Pick a desired agreement level (25% through 50% seem to work well depending on the level of divergence in your subset of the data) and create a consensus sequence of your desired subset. A new consensus sequence will appear below the alignment. Select your new consensus sequence (only) and go through the same procedure with it as with the overall alignment consensus to discover the best forward 42

and reverse primers in it. Don’t forget to save the new RSF file and to use the “Save primers found to a pattern file” option for saving the primers in GCG pattern data format. Name this data file with a name that you’ll recognize (e.g. “HPV.subset.dat”). You’ll be selecting those ranges within the subset consensus that correspond to the furthest separated, most highly conserved regions, that do not line up with highly conserved regions of the overall alignment. Got it? This way you should be choosing sections of the L1 gene that will discriminate between, in my case, the type 6 and 11 variants from all the other closely related type 16 sequences. Pretty cool, huh? I discovered four acceptable type 6/11 strain specific primer pairs with my alignment subgroup consensus. The abridged result of my analysis follows. The parameters I had to change are again highlighted in bold in the output below: PRIME of: L1.rsf{cons50pct}

ck: 3391

from: 1 to: 1638

March 4, 1997 15:33

INPUT SUMMARY ------------Input sequence: /usr/thompson/seqlab/L1.rsf{cons50pct} Primer constraints: primer size: 18 - 22 primer 3' clamp: S primer sequence ambiguity: NOT ALLOWED primer GC content: 40.0 - 55.0% primer Tm: 50.0 - 65.0 degrees Celsius primer self-annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) unique primer binding sites: required primer-template and primer-repeat annealing. . . 3' end: ignored total: ignored repeated sequences screened: none specified Product constraints: product length: 740 - 942 product GC content: 40.0 - 55.0 product Tm: 70.0 - 95.0 degrees Celsius product must include the region from 611 - 1350 duplicate primer endpoints: NOT ALLOWED difference in primer Tm: < 2.0 degrees Celsius primer-primer annealing. . . 3' end: < 8 (weight: 2.0) total: < 14 (weight: 1.0) PRIMER SUMMARY -------------forward

reverse

Number of primers considered:

6

6

Number of primers rejected for . . . primer 3' clamp: primer sequence ambiguity: primer GC content: primer Tm: non-unique binding sites: primer self-annealing: primer-template annealing: primer-repeat annealing:

0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0

43

Number of primers accepted:

6

5

PRODUCT SUMMARY --------------Number of products considered:

30

Number of products rejected for. . . product length: product GC content: product Tm: product position: duplicate primer endpoints: difference in primer Tm: primer-primer annealing:

0 2 0 0 9 13 0

Number of products accepted: Number of products saved:

6 6

----------------------------------------------------------------------------Product: 1 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer: T611fb reverse primer: T611rc forward primer (21-mer): reverse primer (18-mer):

5' 3' 525 GGATAACAGGGTTAATGTAGG 545 1408 GCTTTTGACAGGTAATGG 1391 forward

reverse

42.9 53.0

44.4 51.4

primer %GC: primer Tm (degrees Celsius):

PRODUCT ------product length: 884 product %GC: 40.0 product Tm: 76.3 degrees Celsius difference in primer Tm: 1.6 degrees Celsius annealing score: 53 optimal annealing temperature: 53.9 degrees Celsius ----------------------------------------------------------------------------Product: 2 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer: T611fe reverse primer: T611rc forward primer (20-mer): reverse primer (18-mer):

5' 3' 522 ACAGGATAACAGGGTTAATG 541 1408 GCTTTTGACAGGTAATGG 1391 forward

reverse

40.0

44.4

primer %GC:

44

primer Tm (degrees Celsius):

51.7

51.4

PRODUCT ------product length: 887 product %GC: 40.0 product Tm: 76.3 degrees Celsius difference in primer Tm: 0.3 degrees Celsius annealing score: 55 optimal annealing temperature: 53.9 degrees Celsius ----------------------------------------------------------------------------Product: 3 [DNA] = 50.000 nM

[salt] = 50.000 mM PRIMERS -------

forward primer: T611fc reverse primer: T611rc forward primer (18-mer): reverse primer (18-mer):

primer %GC: primer Tm (degrees Celsius):

5' 3' 514 AACCCTGGACAGGATAAC 531 1408 GCTTTTGACAGGTAATGG 1391 forward

reverse

50.0 52.0

44.4 51.4

PRODUCT ------product length: 895 product %GC: 40.2 product Tm: 76.4 degrees Celsius difference in primer Tm: 0.6 degrees Celsius annealing score: 55 optimal annealing temperature: 54.0 degrees Celsius ///////////////////////////////////////////////////////////////

Be sure to again “Add to Editor” the “prime.rsf” file displayed in the “Output Manager.”

Choose

“Overwrite old with new” in the “Reloading Same Sequence” window that pops up. As before, this will merge the new feature annotation that locates the successful strain specific primers onto your existing RSF file. “Close” the “Output Manager” window after loading your new feature annotation. c) Test for specificity First we’ll see how specific our primers are for the sequences in our dataset.

Therefore, we’ll run

FindPatterns in SeqLab against all the sequences in your alignment. Go to the “Functions” menu, choose “Database Sequence Searching” and “FindPatterns.”

Next, specify both the “Search Set” and the

“Patterns” to be used by the program. This gets a bit interesting, as you have to navigate through several file chooser boxes to designate your desired input file and pattern data file. First click the “Search Set” button to get a dialog box entitled “Build FindPattern’s Search Set.” Click on “Add Main List Selection” to produce the “List Chooser;” there select and then “Add” the alignment file we’ve been working on, “primer45

tutorial.L1.rsf.” “Close” the chooser boxes to return to the FindPatterns program box. Now punch “Patterns” to get the “Pattern Chooser.” Here click on “Pattern Data File. . .” to get the “File Chooser;” your newly created data files from above should be displayed. Select your final universal primer data file and click “OK” and then press the “Add All” button; then click the “Pattern Data File. . .” button again. This time add your final subset specific primer data file by selecting it and pressing “OK” in the “File Chooser” window and then using the “Add All” button in the “Pattern Chooser.” Finally repeat the procedure with the My09/11 commercial data set. You should end up with a combined pattern file of both your universal and subset specific primer sets and then commercial set; they will now be displayed in the “Chosen Patterns:” window. You may want to “Save Chosen. . .” to create a combined primer data pattern file in your account. “Close” the “Pattern Chooser” window. “Using selected Patterns.” should now be displayed after the “Patterns. . .” button. Check “Save matches as features in “findpatterns.rsf” and then press the “Run” button after setting up the specified search and pattern sets. The program box will go away and the output will display relatively soon. Scroll thorough the file, noticing the specificity of each primer for particular sequences. “Close” the windows when done. I found it interesting in my test runs that the My09/11 primers matched so few sequences in our dataset. Be sure to “Add to Editor” the “findpatterns.rsf” file displayed in the “Output Manager.” Choose “Overwrite old with new” in the “Reloading Same Sequence” window that pops up. As always, this will merge the new feature annotation that locates all the primer location data onto your existing RSF file. “Close” the “Output Manager” window after loading your new feature annotation. Check out the new features in “Graphic Features” mode to quickly see which sequences anneal to your new primers. We still need to see if our primers are specific to HPV though. We did this sort of test in the prion exercise with FindPatterns, for the reasons already discussed. Because the DNA databases are so big, and getting bigger all the time, if you ever do this type of search from the command line, be sure to do it in batch mode. GCG makes this easy by providing a –Batch option to many of their cpu intensive programs. Searching all of GenBank in this fashion takes quite a while to run so I am providing you with that output file. I ran the FindPatterns search with our candidate primers data file against all of GenBank allowing for one mismatch to occur between the primer and the sequence. For those interested, I used the following command line for this search; however, do not repeat this search at this point: > findpatterns -data=combined.L1.dat -mismatch=1 -batch gb:* primer-tutorial.genbank.finds

Temporarily switch to the terminal window that’s been hanging around behind SeqLab.

Use the UNIX

command “more” to page through the file “primer-tutorial.L1.finds” in the following manner: > more primer-tutorial.L1.finds The file is huge, almost 5000 lines, with 1,357 finds in 326 different sequences. Each individual pattern found is listed along with its location on each sequence. Notice the types of sequences being found by the program. Largely they are HPV, the proper sequences to be found by the candidate primers, but some notable 46

exceptions appear. In particular, all of the U.S. patent sequences are interesting; they are likely commercial kits for HPV diagnosis. Other exceptions include a few E. coli sequences, a potato cDNA, and a cow kinase — not likely to cause much of a PCR contamination problem in genital tissue swabs — and some human “genomic” sequences. These human sequences may cause some problems and should probably be checked out. One of them turns out to be an HPV integrated site in a human carcinoma (Yabe, et al., 1991) and the other a genomic clone that encodes the PAX6 protein (van Heyningen and Little, 1995). The HPV site is expected. The PAX6 site is interesting but probably won’t be a problem since the primers that found it are from two different series, MY11 and T 6/11, but its contaminant potential should be kept in mind. For those interested, PAX6 turns out to be a ‘paired box’ type homeobox protein involved in vertebrate eye development. Supplement Back in the wet lab you would have synthesized oligo’s (and labeled them, if doing hybridization), performed the PCR reaction or hybridization screen, and isolated the products with plaque/colony purification or direct PCR purification, as appropriate. After you found a candidate sequence; what next: often it’s restriction mapping. The unknown stretch of DNA is restriction digested with various enzymes and agarose gel electrophoresed; the resultant fragment sizes are extrapolated from migration distances.

From this information a tentative restriction map can be

hypothesized. This type of restriction mapping, i.e. reconstructing a physical map based on overlaps without having an actual sequence, is computationally very difficult. Few automated solutions exist. Alternative strategies include subcloning the pieces into a manageable vector and then sequencing those fragments or direct PCR product sequencing. After generating some sequence data, the other type of restriction mapping, that where you do know the sequence and you merely want to know where all the various restriction enzymes may cut, can be very helpful. The GCG programs, Map, MapPlot, MapSort and PlasmidMap can all assist in guiding and illustrating this process. Once all cut sites have been mapped SeqLab, or the stand-alone sequence editor SeqEd, can be used to actually perform the subcloning operation on the computer before doing it in the wet lab. References Cited Bilofsky, H.S., Burks, C., Fickett, J.W., Goad, W.B., Lewitter, F.I., Rindone, W.P., Swindell, C.D., and Tung, C.S. (1986) The GenBank(TM) Genetic Sequence Data Bank. Nucleic Acids Research 14: 1-4. Cherfas, J. (1990). Genes Unlimited. New Scientist 14: 29-33. Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin Package, Version 10.2, Madison, Wisconsin, USA 53711. Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. Methods in Enzymology, 183: 146-159, Academic Press, San Diego, California, U.S.A.

47

Gupta, S. K., Kececioglu, J., and Schaffer, A. A (1995) Making the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment More Space Efficient in Practice, Proc. 6th Annual Combinatorial Pattern Matching

conference (CPM ‘95). Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National

Academy of Sciences U.S.A.. 89: 10915-10919. Mullis, K.B. (1990). The Unusual Origin of the Polymerase Chain Reaction. Scientific American April: 56-65. Nagano,H., Yoshikawa, H., Kawana, T., Yokota, H., Taketani, Y., Igarashi, H., Yoshikura, H., and Iwamoto, A. (1996) Association of multiple human papillomavirus types with vulvar neoplasias. J. Obstet. Gynaecol. 22: 1-8. Online Mendelian Inheritance in Man, OMIM (TM). (1996) Center for Medical Genetics, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD). World Wide Web URL: http://www3.ncbi.nlm.nih.gov/omim/ Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R., Horn, G.T., Mullis, K.B., and Erlich, H.A. (1988). PrimerDirected Enzymatic Amplification of DNA with a Thermostable DNA Polymerase. Science 239: 487-491. Sambrook, J., Fritsch, E.F., and Maniatis, T. (1989). Synthetic Oligonucleotide Probes. In Molecular Cloning A Laboratory

Manual, 2nd ed. (pp 11.2-11.53), Cold Spring Harbor Laboratory Press, New York, New York, USA. Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and

Structure, 5, Suppl. 3, (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., U.S.A. Smith, R.F. andSmith, T.F. (1992). Pattern-Induced Multi-sequence Alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering 5: 35-41. Stewart

A.C., Eriksson, A.M., Manos, M.M., Munoz, N., Bosch, F.X., Peto, J., and Wheeler, C.M. (1996) Intratype variation in 12 human papillomavirus types: a worldwide perspective. J. Virol. 70: 3127-3136.

Tenti, P., Romagnoli, S., Silini, E., Zappatore, R., Spinillo, A. , Giunta, P., Cappellini, A., Vesentini, N., Zara, C., and Carnevali, L. (1996) Human papillomavirus types 16 and 18 infection in infiltrating adenocarcinoma of the cervix: PCR analysis of 138 cases and correlation with histologic type and grade. Am. J. Clin. Pathol. 106: 52-56. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.

Nucleic Acids Research 22: 4673-4680. van Heyningen, V. and Little, P.F. (1995) Report of the fourth international workshop on human chromosome 11 mapping 1994. Cytogenet. Cell Genet. 69: 127-158. White, T.J., Arnheim, N., and Erlich, H.A. (1989). The Polymerase Chain Reaction. Trends in Genetics 5: 185-189. Wood, W.I. (1987). Gene Cloning Based on Long Oligonucleotide Probes.

Methods in Enzymology 152: 443-447,

Academic Press, San Diego, California, USA. Yabe, Y., Sakai, A. , Hitsumoto, T. , Kato, H. , and Ogura, H. (1991) A subtype of human papillomavirus 5 (HPV-5b) and its subgenomic segment amplified in a carcinoma: nucleotide sequences and genomic organizations. Virology 183: 793-798.

48