Next generation sequencing

Next generation sequencing www.gatc.co.uk/cgi-bin/wPrintpreview.cgi?sour... Depiction of a new type of highly miniaturized microarray that incorpor...
Author: Alexia Byrd
0 downloads 3 Views 4MB Size
Next generation sequencing

www.gatc.co.uk/cgi-bin/wPrintpreview.cgi?sour...

Depiction of a new type of highly miniaturized microarray that incorporates randomness in its design. Each array contains ~50,000 beads carrying oligonucleotide probes. The beads are lodged in wells on the surface of a hexagonally packed optical fiber bundle. The location and identity of the randomly arrayed beads are determined using a hybridization-based decoding process Among other applications, the decoded arrays have been used to develop a microarray based gene expression profiling assay that makes use of PCR, and to carry out genotyping from small amounts of human genomic DNA using whole-genome amplification. (Artwork provided by Andrew Roberts at Studio 209, PortlandOR.

Next-generation of DNA and RNA sequencing methods - The bead-amplification sequencing (Roche/454FLX) -Sequencing by synthesis (Illumina/Solexa Genome analyzer) -Sequencing by ligation (Applied Biosystems SOLID System) -Helicos Helioscope (2008) -Pacific Biosciences SMRT (2010)

Common features: -A compex interplay of enzymology, chemistry, software, hardware, optics engineering…) -A streamline of sample preparation prior to sequencing (time saving) -Preparation of fragment libraries of the DNA of interest by annealing for platform-specific linkers and amplification -Amplification of single stranded fragment library and performing sequencing on amplified fragments -Single molecule sequencing just arrived or is under development

Comparison between capillary sequencers and novel generation sequencers -A huge difference in the throughput of a single run: 96 capillaries of about 750 bp compared to several thousand (Roche) to tens of millions (Illumina, ABi) shorter reads. -Time runs are longer that by next generation sequencers (8h – 10days) - Longer run times are required to image the massive parallel sequencing reactions -Due to the streamline preparation and long reading times a single human operator can operate several machines of next generation at the full capacity.

ABI

Solexa

Nova (naslednja) generacija sekvenciranja - Sekvenciranje človeškega genoma je trajalo več let, z uporabo cca 20 kb BAC klonov, ki so vsebovali cca 100 kb dolge tarčne fragmente, in 8-kratnega pokrivanja vsakega dela tarče. Analiza s kapilarno elektroforezo. -Nadaljnji razvoj sekvenciranja je temeljil na sočasnem sekvenciranju celotnega genoma (WHS, angl. whole genome sequencing), ki je bil vstavljen v vektorje. Metoda je hitrejša, pušča pa velike praznine v zelo polimorfnih ali repetitivnih genomih. Analiza s kapilarno elektroforezo. -Naslednja generacija sekvenciranja (2004) – visokozmogljivostno paralelno čitanje odsekov DNA na ravni celega genoma preko PCR pomnoževanja enoverižnih fragmentov genomske knjižnice.

Classical Sanger dideoxy sequencing method

Dideoxynucleotide sequencing represents only one method of sequencing DNA. It is commonly called Sanger sequencing since Sanger devised the method. This technique utilizes 2',3'dideoxynucleotide triphospates (ddNTPs), molecules that differ from deoxynucleotides by the having a hydrogen atom attached to the 3' carbon rather than an OH group. These molecules terminate DNA chain elongation because they cannot form a phosphodiester bond with the next deoxynucleotide. campus.queens.edu/faculty/jannr/molecular/

General concepts for clonal-array generation and sequencing a | Bead-chips. Genomic DNA is fragmented and adaptors are ligated to create an insert library that is flanked by two universal priming sites. Because of the random fragmentation, the complexity of this signature sequence library is equivalent to the genome. This library is cloned on beads using emulsion PCR technology. A water-in-oil emulsion is created from a PCR mix that contains a limiting dilution of DNA and beads. The emulsion creates micro-compartments with, on average, a single bead and single DNA template each. After PCR, beads with clones are affinity selected and assembled onto a planar substrate. A subsequent cycle-sequencing reaction is used to read out the sequence on the clones. b | Sequencing by synthesis (SBS). A common anchor primer is annealed to a constant sequence (universal priming site) that is contained within the library clones that are located on the polony (clonal bead) array (the orientation of the immobilized target might vary depending on the platform that is used). The sequence is read out by polymerase extension in a base-by-base fashion using either reversible terminators or sequential nucleotide addition (pyrosequencing). After incorporation of a single base or base type, the incorporated base is identified by fluorescence (laser) or chemiluminescence (no laser required). c | Sequencing by ligation. The polony array set-up is similar to SBS in which a common primer is annealed to an arrayed polony library and used to read out the sequence through a stepwise ligation of random oligomers. The labelled oligomers are designed to have random bases inserted at every site except the query site. The query site has one of four base substitutions, each matched to a particular fluorescent label on the oligonucleotide. After read-out of each ligation event, the primer and the ligated oligomer are stripped, a new primer reannealed and the process repeated with an oligomer that contains a query base at a different position. Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901

General concepts for clonal-array generation and sequencing

Bead chips

Sequencing by synthesis

Sequencing by ligation

Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901

Sequencing with novel generation bead-chips

FIGURE 1. A single well within a picotiter plate includes DNA copies bound to the bead as sequencing reagents are added. (Courtesy of Roche) The Roche system uses native, unmodified DNA bases in its process. In the DNA preparation step, the DNA sample is sheared into small fragments that are then attached to 26 µm beads, one fragment to one bead. Then, in a process called emulsion PCR, the DNA is amplified so that each bead carries 100,000 copies of the original DNA fragment. The DNAcoated beads are then loaded into the wells of a 1.6 million-well picotiter plate so that, on average, there is one bead per well. The wells of the picotiter plate are made of fiber-optic material so that they can transmit (via a coupled CCD imager) the light signals that are used to indicate the DNA sequence. The sequencing reagents and one of the four DNA bases are added to start the pyrosequencing process. The camera records all the wells in which that base was added, and the intensity of the signal is used to infer the number of times that base was added to the growing strand. Then that base is washed away and the second base is added, and so on, until the sequence of the fragment is established.

Principles of Pyrosequencing

Pyrosequencing I

Step 1 A sequencing primer is hybridized to a single-stranded PCR amplicon that serves as a template. Mixtures incubated with the enzymes, DNA polymerase, ATP sulfurylase, luciferase, and apyrase as well as the substrates, adenosine 5' phosphosulfate (APS), and luciferin.

Step 2 The first deoxribonucleotide triphosphate (dNTP) is added to the reaction. DNA polymerase catalyzes the incorporation of the deoxyribonucleotide triphosphate into the DNA strand, if it is complementary to the base in the template strand. Each incorporation event is accompanied by release of pyrophosphate (PPi) in a quantity equimolar to the amount of incorporated nucleotide. www1.qiagen.com/Products/PyroMarkQ96ID.aspx

Pyrosequencing II Step 3 ATP sulfurylase converts PPi to ATP in the presence of adenosine 5' phosphosulfate (APS). ATP drives the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light in amounts that are proportional to the amount of ATP. The light produced in the luciferasecatalyzed reaction is detected by a charge coupled device (CCD) chip and seen as a peak in the raw data output (Pyrogram). The height of each peak (light signal) is proportional to the number of nucleotides incorporated Step 4 Apyrase, a nucleotide-degrading enzyme, continuously degrades unincorporated nucleotides and ATP. When degradation is complete, another nucleotide is added

www1.qiagen.com/Products/PyroMarkQ96ID.aspx

Pyrosequencing III

Step 5 Addition of dNTPs is performed sequentially. It should be noted that deoxyadenosine alfa-thio triphosphate (dATP·S) is used as a substitute for the natural deoxyadenosine triphosphate (dATP) since it is efficiently used by the DNA polymerase, but not recognized by the luciferase. As the process continues, the complementary DNA strand is built up and the nucleotide sequence is determined from the signal peaks in the Pyrogram trace

Roche/454 FLX Pyrosequencer Library fragments are mixed with agarose beads with oligos complementary to adapter sequences on the library. Each bead is associated with a single fragment. Each fragment-bead complex is isolated into individual oil:water micelles with PCR mixture. Thermal cycling of this emulsion PCR of the micelles produces amplified unique sequences on the bead surface. “En mass” sequencing of PCR products on picotiter plates (PTP) with single beads in each picowell. Enzyme/substrate containing beads for the pyrosequencing reaction are added to wells that act as floww cells for addition of individual pure nucleotide solutions. The CCD camera records the light emitted at each bead. Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).

Watson and Cricks pyrosequencing readout

Timeline of the pyrosequencing development

October 2005 Release of the Genome Sequencer 20, the first next-generation sequencing system on the market October 2005 Collaboration agreement signed with Roche Diagnostics

.

January 2007 Release of the Genome Sequencer FLX System March 2007 Roche Diagnostics completes integration with 454 Life Sciences May 2007 Complete sequence of Jim Watson published in Nature. First genome to be sequenced for less than $1 million. November 2007 Announcement of the 100th peer-reviewed publication enabled by 454 Sequencing June 2008 454 Joins the 1000 Genome Project, an international effort to build the most detailed map to date of human genetic variation as a tool for medical research September 2008 Announcement of the 250th peer-reviewed publication enabled by 454 Sequencing October 2008 Release of Genome Sequencer FLX Titanium Series reagents, featuring 1 million reads at 400 base pairs in length

illumina sequencing technology is based on arrays of randomly assembled glass (silica) beads; the beads have oligonucleotides covalently attached to the surface; each bead has about one million oligos on its surface; all oligos on each bead have the same sequence Attached DNA fragments are extended and bridge amplified to create an ultra-high density sequencing flow cell with 80-100 million clusters, each containing ~1,000 copies of the same template. These templates are sequenced using a robust four-color DNA sequencing-by-synthesis technology that employs reversible terminators with removable fluorescent dyes. This novel approach ensures high accuracy and true base-by-base sequencing, eliminating sequence-context specific errors and enabling sequencing through homopolymers and repetitive sequences. the beads are randomly assembled on the arrays, and the location of a particular probe is initially unknown;a process called decoding is used to find the location of each bead;

Illumina beads and scanner

Bridge amplification – sequencing by synthesis

.

An isolated pair of 5' immobilized primers (positive and negative) and a specific target DNA strand. The solution above the array with amplification buffer, target DNA, polymerase, and labeled dNTPs

Bridge amplification is a technology that uses primers bound to a solid phase for the extension and amplification of solution phase target nucleic acid sequences. The name refers to the fact that during the annealing step, the extension product from one bound primer forms a bridge to the other bound primer. All amplified products are covalently bound to the surface, and can be detected and quantified without electrophoresis.

An array of 100 pixels on a flat surface (bead, chip or any other suitable solid phase format). Each pixel contains primer pairs (negative and positive) and is specific for one target DNA sequence. All amplified DNA remains covalently attached to a specific pixel on the array. Detection of incorporated label in a pixel indicates the presence of a specific target DNA sequence in the sample www.promega.com/

Illumina sequencing by synthesis

Illumina sequencing by synthesis

Decoding process

After the arrays are assembled, the (example 16) beads in each position are identified by decoding. The array is hybridized to 16 'decoder oligonucleotides‘ each one of which is a match for one of the oligonucleotide sequences (bead type) on the beads in the array; The decoder oligos are labelled with 4 different fluorescent dyes; the array is then imaged and stripped;

Illumina2008ProductGuide.pdf

Illumina experimental protocol The total mature RNA is isolated from the cell/tissue being studied. This RNA has already been “processed” (removal of the noncoding introns and splicing together of the coding exon) as well as the addition of a poly-A tail The RNA is turned into a double stranded DNA copy known as a cDNA. This is done through reverse transcription. This is done because RNA itself is not a very stable molecule and the cDNA is a way to store the RNA for a much longer period of time When it comes time to run the array, the cDNA is allowed to go through in vitro transcription back to RNA (now known as cRNA), but this RNA is labeled with Biotin. This is done by having uracil bases tagged with the Biotin. The Biotin-labeled cRNA is then added to the array Anywhere on the array where a RNA fragment and an oligonucleotide on a bead are complimentary, the RNA sticks to the probe on the bead The array is then washed to remove any RNA that is not stuck to an array (i.e., no match was made) and then stained with the fluorescent molecule that sticks to Biotin Lastly, the entire array is scanned with a laser and the information is kept in a computer for quantitative analysis of what genes were expressed and at what approximate level.

Ligation mediated sequencing

Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).

Structure of detector oligonucleotides First two nucleotides determine the colour of the fluorophore. Colour table show the relationship between dinucleotides and fluorophores. Four different dinucleotides (256 different oligonucleotides) correspond to each fluorophore. If first or second nucleotide (in dinucleotide) is known, colour is unambiguously related with the other nucleotide. Three next positions — degenerate nucleotides: 64 different versions for each particular dinucleotide. When ligated to the sequencing primer, only one from these 64 versions would fit to the position. Detector oligonucleotides (DO) are 8-mers fluorescently labeled on 3' end. DO's can't be too short, otherwise T4 ligase would not recognize them as a substrate. Altogether, there are 1024 different detection oligos: (dinucleotide + 3 degenerate)4=54.

Three last positions: universal bases, they are the same for all detector oligonucleotides. Dark oligonucleotides have the same internal structure, but have no fluorophores.

seq.molbiol.ru/sch_seq_ligase.html

Sequencing: ligation step Three main operations during lgation-based sequencing are: ligation of detector oligonucleotides: only one from 1024 possible types of oligonucleotides is suitable for ligation. Both "XY-dinucleotide" and degenerate part should be complementary to the template for the succesfull ligation. scanning: unincorporated oligonucleotides washed out, bead fluorescence registered in four spectral intervals. digestion of ligated DO remover fluorophore, expose phosphate on 5'-end, shift sequencing primer to a new position.

Ligation accuracy Two factors provide specificity of ligation: hybridization stability: 8-mer oligonucleotide should be very sensitive to any mismatches; T4 ligase accuracy: enzyme is particularly sensitive to mismatches on 3'-side of the gap (sensitivity drop down fast with increasing of a distance to the gap). seq.molbiol.ru/sch_seq_ligase.html

Sequencing: example of 35-base sequencing

Five primers & seven ligations for each primer: 35 reactions altogether. Each ligation reaction provides information about colour of particular dinucleotide. According to colour code table (bottom-right) four different dinucleotides may correspond to the same colour. To resolve this ambiguity, one ligation reaction analyses dinucleotide with one known nucleotide in the first ligation with primer "B", dinucleotide overlaps with known adaptor sequence. Starting from the first known nucleotide it is possible to determine the whole sequence.

The principles of 2-base encoding/decoding

Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).

Applications of next generation sequencing

Creating highly parallel genotyping assays

Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901

Creating highly parallel genotyping assays a | Molecular-inversion probe (MIP) genotyping uses circularizable probes with 5' and 3' ends that anneal upstream and downstream of the SNP site leaving a 1 bp gap (genomic DNA is shown in blue). Polymerase extension with dNTPs and a non-strand-displacing polymerase is used to fill in the gap. Ligation seals the nick, and exonuclease I (which has 3' exonuclease activity) is used to remove excess unannealed and unligated circular probes. Finally, the circularized probe is release through restriction digestion at a consensus sequence, and the resultant product is PCR-amplified using common primers to 'built-in' sites on the circular probe. The orientation of the primers ensures that only circularized probes will be amplified. The resultant product is hybridized and read out on an array of universal-capture probes. b | GoldenGate genotyping uses extension ligation between annealed locus-specific oligos (LSOs) and allele-specific oligos (ASOs). An allele-specific primer-extension (ASPE) step is used to preferentially extend the correctly matched ASO (at the 3' end) up to the 5' end of the LSO primer. Ligation then closes the nick. A subsequent PCR amplification step is used to amplify the appropriate product using common primers to 'built-in' universal PCR sites in the ASO and LSO sequences. As in MIP, the resultant products are hybridized and read out on an array of universal-capture probes (complementary to IllumiCodes). c | Reduced-complexity PCR representation using restriction enzyme (RE) digestion of genomic DNA (gDNA), common primer adaptor ligation and single-primer PCR. The single-primer PCR reaction effectively selects for restriction digestion products of 200–2,000 nucleotides. The reduced-complexity representation is read out on an array of locus-specific probes. The decrease in complexity improves the signal-to-noise ratio by increasing the partial concentration of any given locus and decreasing cross-hybridization. d | Whole-genome genotyping on bead arrays. gDNA is whole-genome amplified (WGA), fragmented, denatured and hybridized to an array of locus-specific capture probes (shown is an allele-specific primer extension assay using two bead types, A and B, per locus). SNPs are scored directly on the array surface by primer extension. The separation of the capture step from the SNP-scoring step allows efficient target capture and facilitates good discrimination between alleles. After extension, the array is stained and read out using standard immunohistochemical detection methods.

SNP genotyping illumina

The third approach to SNP genotyping is that primarily of a high through put low multiplicity assay (1536 plex) more usually used for follow-on or focused custom genotyping. This approach adopts Illuminas’ GoldenGate® assay (figure 4), a modified form of which is used for the DASL assay we has seen already. The assay, unlike DASL is an allele specific PCR based amplification and ligation assay that relies upon a pool of locus (LSO) and allele specific (ASO) primers each containing one of three universal tag sequences (P1, P2 & P3) and an array specific address sequence (which forms the duplex with the oligo attached to the bead specifying the location within the array). It is the combination of these that allow extension; ligation and PCR based amplification conferring detection of the specific alleles. Because of this design GoldenGate® genotyping is a twocolour system.

CNV genotyping illumina

Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901

SNP/CNV genotyping .

For genome wide SNP/CNV interrogation12 Illumina

Serial analysis of gene expression (SAGE)

Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901

Povzetek • Nova generacija visokozmogljivostnega sekvenciranje omogoča razpoznavanje zaporedij DNA na ravni celega genoma, z resolucijo posameznega baznega para. • Iz vsakega vzorca se pripravi z adaptorji ligirana knjižnica, ki vsebuje vse v vzorcu prisotne fragmente DNA ali RNA (cDNA). • Vse platforme bazirajo na ligaciji adaptorjev in pomnoževanju, imajo pa različne pristope sekvenciranja: -Pirosekvenciranje (Roche-Nimblegen) -Sekvenciranje s sintezo (Illumina-Solexa) -Sekvenciranje z ligacijo (ABI) •Razvijajo se tudi metode, ki pred sekvenciranjem ne potrebujejo pomnoževanja. • Aplikacije so enake kot pri klasičnih mikromrežah (ekspresijsko profiliranje oz. SAGE, genotipizacija SNP in CNV, kroamtinska imunoprecipitacija, metilacija kromatina, itd.). •Prednost pred klasičnimi mikromrežami je v preprosti pripravi vzorca in zmožnosti procesiranja velikega števila vzorcev v kratkem času. •Procesiranje velikega števila vzorcev na eni ali več aparaturah lahko upravlja en človek.

Načrtovanje bioloških poskusov in standardizacija - Pomen bioinformatike pri načrtovanju in sledenju poskusov - Standardi za izvedbo bioloških poskusov z mikromrežami - Normalizacija • Biološke replike • Tehnične replike • Načrtovanje bioloških poskusov na konkretnih primerih

Bionformatics

www.gwumc.edu

http://bioinformatics.ubc.ca/about/what_is_bioinformatics/images/computer.gif

Bioinformatics Wikipedia

Making sense of the huge amounts of DNA data produced by gene sequencing projects. Bioinformatics and computational biology involve the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling. The terms bioinformatics and computational biology are often used interchangeably, although the former typically focuses on algorithm development and specific computational methods, while the latter focuses more on hypothesis testing and discovery in the biological domain.

Bioinformatics More hypothesis-driven research in computational biology. More technique-driven research in bioinformatics. A common thread in projects in bioinformatics and computational biology is the use of mathematical tools to extract useful information from noisy data produced by high-throughput biological techniques. A representative problem in bioinformatics is the assembly of high-quality DNA sequences from fragmentary "shotgun" DNA sequencing. In computational biology, a representative problem might be statistical testing of a hypothesis of common gene regulation using data from mRNA microarrays or mass spectrometry.

Microarrays and bioinformatics Standardization The lack of standardization in arrays presents an interoperability problem in bioinformatics, which hinders the exchange of array data. Various projects are attempting to facilitate the exchange and analysis of data produced with non-proprietary chips. The "Minimum Information About a Microarray Experiment" (MIAME) XML based standard for describing a microarray experiment is being adopted by many journals as a requirement for the submission of papers incorporating microarray results.

http://www.mged.org/Workgroups/MIAME/miame.html

MIAME – I. Experiment Design: The goal of the experiment – one line maximum (e.g., the title from the related publication) A brief description of the experiment (e.g., the abstract from the related publication) Keywords, for example, time course, cell type comparison, array CGH (the use of MGED ontology terms is recommended). Experimental factors - the parameters or conditions tested, such as time, dose, or genetic variation (the use of MGED ontology terms is recommended). Experimental design - relationships between samples, treatments, extracts, labeling, and arrays (e.g., a diagram or table). Quality control steps taken (e.g., replicates or dye swaps). Links to the publication, any supplemental websites or database accession numbers.

MIAME – II.

Samples used, extract preparation and labeling: The origin of each biological sample (e.g., name of the organism, the provider of the sample) and its characteristics (e.g., gender, age, developmental stage, strain, or disease state). Manipulation of biological samples and protocols used (e.g., growth conditions, treatments, separation techniques). Experimental factor value for each experimental factor, for each sample (e.g., ‘time = 30 min' for a sample in a time course experiment). Technical protocols for preparing the hybridization extract (e.g., the RNA or DNA extraction and purification protocol), and labeling. External controls (spikes), if used.

MIAME – III. Hybridization procedures and parameters: The protocol and conditions used for hybridization, blocking and washing, including any post-processing steps such as staining. Measurement data and specifications: The raw data, i.e. scanner or imager and feature extraction output (providing the images is optional). The data should be related to the respective array designs (typically each row of the imager output should be related to a feature on the array – see Array Designs). The normalized and summarized data, i.e., set of quantifications from several arrays upon which the authors base their conclusions (for gene expression experiments also known as gene expression data matrix and may consist of averaged normalized log ratios). The data should be related to the respective array designs (typically each row of the summarized data will be related to one biological annotation, such as a gene name). Data extraction and processing protocols. Image scanning hardware and software, and processing procedures and parameters. Normalization, transformation and data selection procedures and parameters.

Statistical analysis The analysis of DNA microarrays poses a large number of statistical problems, including the normalisation of the data. From a hypothesis-testing perspective, the large number of genes present on a single array means that the experimenter must take into account a multiple testing problem: even if each gene is extremely unlikely to randomly yield a result of interest, the combination of all the genes is likely to show at least one or a few occurrences of this result which are false positives.

Gene Ontology Viewer for Microarray Data Interpretation

swift.cmbi.kun.nl/.../ report/materials/

Definitions of gene ontology on the Web: a controlled vocabulary used to describe the biology of a gene product in any organism. There are 3 independent sets of vocabularies, or ontologies, that describe the molecular function of a gene product, the biological process in which the gene product participates, and the cellular component where the gene product can be found. www.madison.k12.wi.us/west/science/biotech/vocabulary.htm The Gene Ontology, or GO, is a trio of controlled vocabularies that are being developed to aid the description of the molecular functions of gene products, their placement in and as cellular components, and their participation in biological processes. Terms in each of the vocabularies are related to one another within a vocabulary in a polyhierarchical (or directed acyclic graph) manner; terms are mutually exclusive across the three vocabularies. ... en.wikipedia.org/wiki/Gene_Ontology

Gene Ontology

The adoption of common standards and ontologies for the management and sharing of microarray and/or mass spectrometry data is essential. The Global Open Biological Ontologies GOBO effort, which has grown from work by the Gene Ontology Consortium, is seeking to collect ontologies for the domains of genomics and proteomics. Together with Spotfire the ErasmusMC bioinformatics group works on the improvement and further development of a GO tool that runs in the portal environment of the Spotfire decision site product for functional genomics and proteomics applications. The wealth of biological data that will be generated using high-throughput technologies from different modalities in the next decade has yet to be realized, as has the enormous potential for discoveries.

www.erasmusmc.nl/.../ research/gocp.shtml

http://cardioserve.nantes.inserm.fr/ptf-puce/images/camembert_go.gif

Steps in microarray technology

irfgc.irri.org/cropbioportal/index.php?option...

Experimental design A proper experimental design is crucial for obtaining useful conclusions from a project. The choice of design ideally includes an assessment of the biological variation, the technical variation, the cost and duration of the experiment, and the availability of biological material. The experimental design can also depend on the methods that will be used to analyse the data afterwards. In certain cases, the parameters needed to find the optimal design must be obtained by a pilot experiment. A related problem is the comparison of different competing experimental methods or devices. Here, a proper test design is crucial as well to be able to make a firm conclusion in favor of one or the other method.

Dere E. et al., BMC Genomics 2006, 7:80doi

lunabiosciences.com/experimentaldesign.html

Načrtovanje poskusov

Različni načrti poskusa z dvobarvnimi DNA mikromrežami, ki vključujejo dva tretmaja (A in B). (a) in (b) predvidevata dve oz. štiri tehnične ponovitve z zamenjavo barv (angl. dye swap). (c) in (d) sta osnovana na dveh neodvisnih bioloških ponovitvah tretmajev (ponazorjeno z indeksom pri oznaki tretmaja). (c) predvideva biološko ponovitev načrta (a). (d) prikazuje enostaven krožni načrt.

Načrt poskusa z dvobarvnimi DNA mikromrežami, kjer vzorce (A,...Z) primerjamo preko skupne reference

Juvan P., Rozman, D. Informatica Medica Slovenica, 11: 2-15 (2006)

Technical Issues Involved in Obtaining Reliable Data from Microarray Experiments – Standardization and beyond

Primary data analysis – experimental example

Figure 1 – Drug metabolism and cholesterol homeostasis Principal groups of genes involved in cholesterol homeostasis and drug metabolism present on the Sterolgene v0 cDNA microarray prototype. T. Rezen et al., BMC Genomics 2008, 9:76

Primary data analysis – experimental example Images of the Sterolgene v0 microarrays were analyzed by Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA). The median feature and local background intensities were extracted together with the estimates of their standard deviation. Only features with foreground to background ratio higher than 1.5 and coefficient of variation (CV, ratio between standard deviation of the background and the median feature intensity) lower than 0.5 in both channels were used for further analysis. Log2 ratios were normalized using LOWESS fit to spike in control RNAs according to their average intensity. Two types of spike in controls were used: custom-made (Firefly luciferase) and commercial ArrayControl Spikes (Ambion, Austin, TX, USA). In phenobarbital and cholesterol-feeding experiments data were additionally standardized (median-centered and scaled by median absolute deviation) in order to reduce inter-array variability. All data analysis were done in Orange software [37] Images of Agilent microarrays were analyzed using Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA). Features, with CV>0.39, were filtered out and data were normalized using LOWESS fit to all genes according to their average intensity. Filtration and normalization was done in BASE softwar]. Affymetrix data were normalized by the Robust Multichip Average (RMA) algorithm. After transformation to non-logarithmic data, the expression estimates were scaled to the average expression levels in the control group analyzed using GeneSpring software (Silicon Genetics, Redwood, USA). Classification of the differentially expressed genes was done in Orange using single-factor ANOVA or two tailed Student’s t-test. For Sterolgene microarrays probability of type I error αS=0.05 or αS=0.1 was used. For Agilent microarrays complementary probability of type I error was calculated according to Bonferroni correction for multiple testing (αA=αS*nS/nA), but final comparisons were made using more relaxed criteria αA=0.001 and αA=0.01. For Affymetrix microarrays complementary probability αAf=0.00043 was calculated, but also a more relaxed criterion αAf=0.001 was used. Additional data analyses were done only using common genes between platforms, which were matched using unigene, refseq or gene symbol. On this gene list another classification of differentially expressed genes was done in Orange using ANOVA for Affymetrix and Agilent platforms. A probability for type I error was selected as in a complementary Sterolgene experiment (α=0.1 in Affymetrix analyses and α=0.05 in Agilent analyses). Pearson’s product moment correlation coefficient and a scatterplot between log2 ratios from common genes were calculated in SPSS 14.0. All data have been submitted to GEO (Gene expression omnibus) under accession codes: GSE6271 (Affymetrix data), GSE6317 (Agilent data), GSE6447 (Sterolgene phenobarbital and high cholesterol diet data), and GSE6423 (Sterolgene fasting and inflammation data). T. Rezen et al., BMC Genomics 2008, 9:76

Problems in comparison different data formats

Agilent

1

Sterolgene 11

4

A

1

Sterolgene 2

3

B

Affymetrix 17

C

Agilent

4

Sterolgene

Affymetrix

11

36

6

Sterolgene 9

D

Figure 3 - Agreement between Sterolgene v0, Agilent and Affymetrix platforms Venn diagram illustrating agreement between differentially expressed (DE) gene lists from Sterolgene v0 cDNA, Agilent 10K cDNA microarrays and Affymetrix MOE430A GeneChip. DE genes were determined using a single factor ANOVA, and a probability of type I error α=0.05 for Sterolgene and Agilent platform comparisons and α=0.1 for Sterolgene and Affymetrix platform comparison. Only genes present on both microarrays were used in these analyses and are shown on the diagrams. A. In starvation experiment only one gene was common to both platforms. B. In TNF-α experiment only one gene was common to both platforms. C. In phenobarbital experiment four genes were common to both platforms. D. In cholesterol diet experiment six genes were common to both platforms.

Changes in cholesterol homeostasis and drug metabolism caused by different factors in mouse liver The Sterolgene v0 cDNA microarray successfully detected all changes in cholesterol homeostasis and drug metabolism caused by high-cholesterol diet, fasting, TNF-α, and phenobarbital (PB) treatment (solid arrows). The Agilent 10 K cDNA microarray (G4104A) detected none of the changes caused by the inflammatory cytokine TNF-α and fasting (crosses). The Affymetrix MOE430A GeneChip detected only down-regulation of the cholesterol biosynthesis by the high-cholesterol diet and induction of the Cyp2b family by the phenobarbital treatment (dashed arrows), but not the up-regulation of Cyp3a11 by highcholesterol diet and up-regulation of Cyp3a family and Alas1 by phenobarbital treatment. For all microarrays the same statistical method for determination of differentially expressed genes was used (single-factor ANOVA). T. Rezen et al., BMC Genomics 2008, 9:76

Povzetek -Načrtovanje bioloških poskusov zahteva sodelovanje eksperimentatorjev in informatikov že od vsega začetka. -Zasnova poskusa zahteva definicijo biolškega vprašanja, izbiro platforme za analizo, določitev števila bioloških in tehničnih replik in načrt serije poskusov (hibridizacij ali sekvenciranj). -Po tehnični izvedbi poskusa sledita statistična in informatična obdelava ter rudarjenje podatkov. Statistično-informatična obdelava obsega ekstrakcijo intenzitet signala, normalizacijo podatkov, ter različne statistične teste, da pridobimo listo diferencialno izraženih genov ali zaporedij DNA. Sekundarna informatična analiza in rudarjenje podatkov obsegata gručanje in razpoznavanje vzorcev, študije seznama genov z genskimi ontologijami. Sikanje regulatornih vzorcev, kot tudi načrtovanje validacijskih eksperimentov.