NEXT-GENERATION SEQUENCING AND BIOINFORMATICS

NEXT-GENERATION SEQUENCING AND BIOINFORMATICS Moore's law: the number of transistors in a dense integrated circuit doubles every two years Moore'...
Author: Patrick Rice
6 downloads 1 Views 4MB Size
NEXT-GENERATION SEQUENCING AND BIOINFORMATICS

Moore's law:

the number of transistors in a dense integrated circuit doubles every two years

Moore's law calculates and predicts the pace of improvement of one of the fastest improving technologies, computers

In the last 15 years the pace of improvement of DNA sequencing technologies has been much faster than that of computers

Frederick Sanger Nobel prize in chemistry in 1958 for sequencing insulin (and proteins in general) Nobel prize in chemistry in 1980 for sequencing nucleic acids One of only three persons to win two Nobel prizes in science

SANGER SEQUENCING

SANGER SEQUENCING The most modern Sanger sequencers allow parallelization of up to 96 samples at once Before sequencing a step of PCR and purification is necessary – and if you do not know the sequence in advance you need to perform a cloning step OUTPUT: 1000 bases per run (96000 if you parallelize)

NEXT-GEN SEQUENCING TECHNOLOGIES • Roche/454 FLX • Applied Biosystems SOLiD System • Illumina/Solexa Genome Analyzer • IonTorrent

NEXT-GENERATION DNA SEQUENCING MAIN CHARACTERISTICS EXTREME MINIATURIZATION Reactions are carried out in volumes of microliters thanks to specific technological advances This in turn allows

MASSIVE PARALLELIZATION Thousands, millions of reactions are performed in parallel, reducing the costs and increasing the output volume by orders of magnitude

NEXT-GEN SEQUENCING TECHNOLOGIES Some specific aspects of each method are protected by copyright and therefore not disclosed In 1977 Sanger made his method public (winning the second Nobel), today every new method is marketed

SAMPLE PREPARATION Nebulization of genomic DNA in fragments of 400-1000 base pairs Ligation of fragments to two adapters (type A and type B) Selection of single strand fragments with both adapters

EMULSION PCR Fragments are mixed with agarose beads by 28 microns in diameter bearing complementary to oligo adapters Isolation of each bead-fragment into individual micelles in water-oil Emulsion PCR reaction in 1 million copies of amplified fragment on the surface of each bead

SAMPLE LOAD Each bead is placed in a well of a picotiter slide (7x7 cm fiber optic slide); several million 44 microns diameter wells per slide Multiple enzymes and reagents are added in the form of even smaller beads

PYROSEQUENCING REACTION 1 single nucleotide species is added each cycle

Nucleotide incorporation → light generation Rothberg Nat. Biotechnol. 2008

ROCHE/454 FLX Pyrosequencer 1 EMULSION PCR takes the place of thousands of cloning experiments 1 SEQUENCING RUN takes the place of thousands of SANGER sequencing runs

EXTREME MINIATURIZATION MASSIVE PARALLELIZATION

ROCHE/454 GSFLX+ BASE CALLING ACCURACY: 99.9% or more (lower in the final part of the reads)

OUTPUT: Generates reads up to 1,000 nucleotides long Generates about 500,000-1,000,000 reads For a total output of 700 megabases per run (8 hours)

454 MAIN ISSUE Homopolymers: stretches of one single nucleotide species Intrinsic problem of the technology Multiple identical nucleotides are incorporated in a single cycle They generate more light, but discrimination becomes increasingly more difficult

454 MAIN ISSUE This problem can affect the downstream bioinformatic analysis KNOW YOUR MACHINE!

ILLUMINA/SOLEXA Genome Analyzer Currently the market leader Very low cost per base, proven technology sequencing by synthesis

ILLUMINA/SOLEXA Genome Analyzer 1. DNA fragmentation and ligation to 2 types of adapters 2. Templates are bound on the surface of a flow microcell

3. "bridge" amplification using primers complementary to the adapters that are bound to the substrate at high density → production of clusters of up to 1,000,000 of template copies "in situ" that generate a sufficient signal to be detected

ILLUMINA/SOLEXA Genome Analyzer 4. Addition of fluorescent nucleotides blocked at 3'-OH 5. Fluorescence detection 6. Removal of the fluorophore 7. repeat steps 3-5

HISEQ 4000 - the newest Solexa/Illumina instrument - total output: 125-1500 Gb - read length: 150bp paired ends - cost per library construction: 500 euros - sequencing cost per lane: 3000 euros

ILLUMINA/SOLEXA Genome Analyzer • Four different fluorophores → no issues with homopolymers • Shorter reads blocking the incorporation of multiple nucleotides is one of the basis of the Illumina method Each cycle imperfect blocking happens, a small percentage of the copies in a cluster incorporates two nucleotides, giving noise instead of good signal When this percentage reaches a threshold, the signal is lost

ION TORRENT

The smallest sequencer, fast and economical An instrument: $ 50,000 A run: $ 1,000 Output: up to 80MB of reads long up to 400pb Very quick, a run lasts for 3 hours

ION TORRENT

In many respects similar to 454 DNA is amplified on microbeads and inserted into wells Then subjected to cycles of incorporation of a single type of nucleotide

ION TORRENT Does not detect light, but the release of H+ ions by sequencing - As a camera chip, which instead of detecting photons detects protons

The sequencing is performed on a semiconductor chip, which identifies the liberation of protons Potential rapid technological development, taking advantage of the electronics industry

ION TORRENT All nucleotides release H+, so cycles of incorporations of individual types of nucleotides are required (A, C, G, T)

Same issue as 454: homopolymers

THIRD GENERATION SEQUENCING TECHNOLOGIES • Pacific Biosciences • Oxford Nanopore

THIRD GENERATION SEQUENCING TECHNOLOGIES REAL TIME SEQUENCING The idea is to bypass the amplification step Advantage

THIRD GENERATION SEQUENCING TECHNOLOGIES REAL TIME SEQUENCING The idea is to bypass the amplification step Advantage

This allows to avoid DNA fragmentation, and to obtain longer reads

Pacific Biosciences PACBIO Launched in 2009 (third-generation?) Real-Time sequencing technology The idea is to directly observe the DNA polymerization while it performed by DNA polymerase Single Molecule Real Time (SMRT) sequencing Recently the third machine was released: PACBIO SEQUEL

cost around 800,000 dollars

Zero-mode waveguide (ZMW) Highly sensitive detection system Nanophotonic structure with 50nm diameter cells A laser illuminates from below, but the wavelength is too large to allow the diffusion of light Same principle of microwave ovens doors

Zero-mode waveguide (ZMW) The light penetrates 20-30 nm This allows to identify only what happens on the bottom of the well, reducing background noise and getting high sensitivity and temporal resolution

The latest PacBio instrument has around 1,000,000 wells

Polimerase phi-29 phage polymerase Highly processive, up to 70,000 nt High fidelity, up to 100 times more of Taq polymerase Modifed to be slower The polymerase is linked to the bottom of the wells

Only 1/3 of the wells have a single polymerase, and thus can perform the sequencing

PacBio sequencing Addition of single strand DNA that binds to the polymerase Addition of the 4 nucleotide species, tagged with 4 different fluorophores The nucleotide is incorporated and the fluorophore is cut

The free fluorophore generates a flash of light, which is detected by a fluorescence microscope

Characteristics The sequencing is continuous, washing is not necessary → much faster PacBio allows to obtain sequences of several thousands of nucleotides (up to 20,000)

Third generation sequencing A novel revolution → expecially for bioinformatics

PacBio ISSUES Current issues are the cost (10x more expensive than Illumina) The read quality: single molecule sequencing means every mistake is recorded, and cannot be cancelled by the presence of thousands of parallel reactions However these errors are random and can be overcome

Rivoluzione dal punto di vista dell'analisi a valle http://flxlexblog.wordpress.com/2013/10/01/developments-in-next-generation-sequencing-october-2013-edition/

NEXT-GEN IS TRENDY

It is the new thing It is powerful and cheap It has uses in any biological system (From viruses to human genetics) It is useful to answer a number of questions (De novo, mapping, transcriptomics)

NEXT-GEN IS TRENDY So everyone wants to use it you just extract your DNA/RNA and send it to a sequencing company And then, who will do the analysis?

NEXT-GEN WORKFLOW 1. What is the goal? 2. Choose the right experimental setup 3. Choose the right sequencing technology 4. Data Analysis

What is your goal? What exactly is the problem you want to address? Evaluate approaches used in the past Consider new approaches Consider future problems

NO WAY BACK!

CHOOSE THE RIGHT EXPERIMENTAL SETUP

Nucleic acid quantity Nucleic acid quality Technical replicates Biological replicates Negative and/or positive controls

CHOOSE THE RIGHT TECHNOLOGY de novo sequencing: 454, PacBio Draft sequencing: Illumina, Iontorrent Microbial communities: 454, Illumina Transcriptomics: Illumina, Iontorrent

DATA ANALYSIS A basic next-gen experiment generates gigabytes of information This is HIGH-THROUGHPUT!

HIGH-TROUGHPUT TECHNOLOGIES Technologies that generate too much data, that cannot be handled without computer assistance

EXAMPLES Shotgun proteomics Network analysis

HIGH-TROUGHPUT TECHNOLOGIES Next-generation sequencing

BIOINFORMATICS Bioinformatics is the development and use of computer methods for the analysis of biological data

Bioinformatics becomes absolutely necessary with the increase of data load

BIOINFORMATICS

Most bioinformatics is run on Linux

SO WHAT IS UNIX? Unix is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, developed in the 1970s at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and others.

UNIX Advantages Full multitasking with protected memory Very efficient virtual memory Access controls and security A rich set of small commands that do specific tasks well Ability to combine commands to accomplish complicated tasks A powerfully unified file system Available on a wide variety of machines Optimized for program development

UNIX Disadvantages The ommand line interface is user hostile Commands often have cryptic names and give very little response to tell the user what they are doing To use Unix well, you need to understand some of the main design features Richness of utilities (over 400 standard ones) often overwhelms novices Documentation often feels underwhelming and poor of Examples Expensive

UNIX → LINUX Linux is a UNIX-like family of Operating Systems (OSs) Each ”member” of the family has different characteristics and comes with different softwares and graphic environments Broadly, each distribution (a.k.a. distro) is ”tuned” for a specific task, to address a specific user or designed for a specific kind of devices Most Unix advantages, plus it is FREE and User-friendly

Linux Distros for beginners: Mint and Ubuntu, #1 and #2 most popular distributions for a specific task: e.g. BioLinux (bioinformatics), Scientific Linux (science in general)and Ubuntu Studio (multimedia) for a specific platform: e.g. Mythbuntu (home theater PCs), Yellow Dog Linux (apple machines), OpenWrt (routers)

LINUX FOR BIOINFORMATICS Why Linux? Free and runs on most hardware fully customizable more efficient and stable

Why Linux for bioinformatics? Supports multiple users in a controlled manner Optimized for writing and executing scripts/commands Features for handling massive amounts of files Adopted by the scientific community

It requires more work than other operating systems

LINUX – OPEN SOURCE Why Linux? free and open software

Open-source software (OSS) is computer software with its source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose Open-source software may be developed in a collaborative public manner

LINUX Why Linux?

fully customizable

From the small details to the core functions

LINUX Why Linux? more efficient and stable Linux servers are widely used → for example by Microsoft and Apple

As a bioinformatician, if you want to interact with your server quickly and well, you may find it easier if you use the same language

is LINUX the only way to do bioinformatics? ABSOLUTELY NO

However its characteristics make it optimal for most bioinformatic tasks Supports multiple users in a controlled manner Optimized for writing and executing scripts/commands Features for handling massive amounts of files Adopted by the scientific community

Many bioinformaticians use a Mac laptop to interact with a Linux server (MAC OS X is unix based)

Many Linux distros are as friendly as Windows

You get to browse your files visually internet browsers Text processors Skype Even videogames … and many things windows does not give you

Give a try to Ubuntu https://www.ubuntu.com/download/desktop/try-ubuntu-before-you-install

Suggest Documents