INTRODUCTION TO NEXT GENERATION SEQUENCING

ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQU...

Author: Lindsay Melton

1 downloads 2 Views 5MB Size

Report

Download PDF

Recommend Documents

Introduction to Next Generation Sequencing

Introduction to illumina Next Generation Sequencing Technology

Introduction to NGS. Next Generation Sequencing

Introduction to Next-Generation Sequencing Technologies

Next Generation Sequencing Exome Sequencing

Next Generation Sequencing

Next Generation Sequencing Update

Sequencing: Next-Generation

Next-Generation Sequencing Survey

Next generation sequencing

Revolutionizing Next-Generation Sequencing

Next Generation Sequencing Applications

Next Generation Sequencing: An introduction to applications and technologies

Introduction to Next Generation Sequencing Analysis: Part I

Introduction to Next-Gen Sequencing

Introduction to second- generation sequencing

Next-Generation Sequencing (NGS): Research to Clinic

RNA Sequencing with Next-Generation Sequencing

NEXT GENERATION SEQUENCING SERVICE GUIDE

NEXT-GENERATION SEQUENCING AND BIOINFORMATICS

Illumina's next generation sequencing technology

APPLICATIONS OF NEXT-GENERATION SEQUENCING

Next Generation Sequencing: An Introduction for the Pathology Laboratory

Next-generation sequencing (NGS) provides unprecedented

ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQUE - ROSCOFF

INTRODUCTION TO NEXT GENERATION SEQUENCING

Claude Thermes Analyse du génome Plateforme de séquençage IMAGIF Centre de Génétique Moléculaire Centre de Recherche de Gif 14/01/2013

LIBRARY PREPARATION

DNA-‐Seq Library

Genomic DNA

Cleavage (sonica.on)

Fragmented DNA

DNA-‐Seq Library

Genomic DNA

Cleavage (sonica.on)

Fragmented DNA

liga3on

PCR PCR product

DNA-‐Seq Library

Genomic DNA

Cleavage (sonica.on)

Fragmented DNA

liga3on

?

PCR PCR product

Adaptor ligation

Comparison of two RNA-‐seq library protocols: SOLiDTM Whole Transcriptome Analysis Kit (RNase III fragmenta.on) versus Illumina’s direc3onal mRNA-‐Seq Library (Zinc fragmenta.on)

SOLiDTM Whole Transcriptome Analysis Kit: RNase III fragmenta.on

RiboMinus RNA

5’

RNaseIII

N

fragmented RNA

3’

NNNNNN

Reverse transcrip5on Hybridiza5on with adapters, liga5on

Size selec5on

PCR ampliﬁca5on

Sequencing on SOLiD

intron

SOLiD

YBR078W

Sequencing on Illumina

intron

YBR078W

SOLiD

Illumina

Very heterogeneous paOern; not due to sequencing technology but to library prepara3on: RNase III fragmenta3on not so random?

Illumina direc5onal mRNA-‐Seq Library: Zinc fragmenta.on

Total RNA

Deple.on of ribosomal RNA ribo-‐ RNA

Zinc fragmented RNA

liga3on

RT

PCR

ds PCR product

Illumina direc5onal mRNA-‐Seq Library: Zinc fragmenta.on

intron

Zinc

YBR078W

Illumina direc5onal mRNA-‐Seq Library: Zinc fragmenta.on

intron

YBR078W

Zinc Same number of reads

RNase III

Correlation between nucleotides

Zinc fragmentation

Rnase III

Distance between nucleotides

M. Wery, M. Descrimes, C. Thermes, D. Gautheret & A. Morillon (submitted)

SEQUENCING QUALITY

cluster density

sequencing quality checking

focus sequencing quality checking

intensity

sequencing quality checking

quality score

sequencing quality checking

Quality score •  Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing •  Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets »

Quality score •  Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing •  Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets » •  Phred :

- calculates several parameters related to peak shape and peak resolution at each base -  uses these parameters to look up a corresponding quality score in huge lookup tables generated from sequence traces where the correct sequence is known -  different lookup tables are used for different sequencing chemistries and machines

Quality score •  a quality score (Q-score) is a prediction of the probability of an incorrect base call •  given a base call, X, the probability that X is not true P(~X), is expressed by Q(X) : Q(X) = -10 log10 (P(~X)) P(~X) is the estimated probability of the base call being wrong P(~X) is estimated on the basis of the 4 intensities (corresponding to dNTP dyes) •  Q(X) = 30 indicates an error probability of 0.001 : the base call is highly probable : one intensity is much higher than the 3 others

• 

Quality score •  Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value •  Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33

Quality score •  Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value •  Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems »)

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems ») •  low intensity values

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

PERFORMANCE PARAMETERS

HiSeq 2000 ILLUMINA GENOME ANALYZER IIX ILLUMINA

1 to1.5 x 109 2 x 100 bp

Augmentation des séquences déposées dans les banques de données publiques (EMBL)

Source: EMBL Statistics

Cost of 1 MB of DNA sequencing

Sboner et al. Genome Biology 2011, 12:125

Cost of 1 MB of DNA sequencing

X

Sboner et al. Genome Biology 2011, 12:125

Cost of E. Coli sequencing with 100 X coverage: 50 €

1 000

200 000

2 x 100 bp $ 800 USD $ 0.8 USD $ 100 000 USD

$ 0.1 USD

1 600

200 000

2 x 100 bp $ 800 USD $ 0.5 USD $ 100 000 USD

$ 0.1 USD

COMPARISON MICROARRAY / NGS

Microarray transcriptome

NGS transcriptome (ILLUMINA)

Sboner et al. Genome Biology 2011, 12:125

E. Coli transcriptome library: 300€ sequencing+data management: 150€

Sboner et al. Genome Biology 2011, 12:125

E. Coli transcriptome library: 300€ sequencing+data management: 150€

Sboner et al. Genome Biology 2011, 12:125

Conclusion The rapid decrease in the cost of ‘data generation’ has not been matched by a comparable decrease in the cost of the computational infrastructure required to mine the data. The major burden will be the downstream analysis... Sboner et al. Genome Biology 2011, 12:125