ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQU...
ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQUE - ROSCOFF
INTRODUCTION TO NEXT GENERATION SEQUENCING
Claude Thermes Analyse du génome Plateforme de séquençage IMAGIF Centre de Génétique Moléculaire Centre de Recherche de Gif 14/01/2013
LIBRARY PREPARATION
DNA-‐Seq Library
Genomic DNA
Cleavage (sonica.on)
Fragmented DNA
DNA-‐Seq Library
Genomic DNA
Cleavage (sonica.on)
Fragmented DNA
liga3on
PCR PCR product
DNA-‐Seq Library
Genomic DNA
Cleavage (sonica.on)
Fragmented DNA
liga3on
?
PCR PCR product
Adaptor ligation
Comparison of two RNA-‐seq library protocols: SOLiDTM Whole Transcriptome Analysis Kit (RNase III fragmenta.on) versus Illumina’s direc3onal mRNA-‐Seq Library (Zinc fragmenta.on)
SOLiDTM Whole Transcriptome Analysis Kit: RNase III fragmenta.on
RiboMinus RNA
5’
RNaseIII
N
fragmented RNA
3’
NNNNNN
Reverse transcrip5on Hybridiza5on with adapters, liga5on
Size selec5on
PCR amplifica5on
Sequencing on SOLiD
intron
SOLiD
YBR078W
Sequencing on Illumina
intron
YBR078W
SOLiD
Illumina
Very heterogeneous paOern; not due to sequencing technology but to library prepara3on: RNase III fragmenta3on not so random?
M. Wery, M. Descrimes, C. Thermes, D. Gautheret & A. Morillon (submitted)
SEQUENCING QUALITY
cluster density
sequencing quality checking
focus sequencing quality checking
intensity
sequencing quality checking
quality score
sequencing quality checking
Quality score • Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing • Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets »
Quality score • Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing • Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets » • Phred :
- calculates several parameters related to peak shape and peak resolution at each base - uses these parameters to look up a corresponding quality score in huge lookup tables generated from sequence traces where the correct sequence is known - different lookup tables are used for different sequencing chemistries and machines
Quality score • a quality score (Q-score) is a prediction of the probability of an incorrect base call • given a base call, X, the probability that X is not true P(~X), is expressed by Q(X) : Q(X) = -10 log10 (P(~X)) P(~X) is the estimated probability of the base call being wrong P(~X) is estimated on the basis of the 4 intensities (corresponding to dNTP dyes) • Q(X) = 30 indicates an error probability of 0.001 : the base call is highly probable : one intensity is much higher than the 3 others
•
Quality score • Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value • Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33
Quality score • Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value • Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33
Read Segment Quality Control A number of factors can cause the quality of base calls to be low : • bad temperature control (leading to « chemistry problems »)
Read Segment Quality Control A number of factors can cause the quality of base calls to be low : • bad temperature control (leading to « chemistry problems ») • low intensity values
Read Segment Quality Control A number of factors can cause the quality of base calls to be low : • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
Read Segment Quality Control A number of factors can cause the quality of base calls to be low • bad temperature control (leading to « chemistry problems ») • low intensity values • bad focus • phasing artifacts
PERFORMANCE PARAMETERS
HiSeq 2000 ILLUMINA GENOME ANALYZER IIX ILLUMINA
1 to1.5 x 109 2 x 100 bp
Augmentation des séquences déposées dans les banques de données publiques (EMBL)
Source: EMBL Statistics
Cost of 1 MB of DNA sequencing
Sboner et al. Genome Biology 2011, 12:125
Cost of 1 MB of DNA sequencing
X
Sboner et al. Genome Biology 2011, 12:125
Cost of E. Coli sequencing with 100 X coverage: 50 €
1 000
200 000
2 x 100 bp $ 800 USD $ 0.8 USD $ 100 000 USD
$ 0.1 USD
1 600
200 000
2 x 100 bp $ 800 USD $ 0.5 USD $ 100 000 USD
$ 0.1 USD
COMPARISON MICROARRAY / NGS
Microarray transcriptome
NGS transcriptome (ILLUMINA)
Sboner et al. Genome Biology 2011, 12:125
E. Coli transcriptome library: 300€ sequencing+data management: 150€
Sboner et al. Genome Biology 2011, 12:125
E. Coli transcriptome library: 300€ sequencing+data management: 150€
Sboner et al. Genome Biology 2011, 12:125
Conclusion The rapid decrease in the cost of ‘data generation’ has not been matched by a comparable decrease in the cost of the computational infrastructure required to mine the data. The major burden will be the downstream analysis... Sboner et al. Genome Biology 2011, 12:125