INTRODUCTION TO NEXT GENERATION SEQUENCING

ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQU...
Author: Lindsay Melton
1 downloads 2 Views 5MB Size
ECOLE DE BIOINFORMATIQUE INITIATION AU TRAITEMENT DES DONNÉES DE GÉNOMIQUE OBTENUES PAR SÉQUENÇAGE À HAUT DÉBIT 14-18 JANVIER 2013 - STATION BIOLOGIQUE - ROSCOFF

INTRODUCTION TO NEXT GENERATION SEQUENCING

Claude Thermes Analyse du génome Plateforme de séquençage IMAGIF Centre de Génétique Moléculaire Centre de Recherche de Gif 14/01/2013

LIBRARY PREPARATION

DNA-­‐Seq  Library  

Genomic  DNA  

Cleavage  (sonica.on)  

Fragmented  DNA  

DNA-­‐Seq  Library  

Genomic  DNA  

Cleavage  (sonica.on)  

Fragmented  DNA  

liga3on  

PCR   PCR  product  

DNA-­‐Seq  Library  

Genomic  DNA  

Cleavage  (sonica.on)  

Fragmented  DNA  

liga3on  

?

PCR   PCR  product  

Adaptor ligation

Comparison  of  two  RNA-­‐seq  library  protocols:   SOLiDTM  Whole  Transcriptome  Analysis  Kit  (RNase  III  fragmenta.on)     versus   Illumina’s  direc3onal  mRNA-­‐Seq  Library  (Zinc  fragmenta.on)  

SOLiDTM  Whole  Transcriptome  Analysis  Kit:  RNase  III  fragmenta.on  

RiboMinus  RNA  

5’  

RNaseIII  

N  

fragmented  RNA  

3’  

NNNNNN  

Reverse  transcrip5on   Hybridiza5on  with   adapters,  liga5on  

Size  selec5on  

PCR  amplifica5on  

Sequencing  on  SOLiD  

intron  

SOLiD  

YBR078W  

Sequencing  on  Illumina  

intron  

YBR078W  

SOLiD  

Illumina  

 Very  heterogeneous  paOern;  not  due  to  sequencing  technology  but  to  library  prepara3on:   RNase  III  fragmenta3on  not  so  random?  

Illumina  direc5onal  mRNA-­‐Seq  Library:  Zinc  fragmenta.on  

Total  RNA  

Deple.on  of  ribosomal  RNA   ribo-­‐    RNA  

Zinc   fragmented  RNA  

liga3on  

RT  

PCR  

ds  PCR  product  

Illumina  direc5onal  mRNA-­‐Seq  Library:  Zinc  fragmenta.on  

intron  

Zinc  

YBR078W  

Illumina  direc5onal  mRNA-­‐Seq  Library:  Zinc  fragmenta.on  

intron  

YBR078W  

Zinc   Same number of reads

RNase  III  

Correlation between nucleotides

Zinc fragmentation

Rnase III

Distance between nucleotides

M. Wery, M. Descrimes, C. Thermes, D. Gautheret & A. Morillon (submitted)

SEQUENCING  QUALITY  

cluster  density  

sequencing   quality  checking  

focus   sequencing   quality  checking  

intensity  

sequencing   quality  checking  

quality  score  

sequencing   quality  checking  

Quality  score   •  Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing •  Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets »

Quality  score   •  Each base is assigned a quality score (Q-score) by a phred-like algorithm similar to that originally developed for Sanger sequencing •  Phred was developed in (Ewing et al. Genome Research 1998) and was used for the Human Genome Project « Although the basic ideas are simple, the full implementation is a complex, somewhat inelegant rule-based procedure, that has been arrived at empirically by progressively refining the algorithms on the basis of examining performance on particular data sets » •  Phred :

- calculates several parameters related to peak shape and peak resolution at each base -  uses these parameters to look up a corresponding quality score in huge lookup tables generated from sequence traces where the correct sequence is known -  different lookup tables are used for different sequencing chemistries and machines

Quality  score   •  a quality score (Q-score) is a prediction of the probability of an incorrect base call •  given a base call, X, the probability that X is not true P(~X), is expressed by Q(X) : Q(X) = -10 log10 (P(~X)) P(~X) is the estimated probability of the base call being wrong P(~X) is estimated on the basis of the 4 intensities (corresponding to dNTP dyes) •  Q(X) = 30 indicates an error probability of 0.001 : the base call is highly probable : one intensity is much higher than the 3 others

• 

Quality  score   •  Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value •  Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33

Quality  score   •  Q scores are written to FASTQ files in an encoded compact form which uses only one byte per quality value •  Illumina 1.8 : this method represents the quality score with an ASCII code equal to the value + 33

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems »)

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems ») •  low intensity values

Read Segment Quality Control A number of factors can cause the quality of base calls to be low : •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

Read Segment Quality Control A number of factors can cause the quality of base calls to be low •  bad temperature control (leading to « chemistry problems ») •  low intensity values •  bad focus •  phasing artifacts

PERFORMANCE PARAMETERS

HiSeq 2000 ILLUMINA GENOME ANALYZER IIX ILLUMINA

1 to1.5 x 109 2 x 100 bp

Augmentation des séquences déposées dans les banques de données publiques (EMBL)

Source: EMBL Statistics

Cost of 1 MB of DNA sequencing

Sboner et al. Genome Biology 2011, 12:125

Cost of 1 MB of DNA sequencing

X

Sboner et al. Genome Biology 2011, 12:125

Cost of E. Coli sequencing with 100 X coverage: 50 €

1 000

200 000

2 x 100 bp $ 800 USD $ 0.8 USD $ 100 000 USD

$ 0.1 USD

1 600

200 000

2 x 100 bp $ 800 USD $ 0.5 USD $ 100 000 USD

$ 0.1 USD

COMPARISON MICROARRAY / NGS

Microarray transcriptome

NGS transcriptome (ILLUMINA)

Sboner et al. Genome Biology 2011, 12:125

E. Coli transcriptome library: 300€ sequencing+data management: 150€

Sboner et al. Genome Biology 2011, 12:125

E. Coli transcriptome library: 300€ sequencing+data management: 150€

Sboner et al. Genome Biology 2011, 12:125

Conclusion The rapid decrease in the cost of ‘data generation’ has not been matched by a comparable decrease in the cost of the computational infrastructure required to mine the data. The major burden will be the downstream analysis... Sboner et al. Genome Biology 2011, 12:125