Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 1st - 2nd, 2010 Manager, Medical and Population Genetics Analysis Broad Institute of MIT and Harvard
Acknowledgments! Genome sequencing and analysis group (MPG)! Eric Banks (Development lead)! Ryan Poplin! Guillermo del Angel! Menachem Fromer! Kiran Garimella (Analysis Lead)! Corin Boyko! Chris Hartl! Matt Hanna! Khalid Shakir! Aaron McKenna!
Broad postdocs, staff, and faculty! Anthony Philippakis! Vineeta Agarwala! Manny Rivas! Jared Maguire! Carrie Sougnez! David Jaffe! Nick Patterson! Steve Schaffner! Shamil Sunyaev! Paul de Bakker!
1000 Genomes project! In general but notably:! Matt Hurles! Philip Awadalla! Richard Durbin! Goncalo Abecasis! Richard Gibbs! Gabor Marth! Thomas Keane! Gil McVean! Gerton Lunter! Heng Li!
Genome Sequencing Platform! In general but notably:! Lauren Ambrogio! Illumina Production Team! Tim Fennell! Kathleen Tibbetts! Alec Wysoker! Toby Bloom!
Copy number group! Bob Handsaker
Jim Nemesh
Josh Korn! Steve McCarroll !
Cancer genome analysis! Kristian Cibulskis! Andrey Sivachenko! Gad Getz!
Integrative Genomics Viewer (IGV)! Jim Robinson! Helga Thorvaldsdottir!
MPG directorship! Stacey Gabriel! David Altshuler! Mark Daly!
Self-introduction: who am I? • Currently the manager of Medical and Population Genetics Analysis in MPG – Run Genome Sequencing and Analysis (GSA) group
• Before this I was: – A business consultant at a local consulting firm – A Damon Runyon fellow doing experimental bacterial evolution at Harvard – A Marshall Fellow in Biochemistry at Cambridge – Mathematics and Computer Science major at Northwestern
Outline • Goal: understand how we can turn raw NGS into reliable information about genetic variation in our samples • Next-generation DNA sequencing (NGS) machines • Processing of NGS to discover and genotype – SNPs – Indels – CNVs
• Quality control
Phase 1: NGS data processing Typically by lane Input
Raw reads
Phase 2: Variant discovery and genotyping
Phase 3: Integrative analysis
Typically multiple samples simultaneously but can be single sample alone Sample 1 reads
Sample N reads
Raw indels
Raw SNPs
Raw SVs
External data
Mapping SNPs Local realignment
Pedigrees
Known variation
Population structure
Known genotypes
Indels Duplicate marking
Base quality recalibration
Output
Analysis-ready reads
Variant quality recalibration Structural variation (SV)
Raw variants
Genotype refinement
Analysis-ready variants
NGS TECHNOLOGIES AND TERMINOLOGY
Sequencing Generations
ABI 377
ABI 3730
=
454
=
Illumina
+
SOLiD
+
=
Helicos
=
Stacey Gabriel
Next, Next
Generation
??
=
Lots of Bases…. Illumina
120000
454
3730XL
90
Total Gb produced
100000
14
2
5
Total Gb produced
70 60
Production Instruments
Total Gb produced
3000
2500
80000
60000
40000 2000
20000
1500
1000
0
50
2008
40
500
30
0
20 10 0
2010
…2011 = 500,000 Gb 2006
2007
2008
SOLiD
Helicos
Prototype Instruments 8
Stacey Gabriel
2009
1
The Illumina HiSeq
Stacey Gabriel
Libraries, lanes, and flowcells Flowcell
Lanes Illumina
Each reaction produces a unique library of DNA fragments for sequencing.
Each NGS machine processes a single flowcell containing several independent lanes during a single sequencing run
Reads and fragments Non-reference bases are colored; reference bases are grey
Clean C/T heterozygote
Depth of coverage
IGV screenshot First and second read from the same fragment
Details about specific read
Individual reads aligned to the genome Reference genome
EXPERIMENTAL DESIGNS
Experimental parameters Candidate gene follow-up (FHS, Pfizer) Target size
Deep single genomes (Cancer) Depth per sample
~100 genes
Exome ARRA projects (ESP, Autism, T2D)
Whole exome Whole genome
Low-pass multi-sample (1000 Genomes, T2D GO)
Number of samples
Hybrid Selection : an Approach to “Fish” for Exons
“Baits” Biotinylated oligo
Hybridize and capture on bead
“Catch” Hybrid-selected targets
Sequence
“Pond” Sheared whole genome library Stacey Gabriel
Genotyping vs. pooling Genotyping
Pooling
Sample 1!
Samples 1 and 2!
Sample 2! Het
Hom
Het
• Known relationship between reads and samples • Analysis algorithms empowered to by using this information – Helps control errors
• E.g., multi-sample SNP calling
Variants!
Rare
Common
Rare
• Unknown relationship between reads and samples – Samples mixed together before sequencing
• Complicates analysis algorithms – Harder to control errors – Less reliable output
v 4.2
Indexed Hybrid Selection
Input: 3 ug July
Shearing (50 uL) 2.2x SPRI (no transfer)
P5
End Repair Modified 2.2x SPRI w/ SPRI Buffer
SP1
A-Base Addition
add adapters
Modified 2.2x SPRI w/ SPRI Buffer
PCR with universal oligos
Barcode Addition
Adapter Ligation
Modified 1x SPRI w/ SPRI Buffer
Limited Pond PCR
SP2’
DNA fragment
8-base index
P7
Pooling Option 2 Before Selection
1 x 50 ul reactions 6 cycles
Hybridization Blocking Oligos
Capture on Bead Direct Catch PCR off Bead 1 x 50 ul reaction P5
SP1
InP
P7
SP2 Barcode
Sheila Fisher
Pooling Option 1 After Selection
20 cycles Modified 1x SPRI w/ SPRI Buffer
Sequencing
Single vs. multi-sample analysis Fixed sequencing budget
Deep single-sample
Shallower multi-sample Sample 1! Sample 2!
Sample 1!
Sample 3! Sample 4! Variants!
Found 3 variants total!
• Higher sensitivity for variants in the sample • More accurate genotyping per sample • Cost: no information about other samples
Found 5 variants total!
• Sensitivity dependent on frequency of variation • Worse genotyping • More total variants discovered
High-pass sequencing design Inter-genic!
Exon I!
Intron I!
Variant site!
Exon II!
Inter-genic!
~30x reads! Excellent sensitivity for hetero- and homozygotes!
High depth allows excellent genotype calling!
Data requirements per sample!
Variant detection among multiple samples!
Targeted bases!
Variants found per sample!
~3-5M!
Percent of variation in genome!
>99%!
Pr{singleton discovery}!
>99%!
Pr{common allele discovery}!
>99%!
Coverage! # sequenced bases! # lanes of HiSeq!
~3 Gb! Avg. 30x! 100 Gb! ~8 lanes!
Low-pass sequencing design Inter-genic!
Exon I!
Intron I!
Variant site!
Exon II!
Inter-genic!
~4x reads! Heterozygotes can be mistaken for homozygotes due to sampling!
Variants missed by sampling!
Significantly better power to detect homozygous sites!
Data requirements per sample!
Variant detection among multiple samples!
Targeted bases!
Variants found per sample!
Coverage!
~3 Gb! Avg. 4x!
~3M!
Percent of variation in genome!
~90%!
# sequenced bases!
20 Gb!
Pr{singleton discovery}!
80% 20x! 5 Gb! ~0.33!
Pr{singleton discovery}!
~95%!
Pr{common allele discovery}!
~95%!
DATA PROCESSING
Phase 1: NGS data processing Typically by lane Input
Raw reads
Phase 2: Variant discovery and genotyping
Phase 3: Integrative analysis
Typically multiple samples simultaneously but can be single sample alone Sample 1 reads
Sample N reads
Raw indels
Raw SNPs
Raw SVs
External data
Mapping SNPs Local realignment
Pedigrees
Known variation
Population structure
Known genotypes
Indels Duplicate marking
Base quality recalibration
Output
Analysis-ready reads
Variant quality recalibration Structural variation (SV)
Raw variants
Genotype refinement
Analysis-ready variants
Commonly used tools • MAQ/BWA – Mapping (Illumina, 454, SOLID)
• BFAST – Mapping (SOLiD)
• SSAHA – Mapping (454)
• Mosaik – Mapping (Illumina, 454, SOLID) – Variant calling
• SOAP – Mapping (Illumina) – SNP calling
• IGV – Data visualization
• Samtools – Data processing – Variant calling
• glfTools – SNP calling
• QCALL – SNP calling
• GATK – – – –
Data processing Variant calling Variant manipulation Variant QC
• DINDEL – Indel calling
The GATK is a general toolkit for developing NGS data processing algorithms! Genome Analysis Toolkit (GATK)!
• Open-source map/reduce programming framework for developing analysis tools for next-gen sequencing data! • Easy-to-use, CPU and memory efficient, automatically parallelizing Java engine! Popular GATK tools! Local realignment! Indel calling!
SAM/BAM format!
• NGS standard: technology agnostic, binary, indexed, portable and extensible file format for NGS reads! • Used throughput Broad production pipeline! http://samtools.sourceforge.net/!
Base quality score recalibration!
SNP calling!
Variant evaluation!
Phasing and imputation!
GATKʼs Queue!
• Simple, distributable pipeline engine! • Underlies MPG-Firehose data processing! http://www.broadinstitute.org/gsa/wiki/!
VCF format!
• Standard and accessible format for storing variation and genotypes! • SNPs, indels, SV! • Many supporters: 1000 Genomes, dbSNP, GAP! http://vcftools.sourceforge.net/ !
The BAM format facilitates sequenceragnostic analyses Bases, quality scores, (optionally) alignments, and meta data!
BAM file! Read name!
Alignment gap information!
Quality scores (fastq format)!
SLX1:1:127:63:4 … 1 10052169 … 23M6N10M … GAAGATACTGGTTTTTTTCTTATGAGACGGAGT 768832'48::::::;;:/78$88818099897 SM:Z:JPTGBMN01 …!
Locus! BAM file allows us to represent the data of any sequencer. Analyses can then be conducted largely agnostic to the particular sequencer used.!
Read sequence!
Meta data!
Data processing and analysis! Kiran Garimella
BAM headers: an optional (but essential) part of a BAM file Required: Standard header @HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 Essential: read groups. @SQ SN:chr2 LN:242951149 Essential: contigs of Carries platform (PL), library [cut for clarity] aligned reference (LB), and sample (SM) @SQ SN:chr9 LN:140273252 sequence. Should be information. Each read is @SQ SN:chr10 LN:135374737 in karotypic order. @SQ SN:chr11 LN:134452384 associated with a read group [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk Useful: Data processing tools applied to the reads @PG ID:GATK TableRecalibration VN:1.0.2864 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33
Finding the true origin of each read is a computationally demanding first step Phase 1: Picard! NGS data processing!
For more information see:
----- By lane/sample ----!
Input!
Raw reads!
Mapping!
GATK local realignment!
Enormous pile of short reads from NGS
Region 1
Mapping and alignment algorithms
Region 2
Duplicate marking!
GATK base quality recal!
Output!
Analysisready reads!
Detects correct read origin and flags them with high certainty
Li and Homer (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics.
Region 3
Reference genome
Detects ambiguity in the origin of reads and flags them as uncertain
Dealing with indels: local realignment and per-base-alignment qualities (BAQ) Effect of MSA on alignments!
Phase 1: Picard! NGS data processing!
----- By lane/sample ----!
Input!
NA12878, chr1:1,510,530-1,510,589! rs28782535! rs28783181!
rs28788974!
rs34877486!
rs28788974!
Raw reads!
Mapping!
GATK local realignment! 1,000 Genomes Pilot 2 data, raw MAQ alignments!
1,000 Genomes Pilot 2 data, after MSA!
HiSeq data, raw BWA alignments!
HiSeq data, after MSA!
Duplicate marking!
GATK base quality recal!
Output!
Analysisready reads!
28 DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Manuscript in review.!
Base quality score recalibration
00
10 10
Reported Quality
!! ! !
GATK base quality recal! 0
50
100
150
!
200
Accuracy Accuracy (Empirical (Empirical !! Reported Reported Quality) Quality) !10 !5 0 5 10 !10 !5 0 5 10
10 5
! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! !!! !! !! !! !! ! !! ! ! ! !! ! ! !!! ! !! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! !
!5
0
Original, RMSE = 1.784 Recalibrated, RMSE = 0.136
Duplicate marking!
!10
Accuracy (Empirical ! Reported Quality)
!
Machine Cycle
00
10 10
Analysisready reads! Original, RMSE = 2.169 Recalibrated, RMSE = 0.135
!! !!
0
!305
!20 10
!10 15
Original,RMSE RMSE==2.207 1.688 Original, Recalibrated,RMSE RMSE==0.186 0.213 Recalibrated,
First of pair reads 020
10 25
20 20
30 30
40 30 20 10 0
40 40
0
ReportedQuality Quality Reported
!!!!!!!!!! !!! ! !! !!! !! !!! !!!!!!! ! !! !!! !! ! !!! !! ! ! ! !!! !!! !!! ! !!! !!!! !! ! !!! !!! !!! ! ! ! ! !! ! !! ! !!! !! ! !!!!!!!!!!! ! !! !!! !!! !!! ! !!!!! !!! ! ! !!! ! ! !!! !!! ! !! !! ! !! ! !!! ! ! ! ! ! ! !! ! !!!!!!!! !! !! !! ! !!! ! ! ! !!! !!!!!!! !! ! !
Second of pair reads
Empirical Quality
40 40
40 40
20 30
30 35
MachineCycle Cycle Machine
!! !!
Original,RMSE RMSE==1.784 2.609 Original, Recalibrated,RMSE RMSE==0.136 0.089 Recalibrated,
! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! !! ! !! !! ! ! ! !! !! ! ! ! !! !! !! !! ! ! ! !!! !! !!! ! ! ! ! ! ! !! ! ! ! !!!!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! !! !! !! !!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! !! !! ! !! !! ! ! ! !! !! !!! !!! ! ! !! ! ! !! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !!! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! !!! ! !! !!!! !!!!! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! ! ! ! !!!!! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! !! !! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! !
Second of pair reads !100 0
!50 50
First of pair reads
!
50 150
100 200
0 100
Original,RMSE RMSE==2.598 1.656 Original, Recalibrated,RMSE RMSE==0.052 0.088 Recalibrated,
Original,RMSE RMSE==2.169 2.469 Original, Recalibrated,RMSE RMSE==0.135 0.083 Recalibrated,
!!!!!!
!! !! ! !
S
!30
MachineCycle Cycle Machine
DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next29 generation DNA sequencing data. Manuscript in review.! Ryan Poplin
rted rted Quality) Quality) 55 10 10
10 5
rted Quality)
Output!
30 30
ReportedQuality Quality Reported
GATK local realignment! !
20 20
10
40
Recalibrated,RMSE RMSE==0.213 0.135 Recalibrated,
5
30
!!
0
20
Recalibrated,RMSE RMSE==0.196 0.756 Recalibrated,
!5
10
!!
!10
0
Original, RMSE = 2.556 Recalibrated, RMSE = 0.213
10
0
!
5
!
Accuracy (Empirical ! Reported Quality)
Mapping!
! !! !! !! !! !! !! ! ! !! ! !! ! ! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! !!! ! !! !! !! !! !! ! !! ! ! !! ! ! !! !! !! ! ! ! ! !! !! ! !! ! !! ! !! ! ! ! ! !! ! !! !! !! ! ! !! ! ! !! Original, Original,RMSE RMSE==2.556 5.634
rted Quality)
!
Illumina/HiSeq Roche/4542000
00
!
Accuracy Accuracy (Empirical (Empirical !! Reported Reported Quality) Quality) !10 !5 0 5 10 !10 !5 0 5 10
!
!
!!
rted rted Quality) Quality) 55 10 10
!!
!! !! !!
!
00
20
Raw reads! ! !! !!
! !! !! !! !!!! ! ! !! !!! !! ! !!! ! ! ! !! !! ! ! !!! ! ! ! ! !! !! ! ! ! ! !! ! ! ! !! !! !! !! !! ! !! !! !! ! ! !! !! !!! ! !! ! ! !! ! ! ! ! !! ! ! !! ! !! ! ! !!! !! ! ! !! ! ! ! !! ! ! ! !! !! !! Original, Original,RMSE RMSE==5.242 1.215 !! !
Empirical Empirical Quality Quality 10 20 30 10 20 30
30
! ! !! ! ! ! !! ! ! !! ! ! !! !
Input!
!!
!
40 40
Life/SOLiD Illumina/GenomeAnalyzer
!! ----- By lane/sample ----! !!
10
Empirical Quality
40
Roche/454
Empirical Empirical Quality Quality 10 20 30 10 20 30
Phase 1: Picard! NGS data processing!
SNP calling I: genotype likelihoods per sample Bayesian model
Likelihood of the Likelihood for Prior for the data given the Independent base model! the genotype! genotype! genotype!
L(G | D) = P(G) P(D | G) =
∏
b∈{good _ bases}
P(b | G)
• Genotype likelihoods describe the probability of the reads for each genotype (AA, AC, …, GT, TT) at each locus • Likelihood of data computed using pileup of bases and associated quality scores at given locus • Only “good bases” are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS • P(b | G) uses platform-specific confusion matrices 30
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information
SNP calling II: multi-sample SNP calling Sample-associated reads!
Genotype likelihoods! Allele frequency!
Individual 1! Individual 2!
Joint estimate across samples!
SNPs!
Individual N! Genotype frequencies!
• Simultaneous estimation of: – Allele frequency (AF) spectrum Pr{AF = i | D} – The probability that a SNP exists Pr{AF > 0 | D} – Assignment of genotypes to each sample Excellent mathematical treatment at http://lh3lh3.users.sourceforge.net/download/samtools.pdf
Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true The HapMap3 sites from NA12878 HiSeq! calls are used to train the GMM. Shown! here is the 2D plot of strand bias vs. the! variant quality / depth for those sites.!
Eric Banks, Ryan Poplin
Variants are scored based on their! fit to the Gaussians. The variants! (here just the novels) clearly! separate into good and bad clusters.!
Dindel: an algorithm by Kees Albers for calling small (~50 bp) indels in NGS Illumina data Phase 2: MPG-Firehose! Variant discovery and genotyping!
---- Multiple samples simultaneously ----! Sample 1 reads!
….!
Sample N reads!
Raw SNPs!
Raw indels!
Raw variants!
Filtered variants!
*!
The GATK version of this algorithm is still under development. We use it here experimentally for its genotyping facility. For initial discovery, we rely on sites emitted by IndelGenotyperV2, written by Andrey Sivachenko.! Figure and work by Kees Albers (http://sites.google.com/site/keesalbers/soft/dindel). A version of this is being implemented in the GATK by Guillermo del Angel with permission from Kees.!
33
Guillermo del Angel, Kiran Garimella
Mark DePristo November 29, 2010
NGS processing requires significant but consistent andrequirements storage costs 1 CPU Storage bytes storage ≈ 2 × targeted area bp
100
CPU time required to process a project
● SCZ_Sklar
80
Data type
60 ● Autism_Daly_WholeExome_Batch4 ● ●●Autism_Daly_WholeExome_Batch6 Autism_Daly_WholeExome_Batch1 ● BMI_Hirschhorn_NHGRI ● ● EOMI_Kathiresan_NHGRI HIV_deBakker_WholeExome_48Exomes ● Autism_Daly_WholeExome_Batch2 ● Autism_Daly_WholeExome_Batch5
40
Target
Storage
Single WEx
32 Mb
30-50 Gb
Deep WGS
2.85 Gb
250 Gb
Shallow WGS
2.85 Gb
25-50 Gb
700 EOMI exomes
32 Mb
35 Tb
1000G Pilot
2.85 Gb
9 Tb
1000G Phase I
2.85 Gb
60 Tb
Per-sample
y = 0.74x + 0.79
Autism_Daly_WholeExome_Batch3
Complete Project
20
CPU time (days)
Example Storage requires
MCKD1_20exomes
● Ashuldin_Mennonite ● Autism_Walsh ●
0
Amish_Rare_CFSC ● MCKD1 Ashuldin_mennonite Height_Hirschhorn_NHGRI● ●●
0
50
100
Number of samples
Kiran Garimella
150
QUALITY CONTROL
Evaluating SNP call quality Expected number of calls? • The number of SNP calls should be close to the average human heterozygosity of 1 variant per 1000 bases • Only detects gross under/over calling
What fraction of my calls are already known? • dbSNP catalogs most common variation, so most of the true variants found will be in dbSNP • For single sample calls, ~90 of variants should be in dbSNP • Need to adjust expectation when considering calls across samples
Concordance with genotype chip calls? • Often we have genotype chip data that indicates the homref, het, hom-var status at millions of sites • Good SNP calls should be >99.5% consistent these chip results, and >99% of the variable sites should be found • The chip sites are in the better parts of the genome, and so are not representative of the difficulties at novel sites
Transition to transversion ratio (Ti/Tv)? • Transitions are twice as frequent as transversions (see Ebersberger, 2002) • Validated human SNP data suggests that the Ti/Tv should be ~2.1 genome-wide and ~2.8 in exons • FP SNPs should has Ti/Tv around 0.5 • Ti/Tv is a good metric for assessing SNP call quality
A
C
G
T transitions transversions
How many calls should I expect?
Sample ethnicity
Target (L)
Heterozygosity (θ)
Expected number of variants
Single sample => 2N = 2 European
Whole genome
0.78 x 10-3
~3.3M
European
32Mb exome
0.42 x 10-3
~20K
African
Whole genome
1.00 x 10-3
~4.3M
African
32Mb exome
0.48 x 10-3
~23K
0.78 x 10-3
~13M
100 samples => 2N = 200 European
Whole genome
Including the 1000 Genomes Pilot, ~92-95% of all SNPs discovered in whole genome sequencing projects will be already known % of novel SNPs discovered in whole-genome sequencing 100
20
Number of SNPs in dbSNP (Millions)
100 80
15
1000 Genomes pilot 1
80 60
10
NA12878 (CEU) NA19240 (YRI) NA12878 (CEU)
60 40
NA19240 (YRI)
40 5
20
20 0 0
0 70
80
70 80 * NA12878 and NA19240 from 1KG Pilot 2 deep whole genome sequencing
90
90
100
110
100 build ID110 dbSNP
120
120
130
130
Genotype sensitivity and concordance • Genotype chips (Illumina 1M, e.g.) reliably identify genotypes at many sites – >99% accuracy at known polymorphic sites
• Sensitivity measure: what fraction of variant sites in sample on chip did I found? • Genotype quality: how concordance are the genotype calls from NGS with the chip? • Caveats: not comprehensive, only at easy sites, does not apply to novel calls
Non-reference sensitivity
Genotype of evaluation calls
Genotype of comparison calls A/A
A/B
B/B
./.
A/A
1
2
3
4
A/B
5
6
7
8
B/B
9
10
11
12
./.
13
14
15
16
A is ref allele, B is alt allele, . is no call
• Blue elements / red area elements • Measures fraction of sites called variant (A/B or B/B) in comparison that are also called variant in evaluation data • Penalizes hom-ref and no-calls at variant sites (6+7+10+11) / (2+3+6+7+10+11+14+1 5) = 34/ 68
Non-reference discrepancy rate
Genotype of evaluation calls
Genotype of comparison calls A/A
A/B
B/B
./.
A/A
1
2
3
4
A/B
5
6
7
8
B/B
9
10
11
12
./.
13
14
15
16
A is ref allele, B is alt allele, . is no call
• •
Blue elements / red area elements Measures accuracy of genotype calls at sites called by both sets – Excludes concordant A/A genotypes since these are often large in number and easier to get correct
•
Good metric for assaying accuracy of genotype calls
(2+3+5+7+9+10) / (2+3+5+6+7+9+10+11) = 36 / 53
Transition/transversion ratio (Ti/Tv) can be used to estimate false-positive rate
Because random errors will have Ti/Tv ratio of ~0.5, we can estimate the accuracy of novel SNP calls by their difference from expectation
EXPECTED OUTPUTS
ʹͻ͵͵
How do indel realignment and base Ǥ
quality recalibration affect SNP calling?
ʹǣ
ǡ
ǡǡͳǦ
ǡ
Ǥ "#$%!'()*%+,
!
-)./0+#')1!$)!2345676!*0+#01$'
2)8!)9!"2:'
! 3BB!
-0BB!'%$!
@1)C1
;#! false ?8J5>! positives due to indels!
I#10B!(0BB!'%$!
N)CO/0''! ǡ
Eric Banks
The process doesnʼt LL8?5! M8M6! remove real SNPs!
! ! ! ! ! ! ! ! http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels ʹͳǤʹͶ
Ǥ
ͳͶǤͶͺ
͵ͳǤͺ͵
ʹǤͲͷ
ͲǤͻͺ
ͺͶǤͲʹ
ʹͲǤ͵ͷ
ͺͲǤͻͻ
ʹʹǤͷͻ
ͳǤͲ͵
͵Ͳ
ͻͻ
ʹǤͻͷ
ͳǤͳ
ͲǤͶ
ͲǤͲͳ
ͷͲǤͲͲ
ͲǤͲʹ
ͷʹǤͺ
http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration
• ȋȌδͷǡ • ε͵
How well does SNP calling work?
ͻǣ
Ǧ
"#$%!'()*%+,
!
-)./0+#')1!$)!2345676!*0+
2)8!)9!"2:'
! 3BB!
-0BB!'%$!
@1)C1
;#