Data processing and analysis of genetic variation using nextgeneration

Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 1st - 2nd, 2010 Manager, Medical and Population G...
Author: Scot Cannon
0 downloads 2 Views 5MB Size
Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 1st - 2nd, 2010 Manager, Medical and Population Genetics Analysis Broad Institute of MIT and Harvard

Acknowledgments! Genome sequencing and analysis group (MPG)! Eric Banks (Development lead)! Ryan Poplin! Guillermo del Angel! Menachem Fromer! Kiran Garimella (Analysis Lead)! Corin Boyko! Chris Hartl! Matt Hanna! Khalid Shakir! Aaron McKenna!

Broad postdocs, staff, and faculty! Anthony Philippakis! Vineeta Agarwala! Manny Rivas! Jared Maguire! Carrie Sougnez! David Jaffe! Nick Patterson! Steve Schaffner! Shamil Sunyaev! Paul de Bakker!

1000 Genomes project! In general but notably:! Matt Hurles! Philip Awadalla! Richard Durbin! Goncalo Abecasis! Richard Gibbs! Gabor Marth! Thomas Keane! Gil McVean! Gerton Lunter! Heng Li!

Genome Sequencing Platform! In general but notably:! Lauren Ambrogio! Illumina Production Team! Tim Fennell! Kathleen Tibbetts! Alec Wysoker! Toby Bloom!

Copy number group! Bob Handsaker
 Jim Nemesh
 Josh Korn! Steve McCarroll !

Cancer genome analysis! Kristian Cibulskis! Andrey Sivachenko! Gad Getz!

Integrative Genomics Viewer (IGV)! Jim Robinson! Helga Thorvaldsdottir!

MPG directorship! Stacey Gabriel! David Altshuler! Mark Daly!

Self-introduction: who am I? •  Currently the manager of Medical and Population Genetics Analysis in MPG –  Run Genome Sequencing and Analysis (GSA) group

•  Before this I was: –  A business consultant at a local consulting firm –  A Damon Runyon fellow doing experimental bacterial evolution at Harvard –  A Marshall Fellow in Biochemistry at Cambridge –  Mathematics and Computer Science major at Northwestern

Outline •  Goal: understand how we can turn raw NGS into reliable information about genetic variation in our samples •  Next-generation DNA sequencing (NGS) machines •  Processing of NGS to discover and genotype –  SNPs –  Indels –  CNVs

•  Quality control

Phase 1: NGS data processing Typically by lane Input

Raw reads

Phase 2: Variant discovery and genotyping

Phase 3: Integrative analysis

Typically multiple samples simultaneously but can be single sample alone Sample 1 reads

Sample N reads

Raw indels

Raw SNPs

Raw SVs

External data

Mapping SNPs Local realignment

Pedigrees

Known variation

Population structure

Known genotypes

Indels Duplicate marking

Base quality recalibration

Output

Analysis-ready reads

Variant quality recalibration Structural variation (SV)

Raw variants

Genotype refinement

Analysis-ready variants

NGS TECHNOLOGIES AND TERMINOLOGY

Sequencing Generations

ABI 377

ABI 3730

=

454

=

Illumina

+

SOLiD

+

=

Helicos

=

Stacey Gabriel

Next, Next

Generation

??

=

Lots of Bases…. Illumina

120000

454

3730XL

90

Total Gb produced

100000

14

2

5

Total Gb produced

70 60

Production Instruments

Total Gb produced

3000

2500

80000

60000

40000 2000

20000

1500

1000

0

50

2008

40

500

30

0

20 10 0

2010

…2011 = 500,000 Gb 2006

2007

2008

SOLiD

Helicos

Prototype Instruments 8

Stacey Gabriel

2009

1

The Illumina HiSeq

Stacey Gabriel

Libraries, lanes, and flowcells Flowcell

Lanes Illumina

Each reaction produces a unique library of DNA fragments for sequencing.

Each NGS machine processes a single flowcell containing several independent lanes during a single sequencing run

Reads and fragments Non-reference bases are colored; reference bases are grey

Clean C/T heterozygote

Depth of coverage

IGV screenshot First and second read from the same fragment

Details about specific read

Individual reads aligned to the genome Reference genome

EXPERIMENTAL DESIGNS

Experimental parameters Candidate gene follow-up (FHS, Pfizer) Target size

Deep single genomes (Cancer) Depth per sample

~100 genes

Exome ARRA projects (ESP, Autism, T2D)

Whole exome Whole genome

Low-pass multi-sample (1000 Genomes, T2D GO)

Number of samples

Hybrid Selection : an Approach to “Fish” for Exons

“Baits” Biotinylated oligo

Hybridize and capture on bead

“Catch” Hybrid-selected targets

Sequence

“Pond” Sheared whole genome library Stacey Gabriel

Genotyping vs. pooling Genotyping

Pooling

Sample 1!

Samples 1 and 2!

Sample 2! Het

Hom

Het

•  Known relationship between reads and samples •  Analysis algorithms empowered to by using this information –  Helps control errors

•  E.g., multi-sample SNP calling

Variants!

Rare

Common

Rare

•  Unknown relationship between reads and samples –  Samples mixed together before sequencing

•  Complicates analysis algorithms –  Harder to control errors –  Less reliable output

v 4.2

Indexed Hybrid Selection

Input: 3 ug July

Shearing (50 uL) 2.2x SPRI (no transfer)

P5

End Repair Modified 2.2x SPRI w/ SPRI Buffer

SP1

A-Base Addition

add adapters

Modified 2.2x SPRI w/ SPRI Buffer

PCR with universal oligos

Barcode Addition

Adapter Ligation

Modified 1x SPRI w/ SPRI Buffer

Limited Pond PCR

SP2’

DNA fragment

8-base index

P7

Pooling Option 2 Before Selection

1 x 50 ul reactions 6 cycles

Hybridization Blocking Oligos

Capture on Bead Direct Catch PCR off Bead 1 x 50 ul reaction P5

SP1

InP

P7

SP2 Barcode

Sheila Fisher

Pooling Option 1 After Selection

20 cycles Modified 1x SPRI w/ SPRI Buffer

Sequencing

Single vs. multi-sample analysis Fixed sequencing budget

Deep single-sample

Shallower multi-sample Sample 1! Sample 2!

Sample 1!

Sample 3! Sample 4! Variants!

Found 3 variants total!

•  Higher sensitivity for variants in the sample •  More accurate genotyping per sample •  Cost: no information about other samples

Found 5 variants total!

•  Sensitivity dependent on frequency of variation •  Worse genotyping •  More total variants discovered

High-pass sequencing design Inter-genic!

Exon I!

Intron I!

Variant site!

Exon II!

Inter-genic!

~30x reads! Excellent sensitivity for hetero- and homozygotes!

High depth allows excellent genotype calling!

Data requirements per sample!

Variant detection among multiple samples!

Targeted bases!

Variants found per sample!

~3-5M!

Percent of variation in genome!

>99%!

Pr{singleton discovery}!

>99%!

Pr{common allele discovery}!

>99%!

Coverage! # sequenced bases! # lanes of HiSeq!

~3 Gb! Avg. 30x! 100 Gb! ~8 lanes!

Low-pass sequencing design Inter-genic!

Exon I!

Intron I!

Variant site!

Exon II!

Inter-genic!

~4x reads! Heterozygotes can be mistaken for homozygotes due to sampling!

Variants missed by sampling!

Significantly better power to detect homozygous sites!

Data requirements per sample!

Variant detection among multiple samples!

Targeted bases!

Variants found per sample!

Coverage!

~3 Gb! Avg. 4x!

~3M!

Percent of variation in genome!

~90%!

# sequenced bases!

20 Gb!

Pr{singleton discovery}!

80% 20x! 5 Gb! ~0.33!

Pr{singleton discovery}!

~95%!

Pr{common allele discovery}!

~95%!

DATA PROCESSING

Phase 1: NGS data processing Typically by lane Input

Raw reads

Phase 2: Variant discovery and genotyping

Phase 3: Integrative analysis

Typically multiple samples simultaneously but can be single sample alone Sample 1 reads

Sample N reads

Raw indels

Raw SNPs

Raw SVs

External data

Mapping SNPs Local realignment

Pedigrees

Known variation

Population structure

Known genotypes

Indels Duplicate marking

Base quality recalibration

Output

Analysis-ready reads

Variant quality recalibration Structural variation (SV)

Raw variants

Genotype refinement

Analysis-ready variants

Commonly used tools •  MAQ/BWA –  Mapping (Illumina, 454, SOLID)

•  BFAST –  Mapping (SOLiD)

•  SSAHA –  Mapping (454)

•  Mosaik –  Mapping (Illumina, 454, SOLID) –  Variant calling

•  SOAP –  Mapping (Illumina) –  SNP calling

•  IGV –  Data visualization

•  Samtools –  Data processing –  Variant calling

•  glfTools –  SNP calling

•  QCALL –  SNP calling

•  GATK –  –  –  – 

Data processing Variant calling Variant manipulation Variant QC

•  DINDEL –  Indel calling

The GATK is a general toolkit for developing NGS data processing algorithms! Genome Analysis Toolkit (GATK)!

•  Open-source map/reduce programming framework for developing analysis tools for next-gen sequencing data! •  Easy-to-use, CPU and memory efficient, automatically parallelizing Java engine! Popular GATK tools! Local realignment! Indel calling!

SAM/BAM format!

•  NGS standard: technology agnostic, binary, indexed, portable and extensible file format for NGS reads! •  Used throughput Broad production pipeline! http://samtools.sourceforge.net/!

Base quality score recalibration!

SNP calling!

Variant evaluation!

Phasing and imputation!

GATKʼs Queue!

•  Simple, distributable pipeline engine! •  Underlies MPG-Firehose data processing! http://www.broadinstitute.org/gsa/wiki/!

VCF format!

•  Standard and accessible format for storing variation and genotypes! •  SNPs, indels, SV! •  Many supporters: 1000 Genomes, dbSNP, GAP! http://vcftools.sourceforge.net/ !

The BAM format facilitates sequenceragnostic analyses Bases, quality scores, (optionally) alignments, and meta data!

BAM file! Read name!

Alignment gap information!

Quality scores (fastq format)!

SLX1:1:127:63:4 … 1 10052169 … 23M6N10M … GAAGATACTGGTTTTTTTCTTATGAGACGGAGT 768832'48::::::;;:/78$88818099897 SM:Z:JPTGBMN01 …!

Locus! BAM file allows us to represent the data of any sequencer. Analyses can then be conducted largely agnostic to the particular sequencer used.!

Read sequence!

Meta data!

Data processing and analysis! Kiran Garimella

BAM headers: an optional (but essential) part of a BAM file Required: Standard header @HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 Essential: read groups. @SQ SN:chr2 LN:242951149 Essential: contigs of Carries platform (PL), library [cut for clarity] aligned reference (LB), and sample (SM) @SQ SN:chr9 LN:140273252 sequence. Should be information. Each read is @SQ SN:chr10 LN:135374737 in karotypic order. @SQ SN:chr11 LN:134452384 associated with a read group [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk Useful: Data processing tools applied to the reads @PG ID:GATK TableRecalibration VN:1.0.2864 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33

Finding the true origin of each read is a computationally demanding first step Phase 1: Picard! NGS data processing!

For more information see:

----- By lane/sample ----!

Input!

Raw reads!

Mapping!

GATK local realignment!

Enormous pile of short reads from NGS

Region 1

Mapping and alignment algorithms

Region 2

Duplicate marking!

GATK base quality recal!

Output!

Analysisready reads!

Detects correct read origin and flags them with high certainty

Li and Homer (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics.

Region 3

Reference genome

Detects ambiguity in the origin of reads and flags them as uncertain

Dealing with indels: local realignment and per-base-alignment qualities (BAQ) Effect of MSA on alignments!

Phase 1: Picard! NGS data processing!

----- By lane/sample ----!

Input!

NA12878, chr1:1,510,530-1,510,589! rs28782535! rs28783181!

rs28788974!

rs34877486!

rs28788974!

Raw reads!

Mapping!

GATK local realignment! 1,000 Genomes Pilot 2 data, raw MAQ alignments!

1,000 Genomes Pilot 2 data, after MSA!

HiSeq data, raw BWA alignments!

HiSeq data, after MSA!

Duplicate marking!

GATK base quality recal!

Output!

Analysisready reads!

28 DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Manuscript in review.!

Base quality score recalibration

00

10 10

Reported Quality

!! ! !

GATK base quality recal! 0

50

100

150

!

200

Accuracy Accuracy (Empirical (Empirical !! Reported Reported Quality) Quality) !10 !5 0 5 10 !10 !5 0 5 10

10 5

! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! !!! !! !! !! !! ! !! ! ! ! !! ! ! !!! ! !! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! !

!5

0

Original, RMSE = 1.784 Recalibrated, RMSE = 0.136

Duplicate marking!

!10

Accuracy (Empirical ! Reported Quality)

!

Machine Cycle

00

10 10

Analysisready reads! Original, RMSE = 2.169 Recalibrated, RMSE = 0.135

!! !!

0

!305

!20 10

!10 15

Original,RMSE RMSE==2.207 1.688 Original, Recalibrated,RMSE RMSE==0.186 0.213 Recalibrated,

First of pair reads 020

10 25

20 20

30 30

40 30 20 10 0

40 40

0

ReportedQuality Quality Reported

!!!!!!!!!! !!! ! !! !!! !! !!! !!!!!!! ! !! !!! !! ! !!! !! ! ! ! !!! !!! !!! ! !!! !!!! !! ! !!! !!! !!! ! ! ! ! !! ! !! ! !!! !! ! !!!!!!!!!!! ! !! !!! !!! !!! ! !!!!! !!! ! ! !!! ! ! !!! !!! ! !! !! ! !! ! !!! ! ! ! ! ! ! !! ! !!!!!!!! !! !! !! ! !!! ! ! ! !!! !!!!!!! !! ! !

Second of pair reads

Empirical Quality

40 40

40 40

20 30

30 35

MachineCycle Cycle Machine

!! !!

Original,RMSE RMSE==1.784 2.609 Original, Recalibrated,RMSE RMSE==0.136 0.089 Recalibrated,

! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! !! ! !! !! ! ! ! !! !! ! ! ! !! !! !! !! ! ! ! !!! !! !!! ! ! ! ! ! ! !! ! ! ! !!!!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! !! !! !! !!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! !! !! ! !! !! ! ! ! !! !! !!! !!! ! ! !! ! ! !! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !!! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! !!! ! !! !!!! !!!!! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! ! ! ! !!!!! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! !! !! ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! !

Second of pair reads !100 0

!50 50

First of pair reads

!

50 150

100 200

0 100

Original,RMSE RMSE==2.598 1.656 Original, Recalibrated,RMSE RMSE==0.052 0.088 Recalibrated,

Original,RMSE RMSE==2.169 2.469 Original, Recalibrated,RMSE RMSE==0.135 0.083 Recalibrated,

!!!!!!

!! !! ! !

S

!30

MachineCycle Cycle Machine

DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next29 generation DNA sequencing data. Manuscript in review.! Ryan Poplin

rted rted Quality) Quality) 55 10 10

10 5

rted Quality)

Output!

30 30

ReportedQuality Quality Reported

GATK local realignment! !

20 20

10

40

Recalibrated,RMSE RMSE==0.213 0.135 Recalibrated,

5

30

!!

0

20

Recalibrated,RMSE RMSE==0.196 0.756 Recalibrated,

!5

10

!!

!10

0

Original, RMSE = 2.556 Recalibrated, RMSE = 0.213

10

0

!

5

!

Accuracy (Empirical ! Reported Quality)

Mapping!

! !! !! !! !! !! !! ! ! !! ! !! ! ! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! !!! ! !! !! !! !! !! ! !! ! ! !! ! ! !! !! !! ! ! ! ! !! !! ! !! ! !! ! !! ! ! ! ! !! ! !! !! !! ! ! !! ! ! !! Original, Original,RMSE RMSE==2.556 5.634

rted Quality)

!

Illumina/HiSeq Roche/4542000

00

!

Accuracy Accuracy (Empirical (Empirical !! Reported Reported Quality) Quality) !10 !5 0 5 10 !10 !5 0 5 10

!

!

!!

rted rted Quality) Quality) 55 10 10

!!

!! !! !!

!

00

20

Raw reads! ! !! !!

! !! !! !! !!!! ! ! !! !!! !! ! !!! ! ! ! !! !! ! ! !!! ! ! ! ! !! !! ! ! ! ! !! ! ! ! !! !! !! !! !! ! !! !! !! ! ! !! !! !!! ! !! ! ! !! ! ! ! ! !! ! ! !! ! !! ! ! !!! !! ! ! !! ! ! ! !! ! ! ! !! !! !! Original, Original,RMSE RMSE==5.242 1.215 !! !

Empirical Empirical Quality Quality 10 20 30 10 20 30

30

! ! !! ! ! ! !! ! ! !! ! ! !! !

Input!

!!

!

40 40

Life/SOLiD Illumina/GenomeAnalyzer

!! ----- By lane/sample ----! !!

10

Empirical Quality

40

Roche/454

Empirical Empirical Quality Quality 10 20 30 10 20 30

Phase 1: Picard! NGS data processing!

SNP calling I: genotype likelihoods per sample Bayesian model

Likelihood of the Likelihood for Prior for the data given the Independent base model! the genotype! genotype! genotype!

L(G | D) = P(G) P(D | G) =



b∈{good _ bases}

P(b | G)

•  Genotype likelihoods describe the probability of the reads for each genotype (AA, AC, …, GT, TT) at each locus •  Likelihood of data computed using pileup of bases and associated quality scores at given locus •  Only “good bases” are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS •  P(b | G) uses platform-specific confusion matrices 30

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information

SNP calling II: multi-sample SNP calling Sample-associated reads!

Genotype likelihoods! Allele frequency!

Individual 1! Individual 2!

Joint estimate across samples!

SNPs!

Individual N! Genotype frequencies!

•  Simultaneous estimation of: –  Allele frequency (AF) spectrum Pr{AF = i | D} –  The probability that a SNP exists Pr{AF > 0 | D} –  Assignment of genotypes to each sample Excellent mathematical treatment at http://lh3lh3.users.sourceforge.net/download/samtools.pdf

Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true The HapMap3 sites from NA12878 HiSeq! calls are used to train the GMM. Shown! here is the 2D plot of strand bias vs. the! variant quality / depth for those sites.!

Eric Banks, Ryan Poplin

Variants are scored based on their! fit to the Gaussians. The variants! (here just the novels) clearly! separate into good and bad clusters.!

Dindel: an algorithm by Kees Albers for calling small (~50 bp) indels in NGS Illumina data Phase 2: MPG-Firehose! Variant discovery and genotyping!

---- Multiple samples simultaneously ----! Sample 1 reads!

….!

Sample N reads!

Raw SNPs!

Raw indels!

Raw variants!

Filtered variants!

*!

The GATK version of this algorithm is still under development. We use it here experimentally for its genotyping facility. For initial discovery, we rely on sites emitted by IndelGenotyperV2, written by Andrey Sivachenko.! Figure and work by Kees Albers (http://sites.google.com/site/keesalbers/soft/dindel). A version of this is being implemented in the GATK by Guillermo del Angel with permission from Kees.!

33

Guillermo del Angel, Kiran Garimella

Mark DePristo November 29, 2010

NGS processing requires significant but consistent andrequirements storage costs 1 CPU Storage bytes storage ≈ 2 × targeted area bp

100

CPU time required to process a project

● SCZ_Sklar

80

Data type

60 ● Autism_Daly_WholeExome_Batch4 ● ●●Autism_Daly_WholeExome_Batch6 Autism_Daly_WholeExome_Batch1 ● BMI_Hirschhorn_NHGRI ● ● EOMI_Kathiresan_NHGRI HIV_deBakker_WholeExome_48Exomes ● Autism_Daly_WholeExome_Batch2 ● Autism_Daly_WholeExome_Batch5

40

Target

Storage

Single WEx

32 Mb

30-50 Gb

Deep WGS

2.85 Gb

250 Gb

Shallow WGS

2.85 Gb

25-50 Gb

700 EOMI exomes

32 Mb

35 Tb

1000G Pilot

2.85 Gb

9 Tb

1000G Phase I

2.85 Gb

60 Tb

Per-sample

y = 0.74x + 0.79

Autism_Daly_WholeExome_Batch3

Complete Project

20

CPU time (days)

Example Storage requires

MCKD1_20exomes

● Ashuldin_Mennonite ● Autism_Walsh ●

0

Amish_Rare_CFSC ● MCKD1 Ashuldin_mennonite Height_Hirschhorn_NHGRI● ●●

0

50

100

Number of samples

Kiran Garimella

150

QUALITY CONTROL

Evaluating SNP call quality Expected number of calls? •  The number of SNP calls should be close to the average human heterozygosity of 1 variant per 1000 bases •  Only detects gross under/over calling

What fraction of my calls are already known? •  dbSNP catalogs most common variation, so most of the true variants found will be in dbSNP •  For single sample calls, ~90 of variants should be in dbSNP •  Need to adjust expectation when considering calls across samples

Concordance with genotype chip calls? •  Often we have genotype chip data that indicates the homref, het, hom-var status at millions of sites •  Good SNP calls should be >99.5% consistent these chip results, and >99% of the variable sites should be found •  The chip sites are in the better parts of the genome, and so are not representative of the difficulties at novel sites

Transition to transversion ratio (Ti/Tv)? •  Transitions are twice as frequent as transversions (see Ebersberger, 2002) •  Validated human SNP data suggests that the Ti/Tv should be ~2.1 genome-wide and ~2.8 in exons •  FP SNPs should has Ti/Tv around 0.5 •  Ti/Tv is a good metric for assessing SNP call quality

A

C

G

T transitions transversions

How many calls should I expect?

Sample ethnicity

Target (L)

Heterozygosity (θ)

Expected number of variants

Single sample => 2N = 2 European

Whole genome

0.78 x 10-3

~3.3M

European

32Mb exome

0.42 x 10-3

~20K

African

Whole genome

1.00 x 10-3

~4.3M

African

32Mb exome

0.48 x 10-3

~23K

0.78 x 10-3

~13M

100 samples => 2N = 200 European

Whole genome

Including the 1000 Genomes Pilot, ~92-95% of all SNPs discovered in whole genome sequencing projects will be already known % of novel SNPs discovered in whole-genome sequencing 100

20

Number of SNPs in dbSNP (Millions)

100 80

15

1000 Genomes pilot 1

80 60

10

NA12878 (CEU) NA19240 (YRI) NA12878 (CEU)

60 40

NA19240 (YRI)

40 5

20

20 0 0

0 70

80

70 80 * NA12878 and NA19240 from 1KG Pilot 2 deep whole genome sequencing

90

90

100

110

100 build ID110 dbSNP

120

120

130

130

Genotype sensitivity and concordance •  Genotype chips (Illumina 1M, e.g.) reliably identify genotypes at many sites –  >99% accuracy at known polymorphic sites

•  Sensitivity measure: what fraction of variant sites in sample on chip did I found? •  Genotype quality: how concordance are the genotype calls from NGS with the chip? •  Caveats: not comprehensive, only at easy sites, does not apply to novel calls

Non-reference sensitivity

Genotype of evaluation calls

Genotype of comparison calls A/A

A/B

B/B

./.

A/A

1

2

3

4

A/B

5

6

7

8

B/B

9

10

11

12

./.

13

14

15

16

A is ref allele, B is alt allele, . is no call

•  Blue elements / red area elements •  Measures fraction of sites called variant (A/B or B/B) in comparison that are also called variant in evaluation data •  Penalizes hom-ref and no-calls at variant sites (6+7+10+11) / (2+3+6+7+10+11+14+1 5) = 34/ 68

Non-reference discrepancy rate

Genotype of evaluation calls

Genotype of comparison calls A/A

A/B

B/B

./.

A/A

1

2

3

4

A/B

5

6

7

8

B/B

9

10

11

12

./.

13

14

15

16

A is ref allele, B is alt allele, . is no call

•  • 

Blue elements / red area elements Measures accuracy of genotype calls at sites called by both sets –  Excludes concordant A/A genotypes since these are often large in number and easier to get correct

• 

Good metric for assaying accuracy of genotype calls

(2+3+5+7+9+10) / (2+3+5+6+7+9+10+11) = 36 / 53

Transition/transversion ratio (Ti/Tv) can be used to estimate false-positive rate

Because random errors will have Ti/Tv ratio of ~0.5, we can estimate the accuracy of novel SNP calls by their difference from expectation

EXPECTED OUTPUTS

ʹͻ‘ˆ͵͵

How do indel realignment and base ’ƒ••†ƒ–ƒ•‡–•Ǥƒ”–‘‡‘ˆ‡ƒ…Š•‡…–‹‘•—ƒ”‹œ‡•–Š‡‹’ƒ…–‘ˆŽ‘…ƒŽ”‡ƒŽ‹‰‡– quality recalibration affect SNP calling? ƒ†„ƒ•‡“—ƒŽ‹–›”‡…ƒŽ‹„”ƒ–‹‘„›…‘’ƒ”‹‰…ƒŽŽ•‘”‡ƒ†•™‹–Š”ƒ™“—ƒŽ‹–› ƒ„Ž‡ʹǣƒ™–‘”‡…ƒŽ‹„”ƒ–‡†ǡ‹’—–‡†…ƒŽŽ• ‹‡“ǡš‘‡ǡƒ†͸ͳ•ƒ’Ž‡Ž‘™Ǧ

•…‘”‡•ƒ†ƒŽ‹‰‡–•–‘–Š‘•‡ƒ†‡‘–Š‡”‡ƒŽ‹‰‡†ǡ”‡…ƒŽ‹„”ƒ–‡†”‡ƒ†•Ǥ "#$%!&#'()*%+,

!

-)./0+#')1!$)!2345676!*0+#01$'

2)8!)9!"2:'

! 3BB!

-0BB!'%$!

@1)C1

;#! false ?8J5>! positives due to indels!

I#10B!(0BB!'%$! 



N)CO/0''! ƒ™”‡ƒ†•ǡƒŽŽ…ƒŽŽ•

Eric Banks



The process doesnʼt LL8?5! M8M6! remove real SNPs!





! ! ! ! ! ! ! ! http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels ʹͳǤʹͶ

͸Ǥ͹͸

ͳͶǤͶͺ

͵ͳǤͺ͵

ʹǤͲͷ

ͲǤͻͺ

ͺͶǤͲʹ

ʹͲǤ͵ͷ

ͺͲǤͻͻ

ʹʹǤͷͻ

ͳǤͲ͵

͵Ͳ

ͻͻ͹

ʹǤͻͷ

ͳǤͳ͹

ͲǤ͸Ͷ

ͲǤͲͳ

ͷͲǤͲͲ

ͲǤͲʹ

ͷʹǤ͹ͺ

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

‹“—‡–‘”ƒ™”‡ƒ†…ƒŽŽ•

• —ƒŽ‹–›‘˜‡”†‡’–ŠȋȌδͷǡ‘” • —ε͵

How well does SNP calling work?



ƒ„Ž‡ͻǣ‡”ˆ‘”ƒ…‡‘ˆŠƒ”†Ǧˆ‹Ž–‡”‹‰ƒ†˜ƒ”‹ƒ–“—ƒŽ‹–›•…‘”‡”‡…ƒŽ‹„”ƒ–‹‘ "#$%!&#'()*%+,

!

-)./0+#')1!$)!2345676!*0+

2)8!)9!"2:'

! 3BB!

-0BB!'%$!

@1)C1

;#

Suggest Documents