Fully powered polygenic prediction using summary statistics
Alkes L. Price Harvard T.H. Chan School of Public Health October 7, 2015 To download slides of this talk: google “Alkes HSPH”
Summary statistics are widely available
—Nat Genet editorial, July 2012
Outline 1. A brief history of summary statistic genetics 2. Introduction to polygenic prediction using summary statistics 3. LDpred method for polygenic prediction using summary statistics 4. Application of LDpred to real data sets
Outline 1. A brief history of summary statistic genetics 2. Introduction to polygenic prediction using summary statistics 3. LDpred method for polygenic prediction using summary statistics 4. Application of LDpred to real data sets
Definition of summary statistics Definition: Summary statistics consist of: • GWAS association z-scores for each typed or imputed SNP + • Sample sizes on which z-scores were computed (may vary by SNP) Note: Many applications also require LD information computed from a reference panel (e.g. 1000 Genomes or UK10K) using a population “very similar” to the target sample.
Meta-analysis can be performed using summary statistics
Evangelou & Ioannidis 2013 Nat Rev Genet
Joint and conditional analysis can be performed using summary statistics
Yang et al. 2012 Nat Genet
Imputation can be performed using summary statistics
Lee et al. 2013 Bioinformatics; Pasaniuc et al. 2014 Bioinformatics also see Park et al. 2015 Bioinformatics, Lee et al. 2015 Bioinformatics
Rare variant meta-analysis can be performed using summary statistics
Lee et al. 2013 AJHG; Hu et al. 2013 AJHG; Liu et al. 2014 Nat Genet also see Clarke et al. 2013 PLoS Genet, Tang & Lin 2015 AJHG
Genetic variance and covariance can be inferred using summary statistics
Palla & Dudbridge 2015 AJHG; Bulik-Sullivan et al. 2015 Nat Genet
Functional enrichment can be inferred using summary statistics
Pickrell 2014 AJHG; Kichaev & Pasaniuc 2015 AJHG; Finucane et al. 2015 Nat Genet
Many projects at ASHG 2015 using summary statistics • Invited talks Pickrell, Pasaniuc, Im (this session) • Platform talks 11 Gusev, 77 Cichonska, 220 Golan, 272 Park • Posters 791 Kichaev, 797 Shi, 807 Roytman, 860 Salem, 868 Pare, 1301 Wu, 1334 Zhu, 1357 Chatterjee, 1477 Brown, 1618 Li, 1668 Khawaja, 1686 Lee, 1687 Zhao, 1728 Torres, 1867 O’Connor
Outline 1. A brief history of summary statistic genetics 2. Introduction to polygenic prediction using summary statistics 3. LDpred method for polygenic prediction using summary statistics 4. Application of LDpred to real data sets
Genetic prediction: why care?
Erbe et al. 2012 J Dairy Sci; Goss et al. 2011 New Engl J Med
Using only genome-wide significant SNPs is a Stone Age genetic prediction method How should we conduct genetic prediction, Fred?
ˆ k ˆi xik i (published SNPs)
φk = phenotype for sample k βi = effect size for SNP i xik = genotype for SNP i, sample k
Prediction r2 is less than half the r2 attained by polygenic prediction PGC-SCZ 2014 Nature; Vilhjalmsson et al. 2015 AJHG
Polygenic prediction can be performed using genome-wide summary statistics
ˆ k ˆi xik i (all GWAS SNPs)
φk = phenotype for sample k βi = effect size for SNP i xik = genotype for SNP i, sample k
Is polygenic prediction using raw genotypes more accurate than using summary statistics? Answer: slightly.
r
h h 2 g
2
2 g
h M /N 2 g
using summary statistics: fit each SNP individually
hg2 = heritability explained by SNPs M = number of (unlinked) SNPs N = number of training samples
E ( i | ˆi )
hg2
ˆi
hg2 M / N Uniform shrink on estimated effect sizes ˆi is appropriate
Accounting for non-infinitesimal architectures can improve polygenic prediction Non-infinitesimal architecture: (e.g. point-normal mixture, mixture of normals, etc.) Non-uniform shrink on estimated effect sizes ˆi is appropriate
Accounting for non-infinitesimal architectures can improve polygenic prediction 2 Infinitesimal (Gaussian) architecture: i ~ N 0, hg / M
ˆi ~ i N 0,1 / N =>
E ( i | ˆi )
hg2
ˆi
hg2 M / N Uniform shrink on estimated effect sizes ˆi is appropriate
Non-infinitesimal architecture: (e.g. point-normal mixture, mixture of normals, etc.) Non-uniform shrink on estimated effect sizes ˆi is appropriate Standard heuristic approach: P-value thresholding
ˆ k ˆi xik i
(Note: requires optimization of PT threshold in validation samples)
P-value < PT
Purcell et al. 2009 Nature; Chatterjee et al. 2013 Nat Genet; Dudbridge 2013 PLoS Genet
Accounting for linkage disequilibrium can improve polygenic prediction Problem:ˆ k
ˆx i ik
does not account for LD b/t SNPs
i
P-value < PT
Standard heuristic approaches: Random LD-pruning: prune SNPs (e.g. r2 < 0.2), removing one of each pair of linked SNPs (decide randomly which SNP to remove) Informed LD-pruning (LD-clumping): prune SNPs, removing one of each pair of linked SNPs (remove SNP with less significant P-value in training data) Purcell et al. 2009 Nature; Stahl et al. 2012 Nat Genet also see Rietveld et al. 2013 Science (COJO)
Pruning + Thresholding is widely used …
Purcell et al. 2009 Nature; Lango Allen et al. 2010 Nature; Ripke et al. 2011 Nat Genet; Stahl et al. 2012 Nat Genet; Deloukas et al. 2013 Nat Genet; Ripke et al. 2013 Nat Genet; Chatterjee et al. 2013 Nat Genet; Dudbridge 2013 PLoS Genet; PGC-SCZ 2014 Nature
Pruning + Thresholding is widely used, but does not attain maximum prediction accuracy Simulations at different proportions p of causal SNPs:
Non-infinitesimal
Non-infinitesimal
Infinitesimal
Infinitesimal
hg2
Vilhjalmsson et al. 2015 AJHG
Outline 1. A brief history of summary statistic genetics 2. Introduction to polygenic prediction using summary statistics 3. LDpred method for polygenic prediction using summary statistics
4. Application of LDpred to real data sets
LDpred computes posterior means under a point-normal prior, accounting for LD ˆ k E ( i | ˆi ) xik i
(all GWAS SNPs)
φk = phenotype for sample k βi = effect size for SNP i xik = genotype for SNP i, sample k
where E ( i | ˆi ) are posterior mean effect sizes
Vilhjalmsson et al. 2015 AJHG
LDpred computes posterior means under a point-normal prior, accounting for LD ˆ k E ( i | ˆi ) xik i
(all GWAS SNPs)
φk = phenotype for sample k βi = effect size for SNP i xik = genotype for SNP i, sample k
where E ( i | ˆi ) are posterior mean effect sizes based on • point-normal prior with 2 parameters: hg2 = heritability explained by SNPs (estimated from training data) p = proportion of causal SNPs (optimized in validation samples) • LD from a reference panel Use validation samples as LD reference (restrict to SNPs with validation data) Vilhjalmsson et al. 2015 AJHG
In the special case of no LD between SNPs, posterior means can be computed analytically E ( i | ˆi )
hg2 hg2 Mp / N
pi ˆi
hg2 = heritability explained by SNPs
p = proportion of causal SNPs M = number of (unlinked) SNPs N = number of training samples ˆi 2
p
where
h / Mp 1 / N 2 g
pi p
h / Mp 1 / N 2 g
e
2 ( h g2 / Mp 1 / N )
ˆi 2
e
2 ( h g2 / Mp 1 / N )
1 p 1/ N
e
ˆi 2 2 (1 / N )
is the posterior probability that i 0 , i.e. SNP i is causal (generalizes uniform shrink when p = 1: infinitesimal prior, no LD)
In the special case of infinitesimal prior (with LD), posterior means can be computed analytically 1
M ˆ E ( i | ˆi ) D I i 2 Nhg
hg2 = heritability explained by SNPs
M = number of (unlinked) SNPs N = number of training samples
where D is an LD matrix from a reference panel
(generalizes uniform shrink when D = I: infinitesimal prior, no LD)
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically Possible solutions: • Assume 1 causal variant per locus
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically Possible solutions: • Assume 1 causal variant per locus
• Iterative approach
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically Possible solutions: • Assume 1 causal variant per locus
• Iterative approach • MCMC
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically Solution: use MCMC. Initialize i = 0 At each big iteration For each SNP i Re-sample i based on • Point-normal prior on i • Observed ˆ ~ N ( D , D / N ) f ( i | ˆ ) ~ f ( i )e
N ˆ D 2
T
D 1 ( ˆ D )
, where f ( i ) reflects point-normal prior (based on hg2 and p)
General case of non-infinitesimal prior with LD: posterior means cannot be computed analytically Solution: use MCMC. Initialize i = 0 At each big iteration For each SNP i Re-sample i based on • Point-normal prior on i • Observed ˆ ~ N ( D , D / N ) 100 big iterations generally suffice for convergence Rao-Blackwellization: average the posterior means sampled Related MCMC methods for prediction from raw genotypes are described in Erbe et al. 2012 J Dairy Sci, Zhou et al. 2013 PLoS Genet, Moser et al. 2015 PLoS Genet
LDpred performs well in simulations Simulations with real genotypes, 1% of SNPs causal
Understanding polygenic prediction Let’s hide away and dance. -- Freddie K.
Let’s hide away with data. -- Alkes
Outline 1. A brief history of summary statistic genetics 2. Introduction to polygenic prediction using summary statistics 3. LDpred method for polygenic prediction using summary statistics 4. Application of LDpred to real data sets
LDpred performs well on within-cohort prediction of WTCCC traits …
Data from WTCCC 2007 Nature. Results are similar to MCMC-based methods that require raw genotypes: Zhou et al. 2013 PLoS Genet, Moser et al. 2015 PLoS Genet
LDpred performs well on within-cohort prediction of WTCCC traits …
2 2 2 (see Lee et al. 2012 Genet Epidemiol) Rnag Robs Rliab
Data from WTCCC 2007 Nature. Results are similar to MCMC-based methods that require raw genotypes: Zhou et al. 2013 PLoS Genet, Moser et al. 2015 PLoS Genet
LDpred performs well on within-cohort prediction of WTCCC traits … Dominated by HLA
Data from WTCCC 2007 Nature. Results are similar to MCMC-based methods that require raw genotypes: Zhou et al. 2013 PLoS Genet, Moser et al. 2015 PLoS Genet
LDpred performs well on within-cohort prediction of WTCCC traits …
Do not validate in new cohort Data from WTCCC 2007 Nature. Results are similar to MCMC-based methods that require raw genotypes: Zhou et al. 2013 PLoS Genet, Moser et al. 2015 PLoS Genet
… but within-cohort prediction accuracy may be too good to be true
2 Rnag CAD Training: WTCCC 0.0451 Validation: WTCCC Training: WTCCC 0.0048 Validation: WGHS
T2D 0.0467 0.0095
Results presented for LDpred; similar relative results for other methods Cryptic relatedness? Population structure? (Wray et al. 2013 Nat Rev Genet)
LDpred performs well on summary statistics with independent validation cohorts
Training N=70K
PGC-SCZ 2014 Nature; MGS replication sample
LDpred performs well on summary statistics with independent validation cohorts
Training N=70K
Training N=30K
Training N=60K
LDpred performs well on summary statistics with independent validation cohorts
Training N=70K
Training N=30K
Training N=70K
Training N=90K
Training N=60K
LDpred performs well on summary statistics with independent validation cohorts Height: complexities due to population stratification. Including PCs can improve prediction accuracy. (Chen et al. 2015 Genet Epidemiol) Training N=130K (Lango Allen et al. 2010 Nature)
Conclusions … • Explicitly modeling both LD and non-infinitesimal architectures improves polygenic prediction from summary statistics. • Polygenic prediction should be evaluated using independent validation cohorts. • Although polygenic predictions are not yet clinically useful, prediction accuracies will increase as sample sizes increase (bounded by heritability explained by SNPs; hg2).
… and Future directions • Polygenic prediction in non-European samples is challenging. How to combine training data from Europeans (large sample size) with training data from target population (small sample size)? (cross-population genetic correlation; Poster 1477 Brown)
• Enrichment of heritability in functional annotation classes could potentially be used to improve polygenic prediction (Poster 1357 Chatterjee) • Methods for large raw genotype data sets (e.g. UK Biobank) should be developed in parallel with summary statistic methods (Platform talk 38 Loh; Platform talk 170 Young)
Acknowledgements Bjarni Vilhjalmsson + Vilhjalmsson et al. 2015 AJHG co-authors
Everyone in alkesgrp. Please check out our other ASHG 2015 talks: • Platform talk 11 Gusev “Large-scale transcriptome-wide association study …” • Platform talk 38 Loh “Contrasting regional architectures of schizophrenia …” • Platform talk 196 Bhatia “Haplotypes of common SNPs explain a large …” • Platform talk 352 Galinsky “Population differentiation analysis of 54,734 …” • Platform talk 346 Hayeck “Mixed model association with family-biased …” • Platform talk 354 Palamara “Leveraging distant relatedness to quantify …”