ENCODE UNIFORM DATA PROCESSING PIPELINES WORKSHOP Ben Hitz and J. Seth StraAan ENCODE DCC ENCODE Research ApplicaFons and Users MeeFng July, 2015 To set up your environment See the link below “Workshop Session 3: ENCODE Uniform Processing” hAps://www.encodeproject.org/tutorials/encode-‐users-‐meeFng-‐2015/ 1
J. Seth StraAan, PhD ENCODE DCC
Pipelines DemonstraFon and Exercise To set up an account: hAps://www.encodeproject.org/tutorials/encode-‐users-‐meeFng-‐2015/ Click “Prepare to run web-‐based pipelines”
Log in -‐>
2
J. Seth StraAan, PhD ENCODE DCC
What would you like to learn? How many of you: 1. … have downloaded ENCODE data and intersected it with other data? 2. … have already implemented an analysis pipeline based on ENCODE? 3. … could repeat an ENCODE analysis (from fastq’s) to generate IDR-‐thresholded sets of peaks? 4. … want to repeat one of the ENCODE analysis pipelines on your data? 5. … need to access ENCODE data but found it difficult or don’t know where to begin? 3
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access VisualizaFon Eurie: ENCODE Portal
4
InterpretaFon
Processing
Advanced Analysis
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access VisualizaFon InterpretaFon Processing Advanced Analysis Pauline: UCSC Genome Browser Emily: ENSEMBL Browser Eurie: ENCODE Portal
5
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access Emily: VEP
VisualizaFon
Eurie: ENCODE Portal
Pauline: UCSC Genome Browser
InterpretaFon
Processing
Advanced Analysis
Jill: HaploReg and RegulomeDB
Emily: ENSEMBL Browser
6
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access Eurie: ENCODE Portal
VisualizaFon InterpretaFon Processing Advanced Analysis Ben & Seth: ENCODE Processing Pipelines
Pauline: UCSC Genome Browser
Emily: ENSEMBL Browser Input Files
Emily: VEP Jill: HaploReg and RegulomeDB
Outputs plumbed to inputs Output Files
7
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access VisualizaFon InterpretaFon Processing Advanced Analysis Yanli: Element and 3D Browser Camden: SOM’s Michael: Factorbook Eurie: ENCODE Portal
Pauline: UCSC Genome Browser
Emily: ENSEMBL Browser
Ben & Seth: ENCODE Processing Pipelines
Input Files
Luca: ChromHMM
Emily: VEP
Outputs plumbed to inputs
Jill: HaploReg and RegulomeDB
8
Output Files
J. Seth StraAan, PhD ENCODE DCC
Pipelines Workshop in Context Data Access
VisualizaFon
InterpretaFon
Eurie: ENCODE Portal
Processing
Advanced Analysis
Ben & Seth: ENCODE Processing Pipelines
Pauline: UCSC Genome Browser Input Files
Emily: ENSEMBL Browser
Michael: Factorbook
Yanli: Element and 3D Browser
Outputs plumbed to inputs
Emily: VEP Jill: HaploReg and RegulomeDB
9
Camden: SOM’s
Output Files
Luca: ChromHMM
J. Seth StraAan, PhD ENCODE DCC
DCC Delivers ENCODE Data
Sample
10
Library
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT +
Primary Data
Processed Data
AWS S3 Bucket ENCODE Files J. Seth StraAan, PhD ENCODE DCC
ENCODE DCC Delivers ENCODE Metadata
Sample
11
Library
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT +
Primary Data
Processed Data
J. Seth StraAan, PhD ENCODE DCC
ENCODE Analysis Pipelines as Deliverables
Sample
Library
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT +
Primary Data
Processed Data
Goals: 1. Deploy ENCODE-‐defined pipelines for ChIP-‐seq, RNA-‐seq, DNase-‐seq, methylaFon. 2. Use those pipelines to generate the standard ENCODE peaks, quanFtaFons, CpG. 3. Capture metadata to make clear what sosware, versions, parameters, inputs were used. 4. Capture, accession, and distribute the output. 5. Deliver exactly the same pipelines in a form that anyone can run on their data or with ENCODE data – one experiment or 1000. Replicability – Provenance – Ease of Use – Scalability 12
J. Seth StraAan, PhD ENCODE DCC
Deployment Plauorm ConsideraFons
Sample HPC Cluster (Scripts) HPC Container Web/Cloud
Library
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT +
Primary Data
Processed Data
Develop
Share
Run
Elas=c
Provenance
Cost
Hard
Hard
Hard
Cluster-‐Dependent
Moderate
Obscure/Subsidized
Hard
Moderate
Moderate
Cluster-‐Dependent
Good
Obscure/Subsidized
Moderate
Easy
Easy
Highly
Excellent
Apparent but Low
Replicability – Provenance – Ease of Use – Scalability We chose to deploy first to a web/cloud-‐based plauorm, DNAnexus Code is open source and adaptable for deployment to your HPC environment hAps://github.com/ENCODE-‐DCC 13
J. Seth StraAan, PhD ENCODE DCC
Pipelines DemonstraFon and Exercise To set up an account: hAps://www.encodeproject.org/tutorials/encode-‐users-‐meeFng-‐2015/ Click “Prepare to run web-‐based pipelines”
Log in -‐>
14
Ben Hitz, PhD ENCODE DCC
Pipelines DemonstraFon and Exercise Select: Featured Projects: .. ENCODE Uniform Processing Pipelines
15
Ben Hitz, PhD ENCODE DCC
Setup the Demo Project
16
Ben Hitz, PhD ENCODE DCC
Project Overview
17
Ben Hitz, PhD ENCODE DCC
A Workflow 3.
2. INPUTS
18
1.
Ben Hitz, PhD ENCODE DCC
The Monitor Tab
19
Ben Hitz, PhD ENCODE DCC
An Applet
20
Ben Hitz, PhD ENCODE DCC
An RNA-‐seq Workflow
21
Ben Hitz, PhD ENCODE DCC
Interregnum What/Where are the ENCODE Results? While the example is running: • What steps do the pipelines run? • What inputs do they take? • What outputs do they produce? • Where are the Uniform ENCODE Results?
22
J. Seth StraAan, PhD ENCODE DCC
Schema: ENCODE ChIP-‐seq IDR Pipeline fastq reads
Map
BAM
Pool Replicates Subsample Pseudoreplicates
BAM 2 Pseudoreplicates per replicate 2 Pseudoreplicates per pool
Call Peaks
Peak Calls
Signal Tracks
BAM, BAI Processed, mapped reads
Target TF's
Histone Mods 23
bigWig
Key SoQware bwa Picard markDuplicates samtools MACS2 (Signal tracks) SPP (PeakSeq, GEM future) IDR2 MACS2 for peaks Overlap thresholding IDR2 (future)
IDR IDR-‐ thresholded Peak Calls
h0ps://github.com/ENCODE-‐DCC/chip-‐seq-‐pipeline Input Files
fastq's (SE or PE) Two biological replicates Matched controls
Output Files
QA Metrics
One bam per replicate NRF (Non-‐redundant fracFon) bigWig fold signal over control PBC1 and 2 (PCR boAleneck coefficients) bigWig p-‐value signal over control Number of disFnct uniquely-‐mapping reads bed/bigBed true replicates peaks NSC/RSC (Strand cross-‐correlaFon) bed/bigBed pooled replicates peaks IDR Rescue RaFo bed/bigBed IDR thresholded peaks IDR Self-‐Consistency RaFo IDR Reproducibility Test bed/bigBed Replicated peaks J. Seth StraAan, PhD ENCODE DCC
ENCODE ChIP-‐seq Quality Metrics: Resources fastq reads
Map
Pool Replicates Subsample Pseudoreplicates
BAM
BAM 2 Pseudoreplicates per replicate 2 Pseudoreplicates per pool
Call Peaks
Peak Calls
Signal Tracks
BAM, BAI Processed, mapped reads
Es=mates Depth Library Complexity ChIP Quality Replicate Concordance 24
bigWig
IDR IDR-‐ thresholded Peak Calls
h0ps://github.com/ENCODE-‐DCC/chip-‐seq-‐pipeline Descrip=on
Number of uniquely mapping reads Number of disFnct uniquly mapping reads Non-‐Redundant FracFon PCR BoAleneck Coefficient Normalized Strand Cross-‐CorrelaFon RelaFve Strand Cross-‐CorrelaFon IDR Rescue RaFo IDR Self-‐Consistency RaFo IDR Reproducibility Test
References Jung YL, et al. Nucleic Acids Research. 2014;42(9):e74
Landt S, et al. Genome Res. 2012. 22: 1813-‐1831
Li Q, et al. Annals Applied StaFsFcs. 2011, Vol. 5, No. 3, 1752–1779 J. Seth StraAan, PhD ENCODE DCC
ENCODE ChIP-‐seq on the Cloud
25
J. Seth StraAan, PhD ENCODE DCC
Uniformly Processed Data On the ENCODE Portal ChIP-‐seq Example hAps://www.encodeproject.org/experiments/ENCSR087PLZ/ • Pipeline graph shows relaFonships between files • Click on files to see more file metadata and download links • Click on steps to see more sosware metadata and download links
26
J. Seth StraAan, PhD ENCODE DCC
Schema: ENCODE WGBS Pipeline h0ps://github.com/ENCODE-‐DCC/dna-‐me-‐pipeline FASTQ (SE/PE) Replicates
Map (converted genome)
Trim Reads
Extract methyl calls
BISMARK (v 0.10)
BAM
Bed/BigBed files for: • CG context • CHG context • CHH context
BigBEDs BigWigs BigWigs (.bb)
FASTQ (SE/PE) Replicates
Trim Reads
Map (converted genome)
BAM (Bismark)
Extract methyl calls
BigWigs BigBEDs BigWigs (.bb)
Map to λ genome 27
Non bisulfite conversion rate
QC metrics Ben Hitz, PhD ENCODE DCC
Schema: ENCODE RNA-‐seq Pipeline h0ps://github.com/ENCODE-‐DCC/long-‐rna-‐seq-‐pipeline FASTQ (SE/PE) Replicates
Map Reads
BAM (tophat)
Signal Tracks
BigWigs BigWigs BigWigs BigWigs (.bw)
Map Reads
BAM (STAR)
Signal Tracks
BigWigs BigWigs BigWigs BigWigs (.bw)
Quan=fica=on
RSEM file
FASTQ (SE/PE) Replicates
Map Reads
BAM (tophat)
Signal Tracks
Map Reads
BAM (STAR)
Signal Tracks
Quan=fica=on
28
BigWigs BigWigs BigWigs BigWigs
BigWigs BigWigs BigWigs BigWigs
RSEM file
Replicate 2
For each Mapper (STAR, tophat) BAM files: • mapped to genome • mapped to transcriptome BigWig files: • plus/minus strand (paired) • uniquely mapped • mulF+uniquely mapped QuanFficaFons (RSEM): • genome • transcriptome
IDR/MAD QC & filtered quan=fica=on Ben Hitz, PhD ENCODE DCC
Uniformly Processed Data On the ENCODE Portal RNA-‐seq Example hAps://www.encodeproject.org/experiments/ENCSR368QPC/ • Pipeline graph shows relaFonships between files • Click on files to see more file metadata and download links • Click on steps to see more sosware metadata and download links
29
Ben Hitz, PhD ENCODE DCC
Pick up the Results
30
Ben Hitz, PhD ENCODE DCC
Visualize!
31
Ben Hitz, PhD ENCODE DCC
Visualize!
32
Ben Hitz, PhD ENCODE DCC
Pipeline Workshop Summary DCC Goals: 1. Deploy ENCODE-‐defined pipelines for ChIP-‐seq, RNA-‐seq, DNase-‐seq, methylaFon. 2. Use those pipelines to generate the standard ENCODE peaks, quanFtaFons, CpG. 3. Capture metadata to make clear what sosware, versions, parameters, inputs were used. 4. Capture, accession, and distribute the output. 5. Deliver exactly the same pipelines in a form that anyone can run on their data or with ENCODE data – one experiment or 1000. Replicability – Provenance – Ease of Use – Scalability
33
J. Seth StraAan, PhD ENCODE DCC
Contributors ENCODE Data Coordina=ng Center
Mike Cherry, PI, Stanford Jim Kent, co-‐PI, UCSC Eurie Hong, Project Manager Pipeline Developers Ben Hitz, WGBS, Sosware Lead Tim Dreszer, RNA-‐seq, DNAse-‐seq J. Seth StraAan, ChIP-‐seq Portal Developers Laurence Rowe Nikhil Podduturi Forrest Tanaka Data Wranglers Esther Chan Jean Davidson Venkat Malladi Cricket Sloan J. Seth StraAan QA & Biocura=on Assistance Brian Lee Marcus Ho AdiF Narayanan Support Staff Stuart Miyasato @encodedcc MaA Simison Zhenhua Wang 34
ENCODE Data Analysis Center
Zhiping Weng, PI, University of MassachuseAs Mark Gerstein, co-‐PI, Yale Methyla=on Junko Tsuji, U Mass Eric Mendenhall, U Alabama, HAIB RNA-‐seq Alex Dobin, CSHL Carrie Davis, CSHL Rafael Irizarryt, Harvard Xintao Wei, UConn Brent Gravely, UConn Colin Dewey, U Wisconsin Roderic Guigó, CRG Sarah Djebali, CRG ChIP-‐seq Anshul Kundaje, Stanford Nathan Boley, Stanford Jin Lee, Stanford
encode-‐
[email protected]
DNAnexus
Mike Lin Andey Kislyuk Singer Ma BreA Hannigan Ohad Rodeh Joe Dale George Asimenos
hAps://github.com/ENCODE-‐DCC/ J. Seth StraAan, PhD ENCODE DCC