ENCODE UNIFORM DATA PROCESSING PIPELINES WORKSHOP

ENCODE  UNIFORM  DATA   PROCESSING  PIPELINES  WORKSHOP   Ben  Hitz  and  J.  Seth  StraAan   ENCODE  DCC   ENCODE  Research  ApplicaFons  and  Users ...
Author: Garey Owen
7 downloads 0 Views 9MB Size
ENCODE  UNIFORM  DATA   PROCESSING  PIPELINES  WORKSHOP   Ben  Hitz  and  J.  Seth  StraAan   ENCODE  DCC   ENCODE  Research  ApplicaFons  and  Users  MeeFng   July,  2015   To  set  up  your  environment   See  the  link  below  “Workshop  Session  3:  ENCODE  Uniform  Processing”   hAps://www.encodeproject.org/tutorials/encode-­‐users-­‐meeFng-­‐2015/   1  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  DemonstraFon  and  Exercise   To  set  up  an  account:   hAps://www.encodeproject.org/tutorials/encode-­‐users-­‐meeFng-­‐2015/     Click  “Prepare  to  run  web-­‐based  pipelines”  

Log  in  -­‐>  

2  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

What  would  you  like  to  learn?   How  many  of  you:   1.  …  have  downloaded  ENCODE  data  and  intersected  it  with  other  data?   2.  …  have  already  implemented  an  analysis  pipeline  based  on  ENCODE?   3.  …  could  repeat  an  ENCODE  analysis  (from  fastq’s)  to  generate  IDR-­‐thresholded   sets  of  peaks?   4.  …  want  to  repeat  one  of  the  ENCODE  analysis  pipelines  on  your  data?   5.  …  need  to  access  ENCODE  data  but  found  it  difficult  or  don’t  know  where  to   begin?   3  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access   VisualizaFon   Eurie:    ENCODE  Portal  

4  

InterpretaFon  

Processing  

Advanced  Analysis  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access   VisualizaFon   InterpretaFon   Processing   Advanced  Analysis   Pauline:  UCSC  Genome  Browser   Emily:  ENSEMBL  Browser   Eurie:    ENCODE  Portal  

5  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access   Emily:  VEP  

VisualizaFon  

Eurie:    ENCODE  Portal  

Pauline:  UCSC  Genome  Browser  

InterpretaFon  

Processing  

Advanced  Analysis  

Jill:  HaploReg  and  RegulomeDB  

Emily:  ENSEMBL  Browser  

6  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access   Eurie:    ENCODE  Portal  

VisualizaFon   InterpretaFon   Processing   Advanced  Analysis   Ben  &  Seth:  ENCODE  Processing  Pipelines  

Pauline:  UCSC  Genome  Browser  

Emily:  ENSEMBL  Browser  Input  Files  

Emily:  VEP   Jill:  HaploReg  and  RegulomeDB  

Outputs  plumbed  to  inputs   Output  Files  

7  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access   VisualizaFon   InterpretaFon   Processing   Advanced  Analysis   Yanli:  Element  and  3D  Browser   Camden:  SOM’s   Michael:  Factorbook   Eurie:    ENCODE  Portal  

Pauline:  UCSC  Genome  Browser  

Emily:  ENSEMBL  Browser  

Ben  &  Seth:  ENCODE  Processing  Pipelines  

Input  Files  

Luca:  ChromHMM  

Emily:  VEP  

Outputs  plumbed  to  inputs  

Jill:  HaploReg  and  RegulomeDB  

8  

Output  Files  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  Workshop  in  Context   Data  Access  

VisualizaFon  

InterpretaFon  

Eurie:    ENCODE  Portal  

Processing  

Advanced  Analysis  

Ben  &  Seth:  ENCODE  Processing  Pipelines  

Pauline:  UCSC  Genome  Browser   Input  Files  

Emily:  ENSEMBL  Browser  

Michael:  Factorbook  

Yanli:  Element  and  3D  Browser  

Outputs  plumbed  to  inputs  

Emily:  VEP   Jill:  HaploReg  and  RegulomeDB  

9  

Camden:  SOM’s  

Output  Files  

Luca:  ChromHMM  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

DCC  Delivers  ENCODE  Data  

Sample  

10  

Library  

+   CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG   @BI:SL-­‐HAB:D0RRAACXX:8:2309:21201:7829  1:X:0:GCCGTCGA   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   +   CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ   @BI:SL-­‐HAB:D0RRAACXX:8:2113:4623:40045  1:X:0:GCCGTCGA   GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA   +   ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F=   @BI:SL-­‐HAB:D0RRAACXX:8:2206:11680:21762  1:X:0:GCCGTCGA   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT   +  

Primary  Data  

Processed  Data  

AWS  S3  Bucket   ENCODE  Files   J.  Seth  StraAan,  PhD    ENCODE  DCC  

ENCODE  DCC  Delivers  ENCODE  Metadata  

Sample  

11  

Library  

+   CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG   @BI:SL-­‐HAB:D0RRAACXX:8:2309:21201:7829  1:X:0:GCCGTCGA   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   +   CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ   @BI:SL-­‐HAB:D0RRAACXX:8:2113:4623:40045  1:X:0:GCCGTCGA   GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA   +   ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F=   @BI:SL-­‐HAB:D0RRAACXX:8:2206:11680:21762  1:X:0:GCCGTCGA   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT   +  

Primary  Data  

Processed  Data  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

ENCODE  Analysis  Pipelines  as  Deliverables  

Sample  

Library  

+   CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG   @BI:SL-­‐HAB:D0RRAACXX:8:2309:21201:7829  1:X:0:GCCGTCGA   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   +   CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ   @BI:SL-­‐HAB:D0RRAACXX:8:2113:4623:40045  1:X:0:GCCGTCGA   GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA   +   ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F=   @BI:SL-­‐HAB:D0RRAACXX:8:2206:11680:21762  1:X:0:GCCGTCGA   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT   +  

Primary  Data  

Processed  Data  

Goals:   1.  Deploy  ENCODE-­‐defined  pipelines  for  ChIP-­‐seq,  RNA-­‐seq,  DNase-­‐seq,  methylaFon.   2.  Use  those  pipelines  to  generate  the  standard  ENCODE  peaks,  quanFtaFons,  CpG.   3.  Capture  metadata  to  make  clear  what  sosware,  versions,  parameters,  inputs  were  used.   4.  Capture,  accession,  and  distribute  the  output.   5.  Deliver  exactly  the  same  pipelines  in  a  form  that  anyone  can  run  on  their  data  or  with   ENCODE  data  –  one  experiment  or  1000.   Replicability  –  Provenance  –  Ease  of  Use  –  Scalability   12  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Deployment  Plauorm  ConsideraFons  

Sample   HPC  Cluster  (Scripts)   HPC  Container   Web/Cloud  

Library  

+   CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG   @BI:SL-­‐HAB:D0RRAACXX:8:2309:21201:7829  1:X:0:GCCGTCGA   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   +   CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ   @BI:SL-­‐HAB:D0RRAACXX:8:2113:4623:40045  1:X:0:GCCGTCGA   GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA   +   ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F=   @BI:SL-­‐HAB:D0RRAACXX:8:2206:11680:21762  1:X:0:GCCGTCGA   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT   +  

Primary  Data  

Processed  Data  

Develop  

Share  

Run  

Elas=c  

Provenance  

Cost  

Hard  

Hard  

Hard  

Cluster-­‐Dependent  

Moderate  

Obscure/Subsidized  

Hard  

Moderate  

Moderate  

Cluster-­‐Dependent  

Good  

Obscure/Subsidized  

Moderate  

Easy  

Easy  

Highly  

Excellent  

Apparent  but  Low  

Replicability  –  Provenance  –  Ease  of  Use  –  Scalability   We  chose  to  deploy  first  to  a  web/cloud-­‐based  plauorm,  DNAnexus   Code  is  open  source  and  adaptable  for  deployment  to  your  HPC  environment   hAps://github.com/ENCODE-­‐DCC   13  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Pipelines  DemonstraFon  and  Exercise   To  set  up  an  account:   hAps://www.encodeproject.org/tutorials/encode-­‐users-­‐meeFng-­‐2015/     Click  “Prepare  to  run  web-­‐based  pipelines”  

Log  in  -­‐>  

14  

Ben  Hitz,  PhD    ENCODE  DCC  

Pipelines  DemonstraFon  and  Exercise   Select:     Featured  Projects:     ..  ENCODE  Uniform   Processing  Pipelines  

15  

Ben  Hitz,  PhD    ENCODE  DCC  

Setup  the  Demo  Project  

16  

Ben  Hitz,  PhD    ENCODE  DCC  

Project  Overview  

17  

Ben  Hitz,  PhD    ENCODE  DCC  

A  Workflow   3.  

2.   INPUTS  

18  

1.  

Ben  Hitz,  PhD    ENCODE  DCC  

The  Monitor  Tab  

19  

Ben  Hitz,  PhD    ENCODE  DCC  

An  Applet  

20  

Ben  Hitz,  PhD    ENCODE  DCC  

An  RNA-­‐seq  Workflow  

21  

Ben  Hitz,  PhD    ENCODE  DCC  

Interregnum   What/Where  are  the  ENCODE  Results?   While  the  example  is  running:   •  What  steps  do  the  pipelines  run?   •  What  inputs  do  they  take?   •  What  outputs  do  they  produce?   •  Where  are  the  Uniform  ENCODE  Results?  

22  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Schema:    ENCODE  ChIP-­‐seq  IDR  Pipeline   fastq   reads  

Map  

BAM  

Pool  Replicates   Subsample   Pseudoreplicates  

BAM   2  Pseudoreplicates   per  replicate   2  Pseudoreplicates   per  pool  

Call  Peaks  

Peak   Calls  

Signal  Tracks  

BAM,  BAI   Processed,   mapped  reads  

Target   TF's  

   

Histone   Mods   23  

bigWig  

Key  SoQware   bwa   Picard  markDuplicates   samtools   MACS2  (Signal  tracks)   SPP  (PeakSeq,  GEM  future)   IDR2     MACS2  for  peaks   Overlap  thresholding   IDR2  (future)  

IDR   IDR-­‐ thresholded   Peak  Calls  

h0ps://github.com/ENCODE-­‐DCC/chip-­‐seq-­‐pipeline   Input  Files  

fastq's  (SE  or  PE)   Two  biological   replicates   Matched  controls  

Output  Files  

QA  Metrics  

  One  bam  per  replicate   NRF  (Non-­‐redundant  fracFon)   bigWig  fold  signal  over  control   PBC1  and  2  (PCR  boAleneck  coefficients)   bigWig  p-­‐value  signal  over  control   Number  of  disFnct  uniquely-­‐mapping  reads   bed/bigBed  true  replicates  peaks   NSC/RSC  (Strand  cross-­‐correlaFon)     bed/bigBed  pooled  replicates  peaks   IDR  Rescue  RaFo   bed/bigBed  IDR  thresholded  peaks   IDR  Self-­‐Consistency  RaFo   IDR  Reproducibility  Test   bed/bigBed  Replicated  peaks   J.  Seth  StraAan,  PhD    ENCODE  DCC  

ENCODE  ChIP-­‐seq  Quality  Metrics:  Resources   fastq   reads  

Map  

Pool  Replicates   Subsample   Pseudoreplicates  

BAM  

BAM   2  Pseudoreplicates   per  replicate   2  Pseudoreplicates   per  pool  

Call  Peaks  

Peak   Calls  

Signal  Tracks  

BAM,  BAI   Processed,   mapped  reads  

Es=mates   Depth   Library  Complexity   ChIP  Quality   Replicate  Concordance   24  

bigWig  

IDR   IDR-­‐ thresholded   Peak  Calls  

h0ps://github.com/ENCODE-­‐DCC/chip-­‐seq-­‐pipeline   Descrip=on  

Number  of  uniquely  mapping  reads   Number  of  disFnct  uniquly  mapping  reads   Non-­‐Redundant  FracFon   PCR  BoAleneck  Coefficient   Normalized  Strand  Cross-­‐CorrelaFon   RelaFve  Strand  Cross-­‐CorrelaFon   IDR  Rescue  RaFo   IDR  Self-­‐Consistency  RaFo   IDR  Reproducibility  Test  

References   Jung  YL,    et  al.  Nucleic  Acids  Research.  2014;42(9):e74  

Landt  S,  et  al.  Genome  Res.  2012.  22:  1813-­‐1831  

Li  Q,  et  al.  Annals  Applied  StaFsFcs.  2011,  Vol.  5,  No.  3,  1752–1779   J.  Seth  StraAan,  PhD    ENCODE  DCC  

ENCODE  ChIP-­‐seq  on  the  Cloud  

25  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Uniformly  Processed  Data  On  the  ENCODE  Portal   ChIP-­‐seq  Example   hAps://www.encodeproject.org/experiments/ENCSR087PLZ/     •  Pipeline  graph  shows  relaFonships  between  files   •  Click  on  files  to  see  more  file  metadata  and  download  links   •  Click  on  steps  to  see  more  sosware  metadata  and  download  links  

26  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Schema:    ENCODE  WGBS  Pipeline   h0ps://github.com/ENCODE-­‐DCC/dna-­‐me-­‐pipeline   FASTQ  (SE/PE)   Replicates  

Map    (converted  genome)  

Trim  Reads  

Extract  methyl  calls  

BISMARK  (v  0.10)  

BAM  

Bed/BigBed  files  for:   •  CG  context   •  CHG  context   •  CHH  context  

BigBEDs   BigWigs   BigWigs   (.bb)  

FASTQ   (SE/PE)   Replicates  

Trim  Reads  

Map    (converted  genome)  

BAM   (Bismark)  

Extract  methyl  calls  

BigWigs   BigBEDs   BigWigs   (.bb)  

Map  to  λ  genome   27  

Non  bisulfite   conversion  rate  

QC  metrics   Ben  Hitz,  PhD    ENCODE  DCC  

Schema:    ENCODE  RNA-­‐seq  Pipeline   h0ps://github.com/ENCODE-­‐DCC/long-­‐rna-­‐seq-­‐pipeline   FASTQ  (SE/PE)   Replicates  

Map  Reads  

BAM   (tophat)  

Signal  Tracks  

BigWigs   BigWigs   BigWigs   BigWigs   (.bw)  

Map  Reads  

BAM   (STAR)  

Signal  Tracks  

BigWigs   BigWigs   BigWigs   BigWigs   (.bw)  

Quan=fica=on  

RSEM   file  

FASTQ   (SE/PE)   Replicates  

Map  Reads  

BAM   (tophat)  

Signal  Tracks  

Map  Reads  

BAM   (STAR)  

Signal  Tracks  

Quan=fica=on  

28  

BigWigs   BigWigs   BigWigs   BigWigs  

BigWigs   BigWigs   BigWigs   BigWigs  

RSEM   file  

Replicate  2  

For  each  Mapper  (STAR,  tophat)   BAM  files:   •  mapped  to  genome   •  mapped  to  transcriptome     BigWig  files:   •  plus/minus  strand  (paired)   •  uniquely  mapped   •  mulF+uniquely  mapped   QuanFficaFons    (RSEM):   •  genome     •  transcriptome    

IDR/MAD   QC  &  filtered   quan=fica=on   Ben  Hitz,  PhD    ENCODE  DCC  

Uniformly  Processed  Data  On  the  ENCODE  Portal   RNA-­‐seq  Example   hAps://www.encodeproject.org/experiments/ENCSR368QPC/     •  Pipeline  graph  shows  relaFonships  between  files   •  Click  on  files  to  see  more  file  metadata  and  download  links   •  Click  on  steps  to  see  more  sosware  metadata  and  download  links  

29  

Ben  Hitz,  PhD    ENCODE  DCC  

Pick  up  the  Results  

30  

Ben  Hitz,  PhD    ENCODE  DCC  

Visualize!  

31  

Ben  Hitz,  PhD    ENCODE  DCC  

Visualize!  

32  

Ben  Hitz,  PhD    ENCODE  DCC  

Pipeline  Workshop  Summary   DCC  Goals:   1.  Deploy  ENCODE-­‐defined  pipelines  for  ChIP-­‐seq,  RNA-­‐seq,  DNase-­‐seq,  methylaFon.   2.  Use  those  pipelines  to  generate  the  standard  ENCODE  peaks,  quanFtaFons,  CpG.   3.  Capture  metadata  to  make  clear  what  sosware,  versions,  parameters,  inputs  were  used.   4.  Capture,  accession,  and  distribute  the  output.   5.  Deliver  exactly  the  same  pipelines  in  a  form  that  anyone  can  run  on  their  data  or  with   ENCODE  data  –  one  experiment  or  1000.   Replicability  –  Provenance  –  Ease  of  Use  –  Scalability  

33  

J.  Seth  StraAan,  PhD    ENCODE  DCC  

Contributors   ENCODE  Data  Coordina=ng  Center  

Mike  Cherry,  PI,  Stanford   Jim  Kent,  co-­‐PI,  UCSC   Eurie  Hong,  Project  Manager   Pipeline  Developers   Ben  Hitz,  WGBS,  Sosware  Lead   Tim  Dreszer,  RNA-­‐seq,  DNAse-­‐seq   J.  Seth  StraAan,  ChIP-­‐seq   Portal  Developers   Laurence  Rowe   Nikhil  Podduturi   Forrest  Tanaka   Data  Wranglers   Esther  Chan   Jean  Davidson   Venkat  Malladi   Cricket  Sloan   J.  Seth  StraAan   QA  &  Biocura=on  Assistance   Brian  Lee   Marcus  Ho   AdiF  Narayanan   Support  Staff   Stuart  Miyasato   @encodedcc   MaA  Simison   Zhenhua  Wang   34  

ENCODE  Data  Analysis  Center  

Zhiping  Weng,  PI,  University  of  MassachuseAs   Mark  Gerstein,  co-­‐PI,  Yale   Methyla=on   Junko  Tsuji,  U  Mass   Eric  Mendenhall,  U  Alabama,  HAIB   RNA-­‐seq   Alex  Dobin,  CSHL   Carrie  Davis,  CSHL   Rafael  Irizarryt,  Harvard   Xintao  Wei,    UConn   Brent  Gravely,  UConn   Colin  Dewey,    U  Wisconsin   Roderic  Guigó,  CRG   Sarah  Djebali,  CRG   ChIP-­‐seq   Anshul  Kundaje,  Stanford   Nathan  Boley,  Stanford   Jin  Lee,  Stanford  

encode-­‐[email protected]  

DNAnexus  

Mike  Lin   Andey  Kislyuk   Singer  Ma   BreA  Hannigan   Ohad  Rodeh   Joe  Dale   George  Asimenos  

hAps://github.com/ENCODE-­‐DCC/   J.  Seth  StraAan,  PhD    ENCODE  DCC