SIPeS User Manual

SIPeS

User Manual Version 2.0

Congmao Wang Shanghai Jiao Tong University [email protected]

Page 1 of 7

SIPeS User Manual I. Introduction This document is intended as a user guide to SIPeS (Site Identification from Paired-end Sequencing) application (version 2.0). With the advent of sequencing techniques, chromatin immunoprecipitation combined with high throughput sequencing (ChIP-Seq) is becoming a powerful tool to study protein-DNA interactions on genome-wide scale. SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm, allows researchers to identify transcript factor binding sites from paired-end sequencing reads. SIPeS uses a dynamic baseline directly through the piling up of fragments to effectively find peaks, overcoming the disadvantage of estimating the average length of DNA fragments from singled-end sequencing achieving more powerful prediction of binding sites with high sensitivity and specificity. In addition, the version 2.0 of SIPeS not only has made a good performance on ChIP-Seq, but also on MeDIP-Seq and HMC-Seq. Also, the newest version supports the multithreading which saves a lot time. And the ranking method of the peaks is changed to the “max fragment pileup value” while the former version is “fold” of each peak.

Ⅱ. Preparing the Input SIPeS takes the input file named “fragment.fasta” for IP sample and “fragment_control.fasta” ( if available) for Input control in FASTA format (Please make these two files and SIPeS executable file in the same folder, do not rename fragment.fasta or fragment_control.fasta): >chr1 123 456 789 888 Page 2 of 7

SIPeS User Manual ... >chr2 234 567 890 888 ... where the first and second term (separated by one space) [If you separated by \t, you can using: sed 's/\t/ /' yourfile > your_newfile] means the chromosomal start and end coordinates of the mapped mates (fragments).These two files can be generated by two main steps: First: Using SSAHA (http://www.sanger.ac.uk/Software/analysis/SSAHA2/) or other software to map of paired-end sequencing reads. A typical command using SSAHA2 can from its’ manual. For example: Illumina-Solexa paired-end short reads [2x35bp, 2x40bp, 2x44bp and other 2x??bp (here ?? is shorter than 76bp)] “ssaha2

-solexa

-pair

$(reads_length-1),400

your_reference_genome.fa

your_lane_1.fq

-outfile

your_outfilename

your_lane_2.fq

>

your_ssaha2’s_output_file_name” $(reads_length-1) are 34 for 2x35bp, 39 for 2x40bp and 43 for 2x44bp. Illumina-Solexa paired-end longer reads [2x76bp, 2x91bp and other 2x??bp (here ?? is longer than 76bp)] “ssaha2 -solexa -skip 6 -pair $(reads_length-1),400 -outfile your_outfilename your_reference_genome.fa

your_lane_1.fq

your_ssaha2’s_output_file_name” Page 3 of 7

your_lane_2.fq

>

SIPeS User Manual $(reads_length-1) are 75 for 2x76bp and 90 for 2x91bp. A typical command using SOAP2 can from its’ manual. A typical command using Maq can from its’ manual. (Note Maq cannot align reads longer than 63bp before version 0.7.1.) A typical command using Bowtie can from its’ manual. (Note the longest length of reads can apply.) Second: Using preprocessing program of SIPeS (Currently, we support SSAHA2 and SOAP2) or by writing a Python/Perl/AWK script by yourself to process the output of the mapping software (Maq, Bowtie and others. Please contact me if you have any problem when do this step! ) to generate fragment.fasta and fragment_control.fasta (if available). Since different people using different methods to sequencing, such as Roche/454, Illumina/Solexa, Applied Biosystems, Helicos and others. In addition, the sequenced reads length are different, including 2x35bp, 2x40bp, 2x44bp, 2x76bp and 2x91bp etc. Hence, the supported preprocessing programs of SIPeS are only a reference, you should write your own preprocessing scripts when need. For SSAHA2, we support Arabidopsis thaliana (2x40bp, one mis matched nucleotide allowed in either end-1 or end-2), Rice (2x40bp, one mis matched nucleotide allowe d in either end-1 or end-2), Mouse (2x44bp two mis matched nucleotide allowed in either end-1 or end-2, 2x91bp two mis matched nucleotide allowed in either end-1 or end-2) and Human (2x76bp two mis matched nucleotide allowed in either end-1 or end-2).

Note: Please make sure to store the fragments information from chr1, chr2, chr3... (Format of Chr1 or chr01 is not correct!!!). If the reference genome contains chrX, chrY, chrC or chrM, etc., please turn the letter of the alphabet to number Page 4 of 7

SIPeS User Manual followed by the last chromosome named with number. For example, if the reference genome is mouse (mm9), then turn chrX to chr20 and chrY to chr21. Now the max chromosome number that SIPeS can process is 24, SIPeS will try enlarging the chromosome number in the next version. Also if your reference genome is human, there is no need to rename chrX or chrY, SIPeS can process automatically.

Ⅲ. Running SIPeS SIPeS runs from the command line, which you can get the executable file of Linux systems (i686, x86-64), other systems (Including Mac OS) can be compiled with the source code by yourself. We suggest you to use SIPeS under a Linux system for its good efficiency.

A. Usage: ./SIPeS [OPTIONS] [-bs ]

dynamic baseline start to construct the signal map (default: 1)

[-be ]

dynamic baseline end to construct the signal map (default: 500), if the value is lager, the peak quality will be better while the running time is longer

[-p ]

p-value cutoff to call signal maps (default: 0.05)

[-pb ]

The minimal binding times to find signal maps (default: 30)

[-f ]

fold-enrichment cutoff to find signal maps (default: 2)

[-c ]

using the input control as the background model (default: 0), this option is irrelevant if -c option is not used, you should use “-c 1” to enable this option effective

[-gs ] effective genome size, number of nucleotides in the reference genome, Page 5 of 7

SIPeS User Manual users need to set this parameter based on their own data. SIPeS preprocessing program will calculate this for users. [-r ]

removing the signal coordinate and fragment pileup value information of each chromosome (default: 0), you should use "-r 1" to enable this option effective.

[-h]

print the help information

B. Examples: (1) Using a input control as the background ./SIPeS -bs 1 -be 500 -p 0.05 -f 2 -pb 30 -c 1 –gs 119707899 (2) Using random model based on Poisson distribution as the background ./SIPeS -bs 1 -be 500 -p 0.05 –f 2 -pb 30 -gs 119707899 Trick: You can run SIPeS using “-be 50” to find the max fragment pileup value of all the peaks (sort the “Peak Information.xls” using “max fragment pileup value”), then using the maximum “max fragment pileup value” to find the peaks again. For example, after using the command “./SIPeS -bs 1 -be 50 -p 0.05 –f 2 -pb 30 -gs your_effective_genome_size”, the maximum “max fragment pileup value” in generated “Peak Information.xls” is 1908, then rerun the command “./SIPeS -bs 1 -be 1908 -p 0.05 –f 2 -pb 30 -gs your_effective_genome_size” to find the final peaks. Note: When set –be parameter bigger, SIPeS will use more time!

Ⅳ. Interpreting the Output The results of the SIPeS contain files of two types: (1) Signal coordinates and fragments pileup value information of each chromosome. (2) Peak information including signal start, signal end, signal width, reads in signal, max fragment pileup value, summit

start, summit end, summit middle, summit Page 6 of 7

SIPeS User Manual width, summit endpoint number, p-value and fold-enrichment of each signal map. Note: If a message "The maximum number of rows has been exceeded. Excess rows were not imported! " when you open the "Peak Information.xls", you should rename the "Peak Information.xls" as "Peak Information", and reopen with other edit tools (such as UltraEdit).This situation will happen only when the peaks number is larger than 65535.

Page 7 of 7