Topics in Sequencing ●
Current advances in sequencing
●
Sequencing applications
●
Algorithms for short read alignment ●
Seed-and-Extend/Hash table index methods –
●
Suffix Tree/Array based methods –
●
Spaced seeding Burroughs-Wheeler Transform & FM-Index
Algorithms for short read assembly (if time permits)
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
1
DNA/RNA Sequencing ●
Given a sample of DNA or RNA, read the sequence of bases that make it up ● ●
03/10/2011
Used to be slow & manual – Sanger sequencing Current generation of massively parallel machines enable high throughput sequencing of entire genomes (“next generation”)
CS 555 - Biological and Linguistic Sequence Analysis
2
Sequencer Technology – DNA is amplified and fragmented – Single-stranded DNA is stuck to flowcell – New nucleotides with florescent dyes are added at each stage – The DNA is imaged; then the dyes washed away for the next cycle 03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
3
Sequencing Technology
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
4
Reference Genomes ●
Lots of effort has gone into publishing “reference” genomes ● ●
●
●
e.g. Human in 2003 Combine contributions from many individuals to describe “typical” genome sequence for species Thousands of reference genomes have now been published for various organisms I'll be talking about applications that use the human reference –
03/10/2011
3.2 billion bases, keeps getting refined, considered 'complete enough' CS 555 - Biological and Linguistic Sequence Analysis
5
Sequencing Data ●
Current sequencing machines produce lots of data ●
●
e.g. Illumina HiSeq – up to 25GB/day
But there's a catch: short reads ●
●
●
●
03/10/2011
Machines output short sequences (25 – 100bp) in batches of 10s or 100s of millions Bioinformatics challenge is to make sense of all of these short reads Many applications are based on finding mapping locations of sample reads in the reference genome Can be solved by lots of fast approximate matching/local alignment CS 555 - Biological and Linguistic Sequence Analysis
6
Example Read Typical sequence pair in FASTQ format: @GA1_0010:3:120:19851:8281#0/1 NTGTGGTTTATCTATCCACCAGGACAGATTTCTACA +GA1_0010:3:120:19851:8281#0/1 KSSPPUXXVU[X]^^_^X[]Z__^_X^_____]\^` @GA1_0010:3:120:19851:8281#0/2 ACCAATTTACTGTATTAGTCCATCTTAATAAGAAAT +GA1_0010:3:120:19851:8281#0/2 ddcdcacddc`dcdd`d`adddddYdddadddddcc ●
# in pair DNA Sequence Quality Scores
Reads can be generated in pairs that come from a known distance apart in the sample ●
●
Sequence Name
Can be a mapping hint, among other uses
Quality scores indicate the probability that the base in that position was called by error
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
7
Sequencing Applications ●
By finding approx. match locations of reads from a sample in the reference, you can characterize the genetic variations in the sample compared to the reference ●
For example, single nucleotide polymorphism (SNP) discovery – single bases where the sample differs from the reference
Reference
CAAATAGGC AAATAGGCT GCAAATCGG AATAGGCTT ATAGGCTTA ...ACGTAGCAAATCGGCTTACTAGACCAATTTAC...
SNP 03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
8
Sequencing Applications ●
● ●
Some experiments can generate reads in pairs that have a known distance apart in the sample genome Distance can be used to direct matching Mappings of pairs that aren't concordant with the expected distance can indicate larger variations like deletions, insertions, and inversions
Medvedev et al 2009, Nature Methods 03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
9
Sequencing Applications ●
RNA-SEQ: ● ●
●
●
Sequence RNA Map reads back to reference Use number of reads to estimate gene expression Find which exons are making it into the final gene product (“alternative splicing”)
03/10/2011
Illumina.com
CS 555 - Biological and Linguistic Sequence Analysis
10
Sequencing Applications ●
Chromatin Immunoprecipitation (ChIP-SEQ): ●
●
●
03/10/2011
Transcription Factors: proteins that bind to DNA to turn genes on or off ChIP-SEQ captures reads near TFs Map back to reference, learn TF binding sites, figure out how genes regulate each other
Valouev et al, Nature Methods, 2008.
CS 555 - Biological and Linguistic Sequence Analysis
11
Sequencing Applications ●
Bisulfite Sequencing ●
●
●
●
In cells, methyl groups are added to cytosine (C) bases in the DNA strand, especially 'CpG' sites; this usually causes gene repression Example of an 'epigentic' mechanism: heritable change to DNA without alterations in DNA sequence Bisulfite turns unmethylated C's into U's but leaves methylated C's alone; U's get coverted back to T's in reads Mapping to converted versions of reference show where methylation is happening
m m GCCCGTCACACG CGGGCAGTGTGC m m 03/10/2011
GTTCGTTATACG TGGGCAGTGTGC
GTTCGTTATACG CAAGCAATATGC TGGGCAGTGTGC ACCCGTCACACG
CS 555 - Biological and Linguistic Sequence Analysis
12
Short Read Mapping Challenges ●
●
●
●
●
All of these applications are basically approximate matching problems at heart. Should be easy but: The human genome is made of ~50% repetitive sequences – causes mapping ambiguity Reads are short; if they fall in repetitive areas, it's hard to know where they truly map Have to account for both sequencing (machine) errors and variations between the sample and reference The volume of data can be so large that performance is a real concern
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
13
Some Short Read Aligners Or, funny name roll call
BFAST
Novoalign
Bowtie
RazerS
BWA
RMAP
Cloudburst
SHRiMP
Cross_match
SOAPv1
ELAND
Stampy
MAQ
ZOOM
Mosaik
SOAPv2
MrFAST
segemehl vmatch
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
14
Hash Index Alignment Strategies ●
●
Two main strategies used by short read aligners: Hash Table Indexing and Suffix Tree/Array methods Algorithms based on Hash Tables (Seed-and-Extend) ●
●
03/10/2011
Create a hash table index of the location of all k-mers (ngrams) in the reference (or the reads), with k < read length Look up locations of k-mer seeds from the reads (or the reference) in the index, and then extend with dynamic programming (like Baeza-Yates and Perleberg algorithm)
CS 555 - Biological and Linguistic Sequence Analysis
15
Hash Index Alignment Strategies ●
Allowing m mismatches requires more seeds ● ●
●
●
e.g. BYP based RMAP (Smith et al 2008): m+1 The more seeds, and the smaller they are, the less benefit from hash table indexing
For performance, often don't use an exhaustive set of seeds to guarantee all matches are found Often seeds are chosen to emphasize higher quality portions of the read (the beginning in Illumina)
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
16
Hash Indexing – Spaced Seeds ●
Surprisingly, spaced seeds increase sensitivity ●
●
Choose non-consecutive positions for the seed instead of a consecutive block (substring vs. subsequence) Use a mask to define the seed –
● ●
111011100110111: a seed of length 15 and weight 11
Only require a match at the 1 positions Ma et al. (2002) found that this seed is 50% more sensitive than a contiguous seed of 11 positions for sequences with 70% similarity ACATTGCACGAATTC ACATTGCACGA ACA TGC CG ATC AC TTG AC GA TC
03/10/2011
11111111111 111011100110111 110111011011011
CS 555 - Biological and Linguistic Sequence Analysis
17
Hash Indexing – Spaced Seeds ●
Hits to spaced seeds are more independent than hits to consecutive seeds ●
●
This has to do with the lower likelihood that they will overlap each other Analyzing the properties of spaced seeds and algorithms for designing optimal spaced seeds is an active topic in theoretical computer science –
03/10/2011
Depending on the constraints chosen, some problems in spaced seed optimization are NP-hard
CS 555 - Biological and Linguistic Sequence Analysis
18
Short read alignment – Hash Indices ●
●
●
Most short read aligners use multiple spaced seed templates in a hash table indexing approach With proper seed choice, hash table index aligners can be very sensitive, at a cost to performance However: ● ●
●
03/10/2011
The more sensitive, the greater the performance hit Comprehensive hash table indices take lots of memory, which also degrades performance in practical implementations Depending on the application, may want to sacrifice sensitivity for performance, or limit mismatches CS 555 - Biological and Linguistic Sequence Analysis
19
Short read alignment – Suffix arrays ●
Suffix tree/array based methods are very fast for exact match, as we've seen ●
●
Finds multiple match locations at the same time
However, they have large memory requirements ●
●
●
03/10/2011
Assume 4 byte integers, then a suffix array needs 4N bytes → 12GB for the human genome Implementation tricks and compression can get this down, but not by much To get around this, the current crop of aligners use the FM-Index, based on the Burroughs-Wheeler Transform (BWT) CS 555 - Biological and Linguistic Sequence Analysis
20
Burroughs-Wheeler Transform ●
BWT (Burroughs and Wheeler, 1994) is a transformation on a string X related to building a suffix array. The easy way to build it: ●
●
Add an end of character symbol $ that is lexicographically smaller than all characters in Σ to the end of X
●
Create a matrix of all cyclic rotations of X and sort it
●
BWT(X) is the last column of the sorted matrix
The BWT was originally designed for compression purposes
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
21
Computing BWT Append $
ACAACG$ 0123456
Create Cycled Strings
Sort Rows
0 1 2 3 4 5 6
6 2 0 3 1 4 5
ACAACG$ CAACG$A AACG$AC ACG$ACA CG$ACAA G$ACAAC $ACAACG
$ACAACG AACG$AC ACAACG$ ACG$ACA CAACG$A CG$ACAA G$ACAAC
Take Last Column
GC$AAAC
X: ACAACG$ Suffix array: [6,2,0,3,1,4,5] BWT(X): GC$AAAC 03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
22
Relation of BWT and Suffix Arrays ●
●
●
For our string ACAACG, the suffix array S is [6,2,0,3,1,4,5], and BWT(X) is GC$AAAC BWT(X) and S are related: ●
BWT(i) = $ if S[i] == 0
●
BWT(i) = X(S[i] – 1) otherwise
Therefore, we can use linear time and space algorithms for building suffix arrays to compute BWT(X)
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
23
Aside on BWT Compression ●
Why is BWT useful for compression? ●
●
●
● ●
03/10/2011
The transform attempts to take advantage of patterns in the text to produce a string with more repeated characters Imagine the BWT of an English text, which contains many instances of 'the' The sorted matrix will have many rows that start with 'he' and end with 't', which means the last column will have a large run of 't's This makes it easy to compress There are fast algorithms for reversing the transform and searching in it without decompressing CS 555 - Biological and Linguistic Sequence Analysis
24
BWT properties ●
The matrix of sorted cyclic rotations has the property of 'last-first mapping' ●
●
The ith occurrence of char a in the last column is the same character in X as the ith occurrence of a in the first column
We can use this property to reverse the transformation
Langmead et al, Genome Biology, 2009. 03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
25
The FM Index ●
●
Ferragina and Manzini (2000) developed methods for searching BWT transformed strings with minimal memory requirements First we need to define two sets of values: ●
●
Let C(a) be the count of symbols in X[0,n-2] that are lexicographically smaller than a, where n is the length of X Let O(a,i) be the number of occurrences of a in BWT(X)[0,i]
●
Using these values, we can quickly search for substrings using ranges on the BWT string
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
26
Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG
●
●
●
Let c be the current character, sp be the start of our range on BWT and ep be the end Start with c equal to the last character in the pattern
Initialize: c = 'C' sp = C[c] + 1 = 4 ep = C[c+1] + 1 = C['G'] + 1 = 6
C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011
O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1
O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1
O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1
O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1
O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1
CS 555 - Biological and Linguistic Sequence Analysis
O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 27
Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG
●
Move to next character c = 'A' sp = C[c] + O(c,sp) + 1 = 2 ep = C[c] + O(c,ep) + 1 = 4
C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011
O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1
O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1
O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1
O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1
O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1
CS 555 - Biological and Linguistic Sequence Analysis
O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 28
Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG
●
Move to next character c = 'A' sp = C[c] + O(c,sp) + 1 = 1 ep = C[c] + O(c,ep) + 1 = 2 Stop when pattern exhausted or sp >= ep C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011
O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1
O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1
O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1
O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1
O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1
CS 555 - Biological and Linguistic Sequence Analysis
O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 29
Advantages of BWT/FM ●
●
Backwards searching is equivalent to traversing a suffix tree. But why go to this trouble? Our goal was to save memory, but the O(a,i) array would be very large: 4n integers ●
●
●
●
03/10/2011
The trick is to only store a few entries from the O array and compute the rest on the fly from the BWT string For example BWA (Li and Durbin, 2009) only stores O(•,i) for i's that are factors of 128 A DNA string, or its BWT, can be stored using only two bits per character This leads to a massive memory savings CS 555 - Biological and Linguistic Sequence Analysis
30
Inexact tree matching ●
●
●
Can use suffix tree to align portion of the read and extend (like BYP) Some use greedy backtracking approaches that do not guarantee that all matches will be found Others do all possible backtrackings ●
03/10/2011
The dotted line shows a path through the prefix tree of “GOOGOL” looking for the pattern “LOL” with 1 mismatch
Li and Durbin, Bioinformatics, 2009
CS 555 - Biological and Linguistic Sequence Analysis
31
Inexact tree matching ●
BWA (Li and Durbin, 2009) computes a lower bound on the number of possible mismatches for substrings of the pattern
●
For the previous example, “LOL” and “GOOGOL”: ●
D(0) = 0
●
D(1) = 1 → “LO” is not a substring
● ●
D(2) = 1
Therefore would not traverse the “G” and “O” branches
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
32
Additional Complications ●
Incorporating base quality scores ●
●
●
Bowtie (Langmead et al 2009) chooses backtracking paths based on quality scores Others re-rank alignments based on the quality scores of their matches and mismatches
Both the reads and their reverse complements need to be mapped to the reference ●
03/10/2011
Need to either hold two data structures in memory or convert on the fly
CS 555 - Biological and Linguistic Sequence Analysis
33
Current State of Short Read Aligners ●
Hash-Index based aligners are the most sensitive ●
●
●
●
Important for applications where mismatches cause big problems like structural variation detection Can be very slow and use lots of memory
Suffix tree based aligners are gaining in popularity ●
Much faster
●
Dramatically less memory usage (3GB vs. 8-12GB)
One recent strategy is to use a tiered alignment strategy, with a tree based aligner first and an indexed aligner on the reads it fails to map unambiguously
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
34
de novo Short Read Assembly ●
What if there's no genome reference to align to? ● ●
●
●
Sequencing new species from scratch Metagenomic studies – take a sample (of water, soil, etc) and sequence all the DNA you find in it Characterizing areas that are heavily divergent from the reference
Can't cast these problems as mapping/alignment anymore ●
●
03/10/2011
Instead, we want to piece the short reads back together to find the long sequences from which they came Look for overlaps between reads CS 555 - Biological and Linguistic Sequence Analysis
35
De Bruijn Graphs ●
●
Most de novo assemblers use a data structure called a de Bruijn graph to hold overlaps between reads Each node represents a set of k-mers from the reads that overlap by k – 1 symbols Zerbino and Birney, Genome Res, 2008
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
36
Assembly with de Bruijn Graphs ●
●
●
●
●
Use a set of hash table indices to add each read to the graph Collapse non-branching chains of nodes into single nodes to produce 'contigs' Try to use paired end reads, which come from a known distance apart, to scaffold contigs together This process is extremely RAM intensive – often conducted on 512GB supercomputers Active research into parallelizing and distributing the process
03/10/2011
CS 555 - Biological and Linguistic Sequence Analysis
37