Topics in Sequencing ●

Current advances in sequencing



Sequencing applications



Algorithms for short read alignment ●

Seed-and-Extend/Hash table index methods –



Suffix Tree/Array based methods –



Spaced seeding Burroughs-Wheeler Transform & FM-Index

Algorithms for short read assembly (if time permits)

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

1

DNA/RNA Sequencing ●

Given a sample of DNA or RNA, read the sequence of bases that make it up ● ●

03/10/2011

Used to be slow & manual – Sanger sequencing Current generation of massively parallel machines enable high throughput sequencing of entire genomes (“next generation”)

CS 555 - Biological and Linguistic Sequence Analysis

2

Sequencer Technology – DNA is amplified and fragmented – Single-stranded DNA is stuck to flowcell – New nucleotides with florescent dyes are added at each stage – The DNA is imaged; then the dyes washed away for the next cycle 03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

3

Sequencing Technology

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

4

Reference Genomes ●

Lots of effort has gone into publishing “reference” genomes ● ●





e.g. Human in 2003 Combine contributions from many individuals to describe “typical” genome sequence for species Thousands of reference genomes have now been published for various organisms I'll be talking about applications that use the human reference –

03/10/2011

3.2 billion bases, keeps getting refined, considered 'complete enough' CS 555 - Biological and Linguistic Sequence Analysis

5

Sequencing Data ●

Current sequencing machines produce lots of data ●



e.g. Illumina HiSeq – up to 25GB/day

But there's a catch: short reads ●







03/10/2011

Machines output short sequences (25 – 100bp) in batches of 10s or 100s of millions Bioinformatics challenge is to make sense of all of these short reads Many applications are based on finding mapping locations of sample reads in the reference genome Can be solved by lots of fast approximate matching/local alignment CS 555 - Biological and Linguistic Sequence Analysis

6

Example Read Typical sequence pair in FASTQ format: @GA1_0010:3:120:19851:8281#0/1 NTGTGGTTTATCTATCCACCAGGACAGATTTCTACA +GA1_0010:3:120:19851:8281#0/1 KSSPPUXXVU[X]^^_^X[]Z__^_X^_____]\^` @GA1_0010:3:120:19851:8281#0/2 ACCAATTTACTGTATTAGTCCATCTTAATAAGAAAT +GA1_0010:3:120:19851:8281#0/2 ddcdcacddc`dcdd`d`adddddYdddadddddcc ●

# in pair DNA Sequence Quality Scores

Reads can be generated in pairs that come from a known distance apart in the sample ●



Sequence Name

Can be a mapping hint, among other uses

Quality scores indicate the probability that the base in that position was called by error

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

7

Sequencing Applications ●

By finding approx. match locations of reads from a sample in the reference, you can characterize the genetic variations in the sample compared to the reference ●

For example, single nucleotide polymorphism (SNP) discovery – single bases where the sample differs from the reference

Reference

CAAATAGGC AAATAGGCT GCAAATCGG AATAGGCTT ATAGGCTTA ...ACGTAGCAAATCGGCTTACTAGACCAATTTAC...

SNP 03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

8

Sequencing Applications ●

● ●

Some experiments can generate reads in pairs that have a known distance apart in the sample genome Distance can be used to direct matching Mappings of pairs that aren't concordant with the expected distance can indicate larger variations like deletions, insertions, and inversions

Medvedev et al 2009, Nature Methods 03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

9

Sequencing Applications ●

RNA-SEQ: ● ●





Sequence RNA Map reads back to reference Use number of reads to estimate gene expression Find which exons are making it into the final gene product (“alternative splicing”)

03/10/2011

Illumina.com

CS 555 - Biological and Linguistic Sequence Analysis

10

Sequencing Applications ●

Chromatin Immunoprecipitation (ChIP-SEQ): ●





03/10/2011

Transcription Factors: proteins that bind to DNA to turn genes on or off ChIP-SEQ captures reads near TFs Map back to reference, learn TF binding sites, figure out how genes regulate each other

Valouev et al, Nature Methods, 2008.

CS 555 - Biological and Linguistic Sequence Analysis

11

Sequencing Applications ●

Bisulfite Sequencing ●







In cells, methyl groups are added to cytosine (C) bases in the DNA strand, especially 'CpG' sites; this usually causes gene repression Example of an 'epigentic' mechanism: heritable change to DNA without alterations in DNA sequence Bisulfite turns unmethylated C's into U's but leaves methylated C's alone; U's get coverted back to T's in reads Mapping to converted versions of reference show where methylation is happening

m m GCCCGTCACACG CGGGCAGTGTGC m m 03/10/2011

GTTCGTTATACG TGGGCAGTGTGC

GTTCGTTATACG CAAGCAATATGC TGGGCAGTGTGC ACCCGTCACACG

CS 555 - Biological and Linguistic Sequence Analysis

12

Short Read Mapping Challenges ●









All of these applications are basically approximate matching problems at heart. Should be easy but: The human genome is made of ~50% repetitive sequences – causes mapping ambiguity Reads are short; if they fall in repetitive areas, it's hard to know where they truly map Have to account for both sequencing (machine) errors and variations between the sample and reference The volume of data can be so large that performance is a real concern

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

13

Some Short Read Aligners Or, funny name roll call

BFAST

Novoalign

Bowtie

RazerS

BWA

RMAP

Cloudburst

SHRiMP

Cross_match

SOAPv1

ELAND

Stampy

MAQ

ZOOM

Mosaik

SOAPv2

MrFAST

segemehl vmatch

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

14

Hash Index Alignment Strategies ●



Two main strategies used by short read aligners: Hash Table Indexing and Suffix Tree/Array methods Algorithms based on Hash Tables (Seed-and-Extend) ●



03/10/2011

Create a hash table index of the location of all k-mers (ngrams) in the reference (or the reads), with k < read length Look up locations of k-mer seeds from the reads (or the reference) in the index, and then extend with dynamic programming (like Baeza-Yates and Perleberg algorithm)

CS 555 - Biological and Linguistic Sequence Analysis

15

Hash Index Alignment Strategies ●

Allowing m mismatches requires more seeds ● ●





e.g. BYP based RMAP (Smith et al 2008): m+1 The more seeds, and the smaller they are, the less benefit from hash table indexing

For performance, often don't use an exhaustive set of seeds to guarantee all matches are found Often seeds are chosen to emphasize higher quality portions of the read (the beginning in Illumina)

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

16

Hash Indexing – Spaced Seeds ●

Surprisingly, spaced seeds increase sensitivity ●



Choose non-consecutive positions for the seed instead of a consecutive block (substring vs. subsequence) Use a mask to define the seed –

● ●

111011100110111: a seed of length 15 and weight 11

Only require a match at the 1 positions Ma et al. (2002) found that this seed is 50% more sensitive than a contiguous seed of 11 positions for sequences with 70% similarity ACATTGCACGAATTC ACATTGCACGA ACA TGC CG ATC AC TTG AC GA TC

03/10/2011

11111111111 111011100110111 110111011011011

CS 555 - Biological and Linguistic Sequence Analysis

17

Hash Indexing – Spaced Seeds ●

Hits to spaced seeds are more independent than hits to consecutive seeds ●



This has to do with the lower likelihood that they will overlap each other Analyzing the properties of spaced seeds and algorithms for designing optimal spaced seeds is an active topic in theoretical computer science –

03/10/2011

Depending on the constraints chosen, some problems in spaced seed optimization are NP-hard

CS 555 - Biological and Linguistic Sequence Analysis

18

Short read alignment – Hash Indices ●





Most short read aligners use multiple spaced seed templates in a hash table indexing approach With proper seed choice, hash table index aligners can be very sensitive, at a cost to performance However: ● ●



03/10/2011

The more sensitive, the greater the performance hit Comprehensive hash table indices take lots of memory, which also degrades performance in practical implementations Depending on the application, may want to sacrifice sensitivity for performance, or limit mismatches CS 555 - Biological and Linguistic Sequence Analysis

19

Short read alignment – Suffix arrays ●

Suffix tree/array based methods are very fast for exact match, as we've seen ●



Finds multiple match locations at the same time

However, they have large memory requirements ●





03/10/2011

Assume 4 byte integers, then a suffix array needs 4N bytes → 12GB for the human genome Implementation tricks and compression can get this down, but not by much To get around this, the current crop of aligners use the FM-Index, based on the Burroughs-Wheeler Transform (BWT) CS 555 - Biological and Linguistic Sequence Analysis

20

Burroughs-Wheeler Transform ●

BWT (Burroughs and Wheeler, 1994) is a transformation on a string X related to building a suffix array. The easy way to build it: ●



Add an end of character symbol $ that is lexicographically smaller than all characters in Σ to the end of X



Create a matrix of all cyclic rotations of X and sort it



BWT(X) is the last column of the sorted matrix

The BWT was originally designed for compression purposes

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

21

Computing BWT Append $

ACAACG$ 0123456

Create Cycled Strings

Sort Rows

0 1 2 3 4 5 6

6 2 0 3 1 4 5

ACAACG$ CAACG$A AACG$AC ACG$ACA CG$ACAA G$ACAAC $ACAACG

$ACAACG AACG$AC ACAACG$ ACG$ACA CAACG$A CG$ACAA G$ACAAC

Take Last Column

GC$AAAC

X: ACAACG$ Suffix array: [6,2,0,3,1,4,5] BWT(X): GC$AAAC 03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

22

Relation of BWT and Suffix Arrays ●





For our string ACAACG, the suffix array S is [6,2,0,3,1,4,5], and BWT(X) is GC$AAAC BWT(X) and S are related: ●

BWT(i) = $ if S[i] == 0



BWT(i) = X(S[i] – 1) otherwise

Therefore, we can use linear time and space algorithms for building suffix arrays to compute BWT(X)

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

23

Aside on BWT Compression ●

Why is BWT useful for compression? ●





● ●

03/10/2011

The transform attempts to take advantage of patterns in the text to produce a string with more repeated characters Imagine the BWT of an English text, which contains many instances of 'the' The sorted matrix will have many rows that start with 'he' and end with 't', which means the last column will have a large run of 't's This makes it easy to compress There are fast algorithms for reversing the transform and searching in it without decompressing CS 555 - Biological and Linguistic Sequence Analysis

24

BWT properties ●

The matrix of sorted cyclic rotations has the property of 'last-first mapping' ●



The ith occurrence of char a in the last column is the same character in X as the ith occurrence of a in the first column

We can use this property to reverse the transformation

Langmead et al, Genome Biology, 2009. 03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

25

The FM Index ●



Ferragina and Manzini (2000) developed methods for searching BWT transformed strings with minimal memory requirements First we need to define two sets of values: ●



Let C(a) be the count of symbols in X[0,n-2] that are lexicographically smaller than a, where n is the length of X Let O(a,i) be the number of occurrences of a in BWT(X)[0,i]



Using these values, we can quickly search for substrings using ranges on the BWT string

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

26

Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG







Let c be the current character, sp be the start of our range on BWT and ep be the end Start with c equal to the last character in the pattern

Initialize: c = 'C' sp = C[c] + 1 = 4 ep = C[c+1] + 1 = C['G'] + 1 = 6

C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011

O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1

O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1

O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1

O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1

O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1

CS 555 - Biological and Linguistic Sequence Analysis

O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 27

Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG



Move to next character c = 'A' sp = C[c] + O(c,sp) + 1 = 2 ep = C[c] + O(c,ep) + 1 = 4

C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011

O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1

O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1

O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1

O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1

O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1

CS 555 - Biological and Linguistic Sequence Analysis

O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 28

Searching with the FM Index Searching for the pattern “AAC” in our string ACAACG



Move to next character c = 'A' sp = C[c] + O(c,sp) + 1 = 1 ep = C[c] + O(c,ep) + 1 = 2 Stop when pattern exhausted or sp >= ep C[$] = 0 C[A] = 0 C[C] = 3 C[G] = 5 03/10/2011

O($,1)=0 O(A,1)=0 O(C,1)=0 O(G,1)=1

O($,2)=0 O(A,2)=0 O(C,2)=1 O(G,2)=1

O($,3)=0 O(A,3)=0 O(C,3)=1 O(G,3)=1

O($,4)=0 O(A,4)=1 O(C,4)=1 O(G,4)=1

O($,5)=0 O(A,5)=2 O(C,5)=1 O(G,5)=1

CS 555 - Biological and Linguistic Sequence Analysis

O($,6)=0 O(A,6)=3 O(C,6)=1 O(G,6)=1 29

Advantages of BWT/FM ●



Backwards searching is equivalent to traversing a suffix tree. But why go to this trouble? Our goal was to save memory, but the O(a,i) array would be very large: 4n integers ●







03/10/2011

The trick is to only store a few entries from the O array and compute the rest on the fly from the BWT string For example BWA (Li and Durbin, 2009) only stores O(•,i) for i's that are factors of 128 A DNA string, or its BWT, can be stored using only two bits per character This leads to a massive memory savings CS 555 - Biological and Linguistic Sequence Analysis

30

Inexact tree matching ●





Can use suffix tree to align portion of the read and extend (like BYP) Some use greedy backtracking approaches that do not guarantee that all matches will be found Others do all possible backtrackings ●

03/10/2011

The dotted line shows a path through the prefix tree of “GOOGOL” looking for the pattern “LOL” with 1 mismatch

Li and Durbin, Bioinformatics, 2009

CS 555 - Biological and Linguistic Sequence Analysis

31

Inexact tree matching ●

BWA (Li and Durbin, 2009) computes a lower bound on the number of possible mismatches for substrings of the pattern



For the previous example, “LOL” and “GOOGOL”: ●

D(0) = 0



D(1) = 1 → “LO” is not a substring

● ●

D(2) = 1

Therefore would not traverse the “G” and “O” branches

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

32

Additional Complications ●

Incorporating base quality scores ●





Bowtie (Langmead et al 2009) chooses backtracking paths based on quality scores Others re-rank alignments based on the quality scores of their matches and mismatches

Both the reads and their reverse complements need to be mapped to the reference ●

03/10/2011

Need to either hold two data structures in memory or convert on the fly

CS 555 - Biological and Linguistic Sequence Analysis

33

Current State of Short Read Aligners ●

Hash-Index based aligners are the most sensitive ●







Important for applications where mismatches cause big problems like structural variation detection Can be very slow and use lots of memory

Suffix tree based aligners are gaining in popularity ●

Much faster



Dramatically less memory usage (3GB vs. 8-12GB)

One recent strategy is to use a tiered alignment strategy, with a tree based aligner first and an indexed aligner on the reads it fails to map unambiguously

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

34

de novo Short Read Assembly ●

What if there's no genome reference to align to? ● ●





Sequencing new species from scratch Metagenomic studies – take a sample (of water, soil, etc) and sequence all the DNA you find in it Characterizing areas that are heavily divergent from the reference

Can't cast these problems as mapping/alignment anymore ●



03/10/2011

Instead, we want to piece the short reads back together to find the long sequences from which they came Look for overlaps between reads CS 555 - Biological and Linguistic Sequence Analysis

35

De Bruijn Graphs ●



Most de novo assemblers use a data structure called a de Bruijn graph to hold overlaps between reads Each node represents a set of k-mers from the reads that overlap by k – 1 symbols Zerbino and Birney, Genome Res, 2008

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

36

Assembly with de Bruijn Graphs ●









Use a set of hash table indices to add each read to the graph Collapse non-branching chains of nodes into single nodes to produce 'contigs' Try to use paired end reads, which come from a known distance apart, to scaffold contigs together This process is extremely RAM intensive – often conducted on 512GB supercomputers Active research into parallelizing and distributing the process

03/10/2011

CS 555 - Biological and Linguistic Sequence Analysis

37