String Matching and Suffix Tree

String Matching and Suffix Tree Gusfield Ch1-7 EECS 458 CWRU Fall 2004 BLAST (Altschul et al’90) • Idea: true match alignments are very likely to co...

Author: Augustine McBride

28 downloads 0 Views 163KB Size

Report

Download PDF

Recommend Documents

Fast Approximate String Matching with Suffix Arrays and A* Parsing

String Matching Problems and Bioinformatics

Lecture 13: String Matching

Approximate String Matching

String Matching Algorithms

Tree Pattern Matching

String matching with finite automata

String and Tree Kernels Algorithms and Applications

Pattern Matching using suffix trays, arrays and trees

Approximate Tree Matching and Shape Similarity

A Metric Index for Approximate String Matching

Suffix Trees and Suffix Arrays

Plagiarism Detection Reduced to String Matching

A Java Library for Fuzzy String Matching

Applying Fast String Matching to Intrusion Detection

Approximate String Matching by Fuzzy Automata

Bezeichnungen. Arten von String-Matching-Problemen

Face Recognition using Approximate String Matching

A Guided Tour to Approximate String Matching

Cache-Oblivious Index for Approximate String Matching

String-Matching-Algorithmen. von. Thomas Kramer

A Comparison of Approximate String Matching Algorithms

String Edit Distance, Random Walks and Graph Matching

On Approximate Parameterized String Matching and Related Problems

String Matching and Suffix Tree

Gusfield Ch1-7 EECS 458 CWRU Fall 2004

BLAST (Altschul et al’90) • Idea: true match alignments are very likely to contain a short segment that identical (or very high score). • Consider every substring (seed) of length w, where w=12 for DNA and 4 for protein. • Find the exact occurrence of each substring (how?) • Extend the hit in both directions and stop at the maximum score.

1

Problems • Pattern matching: find the exact occurrences of a given pattern in a given structure ( string matching) • Pattern recognition: recognizing approximate occurences of a given patttern in a given structure (image recognition) • Pattern discovery: identifying significant patterns in a given structure, when the patterns are unknown (promoter discovery)

Definitions • • • • • •

String S[1..m] Substring S[i..j] Prefix S[1..i] Suffix S[i..m] Proper substring, prefix, suffix Exact matching problem: given a string P called pattern, and a long string T called text, find all the occurrences of P in T.

Naïve method • Align the left end of P with the left end of T • Compare letters of P and T from left to right, until – Either a mismatch is found (not an occurrence – Or P is exhausted (an occurrence)

• Shift P one position to the right • Restart the comparison from the left end of P • Repeat the process till the right end of P shifts past the right end of T • Time complexity: worst case θ(mn), where m=|P| and n=|T| • Not good enough!

2

Speedup • Ideas: – When mismatch occurs, shift P more than one letter, but never shift so far as to miss an occurrence – After shifting, ship over parts of P to reduce comparisons – Preprocessing of P or T

Fundamental preprocessing • Can be on pattern P or text T. • Given a string S (|S|=m) and a position i >1, define Zi: the length of longest common prefix of S and S[i..m] • Example, S=abxyabxz zi 1

zi

i

Fundamental preprocessing Zi

0 0 0 3 0 0 0

a b x y a b x z a b x y a b x z a b x y a b x z a b x a b x

y

y

a b x

a b x

z

z

a b x y a b x z a b x y a b x z a b x y a b x z

3

Fundamental preprocessing • Intention: – Concatenate P and T, inserted by an extra letter $: S=P$T – Every i, Zi≤|P| – Every i>|P|+1 and Zi=|P| records the occurrences of P in T

• Question: running time to compute all the Zs? The naïve method according to the definition runs in θ((m+n)2) time!

Fundamental preprocessing • Goal: linear time to compute all the Zs • Z-box: for Zi>0, it is the box starting at i with length Zi.(ending at i+Zi-1), • ri: the rightmost end of a Zj-box (j+Zj-1) for all 11. • li: the left node of Zj-box ending at ri α

α

1

Zli

li

i

ri

Fundamental preprocessing •

Computing Zk: – Given Zi for all 10) 2. k≤r: k is in the Z-box starting at l (substring S[l..r]), therefor S[k]=S[k-l+1], S[k+1]=S[k-l+2], …, S[r]=S[Zl]. In other words, Zk≥min{Zk-l+1,r-k+1}

α 1

k-l+1

α

β Zl

l

β k

r

4

Fundamental preprocessing • A) Zk-l+1< (r-k+1): Zk=Zk-l+1, and r, l remain unchanged • B) Zk-l+1 ≥ (r-k+1): Zk≥ (r-k+1) and start comparison between S[r+1] and S[r-k+2] until a mismatch is found (updating r and l accordingly if Zk≥ r-k+1) α k-l+1

1

β 1

α k-l+1

α

β Zl

l

k α

β Zl

β

l

r β ?

k

r

Fundamental preprocessing •

Conclusions: 1. Zk is correctly computed 2. There are a constant number of operations besides comparisons for each k – |S| iterations – Whenever a mismatch occurs, the iteration terminates – Whenever a match occurs, r is increased

3. In total at most |S| mismatches and at most |S| matches 4. Running time θ(|S|) and space θ (|S|)

Fundamental preprocessing • Th: there is a θ(n+m)-time and space algorithm which finds all the occurrences of P in T, where m=|P| and n=|T|. • Notes: – Alphabet-independent – Space requirement can be reduced to θ(m) – Not well suited for multiple patterns searching – Strictly linear, every letter in T has to be compared at least once

5

Projects • Topics • Meeting: 3 times, as a group • Presentations: 25 minutes/student (~20m talk + 5m questions) • Term paper: single space, 11pt, 1in margin. 5-6p, 9-10p 10-12p, exclude references

The Boyer-Moore algorithm: an example • P=abxabxab, T=daaabxababxabxab daaabx ababx abx ab abx abx ab abx abx ab daaabx ababx abx ab abx abx ab abx abx ab

The Boyer-Moore algorithm • Rule 1: right-to-left comparison • Rule 2: Bad character rule – For each x∈∑, R(x) denotes the right-most occurrence of x in P (0 if doesn’t appear) – When a mismatch occurs, T[k] against P[i], shift P right by max{1, i-R(T[k])} places. This takes T[k] against P[R(T[k])] – |∑| space to store R-values

• Rule 3: good suffix rule

6

The Boyer-Moore algorithm T

z

P before shift

t’

x

t

y

t

z

P after shift

t’

t

y

y

t

The Boyer-Moore algorithm • Rule 3: good suffix rule – When a mismatch occurs, T[k] against P[i] – Find the rightmost occurrence of P[(i+1)..m] in P such that the letter to the left differs P[i] – Shift P right such that this occurrence of P[(i+1)..m] is against T[(k+1)..(m+k-i)] – If there is no occurrence of P[(i+1)..m], find the longest prefix of P matches a suffix of P[(i+1)..m], shift P right such that this prefix is against the corresponding suffix

Preprocessing for the good suffix rule • Let L(i) denote the largest position less than m such that string P[i..m] matches a suffix of P[1..L(i)] • Let N(j) denote the longest suffix of substring P[1..j] that is also a suffix of P • Recall Zi the length of longest substring of P starts I and matches a prefix of S. t

L(i)

t i

L(i)

zi

zi 1

zi

i N(j)

N(j) m

m-N(j)+1

j

1

7

Preprocessing for the good suffix rule • Thm: L(i) is the largest index j less than m such that N(j)≥|P[i..m]|=m-i+1. L(i) 1

L(i)

i

m

N(j)

N(j) 1

m-N(j)+1

j

m

The Knuth-Morris-Pratt Algorithm • History: – Best known – Not the method of choice, inferior in practice – Can be generalized for multiple string matching

• Preprocessing P • Example: *

a b x y a b x z w a b x y a b x z w

KMP • Idea: – left to right comparison, – Shift P more places without missing occurrence

• A prefix of P matches a proper suffix of P[1..i] and the next letters do not match! • Define si of P, 2