Approximate String Matching

Outline Approximate String Matching Text Retrieval Theory vs. Practice Problem Ricardo Baeza-Yates String searching From automata to algorithms Cen...
Author: Ethel McKinney
1 downloads 0 Views 519KB Size
Outline

Approximate String Matching Text Retrieval Theory vs. Practice Problem Ricardo Baeza-Yates

String searching From automata to algorithms

Center for Web Research www.cwr.cl Depto. de Ciencias de la Computaci´on

Filtering Indices

Universidad de Chile Santiago, CHILE [email protected]

ASM with Indices Concluding remarks Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro [19] and Navarro et al. [23], and own work and views.

2

User’s point of view

Theory vs. Practice

How we can measure the goodness of an algorithm? Asymptotic worst case behavior Query

Text

User-defined text normalization

Asymptotic average case behavior Practical behavior

User-defined index points and structure

Indexing Algorithm Searching Algorithm

Index Text

D. Knuth [IFIP’89 Invited speech] Balance between theory and practice

Answer User Interface ?

Software is hard

The best theory Tools vs.

is

Intelligence

inspired by practice — Applications to other areas:

The best practice

Web retrieval, XML processing, NL processing, text mining, multimedia search, bioinformatics, signal processing, .... 3

is inspired by theory 4



2.

is a word (depends on the language) is any sequence starting in an index-point



)





(







of length

Some data structures assume the first model

is considered bounded Answer Models





Problem: find all occurrences of in





Pattern





of length

1.



 

Text



: finite alphabet of size



Search Models



Problem

Space complexity : the extra space used for the search (index)  







exact match RAM model: words of size 

approximate match (distance function needed) Time complexity

: time needed to find the pattern closest match or all matches at a certain distance

or equivalent measure (for example, comparisons) – Worst-case

Computation Models

– Average-case (uniform text and pattern) Text-Pattern comparisons Arithmetical/Bitwise operations 5

6

Algorithmic point of view

String-Matching Space-Time Trade-Offs

Input data: Raw pattern and text Space Complexity

- sequential, on-line, real-time algorithms Tries

Preprocessing of the pattern - pattern is known in advance index

Suffix arrays Inverted files



Preprocessing of the text

Indexed search

Patricia trees

Hybrid solutions



– Inverted index Signature files

– Suffix trees (tries, Patricia trees, ....)

Two level TR

– Suffix arrays

Sequential search









– Based on -grams



Boyer-Moore like algorithms KMP Shift-or



– Automata: DAWGs, suffix based

Brute force

Hybrid solutions: – Filtering or Filtration – Two Level TR 7

8











 



 

RAC Time Complexity

String Matching: Definition

String Matching Complexity

Basic problem: find exact occurrences of a pattern in a text 

: size of the text

Variations



: size of the pattern

– Allow

mismatches (Hamming distance)

– Allow

insertions (Episode distance, not symmetric)

– Allos

insertions and deletions (LCS distance)

– Allow

mismatches, insertions and deletions

Raw text









– Worst case: 







lower and upper bound of

comparisons



lower and upper bound







– Average case:









– Language dependent measure: phonetic, morphems, etc. 

















worst case,



– ASM:







Examples: average case

text Preprocessed text: 



text 

– Index construction:

time and space (finite alphabet)

Software examples: grep command in Unix (sequential) or Google

 

  





comparisons

– Average case:



ex

– Worst case:



t ext







This is a text example ...



text:

– ASM: several results, still open

in the Web (index based). 9

comparisons

10

Classical Algorithms

String Searching: Historical View

Knuth-Morris-Pratt x Text

1992

Cole-Hariharan

1990

Colussi-Galil-Giancarlo Cole Choffrut

y

Baeza-Yates/Perleberg Hume-Sunday

Boyer-Moore x Text

Match heuristic 1988

y

Regnier Baeza-Yates/Gonnet Baeza-Yates

y x Text

Sunday Crochemore-Perrin Wu-Manber Baeza-Yates Baeza-Yates/Gonnet

Abrahamson

Occurrence heuristic

1986

Apostolico-Giancarlo

y

Horspool

Sunday 1980

Match heuristic defines BM automata

Karp-Rabin Galil

Horspool

Rytter

Boyer-Moore Fischer-Patterson Knuth-Morris-Pratt

1970 Theory

11

Practice

12

Knuth-Morris-Pratt Algorithm

Algorithm search( text, n, pat, m ) // Search pat[1..m] in text[1..n] char text[], pat[]; int n, m; { int next[MAX_PATTERN_SIZE];

Fascinating story.... from theory and practice

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT; kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern kmp( text, n, pat, m, next ); // Search text pat[m+1] = END_OF_STRING; 



 







 

.

















for

}





 

  



 







 































 













  













Preprocessing:

kmp( text, n, pat, m, next ) char text[], pat[]; int n, m, next[]; { static dosearch = 0; int i, j; i = 1; if( !dosearch ) // Preprocessing j = next[1] = 0; else j = 1; do { if( j == 0 || text[i] == pat[j] ) { i++; j++; if( !dosearch ) { // Preprocessing if( text[i] != pat[j] ) next[i] = j; else next[i] = next[j]; } } else j = next[j];

Example: a b r a c a d a b r a





 





Worst case complexity:







0 1 1 0 2 0 2 0 1 1 0 5



next[j]

if( dosearch && j > m ) { // Search Report_match_at_position( i-m ); j = next[m+1]; }

Extension to multiple patterns: Aho-Corasick

} while( i

Suggest Documents