Outline
Approximate String Matching Text Retrieval Theory vs. Practice Problem Ricardo Baeza-Yates
String searching From automata to algorithms
Center for Web Research www.cwr.cl Depto. de Ciencias de la Computaci´on
Filtering Indices
Universidad de Chile Santiago, CHILE
[email protected]
ASM with Indices Concluding remarks Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro [19] and Navarro et al. [23], and own work and views.
2
User’s point of view
Theory vs. Practice
How we can measure the goodness of an algorithm? Asymptotic worst case behavior Query
Text
User-defined text normalization
Asymptotic average case behavior Practical behavior
User-defined index points and structure
Indexing Algorithm Searching Algorithm
Index Text
D. Knuth [IFIP’89 Invited speech] Balance between theory and practice
Answer User Interface ?
Software is hard
The best theory Tools vs.
is
Intelligence
inspired by practice — Applications to other areas:
The best practice
Web retrieval, XML processing, NL processing, text mining, multimedia search, bioinformatics, signal processing, .... 3
is inspired by theory 4
2.
is a word (depends on the language) is any sequence starting in an index-point
)
(
of length
Some data structures assume the first model
is considered bounded Answer Models
Problem: find all occurrences of in
Pattern
of length
1.
Text
: finite alphabet of size
Search Models
Problem
Space complexity : the extra space used for the search (index)
exact match RAM model: words of size
approximate match (distance function needed) Time complexity
: time needed to find the pattern closest match or all matches at a certain distance
or equivalent measure (for example, comparisons) – Worst-case
Computation Models
– Average-case (uniform text and pattern) Text-Pattern comparisons Arithmetical/Bitwise operations 5
6
Algorithmic point of view
String-Matching Space-Time Trade-Offs
Input data: Raw pattern and text Space Complexity
- sequential, on-line, real-time algorithms Tries
Preprocessing of the pattern - pattern is known in advance index
Suffix arrays Inverted files
Preprocessing of the text
Indexed search
Patricia trees
Hybrid solutions
– Inverted index Signature files
– Suffix trees (tries, Patricia trees, ....)
Two level TR
– Suffix arrays
Sequential search
– Based on -grams
Boyer-Moore like algorithms KMP Shift-or
– Automata: DAWGs, suffix based
Brute force
Hybrid solutions: – Filtering or Filtration – Two Level TR 7
8
RAC Time Complexity
String Matching: Definition
String Matching Complexity
Basic problem: find exact occurrences of a pattern in a text
: size of the text
Variations
: size of the pattern
– Allow
mismatches (Hamming distance)
– Allow
insertions (Episode distance, not symmetric)
– Allos
insertions and deletions (LCS distance)
– Allow
mismatches, insertions and deletions
Raw text
– Worst case:
lower and upper bound of
comparisons
lower and upper bound
– Average case:
– Language dependent measure: phonetic, morphems, etc.
worst case,
– ASM:
Examples: average case
text Preprocessed text:
text
– Index construction:
time and space (finite alphabet)
Software examples: grep command in Unix (sequential) or Google
comparisons
– Average case:
ex
– Worst case:
t ext
This is a text example ...
text:
– ASM: several results, still open
in the Web (index based). 9
comparisons
10
Classical Algorithms
String Searching: Historical View
Knuth-Morris-Pratt x Text
1992
Cole-Hariharan
1990
Colussi-Galil-Giancarlo Cole Choffrut
y
Baeza-Yates/Perleberg Hume-Sunday
Boyer-Moore x Text
Match heuristic 1988
y
Regnier Baeza-Yates/Gonnet Baeza-Yates
y x Text
Sunday Crochemore-Perrin Wu-Manber Baeza-Yates Baeza-Yates/Gonnet
Abrahamson
Occurrence heuristic
1986
Apostolico-Giancarlo
y
Horspool
Sunday 1980
Match heuristic defines BM automata
Karp-Rabin Galil
Horspool
Rytter
Boyer-Moore Fischer-Patterson Knuth-Morris-Pratt
1970 Theory
11
Practice
12
Knuth-Morris-Pratt Algorithm
Algorithm search( text, n, pat, m ) // Search pat[1..m] in text[1..n] char text[], pat[]; int n, m; { int next[MAX_PATTERN_SIZE];
Fascinating story.... from theory and practice
pat[m+1] = CHARACTER_NOT_IN_THE_TEXT; kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern kmp( text, n, pat, m, next ); // Search text pat[m+1] = END_OF_STRING;
.
for
}
Preprocessing:
kmp( text, n, pat, m, next ) char text[], pat[]; int n, m, next[]; { static dosearch = 0; int i, j; i = 1; if( !dosearch ) // Preprocessing j = next[1] = 0; else j = 1; do { if( j == 0 || text[i] == pat[j] ) { i++; j++; if( !dosearch ) { // Preprocessing if( text[i] != pat[j] ) next[i] = j; else next[i] = next[j]; } } else j = next[j];
Example: a b r a c a d a b r a
Worst case complexity:
0 1 1 0 2 0 2 0 1 1 0 5
next[j]
if( dosearch && j > m ) { // Search Report_match_at_position( i-m ); j = next[m+1]; }
Extension to multiple patterns: Aho-Corasick
} while( i