Approximate String Matching

Outline Approximate String Matching Text Retrieval Theory vs. Practice Problem Ricardo Baeza-Yates String searching From automata to algorithms Cen...

Author: Ethel McKinney

1 downloads 0 Views 519KB Size

Report

Download PDF

Recommend Documents

A Metric Index for Approximate String Matching

Approximate String Matching by Fuzzy Automata

A Comparison of Approximate String Matching Algorithms

Face Recognition using Approximate String Matching

A Guided Tour to Approximate String Matching

Cache-Oblivious Index for Approximate String Matching

Approximate String Matching Techniques for Effective CLIR Among

Improved Algorithms for Approximate String Matching (Extended Abstract)

Fast Approximate String Matching with Suffix Arrays and A* Parsing

On Approximate Parameterized String Matching and Related Problems

Approximate String Matching Using Deformed Fuzzy Automata: A Learning Experience

Lecture 13: String Matching

String Matching Algorithms

String Matching Problems and Bioinformatics

String matching with finite automata

String Matching and Suffix Tree

A best-first anagram hashing filter for approximate string matching with generalized edit distance

A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming

Quality Preference Spatial Approximate String Search

Practical Methods for Approximate String Matching. Department of Computer Sciences University of Tampere, Finland

Using Approximate String Matching Techniques to Join Street Names of Residential Addresses

Full text available at: Approximate String Processing

Approximate Substring Matching over Uncertain Strings

Approximate Tree Matching and Shape Similarity

Outline

Approximate String Matching Text Retrieval Theory vs. Practice Problem Ricardo Baeza-Yates

String searching From automata to algorithms

Center for Web Research www.cwr.cl Depto. de Ciencias de la Computaci´on

Filtering Indices

Universidad de Chile Santiago, CHILE [email protected]

ASM with Indices Concluding remarks Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro [19] and Navarro et al. [23], and own work and views.

2

User’s point of view

Theory vs. Practice

How we can measure the goodness of an algorithm? Asymptotic worst case behavior Query

Text

User-defined text normalization

Asymptotic average case behavior Practical behavior

User-defined index points and structure

Indexing Algorithm Searching Algorithm

Index Text

D. Knuth [IFIP’89 Invited speech] Balance between theory and practice

Answer User Interface ?

Software is hard

The best theory Tools vs.

is

Intelligence

inspired by practice — Applications to other areas:

The best practice

Web retrieval, XML processing, NL processing, text mining, multimedia search, bioinformatics, signal processing, .... 3

is inspired by theory 4

2.

is a word (depends on the language) is any sequence starting in an index-point

)

(

of length

Some data structures assume the first model

is considered bounded Answer Models

Problem: find all occurrences of in

Pattern

of length

1.

Text

: finite alphabet of size

Search Models

Problem

Space complexity : the extra space used for the search (index)

exact match RAM model: words of size

approximate match (distance function needed) Time complexity

: time needed to find the pattern closest match or all matches at a certain distance

or equivalent measure (for example, comparisons) – Worst-case

Computation Models

– Average-case (uniform text and pattern) Text-Pattern comparisons Arithmetical/Bitwise operations 5

6

Algorithmic point of view

String-Matching Space-Time Trade-Offs

Input data: Raw pattern and text Space Complexity

- sequential, on-line, real-time algorithms Tries

Preprocessing of the pattern - pattern is known in advance index

Suffix arrays Inverted files

Preprocessing of the text

Indexed search

Patricia trees

Hybrid solutions

– Inverted index Signature files

– Suffix trees (tries, Patricia trees, ....)

Two level TR

– Suffix arrays

Sequential search

– Based on -grams

Boyer-Moore like algorithms KMP Shift-or

– Automata: DAWGs, suffix based

Brute force

Hybrid solutions: – Filtering or Filtration – Two Level TR 7

8

RAC Time Complexity

String Matching: Definition

String Matching Complexity

Basic problem: find exact occurrences of a pattern in a text

: size of the text

Variations

: size of the pattern

– Allow

mismatches (Hamming distance)

– Allow

insertions (Episode distance, not symmetric)

– Allos

insertions and deletions (LCS distance)

– Allow

mismatches, insertions and deletions

Raw text

– Worst case:

lower and upper bound of

comparisons

lower and upper bound

– Average case:

– Language dependent measure: phonetic, morphems, etc.

worst case,

– ASM:

Examples: average case

text Preprocessed text:

text

– Index construction:

time and space (finite alphabet)

Software examples: grep command in Unix (sequential) or Google

comparisons

– Average case:

ex

– Worst case:

t ext

This is a text example ...

text:

– ASM: several results, still open

in the Web (index based). 9

comparisons

10

Classical Algorithms

String Searching: Historical View

Knuth-Morris-Pratt x Text

1992

Cole-Hariharan

1990

Colussi-Galil-Giancarlo Cole Choffrut

y

Baeza-Yates/Perleberg Hume-Sunday

Boyer-Moore x Text

Match heuristic 1988

y

Regnier Baeza-Yates/Gonnet Baeza-Yates

y x Text

Sunday Crochemore-Perrin Wu-Manber Baeza-Yates Baeza-Yates/Gonnet

Abrahamson

Occurrence heuristic

1986

Apostolico-Giancarlo

y

Horspool

Sunday 1980

Match heuristic defines BM automata

Karp-Rabin Galil

Horspool

Rytter

Boyer-Moore Fischer-Patterson Knuth-Morris-Pratt

1970 Theory

11

Practice

12

Knuth-Morris-Pratt Algorithm

Algorithm search( text, n, pat, m ) // Search pat[1..m] in text[1..n] char text[], pat[]; int n, m; { int next[MAX_PATTERN_SIZE];

Fascinating story.... from theory and practice

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT; kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern kmp( text, n, pat, m, next ); // Search text pat[m+1] = END_OF_STRING;

.

for

}

Preprocessing:

kmp( text, n, pat, m, next ) char text[], pat[]; int n, m, next[]; { static dosearch = 0; int i, j; i = 1; if( !dosearch ) // Preprocessing j = next[1] = 0; else j = 1; do { if( j == 0 || text[i] == pat[j] ) { i++; j++; if( !dosearch ) { // Preprocessing if( text[i] != pat[j] ) next[i] = j; else next[i] = next[j]; } } else j = next[j];

Example: a b r a c a d a b r a

Worst case complexity:

0 1 1 0 2 0 2 0 1 1 0 5

next[j]

if( dosearch && j > m ) { // Search Report_match_at_position( i-m ); j = next[m+1]; }

Extension to multiple patterns: Aho-Corasick

} while( i