Project MOHAIR. Why fuzzy strings are 2.7 times more fun than regular string.
Outline What’s the problem? The Chrysanthemum Corollary. The Ellis Island Effect.
Define Fuzzy Logic. What do we need to make it work?
What has MOHAIR done for you?
What is the problem? Match data to user requests. The User can’t spell. The Chrysanthemum Corollary.
The Database is weird! The Ellis Island Effect.
Exact pattern matching is unreliable.
Fuzzy stuff Inexact pattern matching. But not too inexact! Reasonable matches ranked according to some metric.
We need: Some general purpose way of determining reasonable
matches.
Some way of ranking the results.
The Cauliflower Conundrum The Sainsburys Solution Input
Did you mean…
Caulyflower
Cauliflower
Cauliflour
Cauliflower
Koliflower
Cauliflower
Koliflour
No Match
Colyflower
Cornflower (the plant) Cornflour (the grocery)
Not quite fuzzy enough? IN (‘pat1’ [, ‘pat2’...]) LIKE ‘...’ SIMILAR TO ‘[ck](au|o)ll?[iy]flo(u|we)r[[:SPACE:]]*’ Input
Accepted
cauliflower
Y
caulyflower
Y
cauliflour
Y
koliflour
Y
kolliflour
Y
kolliflower
Y
cauliflower soup
N
MOHAIR: Character Classes Fixed WHITESPACE Added: ASCII, BLANK, CNTRL, GRAPH, PRINT, PUNCT,
Ye Olde Patent Leather SOUNDEX. Pretty good for a centenarian, but… Leading character is too significant. Four character code leads to false positives. Name
SOUNDEX(Name)
Nife
N100
Nichols
N242
Nicholson
N242
Schwartsenegger
S632
Shwarzenegger
S625
Schwarzenegger
S625
Single Knife code leads to false K510negatives.
The SOUNDEX collection. SOUNDEX_DM The Daitch-Mokotoff Soundex. Up to 16 six character codes in a comma separated list.
SOUNDEX_NYSIIS (aka NYSIIS) New York State Identification and Intelligence System. 2.7% more accurate than SOUNDEX. Single code, varchar(128).
Size matters, its more fun in groups
Name
Soundex
Soundex_dm
NYSIIS
dmetaphone
Knife
K510
567000
NAF
NF
Nife
N100
670000
NAF
NF
Nichols
N242
658400,648400
NACAL
NXLS,NKLS
Nicholson
N242
658460,648460
NACALSAN
NXLS,NKLS
Schwartsenegger S632
479465
SWARTSANAGAR
XRTS,XFRT
Shwarzenegger
S625
479465,474659
SWARSANAGAR
XRSN,XRTS
Schwarzenegger
S625
479465,474659
SWARSANAGAR
XRSN,XFRT
But which one? Name
Soundex
Soundex_dm
NYSIIS
dmetaphone
cauliflower
C414
587879,487879
CALAFLAR
KLFL
caulyflower
C414
587879,487879
CALAFLAR
KLFL
cauliflour
C414
587890,487890
CALAFLAR
KLFL
koliflour
K414
587890
CALAFLAR
KLFL
kolliflour
K414
587890
CALAFLAR
KLFL
kolliflower
K414
587879
CALAFLAR
KLFL
cauliflower soup
C414
587879,487879, 587874,487874
CALAFLARSAP
KLFL
Handling a pair. Application. Table procedures.
select element from tp_soundex_dm('nichols') ┌──────┐ │elemen│ ├──────┤ │658400│ │648400│
Table Procedure…it begins! create procedure tp_soundex_dm( string varchar(256) not null not default ) result row my_row( element char(6) not null not default ) as declare code_string varchar(111) not null not default;
Table Procedure…the body. select soundex_dm(:string) into :code_string; select length(:code_string) into :len; n = 1; while (1 > 0)
Ranking. Can we do better than alphabetical? String Distance Metrics. LEVENSHTEIN JARO_WINKLER
Levenshtein distance The minimum number of changes that need to be
made to convert one string into another. An integer. levenshtein(‘nichols’, ‘nicholson’) == 2 name
levenshtein(‘cauliflower’, name)
cauliflower
0
caulyflower
1
cauliflour
2
koliflower
3
koliflour
5
kolliflour
5
JARO_WINKLER A float value between 0 and 1, where 0.0 means no
similarities and 1.0 means the strings are identical.
Better with names. Name
jaro_winkler(‘schwarzenegger’, name)
schwarzenegger
1.000
shwarzenegger
0.979
schwartsenegger
0.944
schworzanayga
0.877
cauliflower
0.547
The Cauliflower Correction. A stock table… Item
Nysiis_code
cauliflower
CALAFLAR
cauliflower soup
CALAFLARSAP
cornflour
CARNFLAR
colander
CALANDAR
kaleidoscope
CALADASCAP
And they ask for a colliflour… select item from stock where left(nysiis_code, 4) = left(NYSIIS(?), 4) order by levenshtein(item, ?); Item cauliflower colander kaleidoscope cauliflower soup
Summary Fuzzy string matching is necessary...and fun! What MOHAIR has delivered. Corrections and additions to Character Classes SOUNDEX_DM
What MOHAIR is delivering. Extra Soundex routines:
SOUNDEX_NYSIIS, DOUBLE_METAPHONE