Project MOHAIR. Why fuzzy strings are 2.7 times more fun than regular string

Project MOHAIR. Why fuzzy strings are 2.7 times more fun than regular string. Outline What’s the problem? The Chrysanthemum Corollary. The Ellis ...
Author: Ronald Harrell
4 downloads 0 Views 222KB Size
Project MOHAIR. Why fuzzy strings are 2.7 times more fun than regular string.

Outline What’s the problem? The Chrysanthemum Corollary. The Ellis Island Effect.

Define Fuzzy Logic. What do we need to make it work?

What has MOHAIR done for you?

What is the problem? Match data to user requests. The User can’t spell. The Chrysanthemum Corollary.

The Database is weird! The Ellis Island Effect.

Exact pattern matching is unreliable.

Fuzzy stuff Inexact pattern matching. But not too inexact! Reasonable matches ranked according to some metric.

We need: Some general purpose way of determining reasonable

matches.

Some way of ranking the results.

The Cauliflower Conundrum The Sainsburys Solution Input

Did you mean…

Caulyflower

Cauliflower

Cauliflour

Cauliflower

Koliflower

Cauliflower

Koliflour

No Match

Colyflower

Cornflower (the plant) Cornflour (the grocery)

Not quite fuzzy enough? IN (‘pat1’ [, ‘pat2’...]) LIKE ‘...’ SIMILAR TO ‘[ck](au|o)ll?[iy]flo(u|we)r[[:SPACE:]]*’ Input

Accepted

cauliflower

Y

caulyflower

Y

cauliflour

Y

koliflour

Y

kolliflour

Y

kolliflower

Y

cauliflower soup

N

MOHAIR: Character Classes Fixed WHITESPACE Added: ASCII, BLANK, CNTRL, GRAPH, PRINT, PUNCT,

Ye Olde Patent Leather SOUNDEX. Pretty good for a centenarian, but… Leading character is too significant. Four character code leads to false positives. Name

SOUNDEX(Name)

Nife

N100

Nichols

N242

Nicholson

N242

Schwartsenegger

S632

Shwarzenegger

S625

Schwarzenegger

S625

Single Knife code leads to false K510negatives.

The SOUNDEX collection. SOUNDEX_DM The Daitch-Mokotoff Soundex. Up to 16 six character codes in a comma separated list.

SOUNDEX_NYSIIS (aka NYSIIS) New York State Identification and Intelligence System. 2.7% more accurate than SOUNDEX. Single code, varchar(128).

Size matters, its more fun in groups

Name

Soundex

Soundex_dm

NYSIIS

dmetaphone

Knife

K510

567000

NAF

NF

Nife

N100

670000

NAF

NF

Nichols

N242

658400,648400

NACAL

NXLS,NKLS

Nicholson

N242

658460,648460

NACALSAN

NXLS,NKLS

Schwartsenegger S632

479465

SWARTSANAGAR

XRTS,XFRT

Shwarzenegger

S625

479465,474659

SWARSANAGAR

XRSN,XRTS

Schwarzenegger

S625

479465,474659

SWARSANAGAR

XRSN,XFRT

But which one? Name

Soundex

Soundex_dm

NYSIIS

dmetaphone

cauliflower

C414

587879,487879

CALAFLAR

KLFL

caulyflower

C414

587879,487879

CALAFLAR

KLFL

cauliflour

C414

587890,487890

CALAFLAR

KLFL

koliflour

K414

587890

CALAFLAR

KLFL

kolliflour

K414

587890

CALAFLAR

KLFL

kolliflower

K414

587879

CALAFLAR

KLFL

cauliflower soup

C414

587879,487879, 587874,487874

CALAFLARSAP

KLFL

Handling a pair. Application. Table procedures.

select element from tp_soundex_dm('nichols') ┌──────┐ │elemen│ ├──────┤ │658400│ │648400│

Table Procedure…it begins! create procedure tp_soundex_dm( string varchar(256) not null not default ) result row my_row( element char(6) not null not default ) as declare code_string varchar(111) not null not default;

Table Procedure…the body. select soundex_dm(:string) into :code_string; select length(:code_string) into :len; n = 1; while (1 > 0)

Ranking. Can we do better than alphabetical? String Distance Metrics. LEVENSHTEIN JARO_WINKLER

Levenshtein distance The minimum number of changes that need to be

made to convert one string into another. An integer. levenshtein(‘nichols’, ‘nicholson’) == 2 name

levenshtein(‘cauliflower’, name)

cauliflower

0

caulyflower

1

cauliflour

2

koliflower

3

koliflour

5

kolliflour

5

JARO_WINKLER A float value between 0 and 1, where 0.0 means no

similarities and 1.0 means the strings are identical.

Better with names. Name

jaro_winkler(‘schwarzenegger’, name)

schwarzenegger

1.000

shwarzenegger

0.979

schwartsenegger

0.944

schworzanayga

0.877

cauliflower

0.547

The Cauliflower Correction. A stock table… Item

Nysiis_code

cauliflower

CALAFLAR

cauliflower soup

CALAFLARSAP

cornflour

CARNFLAR

colander

CALANDAR

kaleidoscope

CALADASCAP

And they ask for a colliflour… select item from stock where left(nysiis_code, 4) = left(NYSIIS(?), 4) order by levenshtein(item, ?); Item cauliflower colander kaleidoscope cauliflower soup

Summary Fuzzy string matching is necessary...and fun! What MOHAIR has delivered. Corrections and additions to Character Classes SOUNDEX_DM

What MOHAIR is delivering. Extra Soundex routines:

SOUNDEX_NYSIIS, DOUBLE_METAPHONE

Suggest Documents