Information Retrieval Text processing. Luca Bondi

Information  Retrieval Text  processing Luca  Bondi Text  processing Introduction • 2 Why  do  we  need  text  processing  before  building   the ...
Author: John Parrish
7 downloads 2 Views 406KB Size
Information  Retrieval Text  processing Luca  Bondi

Text  processing Introduction •

2

Why  do  we  need  text  processing  before  building   the  index  terms   vocabulary? • remove  whitespaces  and  punctuation • how  to  deal  with  apostrophe  and  hyphenation? • what  about  composite  names  (e.g.  New  York)? • remove  non  discriminative  terms • language  dependent • normalize  with  respect  to  UPPER  and  lower  case • normalize  with  respect  to  plurals   • normalize  acronyms • USA  vs  U.S.A. • normalize  accents • …

Information   Retrieval

Text  processing Introduction •

3

Two  main  steps •

Tokenization:  process  of  chopping  characters  streams  into  tokens



Linguistic pre-­processing:  building   equivalence  classes  of  tokens   which  are  the  set  of  terms that  are  indexed • Stop  words  removal • Normalization • Stemming • Lemmatization

Information   Retrieval

Text  processing Tokenization

4



Example • Input:  “Friends,  Romans,  Countrymen,  lend  me  your  ears”;; • Output:  |Friends|Romans|Countrymen|lend|me|your|ears|



a  token is  an  instance of  a  character  sequence  in  some  particular   document



a  type is  the  class  of  all  tokens  containing  the  same  character   sequence



a  term is  a  (normalized)  type  that  is  indexed in  the  IR  system   dictionary

Information   Retrieval

Text  processing Tokenization • •



5

Tokenization  is  not  only  about  chopping  on  whitespaces  and  throw   away  punctuation  characters Issues: § apostrophe  (e.g.  “aren’t”  à aren’t,  arent,  are|n’t,  aren|t) § hyphenation  (e.g.  “over-­eager”  à overeager,  over|eager) § white  spaces  (e.g.  “San  Francisco”  à San  Francisco,   San|Francisco) § compounds  (e.g.  “Computerlinguistik”) § Tokenization  is  language  specific  (need  for  language   identification) § Tokenization  should  recognize  specific  strings  (e.g.  email   addresses,  URLs,  etc.) The  same  tokenization  needs  to  be  performed  on  documents  and   queries

Information   Retrieval

Text  processing Stop  words

6



Stop  words:  extremely  common  and  semantically  non-­selective words   are  excluded  from  the  dictionary  entirely



General  strategy:   • sort  terms  by  frequency • add  the  most  frequent  terms  to  the  stop  list



Example:  stop  list  in  Reuters-­RCV1  dataset

Information   Retrieval

Text  processing Stop  words

7



Phrase  searches  might  be  significantly  affected  by  the  use  of  stop  lists



Example: • “Flight  to  London”  is  different  from  “Flight  London” • “To  be  or  not  to  be”  (all  words  might  be  in  the  stop  list)



General  trends • early  IR  systems:  quite  large  stop  lists  (200-­300  terms) • recent  IR  systems:   • very  small  stop  lists  (7-­12  terms) • no  stop  list  (e.g.  Web  search  engines)

Information   Retrieval

Text  processing Normalization

8



Token  normalization  is  the  process  of  canonicalizing tokens  so  that   matches  occur  despite  superficial  differences  in  the  character   sequences • E.g.  USA  vs.  U.S.A.



The  most  standard  way  of  normalizing  is  to  create  equivalence   classes,  which  are  named  after  one  member  of  the  set



Might  cause  unexpected  results: • e.g.  C.A.T.   cat

Information   Retrieval

Text  processing Normalization •

9

Normalization  typically  deals  with • Accents and  diacritics (might  be  critical  in  languages  other  than   English) • résumé  → resume • naïve  → naive • Capitalization/case folding • A  common  strategy  is  to  reduce  to  lower  case • It  might  be  critical:  “General  Motors” → general  motors • For  English,  a  good  compromise  is  reached  by  simply   • lowercase  words  at  the  beginning  of  the  sentence • lowercase  everything  in  a  title  which  is  all  uppercase

Information   Retrieval

Text  processing Stemming  and  lemmatization •

• •

10

Goal:  reduce  inflectional  forms  and  sometimes  derivationally   related   forms  of  a  word  to  a  common  base  form.   • am,  are,  is  → be • car,  cars,  car’s,  cars’  → car Stemming:  crude  heuristic  process  that  chops  off  the  ends  of  words  in   the  hope  of  achieving   this  goal  correctly  most  of  the  time Lemmatization:  accurate  process  with  the  use  of  a  dictionary  and   morphological  analysis  of  words,  normally  aiming  to  return  the  base  or   dictionary  form  of  a  word Lemmatization  collapses  the  different  inflectional  forms  of  a  lemma • NLP  tools  (Natural  Language  Processing)  that  have  been  shown  not  to   help  in  IR  systems •



Example:  the  token  “saw” Stemming  → it  might  return  just  “s” • Lemmatization  → attempts  to  return  “see”  or  “saw”  depending  on  whether   the  use  of  the  token  is  a  verb  or  a  noun •

Information   Retrieval

Text  processing Stemming:  Porter’s  algorithm

11

• •

5  different  phases  of  word  reductions,  applied  sequentially Each  phase  applies  various  conventions  to  select  rules  (e.g.,  select   the  rule  from  each  rule  group  that  applies  to  the  longest  suffix) • Step  1a  example • SSES → SS (caresses → caress) • IES → I (ponies → poni) • SS → SS (caress → caress) • S → “” (cats → cat)



Many  rules  use  the  concept  of  measure  of  a  word • Checks  whether  it  is  long  enough  that  it  is  reasonable  to  regard   the  matching  portion  as  a  suffix  rather  than  stem • E.g.  (𝑚   >  1)  EMENT → “” 𝑚 =  length  of  remaining  word • “replacement” → “replac” • “cement”  ↛ “c” Information   Retrieval

Text  processing Stemming:  Algorithms •



12

Many  different  algorithms  exist:   • Porter’s  algorithm  (most  common  algorithm  for  English)  [Porter,   1980] • Lovins stemmer  [Lovins,  1968] • Paice/Husk  [Paice,  1990] Example:

Information   Retrieval

Text  processing Stemming

13



Stemmers  increase  recall  and  decrease  precision



Example:



Porter’s  algorithm  stems • operational,  operative,  operating  → oper

Information   Retrieval

Text  statistics Summary

14



Zipf’s law



Luhn analysis



Heaps’  law Information   Retrieval

Text  statistics Zipf’s  law

15



“…  given  some  document  collection,  the  frequency of  any  word  is   inversely  proportional  to  its  rank in  the  frequency  table…” • the  most  frequent  word  will  occur  approximately  twice  as  often  as   the  second  most  frequent  word,  which  occurs  twice  as  often  as  the   fourth  most  frequent  word,  etc…



If  the  words  𝑤 in  a  collection  are  ordered  according  to  the  ranking   function  𝑟(𝑤),  in  a  decreasing  order  of  frequency  𝑓(𝑤),  then  they   satisfy  the  following   relationship 𝑟 𝑤 ⋅ 𝑓(𝑤)   =  𝑐 Different  collections  have  different  c  constants In  English  collections,  𝑐   ≈ 𝑁/10,  where  𝑁 is  the  number  of  words  in   the  collection • Example:  the  word  ‘e’  is  the  most  frequent  word  (𝑟 𝑒 = 1)  in  an   English  collection  of  documents  where  𝑁 = 104 → 𝑐 = 105

• •

• The  frequency  of  ‘e’  is  estimated  as  𝑓 𝑒 = 7 Information   Retrieval

6 8

6

= 9 = 105

Text  statistics Luhn  analysis

• •

16

Discriminative  power  of  the  significant  words  is  maximum  between  the   two  levels  of  cut-­off Used  to: • Weight  index  terms • Stop  lists  (most  frequent  and  less  frequent  words  are  removed) Information   Retrieval

Text  statistics Heaps’  law •

17

How  the  vocabulary  (number  of  words)  grows  according  to  the   collection  (number  of  documents)  size? • There  isn’t  a  real  limit  because  of  nouns  (place,  people,  etc..) Heaps’  law

• • •

𝑀   =  𝑘 𝑁