Information Retrieval Text processing Luca Bondi
Text processing Introduction •
2
Why do we need text processing before building the index terms vocabulary? • remove whitespaces and punctuation • how to deal with apostrophe and hyphenation? • what about composite names (e.g. New York)? • remove non discriminative terms • language dependent • normalize with respect to UPPER and lower case • normalize with respect to plurals • normalize acronyms • USA vs U.S.A. • normalize accents • …
Information Retrieval
Text processing Introduction •
3
Two main steps •
Tokenization: process of chopping characters streams into tokens
•
Linguistic pre-processing: building equivalence classes of tokens which are the set of terms that are indexed • Stop words removal • Normalization • Stemming • Lemmatization
Information Retrieval
Text processing Tokenization
4
•
Example • Input: “Friends, Romans, Countrymen, lend me your ears”;; • Output: |Friends|Romans|Countrymen|lend|me|your|ears|
•
a token is an instance of a character sequence in some particular document
•
a type is the class of all tokens containing the same character sequence
•
a term is a (normalized) type that is indexed in the IR system dictionary
Information Retrieval
Text processing Tokenization • •
•
5
Tokenization is not only about chopping on whitespaces and throw away punctuation characters Issues: § apostrophe (e.g. “aren’t” à aren’t, arent, are|n’t, aren|t) § hyphenation (e.g. “over-eager” à overeager, over|eager) § white spaces (e.g. “San Francisco” à San Francisco, San|Francisco) § compounds (e.g. “Computerlinguistik”) § Tokenization is language specific (need for language identification) § Tokenization should recognize specific strings (e.g. email addresses, URLs, etc.) The same tokenization needs to be performed on documents and queries
Information Retrieval
Text processing Stop words
6
•
Stop words: extremely common and semantically non-selective words are excluded from the dictionary entirely
•
General strategy: • sort terms by frequency • add the most frequent terms to the stop list
•
Example: stop list in Reuters-RCV1 dataset
Information Retrieval
Text processing Stop words
7
•
Phrase searches might be significantly affected by the use of stop lists
•
Example: • “Flight to London” is different from “Flight London” • “To be or not to be” (all words might be in the stop list)
•
General trends • early IR systems: quite large stop lists (200-300 terms) • recent IR systems: • very small stop lists (7-12 terms) • no stop list (e.g. Web search engines)
Information Retrieval
Text processing Normalization
8
•
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences • E.g. USA vs. U.S.A.
•
The most standard way of normalizing is to create equivalence classes, which are named after one member of the set
•
Might cause unexpected results: • e.g. C.A.T. cat
Information Retrieval
Text processing Normalization •
9
Normalization typically deals with • Accents and diacritics (might be critical in languages other than English) • résumé → resume • naïve → naive • Capitalization/case folding • A common strategy is to reduce to lower case • It might be critical: “General Motors” → general motors • For English, a good compromise is reached by simply • lowercase words at the beginning of the sentence • lowercase everything in a title which is all uppercase
Information Retrieval
Text processing Stemming and lemmatization •
• •
10
Goal: reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. • am, are, is → be • car, cars, car’s, cars’ → car Stemming: crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time Lemmatization: accurate process with the use of a dictionary and morphological analysis of words, normally aiming to return the base or dictionary form of a word Lemmatization collapses the different inflectional forms of a lemma • NLP tools (Natural Language Processing) that have been shown not to help in IR systems •
•
Example: the token “saw” Stemming → it might return just “s” • Lemmatization → attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun •
Information Retrieval
Text processing Stemming: Porter’s algorithm
11
• •
5 different phases of word reductions, applied sequentially Each phase applies various conventions to select rules (e.g., select the rule from each rule group that applies to the longest suffix) • Step 1a example • SSES → SS (caresses → caress) • IES → I (ponies → poni) • SS → SS (caress → caress) • S → “” (cats → cat)
•
Many rules use the concept of measure of a word • Checks whether it is long enough that it is reasonable to regard the matching portion as a suffix rather than stem • E.g. (𝑚 > 1) EMENT → “” 𝑚 = length of remaining word • “replacement” → “replac” • “cement” ↛ “c” Information Retrieval
Text processing Stemming: Algorithms •
•
12
Many different algorithms exist: • Porter’s algorithm (most common algorithm for English) [Porter, 1980] • Lovins stemmer [Lovins, 1968] • Paice/Husk [Paice, 1990] Example:
Information Retrieval
Text processing Stemming
13
•
Stemmers increase recall and decrease precision
•
Example:
•
Porter’s algorithm stems • operational, operative, operating → oper
Information Retrieval
Text statistics Summary
14
•
Zipf’s law
•
Luhn analysis
•
Heaps’ law Information Retrieval
Text statistics Zipf’s law
15
•
“… given some document collection, the frequency of any word is inversely proportional to its rank in the frequency table…” • the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc…
•
If the words 𝑤 in a collection are ordered according to the ranking function 𝑟(𝑤), in a decreasing order of frequency 𝑓(𝑤), then they satisfy the following relationship 𝑟 𝑤 ⋅ 𝑓(𝑤) = 𝑐 Different collections have different c constants In English collections, 𝑐 ≈ 𝑁/10, where 𝑁 is the number of words in the collection • Example: the word ‘e’ is the most frequent word (𝑟 𝑒 = 1) in an English collection of documents where 𝑁 = 104 → 𝑐 = 105
• •
• The frequency of ‘e’ is estimated as 𝑓 𝑒 = 7 Information Retrieval
6 8
6
= 9 = 105
Text statistics Luhn analysis
• •
16
Discriminative power of the significant words is maximum between the two levels of cut-off Used to: • Weight index terms • Stop lists (most frequent and less frequent words are removed) Information Retrieval
Text statistics Heaps’ law •
17
How the vocabulary (number of words) grows according to the collection (number of documents) size? • There isn’t a real limit because of nouns (place, people, etc..) Heaps’ law
• • •
𝑀 = 𝑘 𝑁