Information Retrieval Text processing. Luca Bondi

Information Retrieval Text processing Luca Bondi Text processing Introduction • 2 Why do we need text processing before building the ...

Author: John Parrish

7 downloads 2 Views 406KB Size

Report

Download PDF

Recommend Documents

Inverted Indexing. Luca Bondi

Natural Language Processing & Information Retrieval

INFORMATION RETRIEVAL USING XQUERY PROCESSING TECHNIQUES

Unicode and Text Retrieval

Information Retrieval

Information Retrieval and Search Engines in Full-text Databases

NLP Meets the Jabberwocky: Natural Language Processing in Information Retrieval

Multilinguales Information Retrieval, AG Datenbanken und Informationssysteme. Multilinguales Information Retrieval

Information-Retrieval: Unscharfe Suche

Music Information Retrieval

Modern Information Retrieval

Text and File Processing

Information Retrieval 1

Introduction to Information Retrieval

Modern Information Retrieval

Ranking in Information Retrieval

Information Retrieval. Ulf Leser

Information Retrieval im Internet

NLP im Information Retrieval

Evaluation in information retrieval

Modern Information Retrieval

5. Information Retrieval

Text Processing using Perl

Information Retrieval Text processing Luca Bondi

Text processing Introduction •

2

Why do we need text processing before building the index terms vocabulary? • remove whitespaces and punctuation • how to deal with apostrophe and hyphenation? • what about composite names (e.g. New York)? • remove non discriminative terms • language dependent • normalize with respect to UPPER and lower case • normalize with respect to plurals • normalize acronyms • USA vs U.S.A. • normalize accents • …

Information Retrieval

Text processing Introduction •

3

Two main steps •

Tokenization: process of chopping characters streams into tokens

•

Linguistic pre-processing: building equivalence classes of tokens which are the set of terms that are indexed • Stop words removal • Normalization • Stemming • Lemmatization

Information Retrieval

Text processing Tokenization

4

•

Example • Input: “Friends, Romans, Countrymen, lend me your ears”;; • Output: |Friends|Romans|Countrymen|lend|me|your|ears|

•

a token is an instance of a character sequence in some particular document

•

a type is the class of all tokens containing the same character sequence

•

a term is a (normalized) type that is indexed in the IR system dictionary

Information Retrieval

Text processing Tokenization • •

•

5

Tokenization is not only about chopping on whitespaces and throw away punctuation characters Issues: § apostrophe (e.g. “aren’t” à aren’t, arent, are|n’t, aren|t) § hyphenation (e.g. “over-eager” à overeager, over|eager) § white spaces (e.g. “San Francisco” à San Francisco, San|Francisco) § compounds (e.g. “Computerlinguistik”) § Tokenization is language specific (need for language identification) § Tokenization should recognize specific strings (e.g. email addresses, URLs, etc.) The same tokenization needs to be performed on documents and queries

Information Retrieval

Text processing Stop words

6

•

Stop words: extremely common and semantically non-selective words are excluded from the dictionary entirely

•

General strategy: • sort terms by frequency • add the most frequent terms to the stop list

•

Example: stop list in Reuters-RCV1 dataset

Information Retrieval

Text processing Stop words

7

•

Phrase searches might be significantly affected by the use of stop lists

•

Example: • “Flight to London” is different from “Flight London” • “To be or not to be” (all words might be in the stop list)

•

General trends • early IR systems: quite large stop lists (200-300 terms) • recent IR systems: • very small stop lists (7-12 terms) • no stop list (e.g. Web search engines)

Information Retrieval

Text processing Normalization

8

•

Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences • E.g. USA vs. U.S.A.

•

The most standard way of normalizing is to create equivalence classes, which are named after one member of the set

•

Might cause unexpected results: • e.g. C.A.T. cat

Information Retrieval

Text processing Normalization •

9

Normalization typically deals with • Accents and diacritics (might be critical in languages other than English) • résumé → resume • naïve → naive • Capitalization/case folding • A common strategy is to reduce to lower case • It might be critical: “General Motors” → general motors • For English, a good compromise is reached by simply • lowercase words at the beginning of the sentence • lowercase everything in a title which is all uppercase

Information Retrieval

Text processing Stemming and lemmatization •

• •

10

Goal: reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. • am, are, is → be • car, cars, car’s, cars’ → car Stemming: crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time Lemmatization: accurate process with the use of a dictionary and morphological analysis of words, normally aiming to return the base or dictionary form of a word Lemmatization collapses the different inflectional forms of a lemma • NLP tools (Natural Language Processing) that have been shown not to help in IR systems •

•

Example: the token “saw” Stemming → it might return just “s” • Lemmatization → attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun •

Information Retrieval

Text processing Stemming: Porter’s algorithm

11

• •

5 different phases of word reductions, applied sequentially Each phase applies various conventions to select rules (e.g., select the rule from each rule group that applies to the longest suffix) • Step 1a example • SSES → SS (caresses → caress) • IES → I (ponies → poni) • SS → SS (caress → caress) • S → “” (cats → cat)

•

Many rules use the concept of measure of a word • Checks whether it is long enough that it is reasonable to regard the matching portion as a suffix rather than stem • E.g. (𝑚 > 1) EMENT → “” 𝑚 = length of remaining word • “replacement” → “replac” • “cement” ↛ “c” Information Retrieval

Text processing Stemming: Algorithms •

•

12

Many different algorithms exist: • Porter’s algorithm (most common algorithm for English) [Porter, 1980] • Lovins stemmer [Lovins, 1968] • Paice/Husk [Paice, 1990] Example:

Information Retrieval

Text processing Stemming

13

•

Stemmers increase recall and decrease precision

•

Example:

•

Porter’s algorithm stems • operational, operative, operating → oper

Information Retrieval

Text statistics Summary

14

•

Zipf’s law

•

Luhn analysis

•

Heaps’ law Information Retrieval

Text statistics Zipf’s law

15

•

“… given some document collection, the frequency of any word is inversely proportional to its rank in the frequency table…” • the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc…

•

If the words 𝑤 in a collection are ordered according to the ranking function 𝑟(𝑤), in a decreasing order of frequency 𝑓(𝑤), then they satisfy the following relationship 𝑟 𝑤 ⋅ 𝑓(𝑤) = 𝑐 Different collections have different c constants In English collections, 𝑐 ≈ 𝑁/10, where 𝑁 is the number of words in the collection • Example: the word ‘e’ is the most frequent word (𝑟 𝑒 = 1) in an English collection of documents where 𝑁 = 104 → 𝑐 = 105

• •

• The frequency of ‘e’ is estimated as 𝑓 𝑒 = 7 Information Retrieval

6 8

6

= 9 = 105

Text statistics Luhn analysis

• •

16

Discriminative power of the significant words is maximum between the two levels of cut-off Used to: • Weight index terms • Stop lists (most frequent and less frequent words are removed) Information Retrieval

Text statistics Heaps’ law •

17

How the vocabulary (number of words) grows according to the collection (number of documents) size? • There isn’t a real limit because of nouns (place, people, etc..) Heaps’ law

• • •

𝑀 = 𝑘 𝑁