Where Are Terms From?

Tokenization Where Are Terms From? •  How can we derive terms from documents to be indexed? –  Collect the documents to be indexed –  Tokenize the t...
3 downloads 2 Views 273KB Size
Tokenization

Where Are Terms From? •  How can we derive terms from documents to be indexed? –  Collect the documents to be indexed –  Tokenize the text –  Linguistic preprocessing of tokens

•  Language-dependent –  Use heuristic methods, user selection, metadata, or machine learning methods to determine the language of the document –  Some language (e.g., Arabic) may need special sequencing preprocessing J. Pei: Information Retrieval and Web Search -- Tokenization

2

Choosing a Proper Document Unit •  Many possible choices –  Each file in a folder as a document –  In an mbox-format UNIX email file, each email within the large file is treated as a document –  Within an email, each attachment may be treated as a document

•  Why does indexing granularity matter? –  A tradeoff between precision and recall –  Big granularity often leads to low accuracy – searching for “Kung Fu Panda” may return a book containing “Kung Fu” at the beginning and “Panda” at the end –  Very small granularity often leads to low recall – searching “Beijing Olympic” may miss the two sentences “I went to Beijing to join my friends. We watched the Olympic games together.”

J. Pei: Information Retrieval and Web Search -- Tokenization

3

Tokenization •  Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation

J. Pei: Information Retrieval and Web Search -- Tokenization

4

Tokens, Types, and Terms •  Text: “to sleep perchance to dream” •  A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing –  Examples: “to”, “sleep”, “perchance”, “to”, “dream”

•  A type is the class of all tokens containing the same character sequence –  Examples: “to”, “sleep”, “perchance”, “dream”

•  A term is a (perhaps normalized) type that is included in the IR system’s dictionary –  Example: “sleep”, “perchance”, “dream” J. Pei: Information Retrieval and Web Search -- Tokenization

5

Apostrophes •  Used for possession and contractions Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

J. Pei: Information Retrieval and Web Search -- Tokenization

6

Specific Tokens •  •  •  •  •  •  •  • 

C++, C# B-52, B777 M*A*S*H Email addresses ([email protected]) Web URLs (http://www.cs.sfu.ca) IP addresses (142.32.48.231) Phone number (778-782-3054) City names (San Francisco, New York)

J. Pei: Information Retrieval and Web Search -- Tokenization

7

Hyphens •  Hyphenation is used in English for –  Splitting up vowels in words (co-education) –  Joining nouns as names (Hewlett-Packard) –  Showing word grouping (the hold-him-back-and-draghim-away maneuver) –  Special usage (San Francisco-Los Angeles)

•  Splitting on white space may not always be desirable –  “New York University” should not be returned for query “York University” –  “lowercase”, “lower-case”, and “lower case” are equivalent J. Pei: Information Retrieval and Web Search -- Tokenization

8

Word Segmentation •  In some languages (e.g., Chinese), text is written without any spaces between words 信息检索和Web搜索是一门很有意思的课程。

•  Word segmentation methods –  Use a large vocabulary and take the longest vocabulary match –  Machine learning sequence models (e.g., Markov models, conditional random fields) –  Character k-grams J. Pei: Information Retrieval and Web Search -- Tokenization

9

Stop Words •  Some extremely common words that would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely –  Determined by collection frequency – the total number of times each term appears in the document collection

•  Using a stop list significantly reduces the number of postings that a system has to store

25 semantically nonselective words that are common in Reuters-RCV1 J. Pei: Information Retrieval and Web Search -- Tokenization

10

Using Stop Words or Not? •  Phrase searches –  “President of the United States” –  “flights from Vancouver” vs. “flights to Vancouver” –  “To be or not to be, Let It Be, I don’t want to be, …”

•  Web search engines generally do not use stop lists –  Some specific techniques introduced later can reduce the cost due to stop words J. Pei: Information Retrieval and Web Search -- Tokenization

11

Token Normalization •  How can we know USA matches U.S.A? •  Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

J. Pei: Information Retrieval and Web Search -- Tokenization

12

Creating Equivalence Classes •  Using rules removing characters like hyphens –  Both “anti-discriminatory” and “antidiscriminatory” map to “antidiscriminatory”

•  Maintaining relations between unnormalized tokens –  Indexing unnormalized tokens and maintain a query expansion list of multiple vocabulary entries: when a query asks for “car”, search both “car” and “automobile” –  Expansion during index construction: index a document containing “car” under “car” and “automobile” –  Expansion of query terms can be asymmetric –  A more space costly but more flexible method J. Pei: Information Retrieval and Web Search -- Tokenization

13

Example

J. Pei: Information Retrieval and Web Search -- Tokenization

14

Normalization Techniques •  Accents and diacritics –  Normalizing tokens to remove diacritics

•  Capitalization/case-folding –  Reducing all letters to lower case •  May cause problems for names such as Bush, Black, Fed, …

–  Use some heuristics to make some tokens lowercase, e.g., covert the first word in a sentence, all words in a title –  Truecasing: use a machine learning sequence model to make the decision J. Pei: Information Retrieval and Web Search -- Tokenization

15

Stemming and Lemmatization •  How can we know “organize”, “organizes”, and “organizing” should map to the same word? •  Stemming and lemmatization: reduce inflectional forms and sometimes derivationally related forms of a word to a common base form –  am, are, is  be –  car, cars, car’s, cars’  car –  “the boy’s cars are different colors”  “the boy car be different color”

J. Pei: Information Retrieval and Web Search -- Tokenization

16

Stemming •  Algorithmic: a crude heuristic process that chops off the ends of words in the hope of being correct most of the time –  Often remove derivational affixes

•  Porter’s algorithm

–  Use “(m>1) EMENT  ” to map replacement to replac J. Pei: Information Retrieval and Web Search -- Tokenization

17

Comparison of Stemmers

J. Pei: Information Retrieval and Web Search -- Tokenization

18

Lemmatization •  Dictionary-based stemming •  Use a vocabulary and morphological analysis of words to remove inflectional endings only and return the base or dictionary form of a word (lemma) •  “saw”  “see” or “saw” depending on whether the token is used as a verb or a noun •  Can bring very modest benefit for retrieval in English – improving recall but may hurt accuracy J. Pei: Information Retrieval and Web Search -- Tokenization

19

Krovetz Stemmer – A Hybrid Method •  A hybrid approach •  Constantly using a dictionary to check if a word is valid –  If a word is not found, check the word against a list of common inflectional and derivational suffixes, modify the word and check the dictionary again

•  Using manually generated exception entries to record special stemming processing rules •  Low false positive rate, but tends to a high false negative rate •  Producing stems that, in most cases, are full words J. Pei: Information Retrieval and Web Search -- Tokenization

20

Comparison

J. Pei: Information Retrieval and Web Search -- Tokenization

21

Phrases •  Phrases are important in queries –  For query “black sea”, a document containing sentence “the sea turned black” may not be good

•  Phrases are often subtle –  For query “fishing supplies”, should documents containing “fish”, “fishing”, and “supplies” count?

•  How phrases should be identified in tokenizing and stemming? –  N-gram method: a phrase is any sequence of n words –  Many search engines index all n-grams of 2 ≤ n ≤ 5 –  In a document of 1,000 words, there are 3,990 n-grams for 2 ≤ n ≤ 5 J. Pei: Information Retrieval and Web Search -- Tokenization

22

Part-of-Speech (POS) Tagger •  A POS tagger marks the words in a text with labels corresponding to the part-of-speech of the word in that context –  Based on statistical or rule-based approaches –  Trained using large corpora manually labeled

•  Typical tags –  NN (single noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), MD (modal auxiliary, e.g., “can”, “will”)

•  Noun phrases: sequences of nouns or adjectives followed by nouns J. Pei: Information Retrieval and Web Search -- Tokenization

23

POS Tagger Example

Classroom discussion: What is the major drawback of POS tagger in web search? J. Pei: Information Retrieval and Web Search -- Tokenization

24

Summary •  It is an important task to extract tokens from documents •  Choosing document units •  Tokenization •  Stop words and using stop words •  Token normalization •  Stemming and lemmatization •  Processing phrases J. Pei: Information Retrieval and Web Search -- Tokenization

25

To-do List •  Read Section 4.3 in the textbook •  Try out the Porter Stemmer

J. Pei: Information Retrieval and Web Search -- Tokenization

26