Searching. Boolean queries

Searching Web Systems and Algorithms Searching and Indexing Chris Brooks Department of Computer Science University of San Francisco Department of Com...
Author: Dana Barrett
2 downloads 3 Views 66KB Size
Searching Web Systems and Algorithms Searching and Indexing Chris Brooks Department of Computer Science University of San Francisco

Department of Computer Science — University of San Francisco – p.1/??

Information Needs

Once we’ve collected a large body of Web pages (or other documents), we would probably like to find information within this collection. There are a few criteria we would like to satisfy: Expressivity - can we ask complex queries? Scalability - Can we handle large document collections? Relevance - do the documents retrieved satisfy our need? Are they ranked?

The job of an IR system (such as a search engine) is to translate that need into a query and then find documents that satisfy that need.

We can evaluate a search engine by its effectiveness at satisfying this information need.

Department of Computer Science — University of San Fra

Department of Computer Science — University of San Francisco – p.2/??

Boolean queries

Queries

A user typically has an information need.

What sorts of queries might we want to ask?

Let’s start simple, with boolean queries.

Boolean queries: “cat AND dog”, “(cat OR dog) and fish” “cat AND NOT mouse”

The user provides one or more keywords, and we must find documents containing those keywords.

proximity queries: “cat NEAR dog” “cat, dog within same sentence”

Given this query language, how can we structure our document collection?

phrases: “the lazy dog”, “the cat in the hat”

We want to avoid linear search through all documents.

“full text”: user just inputs keywords “cat dog fish” Synonymy: documents with words like “cat” Similarity: documents similar to a given document

Inverted Index The standard way to solve this is through the construction of an inverted index. This is a hashtable mapping tokens to documents they appear in. Construction is very easy: index = for document in collection : for word in document : if word in index : index[word].insert(document) else : index[word] = [document]

(note: we only add a document once if words occur multiple times.) Department of Computer Science — University of San Francisco – p.4/??

Department of Computer Science — University of San Francisco – p.5/??

Department of Computer Science — University of San Fra

Inverted Index Retrieval is also easy. For single-word queries, return the list of documents this word refers to.

Intersection How do we efficiently compute the intersection of two lists?

For multi-word queries, compute the intersection (for AND) or union (for OR) or set difference (for NOT).

### assume both lists are sorted p1 = l1.next(); p2 = l2.next() result = [] while p1 != null and p2 != null : if p1 == p2 : result.append(p1) p1 = l1.next() p2 = l2.next() elif p1 < p2 : p1 = l1.next() else : p2 = l2.next()

Department of Computer Science — University of San Francisco – p.8/??

Department of Computer Science — University of San Francisco – p.7/??

Union and NOT

Intersection

Processing Documents

Department of Computer Science — University of San Fra

Tokenizing

Union and NOT are very similar.

We actually skipped a step here.

The simplest thing to do is to split on whitespace.

For more complex queries, we want to take care to merge lists in the correct order.

We need to go from a document, which is a string, to a list of tokens, which are the keys in the inverted index.

We want to start with the least frequent term, then move in order of increasing frequency

We need to decide: How to separate tokens What tokens to retain? Can tokens be combined or grouped?

What about punctuation or non-alphanumeric characters? Throw them away? What about that’s, or aren’t, or C++? What about San Francisco, or Tiger Woods? What about anti-social vs antisocial? Dates? Phone numbers?

Nested queries can be much more challenging.

What about languages like Chinese or Japanese?

Department of Computer Science — University of San Francisco – p.10/??

Department of Computer Science — University of San Francisco – p.11/??

Department of Computer Science — University of San Fran

Tokenizing

Removing non-useful tokens

Choices include: Fixed set of rules, written as regular expressions Learn rules from labeled data Use a segmentation algorithm, such as Viterbi

We would also like to avoid indexing tokens that do not help us find documents.

Choice depends on anticipated information needs, experience, and efficiency issues.

Stopwords. These are words that do not contain any useful semantic content. a, an, the, are, among, about, he, she, they, etc.

Department of Computer Science — University of San Francisco – p.13/??

Conflation

Markup. We can probably extract this with an HTML parser.

Terms with the same meaning are grouped into an equivalence class. e.g. car, auto, automobile If we have a thesuarus such as WordNet that contains synonyms, we can deal with this in two different ways:

We may also want to remove prefixes and suffixes. For example, if a user searches for “car” and the document contains the token “cars”. This process is known as stemming Most stemmers use a fixed set of rules for removing suffixes.

We can either have a fixed list, or determine them through frequency analysis.

This can introduce errors, due to inconsistencies in English university and universal both stem to univers

Removing stopwords may give us problems with phrase search “As We May Think”, “University of San Francisco”

In Web search, stemming can also introduce problems due to acronyms and jargon gated, SOCKS, ides

Department of Computer Science — University of San Francisco – p.14/??

Normalization

Stemming is a special case of what is called normalization or conflation

Stemming

Other issues include: Accents and diacritical marks. Less important in English, more in other languages. Case. Do we want to convert everything to lower case? What about proper names? (Bush, Windows) Acronyms? Misspellings, dates, regionalisms (color vs colour)

During querying, look up synonyms and add them to the query as disjunction. “car” becomes “car” OR “auto” OR “automobile”

Department of Computer Science — University of San Fran

Measuring performance There are two basic metrics for measuring retrieval performance: Precision What fraction of the documents returned actually meet a user’s information need? Recall What fraction of the documents that would meet the user’s information need are returned? Often we trade one against the other.

During indexing, store a document inder the entry for each synonym Tradeoff: space vs time Department of Computer Science — University of San Francisco – p.16/??

Department of Computer Science — University of San Francisco – p.17/??

Department of Computer Science — University of San Fran

Ranked Retrieval We probably don’t want to just give the user an unsorted list of results. We would instead like to present the best results first.

Ranked Retrieval If we want to score documents according to how well they match our query, we can use TFIDF and cosine similarity. This assigns each word in a document a weight according to how frequently it occurs.

What does best mean? Best match to our query

Cosine similarity measures the distance between two vectors; this produces a score between 0 and 1.

Most authoritative

Ranked Retrieval An alternative approach is to use a document’s place in the Web to determine its rank. Intuition: Each document has prestige, which is a function of the prestige of the documents that link to it. This is the basis of PageRank.

One advantage is that a document’s owner cannot “game” the system by putting extra words in a document

Nice, but perhaps not effective for short queries.

PageRank

PageRank

The idea behind PageRank is this: What is the probability if winding up on a web page x if one is surfing “at random”? Let’s assume the web is strongly connected, and that every page has out-degree >= 1. To have clicked once and wound up at x, a surfer must have: Been at a page y that links to x Chosen the link from y to x out of the E(y) outward links from y . Assuming a uniform distribution, this 1 is N = E(y) .

Department of Computer Science — University of San Francisco – p.22/??

Department of Computer Science — University of San Fran

Department of Computer Science — University of San Francisco – p.20/??

Department of Computer Science — University of San Francisco – p.19/??

PageRank

SInce there are y different pages linking to x, P p1 [x] = p0 [y] ∗ N1y

What’s interesting about PageRank is that a document’s “value” is determined by the value of its neighbors.

(the probability of being at y and then choosing to go to x)

In practice, we would first find pages that matched a query, then use PageRank to order these results, perhaps in combination with document-level rankings.

We can follow this computation out to determine the stationary distribution of p.

A criticism of PageRank is that there’s not necessarily a connection between quality and popularity.

This is a page’s prestige, or PageRank. We’ll go through this more carefully next week.

Department of Computer Science — University of San Francisco – p.23/??

Department of Computer Science — University of San Fran

Index compression An inverted index can take up a large amount of space, particularly if positional information about where a term occurred in a document is also kept. For performance reasons, it is preferable to keep the index small enough to be held in memory. We can do this through index compression If our decompression algorithm is fast enough, the cost of decompression will be less than the cost of retrieving an entry from disk.

Index compression A simple thing we can do is to store offsets of document IDs. For example, if we have: foo: 13, 100, 150 we store foo: 13,87,50 For large numbers of documents, the offset will hopefully be smaller

Relevance Feedback Retrieval can be very challenging in the Web domain Queries are very short, lots of potential results. If we can get the user to give us some help in understanding their information needs, we can improve performance. This is called relevance feedback.

We can also construct a hierarchy of indices, using a tree-based approach. Each identifier maps into a second-level index. Compression makes updating more challenging.

Department of Computer Science — University of San Francisco – p.25/??

Relevance Feedback The basic idea is this: The user submits a query q We return some documents. For each document, the user ranks them as “yes” or “no” We want to then use those features of the good documents to do a second search.

Department of Computer Science — University of San Francisco – p.28/??

Department of Computer Science — University of San Francisco – p.26/??

Rocchio’s method A simple way to do this is called Rocchio’s method. We compute each document’s score as before We then compute a weight for the most useful words in the “good” documents and a weight for the most useful words in the “bad” documents. These weights are used to adjust a document’s score.

Department of Computer Science — University of San Francisco – p.29/??

Department of Computer Science — University of San Fran

Probabilistic feedback

If we have “good” and “bad” documents, we can also use them to build a Naive Bayes classifier that will predict the likelihood of a document matching a search query.

We can also build a probabilistic model directly from our collection and use this to predict the likelihood that a document will satisfy a query. P (dx = T rue|q = cat, dog, tree)

Department of Computer Science — University of San Fran

Relevance Feedback

Handling metadata

Complex queries

Relevance feedback has not been widely used in web search

The standard bag-of-words approach ignores metadata and document structure.

The approach we’ve seen so far will work for Boolean and free-text searches.

Issues Users don’t like labeling data Users don’t want to search twice Mostly useful for boosting recall, which is not as important as precision for this task.

The simplest approach is to add weights to terms that occur inside a tag of interest. The META tag was originally useful for this, but is an obvious target for spammers.

What about phrases or sentences?

We can approximate “good” documents by mining clickstream data.

The anchor text inside links to a particular page can be particularly helpful.

We can also use documents that are both linked in many pages

Department of Computer Science — University of San Francisco – p.31/??

Positional indices An alternate, more scalable approach is to keep track of the position in a document where a term occurs. To process a phrase query, we would treat it as a Boolean query, but in computing intersection only admit adjacent words. For example, suppose we had the query “fat cat” and the following indices: fat: d1 : (34, 72, 103), d2 : (44, 61) cat: d1 : (35, 88, 104), d2 : (42, 99) We would conclude that d1 had two matches, and d2 zero.

The simplest approach would be to construct a separate index for phrases of length n or less. We don’t want to keep all phrases, though. Just those that are statistically interesting.

To do this, we estimate the frequency of tokens t1 and t2 from our collection.

If t1 t2 is an interesting phrase, P (t1 t2 ) >> P (t1 ) ∗ P (t2 ) o P (t1 t2 )