Computer Science / CS342 / Fall 2013

Multimedia Retrieval Chapter 2: Text Retrieval

Dr. Roger Weber, [email protected]

2.1 Overview and Motivation 2.2 Term Extraction 2.3 Models for Text Retrieval 2.4 Indexing Structures 2.6 Literature and Links

2.1 Overview and Motivation •



The problem of managing and retrieving information remained a constant problem over the history of computer science. With the first generation of computers, punch cards were able to store information. Programs then used features (punches) to select the specific information the user had requested. Today, storing and archiving information is hardly a problem any more. The search problem, however, remained challenging. Typical types of information retrieval: – Database: information is maintained in a structured way. Queries refer to the structure to define constraints on the information to be retrieved (SQL as query language). However, SQL comes with some constraints on retrieval. A query like SELECT * FROM * WHERE * like ‘%house%‘ is not feasible and is not supported by vendors. – Boolean Retrieval Systems: To simplify matters, the first systems only allowed Boolean queries. If browsing through the collection of documents, a document can be marked as relevant without knowing the status of other documents (recall: punch cards and tapes only support sequential access). The first systems appeared in libraries: the user entered data into a form to express „this field must contain terms x and y“ or „this field must contain either term x or term y“. The systems reported matching books in a unordered list (or ordered by some internal key). – Retrieval System with Ranking: Right from the beginning, people noticed that Boolean search is not suitable for searches. On the one hand, it was difficult for users to express their information need with the given query language. On the other hand, relevant documents often appeared in latter result pages but not right at the beginning. Vector space retrieval simplified the query language and introduced a ranking based on the relevance of a document for a given query.

Multimedia Retrieval – Fall 2013

Page 2-2

– Vague Queries against Database: Consider a database containing sales information for computers (e.g., processor frequency, memory, hard disk capacity, price). A vague query such as „customer looks for a cheap computer with at least 500MHz, 64MB, and 300GB“ cannot be solved directly with a SQL statement. For instance, the condition „hard disk capacity >= 300GB“ is too hard as the customer would also buy a PC with just 280GB if that one is cheaper. To handle such queries, a system must display results that, to a certain degree, do not match all query constraints. Otherwise, the user would have to perform multiple queries varying the values on the constraints. This type of search problem comes close to the nearest neighbor search of Chapter 6. – Complex Queries: Consider a database with information on industrial parts. A complex query may look like as follows: • “Find bolts made of steel with a radius of 2.5 mm, a length of 10 cm implementing DIN 4711. The bolts should have a polished surface and can be used within a electronic engine.” The main problem with the query above is that research currently has little ideas how to model such complex information. For simpler cases, a semantic network is appropriate to model complex information. Nevertheless, it remains open whether current retrieval methods are suitable for such query scenarios.

Multimedia Retrieval – Fall 2013

Page 2-3

– Web Retrieval: With the advent of the Web, new problems arose: first, the retrieval system had to adapt themselves to a less controlled environment without context (no fixed vocabulary, multiple languages, varying quality of sources). Second, the enormous number of documents (and their uncontrolled additions) in result lists demonstrated the weakness of current ranking algorithms. Third, search engines had to face WebSites that tried to manipulate their rankings for certain keywords (nobody would write a book which contains a term thousand times in the title). Web retrieval quickly turned out to be a research topic of its own (Chapter 3). – Multimedia Content: The increased amount of digitized data (images, audio files, video files) challenged information systems and search engines: how can we efficiently search for images, music, or movies? Today’s search engine still focus on meta-data or annotations and use text retrieval models to identify relevant content. Chapter 5 and 6 elaborate on newer approaches that extract features from the signal information. – Heterogeneous, Distributed, Autonomous Information Sources: Frequently, documents are stored in different systems and archives with varying functionality. The user, however, should not have to check each system in turn to identify relevant information. Rather, a central (or virtual) view over all documents should enable fast retrieval of documents of any kind.

Multimedia Retrieval – Fall 2013

Page 2-4

Text Retrieval – Overview 1

insert

„Dogs at home“

a

new document

2

b

result

doc10 doc4 doc1

query transformation

4 relevance ranking RSV(Q,doc1) = .2 RSV(Q,doc4) = .4 RSV(Q,doc10) = .6

c

feature extraction docID = doc10 dog → word 10, word 25 cat → word 13 home → word 2, word 27 ...

d

indexing

(offline ←→online) from a perspective of query evaluation

Multimedia Retrieval – Fall 2013

retrieval 3

Q= {dog, dogs, hound, home}

inverted file: dog → doc3,doc4,doc10 cat → doc10 home → doc1,doc7,doc10 .... Page 2-5

The retrieval problem Given – N documents (D0, ..., DN-1) – Query Q of user Problem – ranked list of k documents Dj (0 – Links to external references: read this important note – Question: what does the link text describe? the document itself or the embedded/referenced object?

• Usually, the link text is associated with both the embedding and the linked document. In most cases, the link text is a good summary for the linked document. In a few cases, the link text is meaningless („click here“)

Multimedia Retrieval – Fall 2013

Page 2-11

Elimination of structure in today‘s search engines •

Most search engines distinguish different areas in the document – Title ( XYZ ) – Remaining header (Meta-Keywords, ...) – Main body (...) • Text between markups (Hallo → Hallo) • Special attributes of selected tags ( → Image of my cat) – URL / Links – Seldom: comments! ( ) – Google was the first engine to use link text to describe current document and referenced one • Database Research Group

→ „Database Research Group“ also describes the document at http://www-dbs.ethz.ch/

– Search engines frequently weight text piece according to their layout. E.g., the weights for terms within -tag are higher than within -tag. Weights can be considered als multiple occurrences of a term (see step 3, position and occurrences)

Multimedia Retrieval – Fall 2013

Page 2-12

Elimination of Structure in other Document Types •

XML/PDF/Postscript,...: – The general approach is similar to HTML; extraction and weighting has to be adapted to the syntax of the respective document format. Additionally, heuristic algorithms are deployed to detect the importance of text pieces and references to other documents. – In Chapter 4, we study XML documents and combine structure and text retrieval. In this chapter, we simply reduce the XML document to a list of terms with weights and occurrences associated with them.

Multimedia Retrieval – Fall 2013

Page 2-13

Step 2: Elimination of frequent/infrequent terms •

The objective of indexing is to determine useful answers for user queries. To achieve this goal, it is not required to consider terms with little or no semantics (e.g., the, a, it) or terms that appear seldom (e.g., endoplasmatic reticulum in a library of computer science)



Theoretical solution: restrict indexing to terms that have proven to be useful or that appear interesting from past, practical experiences with the system. However, this requires a feedback mechanism with the user to understand term importance.



Pragmatic solution: term frequencies in a document collection follow the so-called Zipfian distribution – Rank terms based on their occurrence frequencies – Let N = Number of term occurrences in collection M = Number of distinct terms in collection nt = Number of occurrences of term t rt = Rank of term t ordered by the number of occurrences pr = Probability that a random term in a document has rank r (r=rt) = nt / N

Multimedia Retrieval – Fall 2013

Page 2-14

Zipf‘s law • Central theorem: – pr = c / r

c ≈ 0.1 (const)

i.e.

Rank * Frequency = const

– If term t occurs nt times then t has rank rt = c*N / nt – Mathematical considerations: • must hold

Σ pr =1

• it follows:

c = 1 / Σ (1/r) (r=1..M) ≈ 1 / (ln(M)+0.5772)

(r=1..M)

• c depends on the number of distinct terms: M c Multimedia Retrieval – Fall 2013

5‘000 0.11

10‘000 0.10

50‘000 0.09

100‘000 0.08 Page 2-15

Distribution of term frequencies

Multimedia Retrieval – Fall 2013

Page 2-16

Restriction to significant terms •

Stop words are terms with little or no semantic meaning. Often, stop words are not indexed as they occur in almost all documents and carry only little information. Examples: – German: in, der, wo, ich – English: the, a, is Often, the rank of these terms is on the left side of the „upper cut-off“ line. Generally, stop words are responsible for 20% to 30% of the term occurrences in a text. With the elimination of stop words, the memory consumption of the index can be reduced.



Similarly, the most frequent terms in a collection of documents carry little information (rank on the left side of the „upper cut-off“ line): – The term „Computer“ is meaningless to index articles about computer science – The term „Computer“, however, is important to distiguish between general articles and articles on computer science.



Analogously, one can strip off words that are seldomly used. This assumes that users will not use them in their queries (the rank is on the right side of the „lower cut-off“ Linie). Although, the additional memory consumption is rather small.

Multimedia Retrieval – Fall 2013

Page 2-17

Discrimination Method •

Let sim(di,dj) be the similarity between two documents Di und Dj with 0 ≤ sim(di , dj ) ≤ 1

• •

The document Di is described with a vector di. Component k denotes the occurrences of the k-th term in document Di (see later on) We introduce a hypothetical document C (Centroid) which contains all M terms with average frequency nk / N, i.e., Ci=nti / N The density of a collection is the sum of similarity between all documents and C: Q=



1: identical, 0: dissimilar

Σ sim(C,di )

(i=1..N)

Finally, we obtain a discrimination valur of a term, i.e., the contribution of a term to distinguish documents in the collection from each other: – Remove term t from the collection and compute a new density Qt according to the approach described above (ignoring components associated with term t ). – The discrimination value is then:

DWt = Qt - Q

– Interpretation: if DWt > 0, it follows that Qt > Q. In other words, the average similarity to the Centroid has increased after the removal of term t: • DWt > 0 => Term t is a good descriptor (useful for describing documents) • DWt ≈ 0 => Term t contains little semantics to distinguish documents • DWt < 0 => Term t is a bad descriptor Multimedia Retrieval – Fall 2013

Page 2-18

Step 3: Mapping text to terms • • • •

To select appropriate features for documents, one typically uses linguistic or statistical approaches to define the features based on words, fragments of words or phrases. Most search engines use words or phrases as features. Some engines use stemming, some differentiate between upper and lower cases, and some support error correction. An interesting option is the usage of fragments, i.e., so-called n-grams. Although not directly related to semantics of text, they are very useful to support “fuzzy” retrieval. But there are other possibilities: fragments of words, i.e., n-grams: – Example: street → str, tre, ree, eet streets → str, tre, ree, eet, ets strets → str, tre, ret, ets – Benefits: • Simple misspellings or bad recognition often result in bad retrievals; fragments significantly improve retrieval quality (see below) • Stemming and syllable division not necessary any more • No language specific retrieval necessary; every language is processed equally – EuroSpider uses 3-grams to index OCR texts. Examples: • Zentral- und Hochschulbibliothek Luzern (www.zhbluzern.ch) http://zhbluzern.eurospider.com/digital_library/ • Luzerner Staatsarchiv (www.staluzern.ch) http://staluzern.eurospider.com/digital_library/

Multimedia Retrieval – Fall 2013

Page 2-19

Locations and frequency of terms •

• •



Retrieval algorithms often use the number of term occurrences and the positions of terms within the document to identify and rank results – Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Ti in document Dj – Term locations („feature locations“): loc(ti,dj) → P(N) [set of locations] Die Kardinalität von loc(Ti,Dj) ist identisch mit tf(Ti,Dj) Term frequency is important to rank documents, cf. Section 2.3 Term locations frequently influence the ranking and whether a document appears in the result at all, e.g.: – Condition: Q =„white NEAR house“ (explicit phrase matching) looking for documents with the terms “white” and “house” close to each other – Ranking: Q =„white house“ (implicit phrase matching) documents with the terms “white” next to “house” should be at the top of results Finally, search engine resort to locations to identify suitable pieces of the document for presentation of the result set (sniplets):

Multimedia Retrieval – Fall 2013

Page 2-20

Step 4: Reduction of terms to their stems •



Stemming: in most languages, words have various inflected (or, sometimes, derived) forms. The different forms should not carry different meanings but should be mapped to a single form. However, in many languages, it is not simple to derive the linguistic stem without a dictionary. At least for English, there exist algorithms without the need of a dictionary which still produce good results (Porter Algorithm) – German: Due strong conjugations and declensions, it is not possible to automatically derive a linguistic correct stem. Stemming requires a dictionary which enumerates all possible flexions of a stem. Examples: • gehen: gehe, gehst, geht, gehen, ging, gingst, gingen, gegangen • Haus: Haus, Hauses, Häuser In addition, the German language supports composite words which may or may not be split into their parts. • Gartenhaus → Garten, Haus (?? gut oder schlecht ??) – English: less strong flexions, more regular flexions. There exists algorithms to automatically derive the stem (near-stem) of terms. Examples: • going → go, conflated → conflat, relational→ relat

Multimedia Retrieval – Fall 2013

Page 2-21

Porter Algorithm (short overview) •

• •

The Porter Algorithm determines not a linguistic stem but a near-stem of words, i.e., in most cases, words with the same linguistic stem are reduced to same approximate stem. The algorithm is very efficient. – Porter defines character v as a „vocal“ if • it is an A, E, I, O, U • it is a Y and the preceding character is not a „vocal“ (e.g. RY, BY) – All other characters are consonants – Let C be a sequence of consonants, and let V be a sequence of vocals – Each word follows the following pattern: • [C](VC)m[V] • m is the measure of the word – further: • *o: stem ends with cvc; second consonant must not be W, X or Y (-WIL, -HOP) • *d: stem with double consonant (-TT, -SS) • *v*: stem contains a vocal The following rules define mappings for words with the help of the forms introduced above. m is used to avoid overstemming of short words. Source: Porter, M.F.: An Algorithm for Suffix Stripping. Program, Vol. 14, No. 3, 1980

Multimedia Retrieval – Fall 2013

Page 2-22

Porter algorithm - extracts (1) Rule

Examples

Step 1

a) SSES

-> SS

caresses

-> caress

IES

-> I

ponies

-> poni

SS

-> SS

caress

-> caress

S

->

cats

-> cat

->EE

feed

-> feed

(*v*) ED

->

plastered

-> plaster

(*v*) ING

->

motoring

-> motor

(m>0) ATIONAL

-> ATE

relational

-> relate

(m>0) TIONAL

-> TION

conditional

-> condition

(m>0) ENCI

-> ENCE

valenci

-> valence

(m>0) IZER

-> IZE

digitizer

-> digitize

b) (m>0) EED

... (further rules)

Step 2

... (further rules)

Multimedia Retrieval – Fall 2013

Page 2-23

Porter algorithm - extracts (2) Rule

Examples

Step 3 (m>0) ICATE

-> IC

triplicate

-> triplic

(m>0) ATIVE

->

formative

-> form

(m>0) ALIZE

-> AL

formalize

-> formal

(m>1) and (*S or *T)ION

->

adoption

-> adopt

(m>1) OU

->

homologou

-> homolog

(m>1) ISM

->

platonism

-> platon

->

rate

-> rate

->

cease

-> ceas

-> single letter

controll

-> control

... (further rules)

Step 4

... (further rules)

Step 5

a) (m>1) E (m=1) and (not *o)E

b) (m>1 and *d and *L)

Multimedia Retrieval – Fall 2013

Page 2-24

Dictionary based stemming •





A dictionary significantly improves the quality of stemming (Note: the Porter Algorithm does not derive a linguistic correct stem). It determines the correct linguistic stem for all words but at the price of additional lookup costs and maintenance costs for the dictionary. The EuroWordNet initiative tries to develop a semantic dictionary for the European languages. Next to words, the dictionary shall contain flexed forms and relations between words (see next section). However, the usage of these dictionaries is not for free (with the exception of WordNet for English). Names remain a problem of their own... Examples of such dictionaries / onthologies: – EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ – GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ – WordNet: http://wordnet.princeton.edu/ We look a dictionary based stemming with the example of Morphy, the stemmer of WordNet. Morphy combines two approaches for stemming: – a rule-based approach for regular flexions much like the porter algorithm but much simpler – an exception list with strong or irregular flexions of terms

Multimedia Retrieval – Fall 2013

Page 2-25

Rule-based approach for regular flexions •

The rule-based approach follows similar rules like the porter algorithms but much simpler as it must not necessarily succeed. It targets regular flexions



The rules depend on the type of word (noun, adjective, verb): – The rules are applied for the current term; the reduced form is looked up in a dictionary. If it exists in the dictionary, it is a possible stem of the current term. – This may lead to several stems which all are valid and returned to the caller of the algorithm Example: axes -> axis, axe

Multimedia Retrieval – Fall 2013

Type NOUN NOUN NOUN NOUN NOUN NOUN NOUN NOUN

Suffix s ses xes zes ches shes men ies

VERB VERB VERB VERB VERB VERB VERB VERB

s ies es es ed ed ing ing

ADJ ADJ ADJ ADJ

er est er est

Ending s x z ch sh man y y e e e

e e Page 2-26

Exception list with strong or irregular flexions •

For each type of word, an exception list enumerates all strong or irregular flexions of words. The so obtained stems are returned together with the ones obtained with the rule-based approach.

adj.exc (1500): ... stagiest stagy stalkier stalky stalkiest stalky stapler stapler starchier starchy starchiest starchy starer starer starest starest starrier starry starriest starry statelier stately stateliest stately steadier steady steadiest steady stealthier stealthy stealthiest stealthy steamier steamy steamiest steamy ...

Multimedia Retrieval – Fall 2013

verb.exc (2400): ... ate atrophied averred averring awoke awoken babied baby-sat baby-sitting back-pedalled back-pedalling backbit backbitten backslid backslidden bade bagged bagging ...

eat atrophy aver aver awake awake baby baby-sit baby-sit back-pedal back-pedal backbite backbite backslide backslide bid bag bag

noun.exc (2000): ... neuromata neuroma neuroptera neuropteron neuroses neurosis nevi nevus nibelungen nibelung nidi nidus nielli niello nilgai nilgai nimbi nimbus nimbostrati nimbostratus noctilucae noctiluca nodi nodus noes no nomina nomen nota notum noumena noumenon novae nova novelle novella ...

Page 2-27

Step 5: Mapping to index terms •



Term extraction must further deal with homonyms (equal terms but different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to consider. In the following, a list of the most common relationships: – Homonyms (equal terms but different semantics): • bank (shore vs. financial institute) – Synonyms (different terms but equal semantics): • walk, go, pace, run, sprint – Hypernyms (umbrella term) / Hyponym (species) • Animal ← dog, cat, bird, ... – Holonyms (is part of) / Meronyms (has parts) • door ← lock The relationships above define a network (often denoted as ontology) with terms as nodes and relations as edges. An occurrence of a term may be interpreted as occurrences of near-by terms in this network as well (whereby „near-by“ has to be defined appropriately). – Example: A document contains the term „dog“. We may also interpret this as an occurence of the term „animal“ (with a smaller weight)

Multimedia Retrieval – Fall 2013

Page 2-28

Concluding remarks • •



Some search engine do not implement step 4 and 5. Google only recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual approaches that (automatically) translate or relate terms to different languages and make them retrievable even for queries in different languages than the document. – engl: to go, dt: gehen, frz: aller → „Moving“ Term extraction for queries: – Similar to term extraction of documents – If term extraction of query implements step 5: • Omit step 5 in term extraction of documents in the collection • Extend the query terms with „near-by“ terms: – Expansion with synonyms: Q=„house“ → Qneu=„house, home, domicile, ...“ – If a specialized search returns not enough answers, exchange keywords with their hypernyms: e.g., Q=„mare“ (female horse) → Qnew=„horse“ – If a general search term returns too many results, let the user choose a more specialized term to reduce the result list: e.g., Q=„horse“ → Qnew=„mare, pony, chestnut, pacer“

Multimedia Retrieval – Fall 2013

Page 2-29

Other document types, e.g., XML •

XML retrieval comprises two different aspects – Text retrieval: Similar to Web retrieval, only textual parts within the document are of interest while the structure is not needed to answer the query. If the search engine understands the semantics of the XML document (e.g., known DTD, XML Schema), term extraction may again assign different weights to terms in different path types or tags (e.g. , , ). In principle, this is very much like the classical retrieval. Analogously to HTML, XML may contain links (XLink, XPointer); they are handled similar to web retrieval approaches. – Search with path types: parts of the document are selected by path expressions; within these parts, search predicates formulate the data items to return similar to classical database queries. Often, these queries require knowledge about the XML structure, i.e., its DTD or XML Schema. Even searching for names is rather difficult without appropriate knowledge about the data structure: Hans Meier, Hans..., , Meier, Hans

Multimedia Retrieval – Fall 2013

Page 2-30

2.3 Models for Text Retrieval •

Overview – Boolean Retrieval – Fuzzy Retrieval – Vector Space Retrieval – Probabilistic Retrieval (BIR Model) – Latent Semantic Indexing

Multimedia Retrieval – Fall 2013

Page 2-31

2.3.1 Boolean Retrieval •

Historically: – Documents were stored on tapes or punched cards – Searching: only sequential access, i.e., while reading documents, the algorithm had to decide whether it matches the query or not (rewinding to costly and no temporary store available for sorting results)



Today: – Boolean search is still very frequent but is not state-of-the-art. Search engines like Google use it for its simplicity but further improved it with additional sorting/ranking criteria for the result sets Model: – Document D represented by binary vector d with di=1 if term ti occurs in document – Query q comes from query space Q; let t be an arbitrary term, and q1 and q2 be queries from Q; Q is given by queries of type • t • q1 ∧ q2 • q1 ∨ q2 • ¬ q1



– Additional operators exist, e.g., SAME, WITHIN, ADJ to constrain the distance between term occurrences Multimedia Retrieval – Fall 2013

Page 2-32



Evaluation: – Retrieval status value rsv for a query from Q and a document d • • • •

rsv(ti , d) rsv(q1 ∧ q2 , d) rsv(q1 ∨ q2 , d) rsv(¬ q1 , d)

= di = min( rsv(q1,d), rsv(q2,d) ) = max( rsv(q1,d), rsv(q2,d) ) = 1 - rsv(q1, d)

– rsv takes only values 0 or 1 – Limitations of engines: q1 ∧ ¬ q2 supported but q1 ∨ ¬ q2 is not





Advantages: – Given a document, one can define a query that exactly returns that document – Efficient implementation feasible with inverted files Disadvantages: – Size of results become unreasonably large (cf. Google's hit counts) – No ranking of documents, i.e., painful browsing of many documents – Complexity of query language (especially if they include SAME, ADJ, ...) – Retrieval quality much worse than with any other approach to follow

Multimedia Retrieval – Fall 2013

Page 2-33

2.3.2 Fuzzy Retrieval •

Same model as Boolean retrieval but enriched with ranking mechanism



Model: – Document D is represented by a vector d with di∈[0,1]. di denotes the (normalized) term frequency of term ti in the document – Query space Q equal to Boolean Retrieval



Evaluation: – Similar to Boolean retrieval but rsv-function evaluates to values between 0 and 1 (fuzzy logic) – Ordering of documents with descending rsv-values – Option (1): fuzzy algebraic • • • •

Multimedia Retrieval – Fall 2013

rsv(ti , d) rsv(q1 ∨ q2 , d) rsv(q1 ∧ q2 , d) rsv(¬ q1 , d)

= di = rsv(q1,d) + rsv(q2,d) - rsv(q1,d) · rsv(q2,d) = rsv(q1,d) · rsv(q2,d) = 1 - rsv(q1, d) Page 2-34



Evaluation (contd.): – Option (2): extended Boolean model, soft Boolean operator • rsv(ti , d) • rsv(q1 ∧ q2 , d) • rsv(q1 ∨ q2 , d) • rsv(¬ q1 , d) • Usually, it holds:

= di = ca1· max( rsv(q1,d), rsv(q2,d) ) + ca2· min( rsv(q1,d), rsv(q2,d) ) = co1· max( rsv(q1,d), rsv(q2,d) ) + co2· min( rsv(q1,d), rsv(q2,d) ) = 1 - rsv(q1, d) co1 > co2 und

• soft Boolean operator: ca1 = co1 , i.e. only a single operator (andor) • •

ca1 < ca2

ca2 = co2 = 1 - co1

Advantage: – Ranking of retrieved documents Disadvantages: – Not better than Boolean retrieval but worse than all other approaches to follow – No weighting of terms feasible, i.e., frequent terms dominate result – Complexity of query language

Multimedia Retrieval – Fall 2013

Page 2-35

2.3.3 Vector Space Retrieval • •

Retrieval status value ranks documents Dj given the query Q: rsv(Q,Dj ) Documents and queries are represented by M-dimensional vectors in the space RM



Model: – Document D represented by vector d with di∈[0,∞ [.

• tf ( Ti , Dj ) = Term frequency ("feature frequency") = number of occurrences of feature Ti in document Dj • df ( Ti ) = Document frequency = number of documents that contain feature Ti • idf ( Ti ) = Inverse document frequency = discrimination value of term Ti

– Queries are described equally; search is implemented as similarity retrieval Multimedia Retrieval – Fall 2013

Page 2-36

Inverse Document Frequency • •



Discrimination method in Section 2.2 is one example Approach: idf shall express how good a term distinguishes documents in the collection – Example: "the" is not able to segregate documents in the collection (used everywhere) – Example: "computer" sharply segregates books in the library of Uni Basel Typical idf-functions from the literature: (N = number of documents in collection) N+1 idf(Ti ) := log —————— df(Ti ) + 1

idf

4

N = 100

3

often you'll find: N idf(Ti ) := log —————— df(Ti )

Multimedia Retrieval – Fall 2013

2

1 0

20 0

40

60

80

df 100

df

Page 2-37

Document-Term-Matrix •

Documents Dj and query Q represented by vectors; dij and qi denote the i-th component representing the i-th term; M = number of distinct terms):

dj :=

d1j ... dmj

q1 q := ... qm

dij := tf(Ti , Dj ) · idf(Ti ) qi := tf(Ti , Q ) · idf(Ti )



dij and qi become large if term Ti occurs frequently in document Dj but is only used by few documents of the collection



Combining the document vectors to a matrix results in the document-term-matrix A with dj denoting the j-th column in A

N A=

Multimedia Retrieval – Fall 2013

aij = dij

M

M: number of terms N : number of documents Page 2-38



Evaluation: – Retrieval status value rsv(q, dj ) ranks documents for a given query – A query q is answer with the k documents having the highest rsv-values



Typical functions: let d be the document vector and q be the query vector – inner vector product: simplest retrieval status function rsv(q, d) = qT d

rsv = ATq – Cosine measure: qT d rsv(q, d) = ————— || q || || d || • L: ljj=1/||dj||,

und q‘: q‘i=qi / ||q||

rsv = L AT q‘

Multimedia Retrieval – Fall 2013

Page 2-39

Example: Vector Space Retrieval (Source: GF98) •

Given: 3 documents (D1 , D2 , D3) and the query Q: D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” “Shipment of gold arrived in a truck” D3: Q:



“gold silver truck”

Document frequency of i-th term (df(Ti)) and inverse document frequency [ idf(Ti )= log10(N / df(Ti )) ] Id 1 2 3 4 5 6 7 8 9 10 11

Term Ti a arrived damaged delivery fire gold in of silver shipment truck

Multimedia Retrieval – Fall 2013

df(Ti ) 3 2 1 1 1 2 3 3 1 2 2

idf(Ti ) 0 0.176 0.477 0.477 0.477 0.176 0 0 0.477 0.176 0.176

M = 11 N = 3

Page 2-40

Example: Document-Term-Matrix AT (empty cells contain 0)

doc T1 D1 D2 D3

T2

T3

T4

.477 .176

T5

T6

.477

.176

T7

T8

.477

.176

T9 .954

.176

T 11

.176

.176

Q

T10

.176 .176

.477

AT

.176 .176

Use the inner vector product to rank documents

rsv = AT q =

0.031 0.486 0.062

This results in the following order: Multimedia Retrieval – Fall 2013

D2 < D3 < D1 Page 2-41

Remarks •

There are many more methods to determine the vector representations and to compute retrieval status values



Main assumption of vector space retrieval – Terms occur independent from each other in documents – This is generally not true: if one writes about Mercedes, the term "car" is likely to co-occur in document



Advantages: – Simple model with efficient evaluation algorithms – Partial match queries possible, i.e., it returns documents that only partly contain the query terms (similar to or-operator of Boolean retrieval) – Very good retrieval quality; but not state-of-the-art – Relevance feedback may further improve vector space retrieval



Disadvantages: – Many heuristics and simplification; no proof for "correctness" of result set – HTML/Web: occurrences of terms is not the most important criteria to rank documents (spamming)

Multimedia Retrieval – Fall 2013

Page 2-42

2.3.3 Probabilistic Retrieval • •

Probabilistic retrieval is based on theory of probability An optimal algorithm would exactly distinguish between relevant and non-relevant documents; this requires an exact description of the set of relevant documents



Probabilistic retrieval requires user interaction; it aims at improving the query such that the distinction between relevant and non-relevant documents becomes possible



Model: – Basic idea: given a query Q and a document, estimate the probability that the user considers the document to be relevant – Assumptions: • This probability depends only on the query and the document collection • There exists a sub-set of documents that are relevant to the user. Let R be this sub-set. All other documents are non-relevant for the user.

Multimedia Retrieval – Fall 2013

Page 2-43



Model (contd.): – Let P(R | Dj ) be the probability that document Dj is relevant for query Q. Let P(NR | Dj ) be the probability that document Dj is not relevant for query Q. Let



sim(Dj ,Q) =

P(R | Dj ) ————— P(NR | Dj )

How to compute the similarity: Binary Independence Retrieval (BIR) – conditional probabilities, Bayes’ theorem: P(A | B) · P(B) = P(A ∩ B) = P(B | A) · P(A) – Applying Bayes’ theorem to the definition of the similarity: sim(Dj ,Q) =

P(R | Dj ) ———— P(NR | Dj )

=

P(Dj | R) · P(R) ———————— P(Dj | NR) · P(NR)

– With: • P(Dj | R): the probability that a randomly selected document in R is Dj • P(R): probability that randomly selected document is relevant (in R) • P(NR | Dj ), P(NR): analogously to the probabilities above but applied to the set of non-relevant documents NR Multimedia Retrieval – Fall 2013

Page 2-44



How to compute the similarity (contd.): – The Binary Independence Retrieval (BIR) simplifies the computation by considering only binary representations of the documents (and queries) instead of the documents themselves. More precisely: a document D is characterized with a binary vector x with xi=1 if term Ti occurs in document D. Analogously, xi=0 denotes that term Ti does not appear in the document. Similarly, one obtains a binary representation q for a query Q. – Instead of probabilities P(Dj | R), BIR resorts to probabilities P(x | R). Obviously, it shall hold that P(Dj | R) = P(x | R), if x characterizes document Dj. – Further assumption: the probability P(x | R) is given by the product of probabilities over all components of the vector, i.e.: P(x | R)

= Π P(xi | R)

P(x | NR) = Π P(xi | NR)

= Πxi=1 P(xi=1 | R) · Πxi=0 P(xi=0 | R) = Πxi=1 P(xi=1 | NR) · Πxi=0 P(xi=0 | NR)

with P(xi=1| R) being the probability that term Ti occurs in a randomly chosen document from R, and P(xi=0 | R) being the probability that term Ti does not appear in a randomly chosen document from R. The similarity is thus given as: sim(x ,Q) =

Multimedia Retrieval – Fall 2013

P(Dj | R) · P(R) ———————— P(Dj | NR) · P(NR)

P(R) Π P(xi | R) = ——— · —————— P(NR) Π P(xi | NR) Page 2-45



How to compute the similarity (contd.): – Simplification of notation: Let this leads to:

ri = P(xi =1 | R), ni = P(xi=1 | NR) 1 - ri = P(xi =0 | R), 1 - ni = P(xi=0 | NR)

– To further simplify the computation, we assume that ri=ni if term Ti does not occur in Q (qi=0). This means that only query terms matter for the distinction between relevant and non-relevant documents. It follows: P(xi | R) / P(xi | NR) = 1 falls qi=0 – Inserting into the similarity function leads to: sim(x ,Q)

P(R) = ——— P(NR) P(R) = ——— P(NR)

Πqi=1, xi=1

ri — ni

Πqi=1, xi=1

ri (1 - ni ) ———— ni(1 - ri )

Πqi=1, xi=0 Πqi=1

1 - ri ——— 1 - ni 1 - ri ——— 1 - ni

– Basically, we are not interested in absolute values for the similarity but only in the ordering induced by the similarity function. Thus we can reduce the formula by eliminating all constant factors (that on depend on the specific query). We obtain: sim(x ,Q)

Multimedia Retrieval – Fall 2013

∼ Πqi=1, xi=1

ri (1 - ni ) ———— ni (1 - ri )

Page 2-46



How to compute the similarity (contd.): – Usually, one computes the logarithm of similarities and uses a abbreviation ci for the ratios of the product: ri (1 - ni ) ci = log ———— ni (1 - ri ) – Finally, we obtain a retrieval status value for document Dj: rsv(Dj , Q) =

Σ

Ti∈Dj, Ti∈Q

ci

i.e., the rsv is the sum of ci-values over terms that appear both in the document Dj and the query Q. Hence, given the ci-values for query terms (small number of terms), search engines can efficiently compute the rsv with inverted files. – Final question: how to compute ci-values, more precisely: ri and ni for the given query terms? • Initialization (for the first round): ri = 0.5,

ni= df(Ti ) / N

∀ Ti ∈ Q

i.e., query terms should appear with constant probability (50%) in relevant documents. The probability of occurrences of query terms in non-relevant documents should follow the average distribution of that term in the collection. Multimedia Retrieval – Fall 2013

Page 2-47



How to compute the similarity (contd.): – Final question (contd.): • Feedback step: the user selects l out of k presented documents to be relevant. Let ki be the number of documents (from the result set) that contain term Ti and let li be the number of relevant document (selected by user) that contain term Ti . The new estimates for ri and ni are given as follows: ri = li / l

ni = ( ki - li ) / ( k - l )

To avoid numerical problems with small values, one often uses: ri = ( li +0.5 ) / ( l + 1 )

ni = ( ki - li + 0.5 ) / ( k - l + 1 )



Advantages: – Ordering of documents based on decreasing probability of being relevant – Efficient evaluation with inverted files possible



Disadvantages: – The first step uses rough estimates for ci (heuristic). – Frequency and positions of terms in document are not considered – Assumption of independence of terms does not hold in practice (house, building)



Note: There are further probabilistic models that address some of the disadvantages mentioned above (see literature)

Multimedia Retrieval – Fall 2013

Page 2-48

2.3.4 Latent Semantic Indexing (LSI) •

Preliminary considerations: – Vector space retrieval maps documents to points in some M-dimensional space (terms denote individual dimensions). In fact, this is not sufficient: • there are correlations between terms (synonyms!) regardless whether terms denote words, phrases or n-grams. • the M-dimensional space may be too high dimensional and a transformation to a sub-space with less dimensions may be more appropriate to answer queries.



Basic idea of LSI: – Transform document vectors to some low-dimensional space which approximates the original vectors as good as possible. The new dimensions are no longer bound to an individual term but should denote concepts encompassing several terms. This transformation and the resulting clustering of terms is called „Latent Semantic Indexing“

Multimedia Retrieval – Fall 2013

Page 2-49

Preliminary mathematical background •

For each eigenvalue λ and eigenvector x of a quadratic (n,n)-matrix A, it holds: (1) Ax = λ x



Eigenvalues are determined by solving the equation det(A-λI)=0. This is equivalent of finding the zeroes of a polynomial function of degree n. Note that the zeroes can be real, complex, and may occur several times. The corresponding eigenvectors are orthogonal to each other.



A symmetric matrix A has real eigenvalues (no complex ones). Let r be the rank of A. We can write matrix A as the following product: (2)



A = UΛUT

Λ denotes an (r,r) diagonal matrix with the eigenvalues on the diagonal; U is an (n,r)matrix with columns that are orthonormal, i.e. UTU= I.

Multimedia Retrieval – Fall 2013

Page 2-50



The singular value decomposition generalizes the eigenvalue decomposition for nonquadratic matrices. Let A be an (m,n)-matrix of rank r. There exists an (r,r)-diagonal matrix S and an (m,r)-matrix U and an (n,r)-matrix V both with columns that are orthonormal. It holds: (3)



A = U S VT

Obviously, it follows: (4)

ATA= (USVT)T(USVT)=VSUTUSVT=VS2VT

(5)

AAT=(USVT)(USVT)T=USVTVSUT=US2UT

and That is: U holds the eigenvectors of AAT in its columns and V holds the eigenvectors of ATA in its columns. •

We can write (3) as a sum of vector products (so called dyadic vector products): (6)



A = s1 (u1v1T) + s2 (u2v2T) + ...+ sr (urvrT)

We obtain an approximation A for A if one or several of the summands are omitted in (6). The best approximation of rank k SQ=000 10 011.



All records with SQ ∧ (¬ SD ) = 0 are called candidates. With the example: { t4, t6 } – t4 is a hit; – t6 is a „false hit“ or „false drop“

Multimedia Retrieval – Fall 2013

Page 2-81

2.4.3.2 Computation of Signatures •

In the following, we consider several approaches to derive signature for texts. Usually, this is implemented with well-optimized hash-functions.



Notations: l bi SQ SD g N W SP F

Length of signature in bits i-th bit of the signature Signature of the query Signature of document D Weight of signature for a term Number of documents Size of the dictionary (number of distinct terms) Signature potential Error rate

Multimedia Retrieval – Fall 2013

Page 2-82



Binary signatures – Features are mapped to a unique signature of length l (2l possibilities) – This approach allows for very efficient comparison between two documents (feature=term); only if the signatures match, the more expensive text comparison has to be executed.



Superimposed Coding – A feature sets g bits of the signature (g is the weight of the signature); if a document has of several features, all their signatures are superimposed (or-ed): text 010010001000 S1 search 010000100100 S2 methods 100100000001 S3 ———————————————————————————— text search methods 110110101101 S1∨ S2∨ S3  l     g 

– There are different signatures for features – However: superimposing disables back tracking of the original set of features in a document (->false hits) – The introductory examples had just one bit per feature (e.g., g=1) and features were n-grams instead of terms Multimedia Retrieval – Fall 2013

Page 2-83

– Query transformation also uses superimposing to determine the signature. How do we compare two signatures with superimposed coding? • Each bit set in the query signature must also be set in the signature of the document (for all others, we do not care) • Down side: introduction of false hits, i.e., we may also find documents that do not actually contain the query features (or one of them) • An efficient implementation is given by: SQ ∧ (¬ SD ) = 0 – Examples: text search methods in search of knowledge-based IR an optical system for full text search the lexicon and IR Query: text search

110110101101 010110101110 010110101100 101001001001

010010101100

Result: the first three documents fulfill the signature condition but the second document does not contain all query features Multimedia Retrieval – Fall 2013

Page 2-84

– Mapping features to signatures • The transformation to signatures depend on the chosen collection and the used features (words, n-grams, …) • Features are words or phrases – Approach 1: Associate each word/phrase with a signature of weight g. Evenly distribute words/phrases across all possible signatures if the number of words/phrases is larger then the number of different signatures – Approach 2: Construct the signature by extracting n-grams from the word and superimposing their signatures (see below). This may lead to different signature weights. If this is not desired, only use the g most frequent ngrams to derive the superimposed signature (you may also have to add bits for very small words) • Features are n-grams – Define mapping of text to n-grams (define handling of non-alphabetic characters; define how to treat lower/upper case distinction) – Define hash function to obtain a bit-position for each n-gram; the signature has all bits to zero except for the position obtained by the hash-function, i.e., the signature weight is g=1. – The signature of the text is the superimposed coding of all the signatures of its n-grams. Multimedia Retrieval – Fall 2013

Page 2-85

– Mapping features to signatures • Features are n-grams – Example 1 for hash function h (with n=2) T

char

0

, - . ; : / ...

1

E

2

ÄJNPXY

3

RU

4

CIK

5

HÖS

6

MOTW

7

DGLQ

8

ABÜVZF

– Example 2:

Characters are equally distributed in each class

h(c1,c2) = [7*T(c1) + T(c2)] MOD l h(c1,c2) = [26*Pos(c1) + Pos(c2)] MOD l

– The hash function should map to all bit positions with equal probability (even distribution of bits along signature). Multimedia Retrieval – Fall 2013

Page 2-86

– Error rate of signatures (collisions) • Assumption: a feature sets g bits of a signature of length l. How many distinct codes (signature strings) exist? – SP = SP( l,g ) is the signature potential, i.e. the number of distinct codes with g bits set in a signature of length l – Then, it follows:

l l! SP (l , g ) =   =  g  g! (l − g )!

maximal if g = l/2

• Determining the error rate F – The M distinct features are mapped to SP distinct signatures. If M>SP, there are collisions: M/SP terms share the same signature – When searching for a feature signature, the comparison function also returns the M/SP-1 features that share the same signature. – Hence:

Multimedia Retrieval – Fall 2013

M N F = − 1  SP  M

N: number of documents

Page 2-87

– To limit the error rate (i.e., given F), one may determine the signature potential and thus the length of signatures:

SP =

– Example for SP :





M ⋅N F ⋅M + N l

g

SP

8

4

70

16

8

12 870

24

12

2 704 156

32

16

601 080 390

Disjoint Coding – The signature of a text is given by cascading the signatures of its features in the order of their appearance. To find a feature, one searches along the signature chain. Block Superimposed Coding – Only signatures within the same text block (e.g., sentence) are superimposed. The signatures of text blocks are cascaded similar to disjoint coding.

Multimedia Retrieval – Fall 2013

Page 2-88



Overview of the different transformation options Block 1

Block 2

Block 3

Block 4

This is a text. A text has many words. Words are made from letters. Disjoint Coding

000101 000101 110000 100100 100100 001100 100001

Block Superimposed Coding

000101

Superimposed Coding

Multimedia Retrieval – Fall 2013

111101

110101

100100

101101

h(text) h(many) h(words) h(made) h(letters)

= 000101 = 110000 = 100100 = 001100 = 100001

Page 2-89

2.4.3.3 Similarity Search •



Signatures enable ranking based on similarity values. In contrast to the retrieval models discussed so far, signatures may provide robustness for (even severe) spelling mistakes, e.g., the difference of signatures for “Meier” and “Maier” is small. To provide a similarity ranking, we must define a distance function between the query signature SQ and the signature SD of the document. – Hamming Distance • Number of different bits in SQ and SD (bit-wise comparison) • It is: hamming(SQ ,SD ) = w ( SQ XOR SD )

• To accelerate the counting of “ones” (function w), we resort to pre-computed look-up tables. For instance, such a table may contain for all 8-bit strings the number of “ones” (e.g., 01001001 -> 3, 11101001 -> 5). The signature SQ XOR SD is split into 8-bits substrings. For each substring, the number of “ones” is looked-up. The sum of these numbers is the hemming distance between SQ and SD. Multimedia Retrieval – Fall 2013

Page 2-90

– Cover Distance • The cover distance computes the number of bits that are set in the query signature but are not set in the signature of the document • An efficient implementation based on the look-up table described before is given as: cover(SQ ,SD ) = w( SQ ) - w ( SQ AND SD )



Discussion: – The hamming distance compares entire signature, i.e., entire documents. The document should look like the query; this is useful if the full document is known (approximately) – The cover distance implements partial comparison, i.e., the document should contain the query; this is useful if only parts of the document are known

Multimedia Retrieval – Fall 2013

Page 2-91

2.4.3.4 Indexing Structures Sequential Organization – The signatures of documents are stored in sequential order (in a flat file). Evaluation occurs in two phases – Signature search accelerates text search due to the much quicker comparison function and the much smaller amount of data read. Exact searchers, however, are better performed with inverted lists (or B-Trees in databases) – Example: Let „10100000“ be the query signature and we search with “contains” semantics (cf. superimposed coding: SQ ∧ (¬ SD ) = 0 )

S1 S2 S3 S4 S5 S6 S7 S8

b1 0 1 0 1 1 0 1 0

Multimedia Retrieval – Fall 2013

b2 0 0 1 0 1 1 0 0

b3 1 1 1 0 1 1 0 0

b4 0 1 0 1 0 0 0 1

b5 1 1 0 0 0 0 1 1

b6 0 0 1 1 1 1 0 1

b7 1 0 1 1 0 0 1 0

b8 1 0 0 1 0 1 0 1

1

Candidates



hits 2 false hits

Page 2-92



Bit-sliced Organization – Vertical Partitioning – For each of the l bits, an individual file stores the contents of the collection – If query evaluation focuses on “ones” in the query signature, we only have to llok at the slices of bits which are set in the query -> reduction of data to read Does not accelerate searches with hamming distance – Example: Let „10100000“ be the query signature and we search with “contains” semantics (cf. superimposed coding: SQ ∧ (¬ SD ) = 0 )

S1 S2 S3 S4 S5 S6 S7 S8

b1 0 1 0 1 1 0 1 0

b2 0 0 1 0 1 1 0 0

b3 1 1 1 0 1 1 0 0

b4 0 1 0 1 0 0 0 1

b5 1 1 0 0 0 0 1 1

b6 0 0 1 1 1 1 0 1

b7 1 0 1 1 0 0 1 0

b8 1 0 0 1 0 1 0 1

SQ

1

0

1

0

0

0

0

0

Multimedia Retrieval – Fall 2013

read slices with (SQ)i=1

b1 0 1 0 1 1 0 1 0

b3 1 1 1 0 1 1 0 0

b1 ∧ b3

1

0 1 0 0 1 0 0 0

hits 2 false hits

candidates

Page 2-93



Horizontal Partitioning – Similar signatures are grouped and stored in separate files (“Buckets”). – Query evaluation is in three steps: 1. Identify groups (buckets) which may contain candidates 2. Read buckets and identify the candidates 3. Read-in the candidates and distinguish them between “hits” and “false hits” – To determine and describe groups, a further hash-function is required; it computes a key the signatures in the group. Signatures with the same key are stored in the same bucket (belong to the same group). • A simple hash-function extracts the first k bits of a signature (fixed prefix) with k typically being small • An other approach, „extended prefix“, extracts a key with variable length from the beginning of the signature. The key length is such that the key contains a certain number of “ones”.

Multimedia Retrieval – Fall 2013

Page 2-94

– Example: let the key be the first two bits of the signatures; let „10100000“ be the query signature and we search with “contains” semantics (cf. superimposed coding: SQ ∧ (¬ SD ) = 0 )

Buckets

00

01

b1 b2 b3 b b b6 b7 b 4

5

10

b1 b2 b3 b b b6 b b8

8

4

S1 0 0 1 0 1 0 1 1 S8 0 0 0 1 1 1 0 1

5

7

4

S3 0 1 1 0 0 1 1 0 S6 0 1 1 0 0 1 0 1

b1 b2 b3 b b b6 b7 b

S2 S4 S5 S7 Multimedia Retrieval – Fall 2013

1 1 1 1

0 0 1 0

1 0 1 0

1 1 0 0

1 0 0 1

8

0 1 1 0

8

4

5

7

8

S5 1 1 1 0 0 1 0 0

1

(thus, the groups 00 und 01 cannot contain candidates)

5

5

b1 b2 b3 b b b6 b b

S2 1 0 1 1 1 0 0 0 S4 1 0 0 1 0 1 1 1 S7 1 0 0 0 1 0 1 0

Query Key = 10

4

11

b1 b2 b3 b b b6 b7 b

0 1 0 1

0 1 0 0

2

Candidates

Keys

hits 3 false hits

Page 2-95



Signature Tree (S-Tree) – Similar to horizontal partitioning, we group signatures with groups typically being small (fits into a database page). However, there is no key for the group. – Groups are represented by a block signature for all its members (Superimposed Coding). – The block signatures are again treated as signature; they are recursively grouped. This leads to a signature tree. – To keep block signatures useful, the hemming distance between signatures of members should be minimal. – The dynamic organization of the tree follows the typical patterns of any other balanced tree such as the B-tree (splitting full nodes, merge/reinsert in case of underflows)

Multimedia Retrieval – Fall 2013

Page 2-96

– Example: Let „10100000“ be the query signature and we search with “contains” semantics (cf. superimposed coding: SQ ∧ (¬ SD ) = 0 )

b1 b2 b3 b b b6 b7 b

root

4

8

• 10011111 • 10111011 • 11100111

(with block signatures)

b1 b2 b3 b b b6 b7 b 4

leaves

5

5

8

S4 1 0 0 1 0 1 1 1 S7 1 0 0 0 1 0 1 0 S8 0 0 0 1 1 1 0 1

b1 b2 b3 b b b6 b7 b 4

5

hits Multimedia Retrieval – Fall 2013

b1 b2 b3 b b b6 b7 b

8

S1 0 0 1 0 1 0 1 1 S2 1 0 1 1 1 0 0 0

S2

match with query signature

4

5

8

S3 0 1 1 0 0 1 1 0 S5 1 1 1 0 0 1 0 0 S6 0 1 1 0 0 1 0 1

candidates

S5

false hits Page 2-97

2.5 Lucene—An Open Source Search Engine •

Apache hosts several projects to provide easy to use yet powerful text and web retrieval. All of them are based on the core engine called Lucene. In addition, thirdparty libraries enrich Lucene with additional content extractor and analyzers. – Lucene: core retrieval library for both analysis of documents and searching – Apache Tika: parsers and extractors for various file formats – Nutch: open source web search engine with scalable, distributed crawlers and a Tomcat web application to search through the content – Solr: open source enterprise search engine for a rich set of file formats



In this chapter, we look at: – how lucene analyzes documents – how lucene ranks documents – how to use lucene in own applications



Note: this is not meant to be a complete overview of lucene. Refer to the online documentation or to books such as “Lucene in Action” to get more details

Multimedia Retrieval – Fall 2013

Page 2-98

2.5.1 History of Lucene •





Started as a SourceForge project and joined the Apache Jakarta family in 2001. Original author was Doug Cutting. Since 2005, Lucene is a top-level Apache project with many sub-projects. Some of them, namely Nutch and Tika, have become independent Apache projects. Main versions introduced (selected versions): – 1.01b (July 2001): last SourceForge release – 1.2 (June 2002): first Apache Jakarta release – 1.9 (February 2006): binary stored fields, date tools, range filters, regexp query – 2.0 (May 2006): clean up of code, removed deprecated methods – 2.9 (September 2009): near-realtime search, numeric ranges, cleanup 2.9.4 is recommended release for production – 3.0 (November 2009): cleanup and migration to Java 1.5 (generics, var args) 3.6 is latest build released on July, 2012 – 4.0beta(August 2012): speedup of indexing and retrieval Lucene implementations – Java (original), C++ (CLucene), .NET (Lucene.NET), C (Lucene4c), Objective-C (LuceneKit), Python (Lupy), PHP 5 (Zend), Perl (Plucene), Delphi (MUTIS), JRuby (Ferret), Common Lisp (Montezuma)

Multimedia Retrieval – Fall 2013

Page 2-99

2.5.2 Core Data Model of Lucene •





Lucene is a high-performance, full-featured text search engine library. It is suitable for a wide range of applications that require text retrieval functions. Most importantly, it works across different platforms, firstly due to its Java implementation, and secondly, due to the many ports to other programming languages If you are looking for an open source search engine, Lucene based projects such as Nutch (web search engine) or Solr (enterprise search engine) provide ready-to-deploy search applications. In all other cases, applications have to implement their search features through the lucene APIs. The core concepts of Lucene revolve around – Document and Field to encompass the content of documents – Analyzer to parse the content and extract features – IndexWriter which maintains the inverted index including concurrency control – Directory that holds the inverted index structures – Query and QueryParser represent queries and parse input strings, respectively – Term and TermQuery denote unit search expressions – IndexSearcher exposes search methods over the inverted indexes – TopDocs contains the result of a search sorted by scores

Multimedia Retrieval – Fall 2013

Page 2-100



Lucene’s API is split into offline analysis functions and online search function. The interaction with an application is as follows:  offline Database

online  User

Internet

Files

Intranet DMS/CMS Maintain Document Library

Query Construction

Special Analyzers

↑ application

Result Presentation

application ↑

↓ lucene

lucene ↓

Analyze & Index

Analyze & Index

Inverted List

 offline Multimedia Retrieval – Fall 2013

online  Page 2-101

Indexing Documents with Lucene (Version 2.9.4) 1. Select Directory to store Index in directory = FSDirectory.open(new File("./index"));

Documents

2. Create Analyzer for Documents analyzer = new StandardAnalyzer(Version.LUCENE_29);

3. Create Document and add Fields doc = new Document(); doc.add(new Field("title",title,Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("content",content,Field.Store.NO, Field.Index.ANALYZED)); doc.add(new Field("id",id,Field.Store.YES, Field.Index.NO));

Index Code Steps 1-5

Maintain Document Library

4. Get Index Writer and add Document writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); writer.addDocument(doc); Field (title)

Field (content)

5. Close Index Writer (optionally optimize)

Field (id)

writer.optimize(); writer.close();

Analyze & Index

Document

Analyzer

IndexWriter Directory

Multimedia Retrieval – Fall 2013

Page 2-102

Indexing Documents with Lucene (Version 2.9.4) • Directory – Lucene provides multiple ways to maintain and persist inverted indexes. Among them are file based indexes, memory based indexes, and database indexes – The LockFactory associated with a directory implements basic concurrency control mechanisms. IndexWriter and IndexSearcher provide concurrency control to the application to ensure integrity of the indexes (other transaction attributes depend on the selected directory implementation) • Analyzers – Lucene and 3rd party extensions provide a rich set of pre-defined analyzers with support for various languages. The main function of an analyzer is to return a TokenStream. A token stream implements a pipeline that cascades a tokenizer with a set of token filters. – A Tokenizer parses the fields of documents, removes syntactical elements, and produces a stream of tokens. – A TokenFilter filters/changes/aggregates elements in the token stream. Prominent examples include stemming, stop word elimination, and lower case converter. • Fields – Lucene is able to store additional attributes for each document in the index. The purpose of fields is two-fold: • Ability to restrict the search on specified meta data items (e.g., only title, author, abstract, etc.) • Ability to store data that identify the document (or are relevant for presentational purposes) – Creation of fields include three options • Field.Store: YES or NO indicating whether the content needs to be stored. NO means that the content is only analyzed but not available at search time any more. Use YES for identifying attributes (or for presentation). Typical examples include ID, file name, document type, date of insertion, size of document. • Field.Index: main values are ANALYZED and NO. NO indicates that the field content must not be analyzed; it is also not possible to search for such attributes. ANALYZED is used for content that must be searchable. • Field.TermVector: allows to fine tune what term vector information is kept in the index. Multimedia Retrieval – Fall 2013

Page 2-103

Searching Documents with Lucene (Version 2.9.4) 1. Select Directory where Index resides

User input

Present Result

directory = FSDirectory.open(new File("./index"));

2. Create Analyzer as used for Documents analyzer = new StandardAnalyzer(Version.LUCENE_29);

3. Create Query (optionally through a QueryParser)

Query Construction

parser = new QueryParser(Version.LUCENE_29, "content", analyzer); Query query = parser.parse(queryStringFromUser);

Search Code Steps 1-5

Result Presentation

4. Get Index Searcher and Search searcher = new IndexSearcher(directory, true); TopDocs hits = searcher.search(query, NUM_RESULTS);

5. Present Result

QueryParser

Analyze & Index

for(int i=0;i )01G: +, >+02=

∙ ∏>+02= > +, = 3,G0=: 3: ) (&) denotes a value that

Lucene computes at indexing time and stores within the inverted lists for each term in document d. () denotes a boost facttor that applications can specify when adding documents.

Multimedia Retrieval – Fall 2013

Page 2-108

2.6 Literature and Links •

General Books on Text Retrieval – – – – – – – –



Gerard Salton and Michael J. McGill. Information Retrieval - Grundlegendes für Informationswissenschaftler, McGraw-Hill Book Company, 1983. W.B. Frakes and R. Baeza-Yates. Information Retrieval, Data Structures and Algorithms, Prentice Hall, 1992. Karen Sparck Jones and Peter Willet. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., 1997. David A. Grossmann and Ophir Frieder. Information Retrieval: Algorithms and Heuristics, Kluwer Academic Publishers, 1998 (1st edition), 2004 (2nd edition). Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, ACM Press Books, 1999 (1st edition), 2011 (2nd edition). Sandor Dominich. Mathematical Foundations of Information Retrieval, Kluwer Academic Publishers, 2001. Christopher Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge University Press, 2008 Stefan Büttcher, Charles Clarke, Gordon Cormack. Information Retrieval - Implementing and Evaluating Search Engines. MIT Press 2010.

Latent Semantic Indexing / Latent Semantic Analysis – – – –

G. W. Furnas, S. C. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, K. E. Lochbaum: Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure.SIGIR 1988: 465-480 S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990. Christos Faloutsos, Searching Multimedia Databases by Content, Kluwer Academic Publishers, 1996 Telcordia: http://lsi.research.telcordia.com/

Multimedia Retrieval – Fall 2013

Page 2-109

Literature and Links (2) Thesaurus & Ontologies for selected Languages – – –

EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://www.cogsci.princeton.edu/~wn/

Application of Retrieval Models –

Boolean Retrieval (+scoring function) • http://www.google.com • http://www.yahoo.com • http://lucene.apache.org (to constrain candidates only)



Vector space Retrieval • AltaVista (legacy search engine) • Eurospider (ETH protoype search engine; startup company) • http://lucene.apache.org (extended version for scoring only)



Probabilistic Retrieval • Inktomi (legacy search engine)



Retrieval with LSI • http://www.netlib.org/cgi-bin/lsiBook • http://lsi.research.telcordia.com/lsi-bin/lsiQuery

Multimedia Retrieval – Fall 2013

Page 2-110