PostgreSQL Full Text Search

PostgreSQL Full Text Search An Introduction and a Performance Comparison with Apache Lucene /Solr Seminar Thesis Master of Science in Engineering Sp...

Author: Patience Bell

19 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Full-Text Search in PostgreSQL. Oleg Bartunov Moscow University PostgreSQL Global Development Group

Full Text Search Functions

Full Text Search Agent Throughput

Oleg Bartunov (thanks 1C for support) Alexander Korotkov. Full-text search in PostgreSQL in milliseconds

Full-Text Search in XML Databases

search?q=search text"

TeXQuery: A Full-Text Search Extension to XQuery

Contextual Sentence Decomposition with Applications to Semantic Full-Text Search

SAP HANA SPS 09 - What s New? Full-text Search

Information Retrieval and Search Engines in Full-text Databases

On the Completeness of Full-Text Search Languages for XML

TeXQuery: A Full-Text Search Extension to XQuery

A User Interface for Semantic Full Text Search

Full Text of Testimony

User Guide. Checking the Full text limit will restrict your search to only search and retrieve records containing full text from ProQuest

Hooks in PostgreSQL. Hooks in PostgreSQL

Full-text Available Online at

Hooks in PostgreSQL. Hooks in PostgreSQL

Performance Comparison of DHT based Peer-to-Peer Full-Text Search Systems

Design of a Full Text Search index for a database management

Full-Text Search with Sphinx and MySQL. Percona Live, NY, 2011

index.html Homepage: Online Database Featuring Author, Key Word and Full-Text Search

Full-Text Search with Sphinx and PHP. SphinxSearch LAMP stack integration, tips and tricks

PostgreSQL Full Text Search An Introduction and a Performance Comparison with Apache Lucene /Solr

Seminar Thesis

Master of Science in Engineering Specialization: Information and Communication Technologies (ICT)

Advisor:

Prof. Martin Studer

Author:

Stefan Lütolf

Version:

2.6 (final)

Chur, December 2012

Database Seminar

PostgreSQL Full Text Search

Abstract PostgreSQL is a widely used open source relational database system. Since version 8.0 it offers full text search. This allows searching a document or text in a database. Full Text Search (FTS) refers to information retrieval techniques for searching a document or text data in a database, which contains textual data. Search criteria (like queries) are used to analyse the documents and list all the matching results; optionally, they can be sorted by relevance. PostgreSQL uses the vector space model: a specific document (e.g. an email, a text document or a magazine article) is searched in a large data pool consisting of documents. FTS in PostgreSQL is optimized for using these techniques in an organisation, for example for a search engine in the intranet. In this paper an introduction to Full Text Search in PostgreSQL is given as well as a performance comparison of alternatives such as Apache Lucene /Solr. Now, the question arises of what reasons can be given to use Full Text Search instead of well-known text operators.

Stefan Lütolf

Page 2 of 31

Database Seminar

PostgreSQL Full Text Search

Table of Contents 1

2

3

4

Introduction ......................................................................................................................... 5 1.1

What is Full Text Search?.......................................................................................... 5

1.2

What is a Document? ................................................................................................ 5

1.3

Information Retrieval Model ....................................................................................... 5

1.4

Indexing ..................................................................................................................... 6

1.5

Full Text Search versus Text Operators .................................................................... 6

Simple Full Text Search Example ...................................................................................... 7 2.1

Preparing ................................................................................................................... 7

2.2

Separate Vector Column ........................................................................................... 8

2.3

Index on the Vector Column .................................................................................... 10

2.4

Text matching .......................................................................................................... 11

2.5

Ranking search results ............................................................................................ 12

Full Text Search Features ................................................................................................ 14 3.1

Different Languages ................................................................................................ 14

3.2

Weights .................................................................................................................... 14

3.3

Dictionaries .............................................................................................................. 15

Benchmark ....................................................................................................................... 16 4.1

State of the Art ......................................................................................................... 16

4.2

Overview of Alternatives .......................................................................................... 16

4.3

PostgreSQL ............................................................................................................. 17

4.4

Apache Lucene /Solr ............................................................................................... 20

5

Conclusion ........................................................................................................................ 22

6

Bibliography ...................................................................................................................... 23

Annex A: SQL Script ................................................................................................................ 24 Annex B: PHP Script................................................................................................................ 27 Annex C: SQL Script (Small Dataset)...................................................................................... 28 Annex D: SQL Script (Large Dataset) ..................................................................................... 29 Annex E: Apache SOLR .......................................................................................................... 30

Stefan Lütolf

Page 3 of 31

Database Seminar

PostgreSQL Full Text Search

Table of Figures Figure 1: ER-model.................................................................................................................... 7 Figure 2: table content after inserting data ................................................................................ 7 Figure 3: expanded ER-model with a separate vector column ................................................. 8 Figure 4: result of the extra ts_vector column in PostgreSQL ................................................... 8 Figure 5: result of ranking ........................................................................................................ 12 Figure 6: result of a normalized ranking .................................................................................. 13 Figure 7: Result of ranking and highlighting ............................................................................ 13 Figure 8: diagram initialization process runtime ...................................................................... 17 Figure 9: diagram query tests runtime without indexes ........................................................... 18 Figure 10: diagram query tests with indexes ........................................................................... 19 Figure 11: Initialization of Lucene /Solr ................................................................................... 20 Figure 12: Query Tests with Lucene /Solr ............................................................................... 21

Table of Charts Table 1: Comparison of Alternatives [19] ................................................................................ 16 Table 2: Comparison of the Dataset ........................................................................................ 17 Table 3: Query test without any Indexes ................................................................................. 18 Table 4: Query test using indexes ........................................................................................... 19 Table 5: Initialization in Apache SOLR .................................................................................... 20 Table 6: Lucene Query tests ................................................................................................... 21

Stefan Lütolf

Page 4 of 31

Database Seminar

PostgreSQL Full Text Search

1 Introduction 1.1

What is Full Text Search?

Full Text Search (FTS) refers to information retrieval techniques for searching a document or text data in a full text database. Search criteria (like queries) are used to analyse the documents and list all the matching results. [1] With full text searching, it is possible to identify natural-language documents that match a query and optionally to sort them by relevance to the query. The most common use case is to find all documents based on the given query and optionally to sort them by relevance. [2]

1.2

What is a Document?

A document is defined as the unit of searching in a full text search system. This can be an email message or a book or a magazine article. There are two possibilities to store a document in a database; firstly, the whole text can be stored in a single field or a combination of a few fields in the database. Secondly, it can be referred to a document in the file system, whereof only a special type of vector containing the key word (lexemes) is stored. [2]

1.3

Information Retrieval Model

A few Information Retrieval model for Retrieval Adhoc are known, such as the standard Boolean model or the expanded Boolean model [3], the vector space model [4] as well as the probabilistic model [5]. In PostgreSQL, the vector space model is used for the documents in Full Text Search. “The tsvector type represents a document in a form optimized for text search” [6] The advantage of the Vector Space model over the standard Boolean model is that the ranking feature is supported in the former mentioned model.

Stefan Lütolf

Page 5 of 31

Database Seminar

1.4

PostgreSQL Full Text Search

Indexing

If the Search engine deals with a small number of documents it is possible to scan the full documents with every query. This is called serial scanning. If there are plenty of documents to scan, the task of full text search is divided into: first indexing the whole documents and then searching the queries. The indexing task means, that the search engine creates a list of search terms, named a concordance. The indexer makes an entry for each term found in a document. It is possible to ignore stop words, such as “the” or “and”, because these words aren’t as relevant as others. Furthermore, a language-specific stemming on the words can be used; thus, only the basic form of a word is used, like “speak” instead of “spoken”. [2] Using indexes with queries, which deliver most of the rows, is slow, because PostgreSQL prefer to make a sequential scan instead of using the index. [2]

1.5

Full Text Search versus Text Operators

What are the reasons for using Full Text Search instead of operators? There are still a few operators in PostgreSQL, which can be used for textual data types. These are “~, ~*, LIKE, ILIKE”. ILIKE with wildcards: SELECT * FROM sometextpool WHERE content ILIKE '%myquery%' And POSIX regular expressions: SELECT * FROM sometextpool WHERE content ~ 'myquery' The disadvantages of these operators are the followings [2]: 

There is no linguistic support



They provide no ranking



Dealing with huge number of documents is very slow

The FTS allows pre-processing the documents: 

Parsing documents into tokens



Converting tokens into lexemes (This is a normalized token)



Storing pre-processed documents for searching

Stefan Lütolf

Page 6 of 31

Database Seminar

PostgreSQL Full Text Search

2 Simple Full Text Search Example There are two different datasets. The first is a small set with 2000 rows and maximal 5000 character per text: movie reviews from the department of computer science at Cornel University (polarity dataset v2.0). The second dataset is from the department of computer science as well, however it is a pool of 27886 unprocessed html files [7].

2.1

Preparing

First of all a new table “FullTextSearch” is created with an id and a content column, as illustrated in the box below: --CLEANUP DROP TABLE IF EXISTS FullTextSearch; --INITIALIZATION CREATE TABLE FullTextSearch( id SERIAL NOT NULL, content text NOT NULL ); ALTER TABLE ONLY FullTextSearch ADD CONSTRAINT fulltextsearch_pkey PRIMARY KEY (id); The ER-model of this database: FullTextSearch PK

id content

Figure 1: ER-model To insert data into the database it is possible to use the standard Insert command. The syntax used is shown in the following box: INSERT INTO FullTextSearch(content) VALUES ('a fat cat sat on a mat - it ate a fat rats'); Further information about the self-made php script is given in the Annex B. The next figure shows the inserted dataset of the database.

Figure 2: table content after inserting data

Stefan Lütolf

Page 7 of 31

Database Seminar

2.2

PostgreSQL Full Text Search

Separate Vector Column

Example with the separate vector column A ts_vector is used for full text search in PostgreSQL; in this case a separate column with a ts_vector is necessary. For further performance tests an extra ts_vector is created for a GIN and a GIST Index. The ts_vector can be built with the following SQL syntax: /* ------------------------Separate Ts_Vector column ------------------------*/ --TS_Vector for GIN INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gin tsvector; UPDATE FullTextSearch SET content_tsv_gin = to_tsvector('pg_catalog.english', content); --TS_Vector for GIST INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gist tsvector; UPDATE FullTextSearch SET content_tsv_gist = to_tsvector('pg_catalog.english', content); The expanded ER-model is shown in the following figure: FullTextSearch PK

id content content_tsv_gin content_tsv_gist

Figure 3: expanded ER-model with a separate vector column The result of building a separate column is shown in the figure 4.

Figure 4: result of the extra ts_vector column in PostgreSQL

Stefan Lütolf

Page 8 of 31

Database Seminar

PostgreSQL Full Text Search

Background theory of the Text Search Vector “A data type tsvector is provided for storing pre-processed documents, along with a type tsquery for representing processed queries” [8] Pre-processing includes: 

“Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, email addresses, so that they can be processed differently.” [8]



“Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as “s” or “es” in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs”. [8]

The tsvector consists of the key words (lexemes) following the position of the key word in the referenced text. The reason to use a ts_vector is shown in the following example. Without the tsvector there is no normalisation and no linguistic support. The following queries illustrate that without the tsvector, the result is not as expected. /* ------------------------Basic text matching with two simple strings ------------------------*/ --Typ Cast-SELECT 'a fat cat sat on a mat and ate fat rats'::tsvector @@ 'cat & rat'::tsquery; --Result: false --Function to_ts..-SELECT to_tsvector('a fat cat sat on a mat and ate fat rats') @@ to_tsquery('cat & rat'); --Result: true, because the ts_vector includes linguistic supports The first query makes only a type cast. Therefore, the “rat” in the query ignores match the “rats” in the tsvector. The second query shows that the linguistic support ignores the “s” from the “rats” and for this reason the query matched.

Stefan Lütolf

Page 9 of 31

Database Seminar

PostgreSQL Full Text Search

Keep the vector up-to-date with a trigger With a trigger it can be guaranteed that the vector is up-to-date. This is used by Insert or modification the text column. With the following syntax, a trigger can be installed: /* ------------------------keep it up-to-date with a Trigger ------------------------*/ --TRIGGER CREATE TRIGGER tsv_gin_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gin,'pg_catalog.english',c ontent); CREATE TRIGGER tsv_gist_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gist,'pg_catalog.english', content);

2.3

Index on the Vector Column

Create Index on the example To increase the speed of a query, it is necessary to add an index to the ts_vector column. The syntax is shown in the box below. The first column is added with a GIN Index and the second with a GIST Index. With the former searching is quicker than with the GIST, however it is slower to update with the GIN Index. The Trigram is needed to use the GIN Index on a text column [9], which is for further performance tests. /* ------------------------Create Indexes ------------------------*/ --Index on content (Trigram needed,to use Gin Index) CREATE EXTENSION pg_trgm; CREATE INDEX FullTextSearch_content ON FullTextSearch USING GIN(content gin_trgm_ops); --GIN INDEX on content_tsv_gin CREATE INDEX FullTextSearch_content_tsv_gin ON FullTextSearch USING GIN(content_tsv_gin); --GIST INDEX on content_tsv_gist CREATE INDEX FullTextSearch_content_tsv_gist ON FullTextSearch USING GIST(content_tsv_gist);

Stefan Lütolf

Page 10 of 31

Database Seminar

PostgreSQL Full Text Search

Further Information about an Index There are two different Indexes that can be used to speed up full text search in PostgreSQL. Indexes are not mandatory for text searching, but in cases where a column is searched on a regular basis, an index is usually desirable. The two kinds of Indexes: 

GIN (Generalized Inverted Index)



GIST (Generalized Search Tree)

“There are substantial performance differences between the two index types […] A GIST index is lossy […]. GIN indexes are not lossy for standard queries, but their performance depends logarithmically on the number of unique words […]. GIN index lookups are about three times faster than GIST. GIN indexes take about three times longer to build than GIST.”[10]

2.4

Text matching

With the @@ Operator it is possible to match a document (tsvector) with a query (tsquery), It does not matter which data type is written first. A tsquery contains search terms; it is already normalized into lexemes. If the query is still in plaintext, there are functions to convert them into a tsquery. For this conversion the functions to_tsquery and plainto_tsquery are helpful to that. A query may combine multiple terms using AND, OR and NOT operators. The difference between the two functions is that the plainto_tsquery cannot recognize Boolean operators or weight labels. The Boolean AND operator is inserted between surviving words. The following example shows how it is applied to make a query to the database. /* ------------------------Basic Query ------------------------*/ --FTS Search with @@ Operator SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ plainto_tsquery('english', 'movie'); The result of this query retrieved 497 matched rows. These are too many results to check all the matched samples. For this reason, it is useful to use the ranking function of PostgreSQL FTS, which is described in the next chapter.

Stefan Lütolf

Page 11 of 31

Database Seminar

2.5

PostgreSQL Full Text Search

Ranking search results

Ranking and Normalized Ranking Ranking attempts to measure how relevant documents are to a matching query, thus the most relevant match can be shown first. With the Vector space model used in FTS of PostgreSQL, a ranking of the matched text samples can be made. The following syntax can be applied to rank the results: --Ranking blank SELECT id,content, ts_rank_cd(content_tsv_gin, query) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; In the next figure, the result of the example is listed

Figure 5: result of ranking The value of the rank column is a real number; however, this number is not normalized because the ranking functions do not use any global information. To normalize the ranking result, use the additional value of the ts_rank_cd can be used. The value 32 connoted that all ranks are scaled into the range zero to one. --Ranking normalized SELECT id, content, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10;

Stefan Lütolf

Page 12 of 31

Database Seminar

PostgreSQL Full Text Search

In the next figure, the normalized result is listed:

Figure 6: result of a normalized ranking

Highlighting Results “To present search results it is ideal to show a part of each document and how it is related to the query.” PostgreSQL has the function ts_headline to highlight text fragments, which matches best to the query. For more information see [11]. The Syntax on the example before: SELECT id, ts_headline(content,query) AS highlighted, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; The following figure presents the result of the ranking with highlighting

Figure 7: Result of ranking and highlighting

Stefan Lütolf

Page 13 of 31

Database Seminar

PostgreSQL Full Text Search

3 Full Text Search Features 3.1

Different Languages

Each vector can have its own language, for example when having a few magazine articles in different languages, each of which can associate with its own language. To identify the language into the vector, the first parameter of the to_tsvector function, shown below, can be used: to_tsvector('pg_catalog.english', content); Another language can be chosen, for example German: to_tsvector('pg_catalog.german', 'hallo, wer ist hier');

3.2

Weights

“The function setweight can be used to label the entries of a tsvector with a given weight, where a weight is one of the letters A, B, C, or D. This is typically used to mark entries coming from different parts of a document, such as title versus body. Because to_tsvector(NULL) will return NULL, it is recommended to use coalesce whenever a field might be null […]” [11] UPDATE table SET column = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D'); Note that in this case the stop words are unaccounted for the vector. To remove the stop words, first build a tsvector and thereafter build the weighted column by the following SQL syntax: update fts set text_search_vector_weighted = setweight(text_search_vector, 'A'); “As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only tsvector lexemes of those weight(s).” [11] SELECT to_tsquery('english', 'Fat | Rats:AB'); For further Information see [11]

Stefan Lütolf

Page 14 of 31

Database Seminar

3.3

PostgreSQL Full Text Search

Dictionaries

“Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance” [12].

Synonym Dictionary With this dictionary template, a word with a synonym in a given table can be replaced. For an example directory, an own synonym directory has to be created: “installation directory of PostgreSQL”\share\tsearch_data\my_synonyms.syn. The file looks like this: My_synonyms.syn postgres postgresql postgre paris fulltextsearch

pgsql pgsql pgsql paris fts

To create the own directory in PostgreSQL and install it, the following SQL syntax can be applied: CREATE TEXT SEARCH DICTIONARY my_synonym_dir ( TEMPLATE = synonym, SYNONYMS = my_synonyms ); ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR asciiword WITH my_synonym_dir, english_stem; The result of a lexeme of the word “fulltextsearch” is “fts” like the entry of the created directory file. To verify this, use the following debug syntax: select * from ts_debug('english','fulltextsearch');

Thesaurus Dictionary The Synonym Dictionary does not support Phrases; instead a Thesaurus Dictionary is useful. For example: the word combination “supernovae stars” can be replaced by the synonym “sn”. The process of installing a thesaurus dictionary is similarly to the Synonym Dictionary.

Others 

ISPELL Dictionary (normalize many different linguistic forms)



Snowball Dictionary (recognize everything)

Stefan Lütolf

Page 15 of 31

Database Seminar

PostgreSQL Full Text Search

4 Benchmark 4.1

State of the Art

Existing database benchmarks, like the well-known database benchmarks from the Transaction Processing Performance Council (TPC) [13], do not contain benchmarks for fulltext search scenarios. One reason for this has been the lack of standard in the domain of SQL interfaces for full-text search in the SQL standards - the SQL Multimedia and Application Packages (SQL/MM) - Part 2: Full-Text [14] [15] tries to fill the gap. The emerging of a standard may allow for a widely accepted benchmark in the near future. The information retrieval community on the other side has been using the TREC [16] for benchmarking. The purpose of the text retrieval conference (TREC) is "to support research within the information retrieval community by providing the infrastructure necessary for largescale evaluation of text retrieval methodologies." The TREC benchmarks has the problem, that it is for scientific documents ("To index and search […] most groups used custom retrieval software develop by their own group or by an associated group. "[17]), in given time. Mike McCandless runs Lucene performance tests regularly on the English Wikipedia [18]. At the moment there is no information retrieval benchmark compared to the TPC benchmarks for relational database systems.

4.2

Overview of Alternatives

There is a comparison of Text Search from "Practical full-text search in PostgreSQL", Bill Karwin, PostgreSQL Conference West, 2009/10/17: Alternative

Indexing time

Index

Query speed

Solution

storage size LIKE / RegEx

None

None

12x

SQL

Full-Text

7x

5x

3x

SQL

Open

Source

Indexing Inverted

/

High

1x

5x

SQL / RDBMS

2x

10x

10x

IR / Open Source

1x

5x

1x

IR / Open Source

Offline

Offline

*

Web service

Index Apache Lucene Sphinx Search Search Engine Service Table 1: Comparison of Alternatives [19]

Stefan Lütolf

Page 16 of 31

Database Seminar

4.3

PostgreSQL Full Text Search

PostgreSQL

Testing System: Query Tool of pgAdmin III. Configuration of PostgreSQL: Out of the Box installation. Version: PostgreSQL 9.2 (32Bit). Dataset: Described in Chapter 2.

Comparison of the two dataset The dataset is processed and inserted into the database with a self-made php script, which is given in Annex B. All the values in the table below are measured in milliseconds. The SQL syntax on this test is shown in Annex A. Dataset

Loading Data

Create

Indexing

Indexing

to the DB

tsvector

time (GIN)

time (GIST)

Movie Review (small)

2’575

4’485

1’437

625

Movie Review (large)

77’777

77’755

30’504

3’672

Table 2: Comparison of the Dataset The following Graph shows the results in a diagram:

Initialization Movie Review (large)

Indexing time (GIST)

Indexing time (GIN)

Create tsvector Loading Data to the DB

Movie Review (small)

3.672 0.625 30.504 1.437 77.755 4.485 77.777 2.575 values in seconds

Figure 8: diagram initialization process runtime

Stefan Lütolf

Page 17 of 31

Database Seminar

PostgreSQL Full Text Search

Query comparison on the two dataset In Annex C is the SQL script of the performance test queries for the small dataset is given, the results for the large dataset is shown in Annex D. This is a comparison of the different possible text queries, such as Posix (RegEx), LIKE and a query with FTS supports (@@ operator). The FTS query is split into two different kinds of queries. First there is an equality search (small range) and thereafter, a search with a wide array of results is done. All the values in the table below are measured in milliseconds. Test without using indexes: Dataset

Posix

LIKE

LIKE

LIKE

FTS

FTS

RegEx

%word%

word%

%word

Equality

result

large

Movie

1688

1640

547

578

63

1610

Review

(1631

(1631

(8 rows)

(0 rows)

(1 row)

(1626 rows)

(small)

rows)

rows)

Movie

23705

23751

1578

1969

516

24751

Review

(15839

(17029

(0 rows)

(0 rows)

(17 rows)

(16608 rows)

(large)

rows)

rows)

Table 3: Query test without any Indexes

Query Tests without Index Movie Review (small) Posix RegEx LIKE %word%

1.688 1.64

LIKE word%

0.547 1.578

LIKE %word

0.578 1.969

FTS Equality FTS large result

Movie Review (large)

23.705 23.751

0.063 0.516 1.61

24.751

values in seconds

Figure 9: diagram query tests runtime without indexes

Stefan Lütolf

Page 18 of 31

Database Seminar

PostgreSQL Full Text Search

Test using indexes: Dataset

Posix

LIKE

LIKE

LIKE

FTS

FTS

RegEx

%word%

word%

%word

Equality

result

large

Movie

1688

1688

469

453

GIN:31

GIN: 1609

Review

(1631

(1631

(8 rows)

(0 rows)

GIST:32

GIST:1609

(small)

rows)

rows)

(1 row)

(1626rows)

Movie

23689

11969

1109

1344

GIN:79

GIN:24673

Review

(15839

(17029

(0 rows)

(0 rows)

GIST:125

GIST:24658

(large)

rows)

rows)

(17 rows)

(16608 rows)

Table 4: Query test using indexes

Query Tests with Index Movie Review (small) Posix RegEx

1.688

LIKE %word%

1.688

LIKE word%

0.469 1.109

LIKE %word

0.453 1.344

FTS Equality (GIN)

0.031 0.079

FTS Equality (GIST)

0.032 0.125

FTS large result (GIN)

1.609

FTS large result (GIST)

1.609

Movie Review (large)

23.689 11.969

24.673 24.673

values in seconds

Figure 10: diagram query tests with indexes

Stefan Lütolf

Page 19 of 31

Database Seminar

4.4

PostgreSQL Full Text Search

Apache Lucene /Solr

Apache SOLR is used for testing [20]. The setup of Solr and the schemas of the datasets, which is configured in Solr, are given in the Annex E. The initialization process time is shown in the table below. The initialization process time is divided into two tasks, first converting the data to an xml and thereafter to import it into the database. It isn’t clear if the indexing process time is included in the import process time. Note, the values are in milliseconds. Dataset

Convert to XML

Import

Movie Review (small)

1679

3800

Movie Review (large)

40420

32000

Table 5: Initialization in Apache SOLR The following graph shows the results in a diagram:

Initialization Movie Review (small)

Movie Review (large)

1.679 Convert to XML 40.42

3.8 Import 32

values in seconds Figure 11: Initialization of Lucene /Solr

Stefan Lütolf

Page 20 of 31

Database Seminar

PostgreSQL Full Text Search

Query Test using indexes (see schema.xml in Annex E for detail): The measure time contains first the prepare time and second the process time. The values are in milliseconds Dataset

Wildcard

Wildcard

Wildcard

FTS

FTS

*word*

+word* -word

*word –word

Equality

large Result set

Movie Review

16 + 93

0 + 15

16 + 15

0 + 16

0+0

(1657

(120 results)

(0 results)

(3

(1537 results)

(small)

results) Movie Review (large)

results)

31 + 156

0 + 63

0 + 63

0 + 15

0 + 16

(15338

(595 results)

(15 results)

(2

(14724 results)

results)

results)

Table 6: Lucene Query tests

The following Graph shows the results in a diagram:

Query Tests with Lucene / Solr Movie Review (small) prepare Movie Review (small) process Movie Review (large) prepare Movie Review (large) process

16 Wildcard *word*

93

31

Wildcard +word* word

0 0

Wildcard *word – word

0

FTS Equality

0 0

FTS large Resulat set

0 0 0

156

15 63 16 15 63 16 15

16 values in milliseconds Figure 12: Query Tests with Lucene /Solr

Stefan Lütolf

Page 21 of 31

Database Seminar

PostgreSQL Full Text Search

5 Conclusion To sum up, it can be said that Full Text search is a useful feature of PostgreSQL. As a vector can be created out of some text, it is possible to store only the keywords in a database and therefore less storage as well as time is needed for the query process. This need less storage and speed up the query process time. In addition to that, an index can be used to speed up the search query. However if the query delivers about the whole dataset, the database ignores the index and makes a whole sequence scan which takes considerably more time. Furthermore, the Synonym Dictionary makes it possible to create its own dictionary with similar words. This feature is useful for shortcuts or synonyms in the technical language.

The performance test is a sensitive issue; because of this the next remarks are a nonexhaustive enumeration: 

To create a GIST Index is much faster than to create a GIN Index. Searching with a GIN index is faster than with a GIST. (PostgreSQL)



The tests with wildcards (LIKE) shows that the index (GIN) increases the speed by querying (PostgreSQL). The wildcards queries are faster with Apache Lucene / Solr.



Fulltextsearch queries with a small result set (equality) are fast with both tested solution and the index increases the speed of the queries.



Fulltextsearch queries with a large result set are fast with Apache Lucene / Solr but with PostgreSQL it is very slow. The reason could be the scan variant, because PostgreSQL makes a sequential scan instead of an index scan.

Stefan Lütolf

Page 22 of 31

Database Seminar

PostgreSQL Full Text Search

6 Bibliography [1] „Full text search“, Wikipedia, the free encyclopedia. 23-Nov-2012. [2] „PostgreSQL: Documentation: 9.2: Full Text Search“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch.html. [Accessed: 26-Nov-2012]. [3] „Standard Boolean model - Wikipedia, the free encyclopedia“. [Online]. Available: http://en.wikipedia.org/wiki/Standard_Boolean_model. [Accessed: 04-Dez-2012]. [4] „Vector space model“, Wikipedia, the free encyclopedia. 30-Nov-2012. [5] „Probability theory - Wikipedia, the free encyclopedia“. [Online]. Available: http://en.wikipedia.org/wiki/Probability_theory. [Accessed: 04-Dez-2012]. [6] „PostgreSQL: Documentation: 9.2: Text Search Types“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/datatype-textsearch.html. [Accessed: 04-Dez2012]. [7] „Data“. [Online]. Available: http://www.cs.cornell.edu/people/pabo/movie-review-data/. [Accessed: 10-Dez-2012]. [8] „PostgreSQL: Documentation: 9.2: Introduction“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-intro.html. [Accessed: 07-Dez-2012]. [9] „PostgreSQL: Documentation: 9.1: pg_trgm“. [Online]. Available: http://www.postgresql.org/docs/9.1/static/pgtrgm.html. [Accessed: 10-Dez-2012]. [10] „PostgreSQL: Documentation: 9.2: GiST and GIN Index Types“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-indexes.html. [Accessed: 07-Dez2012]. [11] „PostgreSQL: Documentation: 9.2: Controlling Text Search“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-controls.html. [Accessed: 07-Dez2012]. [12] „PostgreSQL: Documentation: 9.2: Dictionaries“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-dictionaries.html. [Accessed: 10Dez-2012]. [13] „TPC - Homepage“. [Online]. Available: http://www.tpc.org/. [Accessed: 04-Dez-2012]. [14] „ISO/IEC 13249-2:2003 - Information technology -- Database languages -- SQL multimedia and application packages -- Part 2: Full-Text“. [Online]. Available: http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=31368. [Accessed: 04-Dez-2012]. [15] „32N0760T.pdf“. . [16] „Text REtrieval Conference (TREC) Home Page“. [Online]. Available: http://trec.nist.gov/. [Accessed: 04-Dez-2012]. [17] „TERA.OVERVIEW.pdf“. . [18] „Lucene nightly benchmarks“. [Online]. Available: http://people.apache.org/~mikemccand/lucenebench/. [Accessed: 04-Dez-2012]. [19] Karwin Software Solutions LLC, „Full Text Search In PostgreSQL“, 03-Nov-2009. [20] „Apache Lucene - Apache Solr“. [Online]. Available: http://lucene.apache.org/solr/. [Accessed: 20-Dez-2012].

Stefan Lütolf

Page 23 of 31

Database Seminar

PostgreSQL Full Text Search

Annex A: SQL Script /* ------------------------Initialization ------------------------*/ --CLEANUP DROP TABLE IF EXISTS FullTextSearch; --INITIALIZATION CREATE TABLE FullTextSearch( id SERIAL NOT NULL, content text NOT NULL ); ALTER TABLE ONLY FullTextSearch ADD CONSTRAINT fulltextsearch_pkey PRIMARY KEY (id); --INSERT (see php skript to fill DB with more data) INSERT INTO FullTextSearch(content) VALUES ('a fat cat sat on a mat - it ate a fat rats');

/* ------------------------Basic text matching with two simple strings ------------------------*/ --Typ Cast-SELECT 'a fat cat sat on a mat and ate fat rats'::tsvector @@ 'cat & rat'::tsquery; --Result: false --Function to_ts..-SELECT to_tsvector('a fat cat sat on a mat and ate fat rats') @@ to_tsquery('cat & rat'); --Result: true, because the ts_vector includes linguistic supports

Stefan Lütolf

Page 24 of 31

Database Seminar

PostgreSQL Full Text Search

/* ------------------------Separate Ts_Vector column ------------------------*/ --TS_Vector for GIN INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gin tsvector; UPDATE FullTextSearch SET content_tsv_gin = to_tsvector('pg_catalog.english', content); --TS_Vector for GIST INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gist tsvector; UPDATE FullTextSearch SET content_tsv_gist = to_tsvector('pg_catalog.english', content);

/* ------------------------keep it up-to-date with a Trigger ------------------------*/ --TRIGGER CREATE TRIGGER tsv_gin_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gin,'pg_catalog.english',c ontent); CREATE TRIGGER tsv_gist_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gist,'pg_catalog.english', content);

/* ------------------------Basic Query ------------------------*/ --FTS Search with @@ Operator SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ plainto_tsquery('english', 'movie');

Stefan Lütolf

Page 25 of 31

Database Seminar

PostgreSQL Full Text Search

/* ------------------------Create Indexes ------------------------*/ --Index on content (Trigram needed,to use Gin Index) CREATE EXTENSION pg_trgm; CREATE INDEX FullTextSearch_content ON FullTextSearch USING GIN(content gin_trgm_ops); --GIN INDEX on content_tsv_gin CREATE INDEX FullTextSearch_content_tsv_gin ON FullTextSearch USING GIN(content_tsv_gin);

--GIST INDEX on content_tsv_gist CREATE INDEX FullTextSearch_content_tsv_gist ON FullTextSearch USING GIST(content_tsv_gist);

/* ------------------------Ranking (and normalized ranking) ------------------------*/ --Ranking blank SELECT id,content, ts_rank_cd(content_tsv_gin, query) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; --Ranking normalized SELECT id, content, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; --Highlighting and Ranking SELECT id, ts_headline(content,query) AS highlighted, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10;

Stefan Lütolf

Page 26 of 31

Database Seminar

PostgreSQL Full Text Search

Annex B: PHP Script id

Stefan Lütolf

Page 30 of 31

Database Seminar

PostgreSQL Full Text Search

Schema of large Dataset: id

Stefan Lütolf

Page 31 of 31