PostgreSQL Full Text Search An Introduction and a Performance Comparison with Apache Lucene /Solr
Seminar Thesis
Master of Science in Engineering Specialization: Information and Communication Technologies (ICT)
Advisor:
Prof. Martin Studer
Author:
Stefan Lütolf
Version:
2.6 (final)
Chur, December 2012
Database Seminar
PostgreSQL Full Text Search
Abstract PostgreSQL is a widely used open source relational database system. Since version 8.0 it offers full text search. This allows searching a document or text in a database. Full Text Search (FTS) refers to information retrieval techniques for searching a document or text data in a database, which contains textual data. Search criteria (like queries) are used to analyse the documents and list all the matching results; optionally, they can be sorted by relevance. PostgreSQL uses the vector space model: a specific document (e.g. an email, a text document or a magazine article) is searched in a large data pool consisting of documents. FTS in PostgreSQL is optimized for using these techniques in an organisation, for example for a search engine in the intranet. In this paper an introduction to Full Text Search in PostgreSQL is given as well as a performance comparison of alternatives such as Apache Lucene /Solr. Now, the question arises of what reasons can be given to use Full Text Search instead of well-known text operators.
Stefan Lütolf
Page 2 of 31
Database Seminar
PostgreSQL Full Text Search
Table of Contents 1
2
3
4
Introduction ......................................................................................................................... 5 1.1
What is Full Text Search?.......................................................................................... 5
1.2
What is a Document? ................................................................................................ 5
1.3
Information Retrieval Model ....................................................................................... 5
1.4
Indexing ..................................................................................................................... 6
1.5
Full Text Search versus Text Operators .................................................................... 6
Simple Full Text Search Example ...................................................................................... 7 2.1
Preparing ................................................................................................................... 7
2.2
Separate Vector Column ........................................................................................... 8
2.3
Index on the Vector Column .................................................................................... 10
2.4
Text matching .......................................................................................................... 11
2.5
Ranking search results ............................................................................................ 12
Full Text Search Features ................................................................................................ 14 3.1
Different Languages ................................................................................................ 14
3.2
Weights .................................................................................................................... 14
3.3
Dictionaries .............................................................................................................. 15
Benchmark ....................................................................................................................... 16 4.1
State of the Art ......................................................................................................... 16
4.2
Overview of Alternatives .......................................................................................... 16
4.3
PostgreSQL ............................................................................................................. 17
4.4
Apache Lucene /Solr ............................................................................................... 20
5
Conclusion ........................................................................................................................ 22
6
Bibliography ...................................................................................................................... 23
Annex A: SQL Script ................................................................................................................ 24 Annex B: PHP Script................................................................................................................ 27 Annex C: SQL Script (Small Dataset)...................................................................................... 28 Annex D: SQL Script (Large Dataset) ..................................................................................... 29 Annex E: Apache SOLR .......................................................................................................... 30
Stefan Lütolf
Page 3 of 31
Database Seminar
PostgreSQL Full Text Search
Table of Figures Figure 1: ER-model.................................................................................................................... 7 Figure 2: table content after inserting data ................................................................................ 7 Figure 3: expanded ER-model with a separate vector column ................................................. 8 Figure 4: result of the extra ts_vector column in PostgreSQL ................................................... 8 Figure 5: result of ranking ........................................................................................................ 12 Figure 6: result of a normalized ranking .................................................................................. 13 Figure 7: Result of ranking and highlighting ............................................................................ 13 Figure 8: diagram initialization process runtime ...................................................................... 17 Figure 9: diagram query tests runtime without indexes ........................................................... 18 Figure 10: diagram query tests with indexes ........................................................................... 19 Figure 11: Initialization of Lucene /Solr ................................................................................... 20 Figure 12: Query Tests with Lucene /Solr ............................................................................... 21
Table of Charts Table 1: Comparison of Alternatives [19] ................................................................................ 16 Table 2: Comparison of the Dataset ........................................................................................ 17 Table 3: Query test without any Indexes ................................................................................. 18 Table 4: Query test using indexes ........................................................................................... 19 Table 5: Initialization in Apache SOLR .................................................................................... 20 Table 6: Lucene Query tests ................................................................................................... 21
Stefan Lütolf
Page 4 of 31
Database Seminar
PostgreSQL Full Text Search
1 Introduction 1.1
What is Full Text Search?
Full Text Search (FTS) refers to information retrieval techniques for searching a document or text data in a full text database. Search criteria (like queries) are used to analyse the documents and list all the matching results. [1] With full text searching, it is possible to identify natural-language documents that match a query and optionally to sort them by relevance to the query. The most common use case is to find all documents based on the given query and optionally to sort them by relevance. [2]
1.2
What is a Document?
A document is defined as the unit of searching in a full text search system. This can be an email message or a book or a magazine article. There are two possibilities to store a document in a database; firstly, the whole text can be stored in a single field or a combination of a few fields in the database. Secondly, it can be referred to a document in the file system, whereof only a special type of vector containing the key word (lexemes) is stored. [2]
1.3
Information Retrieval Model
A few Information Retrieval model for Retrieval Adhoc are known, such as the standard Boolean model or the expanded Boolean model [3], the vector space model [4] as well as the probabilistic model [5]. In PostgreSQL, the vector space model is used for the documents in Full Text Search. “The tsvector type represents a document in a form optimized for text search” [6] The advantage of the Vector Space model over the standard Boolean model is that the ranking feature is supported in the former mentioned model.
Stefan Lütolf
Page 5 of 31
Database Seminar
1.4
PostgreSQL Full Text Search
Indexing
If the Search engine deals with a small number of documents it is possible to scan the full documents with every query. This is called serial scanning. If there are plenty of documents to scan, the task of full text search is divided into: first indexing the whole documents and then searching the queries. The indexing task means, that the search engine creates a list of search terms, named a concordance. The indexer makes an entry for each term found in a document. It is possible to ignore stop words, such as “the” or “and”, because these words aren’t as relevant as others. Furthermore, a language-specific stemming on the words can be used; thus, only the basic form of a word is used, like “speak” instead of “spoken”. [2] Using indexes with queries, which deliver most of the rows, is slow, because PostgreSQL prefer to make a sequential scan instead of using the index. [2]
1.5
Full Text Search versus Text Operators
What are the reasons for using Full Text Search instead of operators? There are still a few operators in PostgreSQL, which can be used for textual data types. These are “~, ~*, LIKE, ILIKE”. ILIKE with wildcards: SELECT * FROM sometextpool WHERE content ILIKE '%myquery%' And POSIX regular expressions: SELECT * FROM sometextpool WHERE content ~ 'myquery' The disadvantages of these operators are the followings [2]:
There is no linguistic support
They provide no ranking
Dealing with huge number of documents is very slow
The FTS allows pre-processing the documents:
Parsing documents into tokens
Converting tokens into lexemes (This is a normalized token)
Storing pre-processed documents for searching
Stefan Lütolf
Page 6 of 31
Database Seminar
PostgreSQL Full Text Search
2 Simple Full Text Search Example There are two different datasets. The first is a small set with 2000 rows and maximal 5000 character per text: movie reviews from the department of computer science at Cornel University (polarity dataset v2.0). The second dataset is from the department of computer science as well, however it is a pool of 27886 unprocessed html files [7].
2.1
Preparing
First of all a new table “FullTextSearch” is created with an id and a content column, as illustrated in the box below: --CLEANUP DROP TABLE IF EXISTS FullTextSearch; --INITIALIZATION CREATE TABLE FullTextSearch( id SERIAL NOT NULL, content text NOT NULL ); ALTER TABLE ONLY FullTextSearch ADD CONSTRAINT fulltextsearch_pkey PRIMARY KEY (id); The ER-model of this database: FullTextSearch PK
id content
Figure 1: ER-model To insert data into the database it is possible to use the standard Insert command. The syntax used is shown in the following box: INSERT INTO FullTextSearch(content) VALUES ('a fat cat sat on a mat - it ate a fat rats'); Further information about the self-made php script is given in the Annex B. The next figure shows the inserted dataset of the database.
Figure 2: table content after inserting data
Stefan Lütolf
Page 7 of 31
Database Seminar
2.2
PostgreSQL Full Text Search
Separate Vector Column
Example with the separate vector column A ts_vector is used for full text search in PostgreSQL; in this case a separate column with a ts_vector is necessary. For further performance tests an extra ts_vector is created for a GIN and a GIST Index. The ts_vector can be built with the following SQL syntax: /* ------------------------Separate Ts_Vector column ------------------------*/ --TS_Vector for GIN INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gin tsvector; UPDATE FullTextSearch SET content_tsv_gin = to_tsvector('pg_catalog.english', content); --TS_Vector for GIST INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gist tsvector; UPDATE FullTextSearch SET content_tsv_gist = to_tsvector('pg_catalog.english', content); The expanded ER-model is shown in the following figure: FullTextSearch PK
id content content_tsv_gin content_tsv_gist
Figure 3: expanded ER-model with a separate vector column The result of building a separate column is shown in the figure 4.
Figure 4: result of the extra ts_vector column in PostgreSQL
Stefan Lütolf
Page 8 of 31
Database Seminar
PostgreSQL Full Text Search
Background theory of the Text Search Vector “A data type tsvector is provided for storing pre-processed documents, along with a type tsquery for representing processed queries” [8] Pre-processing includes:
“Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, email addresses, so that they can be processed differently.” [8]
“Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as “s” or “es” in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs”. [8]
The tsvector consists of the key words (lexemes) following the position of the key word in the referenced text. The reason to use a ts_vector is shown in the following example. Without the tsvector there is no normalisation and no linguistic support. The following queries illustrate that without the tsvector, the result is not as expected. /* ------------------------Basic text matching with two simple strings ------------------------*/ --Typ Cast-SELECT 'a fat cat sat on a mat and ate fat rats'::tsvector @@ 'cat & rat'::tsquery; --Result: false --Function to_ts..-SELECT to_tsvector('a fat cat sat on a mat and ate fat rats') @@ to_tsquery('cat & rat'); --Result: true, because the ts_vector includes linguistic supports The first query makes only a type cast. Therefore, the “rat” in the query ignores match the “rats” in the tsvector. The second query shows that the linguistic support ignores the “s” from the “rats” and for this reason the query matched.
Stefan Lütolf
Page 9 of 31
Database Seminar
PostgreSQL Full Text Search
Keep the vector up-to-date with a trigger With a trigger it can be guaranteed that the vector is up-to-date. This is used by Insert or modification the text column. With the following syntax, a trigger can be installed: /* ------------------------keep it up-to-date with a Trigger ------------------------*/ --TRIGGER CREATE TRIGGER tsv_gin_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gin,'pg_catalog.english',c ontent); CREATE TRIGGER tsv_gist_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gist,'pg_catalog.english', content);
2.3
Index on the Vector Column
Create Index on the example To increase the speed of a query, it is necessary to add an index to the ts_vector column. The syntax is shown in the box below. The first column is added with a GIN Index and the second with a GIST Index. With the former searching is quicker than with the GIST, however it is slower to update with the GIN Index. The Trigram is needed to use the GIN Index on a text column [9], which is for further performance tests. /* ------------------------Create Indexes ------------------------*/ --Index on content (Trigram needed,to use Gin Index) CREATE EXTENSION pg_trgm; CREATE INDEX FullTextSearch_content ON FullTextSearch USING GIN(content gin_trgm_ops); --GIN INDEX on content_tsv_gin CREATE INDEX FullTextSearch_content_tsv_gin ON FullTextSearch USING GIN(content_tsv_gin); --GIST INDEX on content_tsv_gist CREATE INDEX FullTextSearch_content_tsv_gist ON FullTextSearch USING GIST(content_tsv_gist);
Stefan Lütolf
Page 10 of 31
Database Seminar
PostgreSQL Full Text Search
Further Information about an Index There are two different Indexes that can be used to speed up full text search in PostgreSQL. Indexes are not mandatory for text searching, but in cases where a column is searched on a regular basis, an index is usually desirable. The two kinds of Indexes:
GIN (Generalized Inverted Index)
GIST (Generalized Search Tree)
“There are substantial performance differences between the two index types […] A GIST index is lossy […]. GIN indexes are not lossy for standard queries, but their performance depends logarithmically on the number of unique words […]. GIN index lookups are about three times faster than GIST. GIN indexes take about three times longer to build than GIST.”[10]
2.4
Text matching
With the @@ Operator it is possible to match a document (tsvector) with a query (tsquery), It does not matter which data type is written first. A tsquery contains search terms; it is already normalized into lexemes. If the query is still in plaintext, there are functions to convert them into a tsquery. For this conversion the functions to_tsquery and plainto_tsquery are helpful to that. A query may combine multiple terms using AND, OR and NOT operators. The difference between the two functions is that the plainto_tsquery cannot recognize Boolean operators or weight labels. The Boolean AND operator is inserted between surviving words. The following example shows how it is applied to make a query to the database. /* ------------------------Basic Query ------------------------*/ --FTS Search with @@ Operator SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ plainto_tsquery('english', 'movie'); The result of this query retrieved 497 matched rows. These are too many results to check all the matched samples. For this reason, it is useful to use the ranking function of PostgreSQL FTS, which is described in the next chapter.
Stefan Lütolf
Page 11 of 31
Database Seminar
2.5
PostgreSQL Full Text Search
Ranking search results
Ranking and Normalized Ranking Ranking attempts to measure how relevant documents are to a matching query, thus the most relevant match can be shown first. With the Vector space model used in FTS of PostgreSQL, a ranking of the matched text samples can be made. The following syntax can be applied to rank the results: --Ranking blank SELECT id,content, ts_rank_cd(content_tsv_gin, query) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; In the next figure, the result of the example is listed
Figure 5: result of ranking The value of the rank column is a real number; however, this number is not normalized because the ranking functions do not use any global information. To normalize the ranking result, use the additional value of the ts_rank_cd can be used. The value 32 connoted that all ranks are scaled into the range zero to one. --Ranking normalized SELECT id, content, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10;
Stefan Lütolf
Page 12 of 31
Database Seminar
PostgreSQL Full Text Search
In the next figure, the normalized result is listed:
Figure 6: result of a normalized ranking
Highlighting Results “To present search results it is ideal to show a part of each document and how it is related to the query.” PostgreSQL has the function ts_headline to highlight text fragments, which matches best to the query. For more information see [11]. The Syntax on the example before: SELECT id, ts_headline(content,query) AS highlighted, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; The following figure presents the result of the ranking with highlighting
Figure 7: Result of ranking and highlighting
Stefan Lütolf
Page 13 of 31
Database Seminar
PostgreSQL Full Text Search
3 Full Text Search Features 3.1
Different Languages
Each vector can have its own language, for example when having a few magazine articles in different languages, each of which can associate with its own language. To identify the language into the vector, the first parameter of the to_tsvector function, shown below, can be used: to_tsvector('pg_catalog.english', content); Another language can be chosen, for example German: to_tsvector('pg_catalog.german', 'hallo, wer ist hier');
3.2
Weights
“The function setweight can be used to label the entries of a tsvector with a given weight, where a weight is one of the letters A, B, C, or D. This is typically used to mark entries coming from different parts of a document, such as title versus body. Because to_tsvector(NULL) will return NULL, it is recommended to use coalesce whenever a field might be null […]” [11] UPDATE table SET column = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D'); Note that in this case the stop words are unaccounted for the vector. To remove the stop words, first build a tsvector and thereafter build the weighted column by the following SQL syntax: update fts set text_search_vector_weighted = setweight(text_search_vector, 'A'); “As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only tsvector lexemes of those weight(s).” [11] SELECT to_tsquery('english', 'Fat | Rats:AB'); For further Information see [11]
Stefan Lütolf
Page 14 of 31
Database Seminar
3.3
PostgreSQL Full Text Search
Dictionaries
“Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance” [12].
Synonym Dictionary With this dictionary template, a word with a synonym in a given table can be replaced. For an example directory, an own synonym directory has to be created: “installation directory of PostgreSQL”\share\tsearch_data\my_synonyms.syn. The file looks like this: My_synonyms.syn postgres postgresql postgre paris fulltextsearch
pgsql pgsql pgsql paris fts
To create the own directory in PostgreSQL and install it, the following SQL syntax can be applied: CREATE TEXT SEARCH DICTIONARY my_synonym_dir ( TEMPLATE = synonym, SYNONYMS = my_synonyms ); ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR asciiword WITH my_synonym_dir, english_stem; The result of a lexeme of the word “fulltextsearch” is “fts” like the entry of the created directory file. To verify this, use the following debug syntax: select * from ts_debug('english','fulltextsearch');
Thesaurus Dictionary The Synonym Dictionary does not support Phrases; instead a Thesaurus Dictionary is useful. For example: the word combination “supernovae stars” can be replaced by the synonym “sn”. The process of installing a thesaurus dictionary is similarly to the Synonym Dictionary.
Others
ISPELL Dictionary (normalize many different linguistic forms)
Snowball Dictionary (recognize everything)
Stefan Lütolf
Page 15 of 31
Database Seminar
PostgreSQL Full Text Search
4 Benchmark 4.1
State of the Art
Existing database benchmarks, like the well-known database benchmarks from the Transaction Processing Performance Council (TPC) [13], do not contain benchmarks for fulltext search scenarios. One reason for this has been the lack of standard in the domain of SQL interfaces for full-text search in the SQL standards - the SQL Multimedia and Application Packages (SQL/MM) - Part 2: Full-Text [14] [15] tries to fill the gap. The emerging of a standard may allow for a widely accepted benchmark in the near future. The information retrieval community on the other side has been using the TREC [16] for benchmarking. The purpose of the text retrieval conference (TREC) is "to support research within the information retrieval community by providing the infrastructure necessary for largescale evaluation of text retrieval methodologies." The TREC benchmarks has the problem, that it is for scientific documents ("To index and search […] most groups used custom retrieval software develop by their own group or by an associated group. "[17]), in given time. Mike McCandless runs Lucene performance tests regularly on the English Wikipedia [18]. At the moment there is no information retrieval benchmark compared to the TPC benchmarks for relational database systems.
4.2
Overview of Alternatives
There is a comparison of Text Search from "Practical full-text search in PostgreSQL", Bill Karwin, PostgreSQL Conference West, 2009/10/17: Alternative
Indexing time
Index
Query speed
Solution
storage size LIKE / RegEx
None
None
12x
SQL
Full-Text
7x
5x
3x
SQL
Open
Source
Indexing Inverted
/
High
1x
5x
SQL / RDBMS
2x
10x
10x
IR / Open Source
1x
5x
1x
IR / Open Source
Offline
Offline
*
Web service
Index Apache Lucene Sphinx Search Search Engine Service Table 1: Comparison of Alternatives [19]
Stefan Lütolf
Page 16 of 31
Database Seminar
4.3
PostgreSQL Full Text Search
PostgreSQL
Testing System: Query Tool of pgAdmin III. Configuration of PostgreSQL: Out of the Box installation. Version: PostgreSQL 9.2 (32Bit). Dataset: Described in Chapter 2.
Comparison of the two dataset The dataset is processed and inserted into the database with a self-made php script, which is given in Annex B. All the values in the table below are measured in milliseconds. The SQL syntax on this test is shown in Annex A. Dataset
Loading Data
Create
Indexing
Indexing
to the DB
tsvector
time (GIN)
time (GIST)
Movie Review (small)
2’575
4’485
1’437
625
Movie Review (large)
77’777
77’755
30’504
3’672
Table 2: Comparison of the Dataset The following Graph shows the results in a diagram:
Initialization Movie Review (large)
Indexing time (GIST)
Indexing time (GIN)
Create tsvector Loading Data to the DB
Movie Review (small)
3.672 0.625 30.504 1.437 77.755 4.485 77.777 2.575 values in seconds
Figure 8: diagram initialization process runtime
Stefan Lütolf
Page 17 of 31
Database Seminar
PostgreSQL Full Text Search
Query comparison on the two dataset In Annex C is the SQL script of the performance test queries for the small dataset is given, the results for the large dataset is shown in Annex D. This is a comparison of the different possible text queries, such as Posix (RegEx), LIKE and a query with FTS supports (@@ operator). The FTS query is split into two different kinds of queries. First there is an equality search (small range) and thereafter, a search with a wide array of results is done. All the values in the table below are measured in milliseconds. Test without using indexes: Dataset
Posix
LIKE
LIKE
LIKE
FTS
FTS
RegEx
%word%
word%
%word
Equality
result
large
Movie
1688
1640
547
578
63
1610
Review
(1631
(1631
(8 rows)
(0 rows)
(1 row)
(1626 rows)
(small)
rows)
rows)
Movie
23705
23751
1578
1969
516
24751
Review
(15839
(17029
(0 rows)
(0 rows)
(17 rows)
(16608 rows)
(large)
rows)
rows)
Table 3: Query test without any Indexes
Query Tests without Index Movie Review (small) Posix RegEx LIKE %word%
1.688 1.64
LIKE word%
0.547 1.578
LIKE %word
0.578 1.969
FTS Equality FTS large result
Movie Review (large)
23.705 23.751
0.063 0.516 1.61
24.751
values in seconds
Figure 9: diagram query tests runtime without indexes
Stefan Lütolf
Page 18 of 31
Database Seminar
PostgreSQL Full Text Search
Test using indexes: Dataset
Posix
LIKE
LIKE
LIKE
FTS
FTS
RegEx
%word%
word%
%word
Equality
result
large
Movie
1688
1688
469
453
GIN:31
GIN: 1609
Review
(1631
(1631
(8 rows)
(0 rows)
GIST:32
GIST:1609
(small)
rows)
rows)
(1 row)
(1626rows)
Movie
23689
11969
1109
1344
GIN:79
GIN:24673
Review
(15839
(17029
(0 rows)
(0 rows)
GIST:125
GIST:24658
(large)
rows)
rows)
(17 rows)
(16608 rows)
Table 4: Query test using indexes
Query Tests with Index Movie Review (small) Posix RegEx
1.688
LIKE %word%
1.688
LIKE word%
0.469 1.109
LIKE %word
0.453 1.344
FTS Equality (GIN)
0.031 0.079
FTS Equality (GIST)
0.032 0.125
FTS large result (GIN)
1.609
FTS large result (GIST)
1.609
Movie Review (large)
23.689 11.969
24.673 24.673
values in seconds
Figure 10: diagram query tests with indexes
Stefan Lütolf
Page 19 of 31
Database Seminar
4.4
PostgreSQL Full Text Search
Apache Lucene /Solr
Apache SOLR is used for testing [20]. The setup of Solr and the schemas of the datasets, which is configured in Solr, are given in the Annex E. The initialization process time is shown in the table below. The initialization process time is divided into two tasks, first converting the data to an xml and thereafter to import it into the database. It isn’t clear if the indexing process time is included in the import process time. Note, the values are in milliseconds. Dataset
Convert to XML
Import
Movie Review (small)
1679
3800
Movie Review (large)
40420
32000
Table 5: Initialization in Apache SOLR The following graph shows the results in a diagram:
Initialization Movie Review (small)
Movie Review (large)
1.679 Convert to XML 40.42
3.8 Import 32
values in seconds Figure 11: Initialization of Lucene /Solr
Stefan Lütolf
Page 20 of 31
Database Seminar
PostgreSQL Full Text Search
Query Test using indexes (see schema.xml in Annex E for detail): The measure time contains first the prepare time and second the process time. The values are in milliseconds Dataset
Wildcard
Wildcard
Wildcard
FTS
FTS
*word*
+word* -word
*word –word
Equality
large Result set
Movie Review
16 + 93
0 + 15
16 + 15
0 + 16
0+0
(1657
(120 results)
(0 results)
(3
(1537 results)
(small)
results) Movie Review (large)
results)
31 + 156
0 + 63
0 + 63
0 + 15
0 + 16
(15338
(595 results)
(15 results)
(2
(14724 results)
results)
results)
Table 6: Lucene Query tests
The following Graph shows the results in a diagram:
Query Tests with Lucene / Solr Movie Review (small) prepare Movie Review (small) process Movie Review (large) prepare Movie Review (large) process
16 Wildcard *word*
93
31
Wildcard +word* word
0 0
Wildcard *word – word
0
FTS Equality
0 0
FTS large Resulat set
0 0 0
156
15 63 16 15 63 16 15
16 values in milliseconds Figure 12: Query Tests with Lucene /Solr
Stefan Lütolf
Page 21 of 31
Database Seminar
PostgreSQL Full Text Search
5 Conclusion To sum up, it can be said that Full Text search is a useful feature of PostgreSQL. As a vector can be created out of some text, it is possible to store only the keywords in a database and therefore less storage as well as time is needed for the query process. This need less storage and speed up the query process time. In addition to that, an index can be used to speed up the search query. However if the query delivers about the whole dataset, the database ignores the index and makes a whole sequence scan which takes considerably more time. Furthermore, the Synonym Dictionary makes it possible to create its own dictionary with similar words. This feature is useful for shortcuts or synonyms in the technical language.
The performance test is a sensitive issue; because of this the next remarks are a nonexhaustive enumeration:
To create a GIST Index is much faster than to create a GIN Index. Searching with a GIN index is faster than with a GIST. (PostgreSQL)
The tests with wildcards (LIKE) shows that the index (GIN) increases the speed by querying (PostgreSQL). The wildcards queries are faster with Apache Lucene / Solr.
Fulltextsearch queries with a small result set (equality) are fast with both tested solution and the index increases the speed of the queries.
Fulltextsearch queries with a large result set are fast with Apache Lucene / Solr but with PostgreSQL it is very slow. The reason could be the scan variant, because PostgreSQL makes a sequential scan instead of an index scan.
Stefan Lütolf
Page 22 of 31
Database Seminar
PostgreSQL Full Text Search
6 Bibliography [1] „Full text search“, Wikipedia, the free encyclopedia. 23-Nov-2012. [2] „PostgreSQL: Documentation: 9.2: Full Text Search“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch.html. [Accessed: 26-Nov-2012]. [3] „Standard Boolean model - Wikipedia, the free encyclopedia“. [Online]. Available: http://en.wikipedia.org/wiki/Standard_Boolean_model. [Accessed: 04-Dez-2012]. [4] „Vector space model“, Wikipedia, the free encyclopedia. 30-Nov-2012. [5] „Probability theory - Wikipedia, the free encyclopedia“. [Online]. Available: http://en.wikipedia.org/wiki/Probability_theory. [Accessed: 04-Dez-2012]. [6] „PostgreSQL: Documentation: 9.2: Text Search Types“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/datatype-textsearch.html. [Accessed: 04-Dez2012]. [7] „Data“. [Online]. Available: http://www.cs.cornell.edu/people/pabo/movie-review-data/. [Accessed: 10-Dez-2012]. [8] „PostgreSQL: Documentation: 9.2: Introduction“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-intro.html. [Accessed: 07-Dez-2012]. [9] „PostgreSQL: Documentation: 9.1: pg_trgm“. [Online]. Available: http://www.postgresql.org/docs/9.1/static/pgtrgm.html. [Accessed: 10-Dez-2012]. [10] „PostgreSQL: Documentation: 9.2: GiST and GIN Index Types“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-indexes.html. [Accessed: 07-Dez2012]. [11] „PostgreSQL: Documentation: 9.2: Controlling Text Search“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-controls.html. [Accessed: 07-Dez2012]. [12] „PostgreSQL: Documentation: 9.2: Dictionaries“. [Online]. Available: http://www.postgresql.org/docs/9.2/static/textsearch-dictionaries.html. [Accessed: 10Dez-2012]. [13] „TPC - Homepage“. [Online]. Available: http://www.tpc.org/. [Accessed: 04-Dez-2012]. [14] „ISO/IEC 13249-2:2003 - Information technology -- Database languages -- SQL multimedia and application packages -- Part 2: Full-Text“. [Online]. Available: http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=31368. [Accessed: 04-Dez-2012]. [15] „32N0760T.pdf“. . [16] „Text REtrieval Conference (TREC) Home Page“. [Online]. Available: http://trec.nist.gov/. [Accessed: 04-Dez-2012]. [17] „TERA.OVERVIEW.pdf“. . [18] „Lucene nightly benchmarks“. [Online]. Available: http://people.apache.org/~mikemccand/lucenebench/. [Accessed: 04-Dez-2012]. [19] Karwin Software Solutions LLC, „Full Text Search In PostgreSQL“, 03-Nov-2009. [20] „Apache Lucene - Apache Solr“. [Online]. Available: http://lucene.apache.org/solr/. [Accessed: 20-Dez-2012].
Stefan Lütolf
Page 23 of 31
Database Seminar
PostgreSQL Full Text Search
Annex A: SQL Script /* ------------------------Initialization ------------------------*/ --CLEANUP DROP TABLE IF EXISTS FullTextSearch; --INITIALIZATION CREATE TABLE FullTextSearch( id SERIAL NOT NULL, content text NOT NULL ); ALTER TABLE ONLY FullTextSearch ADD CONSTRAINT fulltextsearch_pkey PRIMARY KEY (id); --INSERT (see php skript to fill DB with more data) INSERT INTO FullTextSearch(content) VALUES ('a fat cat sat on a mat - it ate a fat rats');
/* ------------------------Basic text matching with two simple strings ------------------------*/ --Typ Cast-SELECT 'a fat cat sat on a mat and ate fat rats'::tsvector @@ 'cat & rat'::tsquery; --Result: false --Function to_ts..-SELECT to_tsvector('a fat cat sat on a mat and ate fat rats') @@ to_tsquery('cat & rat'); --Result: true, because the ts_vector includes linguistic supports
Stefan Lütolf
Page 24 of 31
Database Seminar
PostgreSQL Full Text Search
/* ------------------------Separate Ts_Vector column ------------------------*/ --TS_Vector for GIN INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gin tsvector; UPDATE FullTextSearch SET content_tsv_gin = to_tsvector('pg_catalog.english', content); --TS_Vector for GIST INDEX ALTER TABLE FullTextSearch ADD COLUMN content_tsv_gist tsvector; UPDATE FullTextSearch SET content_tsv_gist = to_tsvector('pg_catalog.english', content);
/* ------------------------keep it up-to-date with a Trigger ------------------------*/ --TRIGGER CREATE TRIGGER tsv_gin_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gin,'pg_catalog.english',c ontent); CREATE TRIGGER tsv_gist_update BEFORE INSERT OR UPDATE ON FullTextSearch FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(content_tsv_gist,'pg_catalog.english', content);
/* ------------------------Basic Query ------------------------*/ --FTS Search with @@ Operator SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ plainto_tsquery('english', 'movie');
Stefan Lütolf
Page 25 of 31
Database Seminar
PostgreSQL Full Text Search
/* ------------------------Create Indexes ------------------------*/ --Index on content (Trigram needed,to use Gin Index) CREATE EXTENSION pg_trgm; CREATE INDEX FullTextSearch_content ON FullTextSearch USING GIN(content gin_trgm_ops); --GIN INDEX on content_tsv_gin CREATE INDEX FullTextSearch_content_tsv_gin ON FullTextSearch USING GIN(content_tsv_gin);
--GIST INDEX on content_tsv_gist CREATE INDEX FullTextSearch_content_tsv_gist ON FullTextSearch USING GIST(content_tsv_gist);
/* ------------------------Ranking (and normalized ranking) ------------------------*/ --Ranking blank SELECT id,content, ts_rank_cd(content_tsv_gin, query) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; --Ranking normalized SELECT id, content, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10; --Highlighting and Ranking SELECT id, ts_headline(content,query) AS highlighted, ts_rank_cd(content_tsv_gin, query, 32 /* rank/(rank+1) */) AS rank FROM FullTextSearch, to_tsquery('wonderful | good') query WHERE query @@ content_tsv_gin ORDER BY rank DESC LIMIT 10;
Stefan Lütolf
Page 26 of 31
Database Seminar
PostgreSQL Full Text Search
Annex B: PHP Script id
Stefan Lütolf
Page 30 of 31
Database Seminar
PostgreSQL Full Text Search
Schema of large Dataset: id
Stefan Lütolf
Page 31 of 31