Full Text Search Agent Throughput

Author: Phebe Booth

18 downloads 2 Views 244KB Size

Report

Download PDF

Recommend Documents

PostgreSQL Full Text Search

Full Text Search Functions

Full-Text Search in XML Databases

search?q=search text"

TeXQuery: A Full-Text Search Extension to XQuery

Contextual Sentence Decomposition with Applications to Semantic Full-Text Search

SAP HANA SPS 09 - What s New? Full-text Search

Information Retrieval and Search Engines in Full-text Databases

On the Completeness of Full-Text Search Languages for XML

TeXQuery: A Full-Text Search Extension to XQuery

A User Interface for Semantic Full Text Search

Full Text of Testimony

User Guide. Checking the Full text limit will restrict your search to only search and retrieve records containing full text from ProQuest

Full-text Available Online at

Performance Comparison of DHT based Peer-to-Peer Full-Text Search Systems

Full-Text Search in PostgreSQL. Oleg Bartunov Moscow University PostgreSQL Global Development Group

Design of a Full Text Search index for a database management

Full-Text Search with Sphinx and MySQL. Percona Live, NY, 2011

index.html Homepage: Online Database Featuring Author, Key Word and Full-Text Search

Full-Text Search with Sphinx and PHP. SphinxSearch LAMP stack integration, tips and tricks

Oleg Bartunov (thanks 1C for support) Alexander Korotkov. Full-text search in PostgreSQL in milliseconds

Analysis of Anchor Text for Web Search

Full Text Search Agent Throughput Best Practices

Perceptive Content, Version: 7.1.x

Written by: Product Knowledge, R&D Date: July 2016

© 2016 Lexmark. All rights reserved. Lexmark is a trademark of Lexmark International Technology, S.A., or its subsidiaries, registered in the U.S. and/or other countries. All other trademarks are the property of their respective owners. No part of this publication may be reproduced, stored, or transmitted in any form without the prior written permission of Lexmark.

Content Server Throughput Best Practices

Table of Contents Introduction to Full Text Search Agent Throughput ............................................................................... 4 About Full Text Search Agent .................................................................................................................... 4 Full Text Search Agent architecture .......................................................................................................... 4 Full Text Search Agent performance model .............................................................................................. 4 About interaction with OSM Agent ......................................................................................................... 4 Full Text Agent internals ........................................................................................................................ 5 About scaling Full Text Search Agent throughput and performance ................................................... 6 eneral tuning ........................................................................................................................................ 6 Worker thread tuning ............................................................................................................................. 6 External performance influences and effects .......................................................................................... 7 Recognition Agent ..................................................................................................................................... 7 System impact of tuning Full Text Agent ................................................................................................... 7 Content Index Throughput Testing Results ............................................................................................. 7 Constants................................................................................................................................................... 7 Index performance..................................................................................................................................... 8 Full Text Search Performance Testing Results ....................................................................................... 9 Constants................................................................................................................................................... 9 Search performance of text documents .................................................................................................... 9 Search performance of Microsoft Word documents ................................................................................ 10 Search performance of PDF documents ................................................................................................. 11 Search performance – varying Full Text Agent worker threads .............................................................. 12 Full Text Performance with Large-Scale Content Collections ............................................................. 14 System specs .......................................................................................................................................... 14 Scalability recommendations ................................................................................................................... 14

Page 3 of 14

Content Server Throughput Best Practices

Introduction to Full Text Search Agent Throughput For customers who process high numbers of pages, Perceptive Software recommends tuning systems for high throughput and efficiency to maximize performance. This document provides information to help you determine your throughput performance needs for Full Text Search Agent based on the volume of documents handled by your organization using Perceptive Content.

About Full Text Search Agent Full Text Search Agent collects text data and creates searchable indexes based on the content of each document page. This allows you to identify documents containing specific keywords or phrases via content searches executed inside Perceptive Content. Content searches allow you to perform various types of full text searches, including fuzzy, phonetic, stemming and synonym, proximity, and relevance ranking searches. Your search can also contain wildcards to represent one or more characters. The server displays the document pages that contain matches for the search and highlights the matching keywords within the documents.

Full Text Search Agent architecture For more information about the interaction between the Full Text Agent and other Perceptive Content components review the following list. •

RabbitMQ Message Broker. All documents submitted to Full Text Search Agent utilize the RabbitMQ message broker to deliver the jobs to the agent.

•

OSM Agent. This agent is responsible for providing the Full Text Agent the physical files to index. When Full Text Agent receives a request to index a document it will send a request to the OSM Agent to retrieve the document for indexing.

•

Content Collections. This database resides on the same server node as Full Text Agent and is required to store document indexing information.

•

Perceptive Content Database. Full Text Agent communicates with the system database to update the document content status. Perceptive Content Database also stores information necessary to reference Content index information with the documents stored in your system.

•

Recognition Agent. This agent handles text recognition for raster-based (graphical) documents. Recognition Agent can strongly affect throughput performance. For more information about optimizing Recognition Agent Throughput, refer to the Recognition Agent Throughput Best Practices Guide.

Full Text Search Agent performance model About interaction with OSM Agent Since Full Text Agent has no direct access to the Perceptive Content OSM it relies on the OSM Agent to deliver the physical files needed for indexing. OSM Agent maintains a pool of worker threads to handle these incoming requests for OSM objects. Other system agents may contact the OSM Agent and the number of incoming requests must be balanced against the number of worker threads. OSM Agent’s throughput is thus entirely based on the availability of its worker threads and the speed of the OSM storage.

Page 4 of 14

Content Server Throughput Best Practices

Full Text Agent internals When Full Text Agent receives indexing jobs from the message broker it does not immediately start processing them, instead it waits for one of two conditions to be met.

Internal Timer Condition Measures time. When the duration on the internal timer expires, the Full Text Agent starts indexing all documents it has received.

High water mark Condition Measures document pages. When this high water mark is met, the Full Text Agent starts indexing the pages.

Page 5 of 14

Content Server Throughput Best Practices

About scaling Full Text Search Agent throughput and performance For more information on scaling the Full Text Search Agent throughput and performance refer to the following topics. General tuning To improve response time for content jobs, tune Full Text Agent to start jobs at a lower high water mark or a shorter timer delay. Changing these two settings reduces the time that Full Text Agent waits to start indexing content jobs. To configure the Full Text Agent to wait for a maximum one minute or to begin indexing when it receives 100 pages, complete the following steps. 1. On the Perceptive Content Server computer, navigate to the [drive:] \inserver\etc\ folder and then open the inow.ini file in a text editor. 2. In the [Full Text] group, change the following settings: •

Decrease content.submission.delay seconds to 60, for this example.

•

Decrease content.submission.threshold to 100, for this example.

3. Save and close the file.

Worker thread tuning You should only increase the number of worker threads for OSM Agent in cases where the existing worker thread availability is low.

Page 6 of 14

Content Server Throughput Best Practices

External performance influences and effects Recognition Agent One of the most critical performance considerations to a Full Text Search Agent solution is the Recognition Agent. Extracting text via OCR from images is a resource intensive operation. For more information, refer to the Perceptive Content Recognition Agent help under Manage Content>Content System>Manage agents.

System impact of tuning Full Text Agent Much of the performance tuning for Full Text Search Agent requires adjustments outside of the Full Text Agent configuration. Before making any changes to the configuration of OSM Agent and Recognition Agent, you should evaluate the changes for any performance impact on other business processes and the Perceptive Content system as a whole. All Full Text Search Agent installations function as remote installations. There is no performance or throughput benefit (other than one based on network performance) to an installation local to the Perceptive Content Server. Currently, Perceptive Software only supports Full Text Search Agent installations on 32-bit based operating systems.

Content Index Throughput Testing Results The data in the following test runs was imported using Import Agent. System documents consisted of one to six random pages of text. For non-legacy runs, Import Agent was configured to automatically submit new content for indexing. Legacy runs were configured to index only content added for this test.

Constants •

The average text page size was approximately 3 KB with each page representing a single document page. Due to the nature of how Import Agent submits new pages to content for indexing, the results for this test are lower than would normally be expected. As part of its processing, Import Agent submits each newly added page to Full Text Search Agent as if it were a unique document. This causes each document to be reviewed for indexing once per page processed by Import Agent.

•

Each PDF document imported into the system was a single PDF file averaging 96 KB in size.

•

Each MS Word document imported into the system was a single word document average 32 KB in size.

•

Full Text Agent was configured to submit legacy documents to Full Text Search Agent as fast as possible to keep the agent saturated with requests.

•

Configuration for legacy document submission was the same as in Test Run 4. Unlike Test Run 1, each document was submitted a single time, which resulted in less duplicate requests, eliminating additional processing by Full Text Agent.

Page 7 of 14

Content Server Throughput Best Practices

Index performance

Pages Indexed per Minute 1400 1200 1000 800 Pages Indexed per Minute

600 400 200 0 Text

PDF

Doc

Legacy PDF Legacy Text

Test Run

Data Type

Pages Indexed

Run Time

Pages Indexed per Minute

1

Text

45993

2:21:00

326

2

PDF

13582

1:14:00

183

3

Doc

13582

0:42:00

323

4

Legacy PDF

13582

1:14:00

183

5

Legacy Text

47328

0:40:00

1183

Page 8 of 14

Content Server Throughput Best Practices

Full Text Search Performance Testing Results The data in the following test runs was imported using Import Agent. System documents consisted of one to six random pages of text.

Constants During the test, the following constants were used: •

64-bit Perceptive Content Server running on Windows Server 2008 R2 Enterprise, 64-bit Operating System, 8 CPU cores, 48GB RAM.

•

32-bit Full Text Agent running on Windows Server 2008, 32-bit Operating System, 8 CPU cores, 48GB RAM.

Search performance of text documents This scale up test varied the number of user connections with each test run. Each user performed a random full text search and opened five random documents from the search results. Simulated user wait times between search and opening of the documents were between 10 and 120 seconds. The database contained approximately 13,500 text files previously indexed by Full Text Agent.

Test Run

Connected Users

Full Text Agent CPU

Total Searches

Avg Search Time (sec)

1

200

3%

1,434

1.829

2

400

5%

2,880

1.114

3

600

8%

4,290

0.948

4

800

11%

5,822

0.910

5

1000

14%

7,057

0.975

6

1200

17%

8,368

0.985

Page 9 of 14

Content Server Throughput Best Practices

Search performance of Microsoft Word documents This scale up test varied the number of user connections with each test run. Each user performed a random full text search and opened five random documents from the search results. Simulated user wait times between search and opening of the documents were between 10 and 120 seconds. The database contained approximately 13,500 Microsoft Word documents previously indexed by Full Text Agent.

Test Run

Connected Users

Full Text Agent CPU

Total Searches

Avg Search Time (sec)

1

200

3%

1,441

1.725

2

400

6%

2,833

1.065

3

600

9%

4,300

0.969

4

800

13%

5,792

1.100

5

1000

15%

7,231

1.004

6

1200

18%

8,341

1.088

Page 10 of 14

Content Server Throughput Best Practices

Search performance of PDF documents This scale up test varied the number of user connections with each test run. Each user performed a random full text search and opened five random documents from the search results. Simulated user wait times between search and opening of the documents were between 10 and 120 seconds. The database contained approximately 13,500 PDF documents previously indexed by Full Text Agent.

Test Run

Connected Users

Full Text Agent CPU

Total Searches

Avg Search Time (sec)

1

200

4%

1,416

1.975

2

400

8%

2,868

1.760

3

600

14%

4,252

2.145

4

800

17%

5,647

1.692

5

1000

23%

7,010

2.144

6

1200

27%

8,218

1.994

Page 11 of 14

Content Server Throughput Best Practices

Search performance – varying Full Text Agent worker threads This thread scale up varied the number of Full Text Agent worker threads. We collected data for test runs with 4, 8, 12, 16, 20, and 24 worker threads. Each test run was configured with 600 users, and each of these users performed random full text searches and opened one random document from the search results. Simulated user wait time between searches was between 10 to 30 seconds. We ran this scale up against a database with about 13,500 PDF documents previously indexed by Full Text Agent.

Time (sec)

Search Performance - PDF 2.5 2.0 1.5 1.0 0.5 0.0 4

8

12

16

20

24

Num Threads

Test Run

FT Threads

Full Text Agent CPU

Total Searches

Avg Search Time (sec)

1

4

28%

16,617

1.810

2

8

55%

15,239

2.147

3

12

68%

17,010

1.793

4

16

73%

16,880

1.733

5

20

64%

16,404

1.974

6

24

69%

16,716

1.788

Page 12 of 14

Content Server Throughput Best Practices

CPU Utilization - PDF

80.00 70.00 60.00

CPU (%)

50.00 40.00

inserver inserverFT

30.00

imagenow db 20.00 10.00 0.00 4

8

12

16

20

24

Number of Worker Threads

Page 13 of 14

Content Server Throughput Best Practices

Full Text Performance with Large-Scale Content Collections This scale-up test investigated search and indexing performance of large-scale content collections. Testing consisted of capturing search and indexing metrics at different milestones, during which Full Text Search Agent indexed over 4 million documents. The tests resulted in these outcomes: •

Search and indexing functionality remained stable through the steady increase in content collection size.

•

Search performance and indexing performance both scaled successfully as content collection size increased.

•

When collection size was increased by more than 300%, indexing throughput decreased only by 15%.

•

Average search times gained approximately one additional second throughout the entire test.

Resource consumption for inserverFT increased with content collection size. See the following graph:

Resource utilization varies by hardware and resource utilization.

System specs Intel Xeon X5550 16 GB RAM iSCSI Equallogic Storage

Scalability recommendations For environments expecting content search volume to exceed 50 searches per minute, Perceptive Software recommends that you make the following configuration change to the inserverFT.ini file: num.connection.workers=5

With this setting (five incoming search request workers), the server can process over 100 searches per minute.

Page 14 of 14