Database Optimization for Novelty Detection Ong Chun Lin, Agus T. Kwee, and Flora S. Tsai School of Electrical & Electronic Engineering Nanyang Technological University Singapore Abstract— Research in the area of optimizing databases in any Database Management System (DBMS) has been evolving constantly. Today, programming languages are being integrated into database systems to help professional programmers develop software quickly to meet deadlines. Therefore, the design of a database must cater to both the needs of customers and the efficiency of database processes. In this paper, a database application, novelty detection, is used to detect new documents for readers who do not want redundant documents to be read again. This application needs a database to store history and current documents. The objective of this research is to optimize the database tables for up to 10 million records. The experiments are done on both sentence level and document level. In both levels, the investigation of data optimization and the use of proper indexing are conducted. In MYSQL, the MYSQL B-Tree index is used to speed up data selection. In addition, the use of EXPLAIN enables us to properly index the correct data column and to avoid redundant indexing. Optimizing data types are also investigated to ensure no extra work is done by MYSQL in selecting data. A technique known as batching is also introduced to speed up results insertion after novelty detection has been done. Overall, the combined optimization improved the speed by up to 90%. Therefore, we have successfully optimized the database for novelty detection, and the techniques have been integrated into a real-time novelty detection application. Index Terms—database optimization, MYSQL, indexing, novelty detection
The research of optimizing database system has been an ongoing process where Database Management Systems (DBMS) have been evolving rapidly. Today, algorithms and data are being unified by integrating programming languages with database systems . The motivation for us to optimize database systems for the business process is indeed important in a world where database technology is constantly improving. Also, as lifestyles are becoming fast paced, a slow and inefficient database applications are be deemed unacceptable. One of key challenges and motivations is to integrate programming languages together with databases. In addition, researches and studies on optimization of database with large record sets and workload are certainly a study of interest. Such applications include monitoring of health monitoring services . Therefore, it is important that a database system is designed to be properly planned and benchmarked to avoid any inefficiency and bug errors.
978-1-4244-4657-5/09/$25.00 ©2009 IEEE
In this paper, we study different database optimization techniques that can be employed for novelty detection, which is the process of singling out novel information from a given set of text documents . In today’s life, information is brought in at an escalating rate, such as in news, blogs , forums, and social networks . As such, repeated information arising from competition of various companies or firms can often make our searches longer due to redundancy . However, the research on novelty detection , , ,  comes hand in hand with a database system. Therefore, in this paper, we used a novelty detection application implemented in C++ , a programming application that is tied together with a database. In this paper, the database software used is MYSQL, an open source database software . In novelty detection , users are able to send different documents to be tested for its relevance and novelty. Due to the millions of data in the database, the insertion and selection of data have to be kept at optimum. Therefore, the operation timing on each different table is investigated to ensure that the least amount of time is taken for the whole novelty detection. These results are compared with that of the original tables. These results is then be benchmarked for further works on duplicating larger records. The objectives of this paper are to design, develop, and optimize techniques for MYSQL database retrieval for novelty detection, which has not been well-studied before. This paper is organized as follows. First literature reviews of storage engines and indexing schemes are provided. Next, the detailed methodologies of indexing and optimization techniques are described. The experimental results are then presented, followed by the conclusions. II.
A. Storage Engines MYSQL offers two different types of tables. They are respectively grouped into transactional and non-transactional tables . In order to understand transactional tables, an understanding of transactions is important. Generally, in the MYSQL, transactions are a group of SQL queries that is required to complete an application. Such an example includes a banking process where bankers depositing cash needs the process of authenticating and updating bank accounts. Such application has to be completed at any costs to avoid a
mismanagement of cash which could end in dire results. Therefore, transactional tables offer such feature of ensuring transactions is “safe” and completed. However, transactional tables come at the cost of the efficiency. An example of transactional tables is InnoDB. Non-transactional tables on the other hand, offer a good compromise on the efficiency. This means data processing is often faster than transactional safe tables. One reason is the lack of extra overheads that occurs in transactional tables for safe transactions. MYISAM and MEMORY are examples of non-transactional tables. B. Indexing Indexes separate data structures that allow DBMS to locate particular records in the file of the base table more quickly . Small tables with a small data size usually do not pose much performance issues for MYSQL. However, as tables start growing, the performance with retrieving data records can get very inefficient resulting in long waiting time. Fortunately, indexing in the MYSQL helps in retrieving data records faster and more efficiently. For example, a simple SQL statement using indexing helps to retrieve and sort data faster is shown below. SELECT AVG (AGE) FROM USER GROUP BY CITY; In the event that the column AGE and CITY are not indexed, MYSQL will have to scan the whole table for AGE. Then, it scans the whole table to do sorting GROUP BY CITY. This makes the selecting and sorting processes a slow one. Since index is always sorted, similar to that of an index in an encyclopedia, the MYSQL does not need to do a whole table look up. Therefore, an index on AGE and CITY will speed up the process. However, we need to understand that different storage engines offer different indexing strategies which are summarized below: • MYISAM – B-Tree • MEMORY – Hash, B-Tree • INNODB – B-Tree B-Tree index entails certain understanding in order for it to work efficiently. Some pointers to be noted are: • Matching an exact value B-Tree index only works when matching an exact value stored in a column and is used when there is a WHERE condition • Matching a leftmost prefix Consider a following SQL query: “SELECT id FROM sample_names WHERE last_name = ‘Alan’;
If an index such as INDEX (age, last_name) is created, this index will not used to find the name Alan, as last_name is not the leftmost prefix in the index created. Matching range of values III.
The experiment shall be aided by using the MYSQL Graphical User Interface (GUI) software known as MySQLyog and the MYSQL Administrator for easy visual understanding. It consists of primarily two parts namely sentence-level and document-level novelty detection. Each part uses a test for an equal number of 50 topics using existing data given. The novelty detection process is coded using the Visual Studio C++. The C++ program together with the MYSQL are run on 2x Intel® Core™ 2 Duo CPU E6850 @ 3.00GHz, 3.3GB RAM. Using the given data, records are duplicated to test for 100k, 200k, 500k, 1 million, 2 million, 5 million and 10 million records. Since transactions are not needed in this application, MYISAM storage engine is chosen. The B-Tree indexing is also implemented as hash index only to store the temporary data. The test investigates the time taken for each SQL statement in the novelty detection process. These timings are taken using C++ timer from its internal library. A. Indexing To ensure that the time taken to select data uses the minimal amount of time, tables were indexed optimally. The SQL statements were examined and indexed in the order as shown. 1. SELECT DISTINCT topic_id FROM “ + table + “ ORDER BY topic_id; Index(topic_id) 2. SELECT doc_id, sentence_text FROM “ + table_relevant_sentences + “ WHERE topic_id =’” + params_ndsl.topic_id + “’ ORDER BY doc_date”; Index(topic_id, doc_date) 3. SELECT DISTINCT tdm.doc_id from “ + m_tableDocs + “ r, “ + m_tableTDM + “ tdm “ “WHERE r.topic_id = tdm.topic_id AND r.topic_id = ‘” + m_strTopicId + “’ AND r.doc_id = tdm.doc_id “ “ ORDER BY r.doc_date DESC Index(topic_id, doc_date) for the table storing relevant sentences Index(topic_id, doc_date) for tdm 4. SELECT term, doc_freq, seq_no FROM “ + m_tableDictionary + “ WHERE topic_id =‘” + m_strTopicId + “ Index(topic_id, term, doc_freq, seq_no)
B. Optimizing Data Types The selection of character set is important to avoid MYSQL to read too many redundant data which causes longer time to deliver the results. In addition, it also helps to save storage space for redundant data. The following tables’ character sets have been optimized to reduce the time needed to return the result set. Multiple tables’ queries are often used when selecting data. Therefore, it is important when stating condition that compares column from 2 different tables; the data types should be the same to make comparison relatively easy and to reduce processing time . C. Optimizing Inserts Retrieval of data does cause serious problems when schemas and indexes are not created properly. However, inserting data can often cause serious performance issues as well. We shall examine on inserting SQL statement in the following example: INSERT INTO student VALUES (student_id, address, major); Two methods available to insert records into the database First one is the LOAD DATA INFILE syntax. In this method, results are either written to a text file or a common separated file (CSV) where fields are delimited by a tab space in text file or a comma in a CSV file. The data written will then be exported using the LOAD DATA INFILE into the tables. The second method is to insert data by batches. The syntax is shown below: INSERT INTO table VALUES [(a,b), (c,d)];
ambiguous timing as logarithmic scale will produce an error when computed with 1 such as log101. However, more important is the study of the general trend of the experiments.
Figure 1 Graph of unoptimized Sentence Level Timing (Sentence Level) A timing study was conducted on 8000 records of unoptimized database. Sentence-level novelty detection was run on this record to determine the largest bottleneck for the whole process. From the graph above, SQL E, which selects data from 2 tables formed the largest bottleneck as it required about 49 seconds to be executed, accounted for almost 20% of the whole sentence-level novelty detection. This is consistent with our knowledge as MYSQL is required to scan through 2 different tables which can take up a longer time. The other peak was the insertion of results, SQL C, after the processing of one document was completed. In all, the total time taken for the unoptimized sentencelevel novelty detection took around 242 seconds to complete.
The values with the brackets  determines how many rows of data to be inserted. This acts as a buffer before inserting a large number of records such as 1000 records at a single time. However, in order to implement the batching technique, programming skills are needed. Therefore, implementation can vary depending on the type of programming languages used. IV.
In this section, the experimental results of document-level novelty detection and sentence-level novelty detection are discussed. For the experiments, records are replicated to test for the timing optimization to retrieve, insert data and for the space needed to store the data. The number of records used in the experiments are 8k, 100k, 200k, 500k, 2 million, 5 million, and 10 million. The timing of respective MYSQL statement is recorded and discussed in terms of its improvement in timing. The timing shown in the results is recorded in seconds and normalized to a logarithm scale due to the great disparity between the different SQL statements. It should be noted that values below 1 register a timing of less than 1 second and could result in
Figure 2 Graph of Unoptimized vs. B-Tree Indexed Timing (Sentence Level) After identifying the huge bottleneck, different tables were indexed. Initially, when the number of records was relatively small, the time taken was relatively fast. However, when the number of records increases in particular at the point of 500,000 records, the novelty detection took about 2.3 days. Therefore, records of 500,000 records under unoptimized conditions will not be carried out.
However, when the tables were indexed properly as indicated in the method above, it only took about 2.3 hours to complete the process. The improvement noted was about 96%, a huge increase of performance as compared to the unoptimized process when no indexes were created. Overall, improvement rates between the unoptimized and indexed tables created an increase in performance of 13% to 96%.
Figure 5 Graph of Overall Normal B-Tree vs. B-Tree with batching & data types optimization (Sentence Level)
Figure 3 Graph of selecting data with same data types vs. different data types (Sentence Level) Prior to indexing a table properly, the optimization of data types to ensure that sufficient storage is allocated is important. In addition, when comparing column of 2 different tables, the columns should designed to have the same data types. This further improves the performance of SQL E which was improved through properly indexing. As a result, a performance improvement of around 95% overall was observed.
The implementation of optimization techniques resulted in performance increase of magnitudes. The graph above shows that performance improvement was observed from 8000 records through to 10 million records. It recorded a better performance from 13% to almost 60% overall. The improvement seems to taper off as we use a fix 1000 batching for insertion. In general, a larger batching size used for larger data sets could result in better performance. Even so, improvement of timing for 10 million records recorded an improvement of 7 hours after optimization was implemented. The table summarizes the results for both sentence and document-level novelty detection. Whole Novelty Number of Records
SQL E Improv ement (%)
Figure 4 Graph of Normal Insertion vs. Insertion with Batching (Sentence Level) The use of batching to insert results at batches of 1000 proves to be a faster approach then to inserting results 1 by 1. This technique shows a positive performance in improvement of almost 95%.
Table 1 Summary of Improvement in Time and Space (Sentence Level)
Number of Records 11,000 100,000 200,000 500,000 1 million
Whole Novelty Detection Timing Improvement (%) 11.0 6.0 5.8 3.0 3.0
Insertion Improvement (%)
SQL E Improvement (%)
Indexing Time (Secs)
97 99 97 98 98
93 99 99 99 99
0.938 8.953 15.969 152.000 395.000
designed. The successful optimization of the MYSQL database has been implemented in a real novelty detection system. REFERENCES 
Table 2 Summary of Improvement in Time (Document Level) V.
This paper addressed the issue of database optimization for novelty detection, which has not been studied before. The experimental results of using B-Tree indexing on a larger data set have seen an increase in performance of database retrieval in novelty detection. On a small set of data, indexing does not seem to improve performance very much. With a much larger set of data, the results show a faster retrieval performance of data by more than 85%. Optimizing data types for a particular column in a table can reduce the CPU workload. This reduces time needed to retrieve and insert data into the database. It is also important to ensure that when retrieving data from multiple data, the column in the comparison should preferably be the same to provide easy handling for MYSQL. Finally, inserting data into the table requires certain preprocessing before data can be inserted. However, the insertion of data row by row will incur a high amount of pre-processing time before data are inserted. Therefore, techniques such as batching can speed up the insertion process by orders of magnitude as rows of records are kept as buffers before they are inserted. This avoids the need to insert a large number records row by row but instead keeps a buffer of records before inserting in one single insert operation. The experiment also noted that indexing and batching techniques have reduced the time needed by almost as much as over 90%. The improvement in overall novelty detection timing for both sentence and document level gets lower as the number of record sets increase. However, this can be attributed to longer processing time due to a larger set of data. The optimization of data set has also increased the size by 1%, but it was noted that this also helps reduce time to retrieve and insert data as MYSQL will not need to read too much redundant data. In conclusion, a well planned and design schema and strategies can speed up a database application by order of magnitudes compared to a database application that is poorly
   
C. Bujdei, S.-A. Moraru and S. Dan. Optimize databases for health monitoring systems. In Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments, Article No. 78, ACM Press, 2008. Y. Chen, F. S. Tsai, and K.-L. Chan, “Blog search and mining in the business domain,” in DDDM ’07: Proceedings of the 2007 international workshop on Domain driven data mining, 2007, pp. 55–60. J. Gray. The next database revolution. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 1-4, ACM Press, 2004. J.A. Hoffer, M. Prescott and F.R. McFadden (2002), Modern Database Management, Prentice Hall, 2002. A. T. Kwee, F. S. Tsai, and W. Tang, “Sentence-level novelty detection in English and Malay,” in Lecture Notes in Computer Science (LNCS), 2009, vol. 5476, pp. 40–51. H. Liang, Tsai, F. S., & Kwee, A. T. “Detecting novel business blogs,” In Seventh international conference on information, communications and signal processing ( ICICS), 2009. K. W. Ng, F. S. Tsai, K. C. Goh, and L. Chen, “Novelty detection for text documents using named entity recognition,” in Information, Communications and Signal Processing, 6th International Conference on, 2007, pp. 1–5. R.D. Schneider, MYSQL Database Design and Tuning, 2005. B. Schwartz, P. Zaitsev, V. Tkachenko, J.D.Zawodny, A. Lentz & D.J. Balling, High Performance MySQL, 2008. Sun Microsystems, MYSQL :: MySQL 5.1 Reference Manual, http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html. W. Tang, A. T. Kwee, and F. S. Tsai, “Accessing contextual information for interactive novelty detection,” in European Conference on Information Retrieval (ECIR) Workshop on Contextual Information Access, Seeking and Retrieval Evaluation, 2009. W. Tang and F. S. Tsai, “Threshold setting and performance monitoring for novel text mining,” in SIAM International Conference on Data Mining Workshop on Text Mining, 2009. F. S. Tsai, W. Han, J. Xu, and H. C. Chua, “Design and development of a mobile peer-to-peer social networking application,” Expert Systems with Applications, vol. 36, no. 8, pp. 11077 – 11087, 2009. Y. Zhang and F. S. Tsai, “Combining named entities and tags for novel sentence detection,” in ESAIR ’09: Proceedings of the WSDM ’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, 2009, pp. 30–34. Y. Zhang and F. S. Tsai, “Chinese novelty mining,” in EMNLP ’09: Pro ceedings of the Conference on Empirical Methods in Natural Language Processing, 2009. Y. Zhang, F. S. Tsai, and A. T. Kwee, “Multilingual sentence categorization and novelty mining,” Information Processing and Management: an International Journal, 2009.