Developing a specialized directory system by automatically classifying Web documents

Developing a specialized directory system by automatically classifying Web documents Young Mee Chung Yonsei University, Seoul, South Korea Young-Hee...

Author: Aron Benson

3 downloads 1 Views 456KB Size

Report

Download PDF

Recommend Documents

A Methodology and Algorithm for Automatically Classifying Text Documents to Strategic Intents

Developing a Web Site

Developing a Web Museum on a Hypermedia System

Classifying Web Queries by Topic and User Intent

Web documents types 1

Classifying Web Videos using a Global Video Descriptor

A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces

Developing Java Web Applications

Developing a Web Site in Primary Care

A Type System for Querying XML Documents

A Novel Indexing Technique for Web Documents using Hierarchical Clustering

Automatic Pagination of HTML Documents in a Web Browser

Building Bilingual Dictionaries From Parallel Web Documents

Progressive. Specialized. Progressive Specialized

DATA CLUSTERING: FROM DOCUMENTS TO THE WEB

Publishing Documents on the World Wide Web

Developing a Fuzzy Logic Based Game System

DEVELOPING A SUSTAINABILITY VISION AND MANAGEMENT SYSTEM

Automatically Categorizing Written Texts by Author Gender

Complementing Your TV-Viewing by Web Content Automatically-Transformed into TV-program-type Content

Informa Web. A Web-based Classroom Response System

PROVIDER DIRECTORY BY SPECIALTY

Lourdes Directory by Name

managed by Guest Directory

Developing a specialized directory system by automatically classifying Web documents

Young Mee Chung Yonsei University, Seoul, South Korea

Young-Hee Noh Ewha Womans University, Seoul, South Korea Received 10 May 2002 Revised 13 December 2002

Abstract. This study developed a specialized directory system using an automatic classification technique. Economics was selected as the subject field for the classification experiments with Web documents. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table to construct a representative term dictionary. In collecting and classifying the Web documents, various strategies were tested in order to find the optimal thresholds. In the classification experiments, Web documents in economics were classified into a total of 757 hierarchical subject categories built from the DDC scheme. The first and second experiments using the representative term dictionary resulted in relatively high precision ratios of 77 and 60%, respectively. The third experiment employing a machine learning-based k-nearest neighbours (kNN) classifier in a closed experimental setting achieved a precision ratio of 96%. This implies that it is

Correspondence to: Young Mee Chung, Department of Library and Information Science, Yonsei University, 134 Shinchon-Dong, Seodaemun-Gu, Seoul, Korea. E-mail: [email protected] Journal of Information Science, 29 (2) 2003, pp. 117–126

possible to enhance the classification performance by applying a hybrid method combining a dictionary-based technique and a kNN classifier.

1. Introduction Hundreds of thousands of new sites are appearing on the Internet daily and the increase of Web documents provided by Internet sites is extraordinary. Not only the quantity but also the quality of these documents has been improving from the early stages of the Internet age. As a result, researchers and research institutes are increasingly using Web documents for scholarly purposes. The well-known retrieval tools of the Web are search engines and directory services. The search engines that mainly perform keyword searching are continually improving with the addition of new functions for more effective retrieval. However, the search results are still far from being satisfactory. On the other hand, the directory services provide more satisfactory results to users by manually reviewing and classifying documents before retrieval. Nevertheless, it is almost impossible for human editors to keep classifying the immensely increasing amount of Web documents. Subject gateways like SOSIG (Social Science Information Gateway) also provide organized access to Internet resources, usually in a specialized subject area. Subject gateways are similar to specialized directories in that they maintain a list of quality Internet resources reviewed by human experts. Besides, many academic subject portals being operated by university libraries like Cleveland State University Library also give access to Web documents by pulling together free Internet resources and subscription-based 117

Developing a specialized directory system

sources such as databases, electronic journals and other digitized collections. A way of improving the effectiveness of retrieval is to group similar documents by clustering algorithms. Document clustering can be used for automatic classification of documents as in the SONIA project of Stanford University [1] as well as for partitioning a group of documents falling within the same subject category into several sub-groups as in the OCLC’s SCORPION project [2]. Especially in the Web environment where enormous numbers of documents are being retrieved, generating small homogeneous groups each containing more similar documents may increase the retrieval effectiveness [3]. For example, the Northern Light search engine clusters retrieved documents with similar topics into ‘custom search folders’, enabling users to access the most appropriate document subset. The ‘Find similar’ feature that most search engines provide as a search aid is also based on the clustering concept. Similarly, in directory services automatic classification or text categorization techniques based on a priori classification schemes can be used to automatically assign documents to relevant subject categories, greatly reducing the work of human experts. Especially in a research library environment where a specialized directory system is needed, a new information service can be accomplished without the use of additional manpower by employing an appropriate document classification method. In this study, we developed a specialized directory system that automates the collecting, indexing and classification processes of Web documents. Web documents were gathered by a Web robot and then automatically indexed. After indexing, automatic classification of the documents was performed using the subject term dictionary built from the DDC table. In an effort to build a more effective directory system, three classification experiments were carried out as described below: . A subject specialist extracted subject terms representing each subject category or class number under economics using the explanatory note and relative index parts of the DDC table. The documents gathered by a Web robot were automatically indexed to make index terms represent each document. A Web document was then classified into an appropriate subject category after computing similarities between the document and every subject category. . Additional representative terms for each subject category were derived from MARC records of 118

.

library materials that had been manually classified into each category in a real library setting. For this purpose, the DDC numbers in a MARC record were used to extract subject terms from the title and subject headings in the same record. The terms that appeared in the title and the subject headings field were added to the representative term dictionary. Then the same method of classification was used as in the first experiment. For the purpose of enhancing the performance of the dictionary-based classification technique, a kNN (k-nearest neighbours) classifier, a very effective machine learning classifier [4, 5], was applied to the pre-classified Web documents. For the efficiency of text categorization, document frequency was used as a feature selection method to remove non-informative terms according to corpus statistics.

2. Developing a specialized directory system 2.1. System outline Many people use general Internet directory systems such as Yahoo! in order to meet their information needs. Most of the general directory systems do not apply standard classification schemes like DDC or UDC but use specific schemes designed for Internet information resources. However, specialized directory systems that focus on scientific information tend to use standard library schemes. McKiernan [6] surveyed sites that provide scientific information on Internet and found that two sites classify resources in alphabetical order, one in numerical order, 20 by DDC, and three by UDC. Most of the Internet directory systems, whether general or specialized, classify or categorize information resources by editors or surfers. However, Cora [7], a specialized retrieval engine in the computer science field, uses an extended Naive Bayes classifier to automatically classify documents into 75 hierarchical subject categories. The performance of the classification method is reported to have an accuracy of 66% [8]. In this study economics was selected as the sample subject field to construct a specialized directory system through an automatic classification process. The classification scheme of the directory follows the DDC, and subject terms representing each class number or subject category were selected from the DDC table. A representative term dictionary was built with category entries, each consisting of class number, Journal of Information Science, 29 (2) 2003, pp. 117–126

Y. M. CHUNG AND Y.-H. NOH

2.2. Classificatory structure of the directory

Fig. 1. Overview of the specialized directory system.

subject category label and corresponding subject terms. The classification scheme and term dictionary need to be updated periodically in order to reflect new concepts appearing in incoming Web documents. The overall structure of the specialized directory system constructed in this study is shown in Fig. 1. The workflow of the system can be described as follows: . To gather Internet documents by a Web robot, initial URLs of Web sites dealing with economics resources are designated for the Web robot to visit; . the documents gathered by the Web robot are stored in a temporary database; . the document indexer extracts URL information for the next visit and other indexing information to be used in the automatic classification process from each document in the temporary database; . according to the designated automatic classification algorithm, each Web document subjected to classification is compared with the representative term dictionary and assigned to the most relevant category; . a subject specialist verifies the result of classification and saves it in the directory database if correctly assigned. This verification process may be omitted if 100% automatic classification is desired. Journal of Information Science, 29 (2) 2003, pp. 117–126

Since specialized directories deal mainly with scholarly information, unlike general Internet directory services, the use of such standard classification schemes as the DDC and the UDC is encouraged. To develop a specialized directory in the economics field, the DDC scheme was adopted and the subject headings that correspond to class numbers in the DDC table were used as subject category labels. DDC is a decimal classification scheme that divides all subjects into 10 categories from 0 to 9. Each main class is further divided into 10 divisions, and the divisions are once again divided into 10 sections. Since this study deals with the field of economics, it uses the ‘Economics’ (330) subdivision under the ‘Social Sciences’ (300) main class. A total of 757 subject categories were collected from the Economics subdivision, and subject terms representing each subject category were selected. A representative term dictionary built from these data was then used in automatically classifying incoming Web documents. Each record of the representative term dictionary includes a class number, a subject category label and subject terms for the subject category. A sample record for class number 334.7 is given below: [334.7] * Benefit societies [benefit societies, benevolent societies, friendly societies, mutual aid societies, provident societies]

The starting structure of the directory consists of nine levels and the class number structure for each level is shown in Table 1. Thus, the top subject category, i.e. economics subdivision, is level 1 and can be subdivided up to level 9. The number of levels in the directory hierarchy is decided after considering the effectiveness of the

Table 1 Hierarchy of subject categories Levels of subject category

Class number structure

Level Level Level Level Level Level Level Level Level

330 33* 33* .* 33* .** 33* .*** 33* .**** 33* .***** 33* .****** 33* .*******

1 2 3 4 5 6 7 8 9

119

Developing a specialized directory system

system. The performance evaluation of the results of classification experiments guides the selection of the hierarchy depth. The level depth up to level 9 is not unusual in other directories. For example, subject categories exceeding 10 levels can be seen in Yahoo!. 2.3. Collecting Web documents In general, Web documents on the Internet are collected by a Web robot. Web robots gather documents by periodically navigating domestic and foreign Web sites on a URL list. Often the URL list of related sites is drawn up to designate the starting URLs. The wellknown sites in a specific field or the URLs appearing on Usenet Newsgroups are usually put on the list. Other options include using sites where mailing list documents are stored or designating the server of a well-known Web directory system. This study sought out Web sites of well-known research institutes in the economics field to come up with 50 starting URLs. The URL list includes 37 domestic sites and 13 foreign sites such as the Asian Development Bank and International Monetary Fund. Beginning with the starting URLs, the collected documents were temporarily saved on a local database to extract new URL information. This is a process of acquiring new URL information from already visited Web sites. The Web robot examined acquired domain names and excluded overlapping sites or checks whether the URL had already been visited in order to prevent the retrieval of duplicate Web documents. The Web robot developed for this study followed a standard for robot exclusion that specified the method to exclude robots from a server [9]. There are two ways for a Web robot to search hierarchically linked Internet documents, breadth-first search and depth-first search. The Web robot in our directory system was designed to first retrieve one host and then move on to the next by the depth-first search. The depth designated for visits was ‘+10’. By designating ‘+10’, the robot visited the 10 upper sites and 10 lower sites from the starting URL. If ‘0’ was designated, it would retrieve all references both above and below on the starting URL. 2.4. Automatic indexing and classification of Web documents In order to classify the gathered Web documents into a specific subject category in the directory, the index terms need be extracted. In the indexing process, a commercial indexing system based on a stop word list 120

and morphological analyser was used. Index terms were extracted from the full text of documents including title, subtitle, hyperlink anchor and words in bold or italics. However, to minimize the indexing overhead in the directory system, file names, file directory paths, e-mail addresses, host names and HTML tags were excluded from the indexing. For each index term, term frequency (TF) within a specific document and document frequency (DF), that is the number of documents where a term appears, were calculated. Each index term was assigned a standardized term frequency weight of 1 þ logTF. After indexing the gathered Web documents, the similarity between a specific document and subject categories of the directory was measured by comparing the assigned index terms and category representative terms. As previously mentioned, each subject category was represented by subject terms. The cosine coefficient was used to measure the similarity between a specific Web document i and a subject category j as follows: P tik 6cjk SðDi ; Cj Þ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P P ðtik Þ2 6 ðcjk Þ2 where tik ¼ weight of term k within document i (index term weight) and cjk ¼ weight of representative term k in subject category j (set to 1 if it exists). If the similarity between an input document and a subject category exceeds a certain threshold value, the document may be classified into the corresponding subject category. In this study, the threshold value varied in four steps, from 0.1 to 0.4. Here, one document was assigned to only one subject category with the highest similarity value, although there could be multiple categories for which the similarity exceeded the threshold.

3. Automatic classification experiments 3.1. Outline of the experiments In order to achieve the best performance in automatically classifying the collected Web documents in our directory system, classification experiments were carried out in three different settings. The number of test documents for the first and second experiments was 6743. In the third experiment, a total of 2512 documents including 1889 training documents and 623 test documents were used. In the first and second experiments, the threshold values for the term frequency of index terms and the Journal of Information Science, 29 (2) 2003, pp. 117–126

Y. M. CHUNG AND Y.-H. NOH

Table 2 Twelve threshold conditions for directory building

TF Similarity

dir 1

dir 2

dir 3

dir 4

dir 5

dir 6

dir 7

dir 8

dir 9

dir 10

dir 11

dir 12

1 0.1

1 0.2

1 0.3

1 0.4

2 0.1

2 0.2

2 0.3

2 0.4

3 0.1

3 0.2

3 0.3

3 0.4

similarity between an input document and subject categories varied to construct 12 different experimental directories. The term frequency of index terms was set to from 1 to 3, generating three conditions. As mentioned before, the similarity threshold value was also set to from 0.1 to 0.4 generating four conditions. In total, there were 12 threshold conditions resulting in 12 different directories, as shown in Table 2. In performance evaluation, three measures of classification precision, the degree of dispersion, and the degree of hierarchical concentration were used. To measure the performance of binary classification, the result of the category assignments is usually evaluated using a two-way contingency table for each subject category and average precision across categories is computed [1]. In this study, classification precision is an averaged precision ratio that computes the percentage of correctly assigned documents among the total documents assigned into a specific category. The classification precision formula for each subject category is as follows: Classification Precision Number of correctly assigned documents ¼ Total number of documents assigned to each category

The degree of dispersion measures whether the documents gathered by the Web robot are evenly assigned in the subject categories of the directory without concentrating in a specific category. The formula is as follows: Degree of Dispersion Number of categories assigned to classified documents ¼ Total number of categories

The degree of hierarchical concentration estimates the percentage of documents assigned into the subject categories in each level among the total number of the classified documents. It is calculated by the following formula: Degree of Hierarchical Concentration Number of documents classified per level ¼ Total number of documents classified Journal of Information Science, 29 (2) 2003, pp. 117–126

3.2. Results of the first and second experiments Variation of thresholding conditions. Tables 3 and 4 show that, as the similarity threshold gets higher, the number of documents being classified into each subject category decreases in both experiments. On the other hand, it appears that the frequency threshold of index terms does not significantly affect the classification result. In the cases of dir 1, dir 5 and dir 9, where the similarity threshold was set at 0.1, the number of documents classified into the subject categories in both experiments exceeded 4694 documents in the three directories. However, for dir 4, dir 8, dir 12, where the similarity threshold was set at 0.4, the actual number of documents classified into subject categories was 87, 163 and 407, respectively, for the first experiment and 63, 122 and 249 for the second experiment. This shows a very low rate of category assignment. When the documents assigned to each category were examined by level, the greatest number of documents was assigned to categories in levels 1, 4 and 5 in the first experiment. However, in the second experiment, few documents were assigned to level 1, while many documents were assigned to levels 6, 5 and 4 in the listed order. The subject category of level 1 is ‘Economics’ with the class number 330. In the first experiment, the representative terms for the category are economics and its corresponding Korean term. This caused many documents to be assigned into that category when applying the low threshold such as 0.1 or 0.2. Since the directory is a specialized one in the economics field, it is not desirable to have many documents in the top level. In the second experiment where nearly 100 representative terms were given for the top level category, a dramatic decrease in the number of documents assigned to the top level can be noticed. On the other hand, level 9 contains very few documents, implying that this level may be too specific to be included in the directory. Analyses of categorical dispersion and hierarchical concentration. It is preferable to design the architecture of a directory so that documents are evenly 121

Developing a specialized directory system

Table 3 Result of the first experiment Level

dir 1

dir 2

dir 3

dir 4

dir 5

dir 6

dir 7

dir 8

dir 9

Dir 10

dir 11

dir 12

Total

1 2 3 4 5 6 7 8 9

1315 99 305 1930 1884 744 337 14 1

1215 35 92 804 724 304 71 3 0

19 6 5 131 133 57 4 0 0

1 0 0 34 47 5 0 0 0

1285 71 247 1771 1683 585 179 8 0

1253 32 116 1106 993 366 87 7 0

1188 6 18 241 233 108 23 0 0

18 0 5 63 64 13 0 0 0

1272 53 247 1566 1601 527 163 5 0

1418 32 126 1120 1139 388 100 4 0

1187 7 32 369 389 168 34 2 0

62 1 7 125 127 72 13 0 0

10,233 342,342 1200 9260 9017 3337 1011 4343 11

852.75 34.20 109.09 771.67 751.42 278.08 101.10 6.14 1.00

Total

6629

3248

355

87

5829

3960

1817

163

5434

4327

2188

407

34,444

2870.33

dispersed in all the subject categories. The number of categories was counted into which documents were actually assigned in the entire 757 subject categories in the directory. The average number of such categories for the two experiments was 171.42 and 185.5, respectively. This means that only a quarter of the subject categories were used to classify the collected Web documents. The degree of dispersions computed for 12 directory conditions is shown in Table 5 and graphically in Fig. 2. The average dispersion ratio was 23% for the first experiment and 25% for the second. The degree of hierarchical concentration was also computed for every level of 12 directory conditions. The average degree of concentration for each level is presented in Fig. 3. The degree of hierarchical concen-

Average

tration is relatively high in levels 1, 4, 5 and 6 for the first experiment and in levels 4, 5 and 6 for the second. Evaluation of classification accuracy. The classification accuracy was evaluated using the precision ratio as stated in 3.1. It was analysed by level as well as by directory condition. In calculating the average precision, level 9 was excluded because it contained too few documents to exist as a subject category in our directory. Table 6 gives the precision ratios by level for the first and second experiments. The average precision was about 77% for the first experiment and about 60% for the second. Level 1, containing the most broad subject category, achieved a precision ratio of 1.0, as expected. When excluding the top level, the precision ratios decreased to 73 and 58%, respectively.

Table 4 Result of the second experiment Level

dir 1

dir 2

dir 3

dir 4

dir 5

dir 6

dir 7

dir 8

dir 9

dir 10

dir 11

dir 12

Total

Average

1 2 3 4 5 6 7 8 9

7 32 397 1264 1680 2181 120 32 2

1 15 164 603 713 1722 32 10 0

0 0 6 47 131 109 2 0 0

0 0 0 3 46 14 0 0 0

2 9 198 1307 1739 2106 145 29 0

0 5 126 770 970 1794 56 16 0

0 0 7 126 218 187 10 2 0

0 0 0 14 57 51 0 0 0

0 3 193 1130 1458 1745 138 27 0

0 1 102 691 1080 1952 81 23 0

0 1 13 185 376 1384 19 7 0

0 1 0 38 116 87 6 1 0

10 67 1206 6178 8584 13,332 609 147 2

0.83 5.58 100.50 514.83 715.33 1111.00 50.75 12.25 0.17

Total

5715

3260

295

63

5535

3737

550

122

4694

3930

1985

249

30135

279.03

122

Journal of Information Science, 29 (2) 2003, pp. 117–126

Y. M. CHUNG AND Y.-H. NOH

Table 5 Degree of categorical dispersion

First experiment Second experiment

dir 1

dir 2

dir 3

dir 4

dir 5

dir 6

dir 7

dir 8

dir 9

dir 10

dir 11

dir 12

Average

0.46

0.23

0.08

0.02

0.43

0.3

0.13

0.04

0.43

0.34

0.18

0.10

0.23

0.51

0.29

0.07

0.02

0.49

0.33

0.12

0.03

0.47

0.35

0.18

0.07

0.25

Fig. 2. Comparison of categorical dispersion.

Fig. 3. Degree of hierarchical concentration by level.

Table 7 shows the classification precision of the directory according to the different directory conditions. In both experiments, the performance was enhanced with the increase of the similarity threshold from 0.1 to 0.3. When the threshold was set at 0.4, there were several directory conditions where the performance dropped. However, the frequency threshold Journal of Information Science, 29 (2) 2003, pp. 117–126

seemed to have little impact on the performance of automatic classification. Discussion on the first and second experiments. The second experiment sought to improve the classification performance by adding the terms from titles, subject headings and other fields in MARC records to the representative terms drawn from the DDC table. Although the performance of the automatic classification became worse in the second experiment, the percentage of the documents assigned to the top-level category decreased from 26.63 to 0.02%. Therefore, in terms of the specificity of classification and the hierarchical concentration, the second experiment gave more satisfactory results. When looking at the classification results in the directory system varying threshold conditions, it was found that the actual number of documents being assigned into subject categories decreased when the similarity threshold between a document and subject categories became higher. Similarly, the actual number of categories where documents were assigned also decreased when the similarity threshold became lower. This implies that an appropriate level of the similarity threshold needs to be ascertained for any Web directory to be effective. In both the first and second experiments, when excluding level 9 where too few documents were assigned due to over-segmentation, the directory achieved a relatively high classification performance with precision ratios of 77 and 60%, respectively. However, when using a subject term dictionary built from the DDC table, as in the first experimental setting, it is desirable to add more general terms to the representative terms of categories in the top level. The analysis of the classification results according to the threshold condition indicates that the performance improves with higher similarity threshold from 0.1 up to 0.3. In contrast, the term frequency threshold varying from 1 to 3 seems to have little influence on the performance as well as on the number of documents being actually classified. 123

Developing a specialized directory system

Table 6 Comparison of classification precisions by level Level

First experiment

Second experiment

1 2 3 4 5 6 7 8

1.00 0.97 0.70 0.78 0.67 0.69 0.62 0.71

1.00 0.14 0.56 0.65 0.55 0.67 0.55 0.80

Average Average (excluding level 1)

0.77 0.73

0.60 0.58

Table 7 Comparison of classification precisions by directory conditions

First experiment Second experiment

dir 1

dir 2

dir 3

dir 4

dir 5

dir 6

dir 7

dir 8

dir 9

dir 10

dir 11

dir 12

Average

0.72 0.55

0.72 0.60

0.75 0.69

0.86 0.52

0.75 0.63

0.77 0.67

0.81 0.68

0.70 0.72

0.69 0.46

0.81 0.52

0.82 0.61

0.85 0.51

0.77 0.60

3.3. The third experiment to enhance the classification performance Test collection. Of the 12 directories evaluated in the first experiment, dir 1 with the frequency threshold of 1 and similarity threshold of 0.1 was used for the third experiment. dir 1 was characterized by a total of 757 subject categories, and 386 subject categories containing more than one document. Among the total documents actually classified in dir 1, 2512 documents with category labels were used for the third experiment. This test collection was divided into two subsets with a ratio of 7 to 3, resulting in 1889 training documents and 623 test documents. Classification process and result. In the third experiment, a kNN classifier was used to categorize the pre-classified test documents by learning from the training documents. The classifying process of the kNN classifier was as follows [10]: 1. First, given a test document, k nearest documents were retrieved from the training documents. For the calculation of similarity between each retrieved training document Dj and the test document Dx , the following cosine coefficient formula was used. In this experiment, the k value was varied to 10, 20 and 30 for the purpose of 124

finding optimal k. As mentioned earlier, the term weights, txk and tjk , were in the form of 1 þ log TF. P P txk 6 tjk ﬃ SimðDx ; Dj Þ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P P ðtxk Þ2 6 ðtjk Þ2 2.

3.

For the purpose of more efficient processing, a feature reduction was applied in representing each document to be classified. The document frequency that proved as effective as information gain or chi-square test in a previous study [11] was used as a feature selection method. Based on the previous studies containing feature selection experiments [4, 11], the top 20% of the terms with the highest document frequency were used to represent each document. After computing the relevance score of each subject category on the basis of cosine similarity value obtained in step 1, the test document was assigned to the category with the highest relevance score. As shown in the relevance formula below, the similarity of each nearest neighbor Dj to the test document was multiplied by the conditional probability of the document Dj to be in each Journal of Information Science, 29 (2) 2003, pp. 117–126

Y. M. CHUNG AND Y.-H. NOH

Table 8 Categorization performance of a kNN classifier k value

Recall ratio

Precision ratio

Accuracy

10 20 30

0.9015 0.9010 0.9009

0.9486 0.9658 0.9752

0.9984 0.9984 0.9984

Average

0.9011

0.9632

0.9984

category Ck and summed to give the relevance score. X relðCk jDx Þ & simðDx ; Dj Þ pðCk ; Dj Þ Dj [ k top documents

Table 8 shows the performance of the kNN classifier when k was set at 10, 20 and 30 respectively. The average precision was as high as 96.32% with a slight increase in precision when k was larger. It can be seen from the table that the k value hardly affects the categorization performance in this experiment.

4. Conclusions and suggestions In developing a specialized Internet directory system in the field of economics, automatic classification experiments were performed to construct a more effective system. The DDC scheme was adopted as a classificatory architecture for the directory and a dictionary-based classification method employed to automatically categorize Web documents being input to the directory system. The first classification experiment that used a representative term dictionary achieved relatively high precision of 77%. Although this level of performance is acceptable in a real system environment, human subject specialists may verify the classification results and reassign incorrectly classified documents without much effort. This process will create a high performance directory similar to directory services managed by human editors. However, it was assumed that the classification performance would also improve if a machine learning categorization technique was applied as a second step to refine the classification result. When a kNN classifier was employed to reclassify the test documents, the classification precision was enhanced up to 96%. Although this extremely high precision was possible because the classifier was applied to the pre-classified Journal of Information Science, 29 (2) 2003, pp. 117–126

documents in our directory, the experimental result suggests a way to optimize the classification performance in a specialized directory system. The suggestions to be made from the experimental results of this study are as follows: . Construct a specialized directory system with a well-established classification scheme such as DDC and employ an automatic classification procedure using a subject term dictionary to classify collected documents. . To improve classification performance, select a certain number of correctly classified documents from every subject category during the verifying process of classified documents and use them as training documents for a kNN classifier. Reclassify the collected documents in the directory using the kNN classifier. . Once a sufficient number of Web documents are collected and classified for a certain period of time, it is also possible to activate a kNN classifier to categorize newly collected documents by providing some correctly classified documents as training documents. . When gathering documents in a specific subject field from the Web and classifying them into a directory, the term frequency threshold for input documents as well as the similarity threshold for document-category matching need to be optimized.

References [1] M. Sahami et al., SONIA: a service for organizing networked information autonomously, Digital Libraries 98. Available at: http://robotics.stanford.edu/users/ sahami/papers-dir/98-sonia.ps (posted 1998, access date: 25 March 1999). [2] The Scorpion Project. Available at: http://orc.rsch.oclc.org:6109/ (posted 1998, access date: 20 March 1999). [3] O. Zamir and O. Etzioni, Web document clustering: a feasibility demonstration, ACM SIGIR ‘98 (1998) 46–54. [4] Y. Yang, An evaluation of statistical approaches to text categorization, Information Retrieval 1 (1999) 69–90. [5] Y. Yang and X. Liu, A re-examination of text categorization methods, ACM SIGIR ‘99 (1999) 42–49. [6] Gerry McKiernan, Beyond Bookmarks: Schemes for Organizing the Web. Available at: www.public.iastate.edu/*CYBERSTACKS/CTW.htm/ (posted 1999, access date: 24 February 1999). [7] Cora: Computer Science Research Paper Search Engine. Available at: www.whizbang.com (access date: 20 May 2001).

125

Developing a specialized directory system

[8] A. McCallum et al., A machine learning approach to building domain-specific search engines, The Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99) (1999) 662–667. [9] M. Koster, A Standard for Robot Exclusion. Available at: www.demon.co.uk/pages/knowbots/norobots.html (posted 1997, access date: 2 April 1999).

126

[10] L.S. Larkey and W.B. Croft, Conbining classifiers in text categorization, SIGIR’ 96 (1996) 287–297. [11] Y. Yang and J.O. Perdersen, A comparative study on feature selection in text categorization, Proceedings of the Fourteenth International Conference on Machine Learning (ICML ‘97) (1997) 412–420.

Journal of Information Science, 29 (2) 2003, pp. 117–126