A Study of Artificial Intelligence Methods Used to Classify Corpora

A Study of Artificial Intelligence Methods Used to Classify Corpora Thomas B. Jones Dept. of Computer Science The University of New Mexico Albuquerque...

Author: Jocelin Olivia Park

5 downloads 1 Views 126KB Size

Report

Download PDF

Recommend Documents

TDT4171 Artificial Intelligence Methods

Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods

Artificial Intelligence Methods (G5BAIM) - Examination

Introduction to Artificial Intelligence

Artificial Intelligence Methods. Learning from Observations

Foundations of Artificial Intelligence

FUNDAMENTALS OF ARTIFICIAL INTELLIGENCE

An Introduction to Artificial Intelligence

CS440 - Introduction to Artificial Intelligence

Artificial Intelligence

Philosophy of Artificial Intelligence: A Course Outline

A Study of Artificial Intelligence Methods Used to Classify Corpora Thomas B. Jones Dept. of Computer Science The University of New Mexico Albuquerque, NM 87131 Abstract Artificial intelligence methods are often used to classify corpora. Many methods have been used to accomplish this task including single value decomposition, clustering including kmeans and k-nearest neighbors clustering, naive Bayesian methods, Bayesian networks, decision tree methods, and support vector machines. In this paper I will present a background on the problem of corpora classification, the methods generally used to preprocess corpora for analysis, and the methods used to automatically classify corpora.

1

Introduction

Corpora classification has many uses in the modern world. These uses include automatic search and retrieval [17], document clustering [11], detecting spam [15], aiding researchers in the search for synonyms in large bodies of autonomous research material [10], and readability assessments (especially in material written by technical experts for non-technical readers) [6].

2

Common Preprocessing Methods

Corpora in natural languages tend to be very complex. If corpora are limited to the 10,000 most common words in a language (a considerable limitation when one considers that the Oxford English Dictionary contains 171,476 words) a corpus of 1000 words still has 100001000 or 105000 different permutations. While many of these permutations may never occur in any corpus in the world, a corpus classifier must still be able to classify bodies of text that are highly unlikely. Therefore, rather than trying to use natural language processes to classify text, many classifiers use a ’bag of words’ approach which vectorizes each document in a set of corpora in some regular way.

2.1

Document Vectors

Because of this high degree of internal complexity, as well as a general lack of knowledge about how humans classify corpora that they are reading, many common artificial intelligence methods used to classify corpora today do not try to process corpora as they come, but instead vectorize the corpora into a much smaller space than the permutation space described in the preceding paragraph. It is common for modern methods to first process a corpora looking for unique words, adding each 1

such word to a dictionary. Once this dictionary is built, the preprocessor then assigns to each document a vector D = (t0 , t1 , ..., tt ) where each ti is a weighting assigned to the document for the ith term in the dictionary. As noted in [17] these weighting assignments can come in many different flavors. An existence weighting assigns a bit vector to each corpus in a corpora with a 1 assigned to the th i space if the ith term in the dictionary of the corpora occurs in the corpus and 0 assigned if the ith term does not appear in the corpus. A simple count weighting can also be used, assigning to the ith position of D the number of terms the ith term has occurred in a particular document. Another weighting that is often used is an importance weighting which is like the existence weighting except that it weights terms continuously from 0 to 1 based on their importance. So, for example, a classifier which is trying to classify a set of corpora about about brain disorders may rank the term ’bipolar’ with a value .9 while ranking the term ’dog’ with a value of .1. Importance weighting also points out an important component of many modern preprocessors in ensuring that common words not occur in the corpora’s dictionary. With respect to some classification which needs to be obtained, terms called ’stop words’ are those terms which are equally likely to be in documents which have that classification as in documents which do not have that classification [19]. Removing these words from a document before creating a dictionary or building a document vector has two primary benefits. First, it improves the signal from much more important words, second it decreases the both the time and storage complexity of classification algorithms. A common example of a stop word in the english langauge is the word ’the’ which will be equally likely in a set of corpera in a classification as in a set of corpera without that classification for almost all classifications likely to be applied to enlish corpera. Since many words in english have this property, it is common to use a standard list of stop words to remove words from a corpera which are unimportant.

2.2

Document Similarity and Latent Semantic Analysis

Even though the ’bag of words’ approach described in section 2.1 helps decrease the complexity of many classification problems, it is often helpful to perform further preprocessing. To help with clustering, it is sometimes useful to determine two different document’s similarity. If D1 and D2 are two different normalized document vectors in the same set of corpora their similarity, S, is often estimated in the following manner: S = D1 · D2 [17]. Another way to help determine if documents are similar is to transform their document vectors into a lower order space which captures as many features about the corpora as possible while simplifying computational complexity and capturing the semantic structure of the corpora. One such method is called latent semantic analysis [4]. Latent semantic analysis relies on statistics methods called singular value decomposition and principle component analysis accomplish these task when the document vectors described in section 2.1 are transformed into euclidean spaces. If C is a set containing all of the corpora, then the document vectors are arranged in a single |C| × |D| mean centered matrix which I will call X, with each row containing the dictionary vectors of a different corpus in the set of corpora minus the average value of each index. This matrix is then rewritten using singular value decomposition into a the product of three matrices U ΣV ∗ where U is called the left singular value of X and V is called the right singular value of X. Σ is a diagonal matrix which contains the PCA ’score’ of each of the row vectors in U . If the terms in Σ are in order from largest at position (1, 1) to 2

smallest at position (|C|, |C|) or (|D|, |D|) (depending on whether |C| > |D|), then the columns of V will the principle components of X in order from the most important component to the least important component. The principle component (which is the first column) captures as much of the variance in the X that it is possible to capture with a single vector in euclidean space, while the second captures as much of the variance as possible after removing the first component, and so on. Using this space one can build an approximation of the input data which only uses the most important factors by setting all but the top L values in Σ to 0. Using a factor analysis like PCA, the high dimensionality of some document’s term space described in section 2.1(for example a document of word length 10, 000 may have dimensionality in the thousands after removing stop words) can be reduced to a space containing only 100 or so factors which are simply a linear combination of terms [4]. Further, classifications and queries can be simply transformed into this space and then documents which are close to the query or the classification can be found in the space using normal euclidean distance algorithms like k-nearest neighbor found in section 3.2 in the newly transformed low dimension space. This, of course, simplifies computational complexity, but it also helps with a number of other problems in corpora classification. It helps remove problems which the classification algorithm might have with synonyms and antonyms since these terms are likely to be highly correlated with each other. Thus a corpus which contains a number of terms which are synonyms with a search query (though not contained in that query) will likely be found near the query since the occurrence of one term of a synonym or antonym pair in the query will force the occurrence of the other term in the newly transformed space. Secondly factor analysis can aid with polysemy (which is more difficult that simple synonymy). Lastly, a factor analysis can remove noise (unique and anecdotal uses of terms) which might occur in a more rich and full representation of the corpora.

3

Document Clustering Methods

After one or all of the preprocessing techniques described in section 2 have been performed on a training set of corpora the actual document classification algorithm is called. The methods described in the rest of this paper rely on automatic methods which run over training data rather than the hard coded methods which might be used in expert systems. This automatic training algorithm can use many different techniques (or combinations of techniques), in this section I will focus on algorithms which classify documents by clustering the documents in to different classification groups as described in [11, 2, 3, 12].

3.1

K-Means Clustering

This first method is that of k-means clustering. Given some euclidean space (like those described in the preceding section) the k-means clustering algorithms distributes a set of k centroids randomly throughout the space. Then, the algorithm iteratively re-positions the the centroids by first finding all of the instances (in our case, each individual corpus) in the space closest to each centroid and then re-positions the centroid to the average of the closest corpora to that centroid. This process is done until some stopping condition is reached (usually when the centroids fail to move a significant distance or fail to trade more than some number of corpora with each other). 3

This form of clustering has the advantage of being easy to both implement and understand as well as having a run time which is approximately linear in the number of documents [11], however it has a number of acute disadvantages. The first problem is that centroids cannot be classified before the algorithm runs. By this I mean that even if a centroid starts in an area of the document space with might be classified ’useful documents’ it may end up moving away from that space through the successful computation of the k-means algorithm and may finally end up in a space which would more properly be classified ’semi-useful’ documents. Secondly, the method depends on a pre-defined number of centroids being set by the user. If the user picks an inappropriate number of centroids (i.e. 10 centroids for a set of documents which, on visual examination should clearly be broken into 4 clusters) then the algorithm is unable to capture the fact that the user did not have an appropriate understanding of the clustering nature of the documents under examination. Another difficulty when dealing with k-means is that the distance measure of the k-means algorithm can arbitrarily affect the clustering process. Finally, and perhaps most importantly, k-means clustering is only capable of finding local minima in the clustering space. This means there may be some much better clustering using k-clusters in the document space which simply isn’t captured due to the fact that the k-means algorithm found some local minima. This can, however, be partially ameliorated by a method known as simulated annealing [18] which periodically shakes up the centriods, moving them around randomly, using a temperature function which cools as the algorithm runs on. If the shaken up centroid combination scores sufficiently better than the algorithms normal progress, then that combination of centriods is used instead of the normal progression of the algorithm. This aids k-means in finding a global minima.

3.2

K - Nearest Neighbors

K-Nearest Neighbors (kNN) is a method of clustering which uses different documents in the training set as ’lamp posts’ in the document vector space, each with its own classification. A new document d is classified by transposing it into that space use the preprocessing methods used to build the space in the first place, and then find the k nearest training documents to the new document d. The new document is then classified with the classification which is mode of the its k-nearest neighbors. Many different distance measures have been proposed to find the k-nearest neighbors. These include: Minkowksy Distance, Manhattan Distance, Chebychev Distance, Euclidean Distance, Camberra Distance and Kendall’s Rank Correlation [12]. K-nearest neighbors is fairly intuitive however it does come with some significant drawbacks. The first drawback is that kNN has large storage requirements as the entire training set must be maintained to act as ’lamp posts’. Secondly, like k-means (in fact even more than k-means) kNN is dependent on which distance measure is used. Finally, like k-means, there is no real principled method to chose k, the best way to do so is to try a group of different ks and see which are most accurate. Two reasons are given in [12] for how the choice of k might affect the accuracy of the kNN algorithm. The first is that if too small a value of k is chosen and the data is highly noisy, then the noisy data may which the majority vote and produce bad answers. However, on the other hand, if k is too large, then classification which have a small footprint in the document vector space may end up being swamped out by their larger neighbors. 4

3.3

Agglomerative Hierarchical Clustering

A second clustering method which is commonly used is that of agglomerative hierarchical clustering, a method that is often considered better than k-means though it is also slower (taking quadratic time in the number of documents) [11]. This method requires some definition of the ’closeness’ of two different clusters. Using this definition, it starts with each document as its own individual cluster and then finds the two clusters which are closest to each other and merges them decreasing the number of clusters by one. This process is repeated on the new clusters until some specified stopping condition is reached (i.e. number of clusters, or all clusters greater than some distance from each other). After the clustering algorithms described above (or some other clustering algorithm, these by no means represent all the possible clustering algorithms) are run, then documents are classified by the cluster which they appear in. This can be done by manual examination of the clusters, or by picking a pre-classified training document or documents and seeing which clusters they appear in and classifying the cluster based on the occurrence of a majority of training documents in that cluster, or by some other method. In the future, documents which were not in the training set can be classified based on which cluster they are closest too after they have been transformed into the cluster space using the same preprocessing methods which were used on the training set.

3.4

Support Vector Machines

In [2, 1] the authors explain that a support vector machine clusters data instances (documents) by finding an optimal hyper-plane in |D|-space (where D is a document vector) which maximizes the distance from either classification cluster and this hyper-plane. This, of-course requires the classification to be linearly separable, a property which the set of corpera is, in fact, highly unlikely to have in |D|-space. However, using support vector machines, two work-arounds to this problem have been found. The first is that of allowing some document instances to cross the hyper plane and creating an associated error term for each such document which is not allowed to exceed some value for the hyper-plane to be valid. The second method is that of taking the documents and transforming them into a transformed feature space which has a higher dimensionality than the document vector space. With enough dimensionality, a consistent training set can be linearly separable. The hyper-plane which is discovered by the support vector machine in this higher order space is linear in this space but non-linear in the document vector space. A problem with raising documents into the higher dimensional space is that no automatic method for choosing the ’best space’ has been found. Instead, document vectors must often be transformed into multiple spaces, and each space must be checked for how well it splits the set of corpora into classification clusters. This cross-validation can require extra programming resources as well as extra-computational time. Further, because SVMs split space along a hyper-plane, they are inherently binary in nature. Therefore, whenever more than two classifications need to be made the problem must be broken into multiple binary classifications for the SVM to be useful. 5

4

Bayesian Networks

Another automatic method used to classify documents relies on the probabilistic ties between features and different classifications. Methods relying on Bayesian networks are able to capture the fact that a document’s classification can depend on the existence of only one or a few terms. This is an advantage over clustering methods, which can often treat different terms as equally important when creating the space in which clustering occurs (though some preprocessing methods, like importance weighting, can fix this problem). Bayesian networks capture in an easy to understand framework, the dependence between document features and classifications.

4.1

Spam Filtering: Naive Bayes

One important reason for corpora classification is spam filtering [15]. Since people who send spam have an incentive to get spam past spam filters, spam is constantly evolving to defeat current methods. The machine learning technique of Bayesian filtering, however, has been shown to appropriately filter spam with an accuracy of 95%. These spam filters use what is called a naive Bayesian technique. Email are first preprocessed using any of the methods in section 2 and using at least the ’bag of words’ technique. Then each term is treated as a feature of the email. This method uses Bayes theorem, along with the assumption that the probability that any feature appears in a spam message is conditionally independent of the probability that any other feature appears in a spam message (this is where the naive part of naive Bayes comes from). According to Bayes theorem, if we let C be a random variable representing whether or not a message is spam, and let n be the number of terms in the email then: P (C|f1 , f2 , ...fn ) =

P (f1 , f2 , ...fn |C)P (C) P (f1 , f2 , ..., fn )

The term P (C) in the above equation is fairly easy to calculate, it is simply the ratio of messages which are spam on some training set. Further, because it is assumed that each feature has an independent conditional Qnprobability from every other feature, we can calculate P (f1 , f2 , ...fn |C) by simply multiplying i=1 P (fi |C). Finally, the denominator can be calculated by simply adding P (f1 , f2 , ...fn |C)P (C) and P (f1 , f2 , ...fn |¬C)P (¬C) where ¬C is the case where the message is not spam. (C|f1 ,f2 ,...fn ) Finally, documents are classified as spam if PP(¬C|f ≥ β) where β is usually 1. 1 ,f2 ,...fn ) One of the questions surrounding naive Bayes classification is why such an algorithm works even though it makes a clearly ridiculous assumption (conditional independence between features) about the computational space in which it is working. It has been shown that on some tasks naive Bayes works as well as or better than the C4 algorithm at classification [14] One explanation, posited in [20] is that while terms may not be conditionally independent of each other, as long as the structure of their dependence is the same for both possible classifications, then the naive Bayes classifier will still work well on those classifications. The naive Bayes classifier is a quick and dirty method for classifying documents into categories. However, while it often works in practice for spam filtering, it has been discovered that there are some conditions in which it may not work well [16]. In these situations a complete Bayesian network may be more appropriate. 6

5

More Generalized Bayesian Networks: TAN

A Bayesian network is a directed a-cyclic graph which has terms and classifications as nodes. If an edge exists from random variable A to random variable B then random variable B is dependent on random variable A, otherwise it is conditionally independent of random variable A [7]. The naive Bayes approach described in section 4.1 can be thought of as a specialized subset of a Bayesian network in which all edges go only from classifications to terms, and for which each term has each classification as a parent node. Full Bayesian networks are more agile than a network limited by the naive Bayes assumption, however solving a general Bayesian network is much more computationally intensive than solving a naive Bayesian network [7]. This extra computational difficulty in a general Bayesian network comes primarily from two different sources. The first is that under the naive Bayes assumption, the graph is already known. However, in a general naive Bayesian network, the graph is unknown and some machine learning technique must be used to find which random variables have edges. This problem has been found to be np-complete [1]. The second problem is that of computing the conditional probabilities of various terms. If a term A has m different Boolean random variable parents, then there are 2m different combinations which those parents can take on and, as such, A has 2m different conditional probabilities. In [7] the authors tackle the first and second problems described in the preceding paragraph by limiting themselves to a subset of Bayesian networks called a Tree Augmented Naive Bayes Network (TAN). In this network, each term or feature random variable (i.e. non-class random variables) is allowed to have all classes as a parent in order to capture to the special status of classes, and it is allowed to have 1 other attribute as a parent, and as many children as possible. This creates a tree structure (along with a Markov blanket of classes which covers all attribute nodes) which is computationally tractable. The cost function they use to find their network is called the minimal description length (MDL) [13] a method which has been found to find the correct underlying Bayesian network with a probability approaching 1 as the number of training data points is increased. Bayesian networks can occasionally have the problem of over fitting. This happens when a model so perfectly describes the data that it has been trained on that it begins to treat errors in the training data as if they should be first order factors in the model. The model begins to loose its ability to generalize, and instead becomes merely good at finding elements of the original training set. More formally, over fitting occurs for ”any model that returns some hypothesis h whenever there is some other hypothesis h0 which has a larger error than h on the training data but smaller error than h on the general data set” [12]. This is avoided by the MDL method used in this paper since the MDL method penalizes networks which have too many edges. This makes it difficult for the network to perfectly capture the training data and as such, avoids the over fitting problem. TAN was discovered to have outperformed naive Bayesian classifiers a majority of the time. It also outperformed the selective naive Bayesian classifier, a classifier which limits the naive Bayesian network to a smaller set of terms which have been show to give better than average performance. Tan was also shown to perform slightly better than C4.5 (a derivative of the ID3 algorithm described in section 6)a majority of the time and to have performance almost exactly comparable to CL-Multinet, a multi-network Bayesian classifier. While Bayesian networks like the naive Bayes network and TAN are important modern document classification methods, another important method used to classify documents is the decision 7

tree. In the next section I examine the ID3 Algorithm, a method for automatically building decision trees given a training set.

6

Decision Trees: The ID3 and C4.5 Algorithms

The ID3 algorithm, and its derivative the C4.5 algorithm, both classify documents by building up a decision tree from a data set [9, 12]. The ID3 algorithm does this iteratively, using a greedy algorithm at each node to determine which non-target attribute (in our case a term in the document vector) is most likely to reduce the amount of target-attribute (i.e. document classification) entropy in the training set. The entropy of a subset S of the training data can be found with the following equation: n X E(S) = − fs (j)log2 fs (j) j=1

where fs (j) is the frequency of target attribute (i.e. class) j in set S. When only two target attributes, for example ’spam’ and ’not-spam’, are possible this becomes E(S) = −fs (+)log2 fs (+) − fs (−)log2 fs (−), where ’+’ represents ’class x’ and ’−’ represents ’not class x’. Next, for each non-target attribute A the algorithm calculates the gain due to splitting on that attribute in the following manner: E(S, A) = E(S) −

m X

fs (Ai )E(SAi )

i=1

where fs (Ai ) is the frequency of items in the set S which have attribute A set to value Ai and E(SAi ) is the gain of the subset split over different values of the attribute A. To make this clearer, we can think of an example where the non-target attribute under examination is the existence of the word ’dog’ in a document. The full gain of ’dog’ would be computed as E(S, dog) = −E(S) − fs (dog)E(Sdog ) − fs (no − dog)E(Sno−dog ) where fs (dog) is the proportion of elements of S that contain dog, E(Sdog ) is the entropy of the subset of elements of S that contain the word ’dog’, fs (no − dog) is the frequency of documents that do not contain the word ’dog’, and E(Sno−dog ) is the entropy of the subset of elements in S that do not contain the word dog. Finally, we can construct a rough pseudo-code for the ID3 algorithm as shown in algorithm 1. This process splits the space at each node so as to minimize entropy in the next step down through the tree. One of the problems experienced when using this algorithm is over-fitting, which can occur if the tree is allowed to grow too deep. Ways to combat this effect include setting a limit on the maximum depth of the tree, and checking that the gain is less than some value rather than simply equal to zero for the stop condition. These are called pre-processing pruning methods in that they prune the tree before it has been completely obtained by the ID3 algorithm. Postprocessing pruning methods also exist, however if a classification is difficult to describe in the document space, post-pruning damages the tree, causing its accuracy to fall [5]. The C4.5 algorithm differs from the ID3 algorithm in a few major ways. These include that the C4.5 algorithm allows for continuous attribute spaces by creating thresholds which divide the continuous attribute spaces into discrete spaces, it allows data to have missing attributes, and it allows different attributes to have weights placed on them increasing or decreasing their cost in the entropy function. 8

Algorithm 1 The ID3 Algorithm Input: S, Target Attribute, Attributes Output: Decision Tree Create root node if E(S) = 0 or |Attributes| = 0 then label node with j s.t ∀ifS (j) ≥ fS (i) else A = attribute with greatest gain label current node with A split set S based on attribute A remove A from Attributes Re-run ID3 on each subset as children to the current node. end if

7

Conclusion

When compared to other classification methods, ID3 and C4.5 compare relatively well. As noted previously, C4.5 outperforms naive Bayesian classifiers, though more general Bayesian network classifiers are able to outperform C4.5. However, C4.5 does have some drawbacks with respect to naive Bayes. It is less able to withstand missing information than a naive Bayes classifier. Additionally decision trees are slightly less tolerant of noise in their training data set and in the general data set. It is also the case that naive Bayes is better at handling over fitting since the underlying model is so simple that it is difficult to capture all of the inter dependencies of any training data set. Lastly, it is difficult to use C4.5 to learn about the world incrementally, while naive Bayes accomplishes this task very well. [12]. When looking at kNN it is easy to see that this classification technique can have some real problems with accuracy since it depends so very much on the local nature of the training set. Further, kNN is highly intolerant of missing information in its input space (since, a question mark essentially takes a document which is a single point and turns it into a line through space). Further kNN is also much less tolerant to synonymy and polysemy than other classification methods. kNN that has large values of k may be tolerant to noise, however this trade off comes at the price of being unable to classify documents which have a classification which is held by only a small number of training documents. However, the one area in which kNN neighbors shines is in it ability to add new data to its model. [12]. Support vector machines are the newest method of the methods described in this paper. When properly implemented, they have some of the best accuracy of any of the other methods of corpora classification. However this comes at a cost. The primary cost is the difficulty of their use along with the run time of their computations. They transform objects into a high dimensional space and then try to solve Lagrangian functions in that space. Additionally, it is often the case that many different spacial-dimensional transformations must be used before a ’good’ transformation is found for the set of corpora which the user wants to automatically classify. After they have learned their classification procedure, however, they classify new documents relatively quickly and are highly tolerant of irrelevant attributes in the input corpora. A drawback, however, is that they are not so tolerant of noise, since noise complicates the transformation space in which they must work. Additionally, they can only perform binary classification splits. Finally, it is very difficult to get 9

an explanation for any given classification choice from an SVM since the transformation masks which attributes contributed to a classification decision. [12]. In the future, I think that it will be important to look at hybridizing these techniques in order to come up with better classification techniques than those which currently exist. One bit of work which may prove fruitful is combining Decision Trees with Naive Bayes to come up with a new classification technique which determines a classification in a multi-class environment on each node and then uses that data to influence the construction of the naive Bayes classifier on the next node down. In fact, in [8] the authors combine the naive Bayes classifier and the SVM to create a brand new classifier. Their method first applies the naive Bayes classifier to produce a vector of probabilities for each classification. Then, this vector is used to represent the document in the SVM. Essentially, the Bayesian classifier shrinks the dimensionality of the spaces representing each document while the SVM re-increases the space. Methods like this can be used to decrease the computational complexity of the learning process used by SVM and thus increase its potential strengths. All in all, what I have discovered in this process of surveying different corpora classification methods is that, at least when limited to methods which currently exist, no one method is going to work for every situation. Different methods have different strengths and weaknesses. Some methods learn quickly but have low accuracy, other methods have high accuracy but require a large amount of storage. There are methods which can easily explain their ’reasoning’ to their users while other methods rely on techniques so opaque that a user would have to have several advanced degrees to understand fully what the method is doing. When performing automatic corpera classification it is still up to the coder/user to pick the classification method which works best for their domain.

References [1] C HICKERING , D. Learning bayesian networks is np-complete. LECTURE NOTES IN STATISTICS-NEW YORKSPRINGER VERLAG- (1996), 121–130. [2] C ORTES , C., AND VAPNIK , V. Support-vector networks. Machine learning 20, 3 (1995), 273–297. [3] C OVER , T., AND H ART, P. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on 13, 1 (1967), 21–27. [4] D EERWESTER , S., D UMAIS , S., F URNAS , G., L ANDAUER , T., AND H ARSHMAN , R. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391–407. [5] E LOMAA , T. The biases of decision tree pruning strategies. Advances in Intelligent Data Analysis (1999), 63–74. [6] F LESCH , R. A new readability yardstick. Journal of applied psychology 32, 3 (1948), 221. [7] F RIEDMAN , N., G EIGER , D., AND G OLDSZMIDT, M. Bayesian network classifiers. Machine learning 29, 2 (1997), 131–163. [8] I SA , D., L EE , L., K ALLIMANI , V., AND R AJ K UMAR , R. Text document preprocessing with the bayes formula for classification using the support vector machine. Knowledge and Data Engineering, IEEE Transactions on 20, 9 (2008), 1264–1272. [9] J IN , C., D E - LIN , L., AND F EN - XIANG , M. An improved id3 decision tree algorithm. In Computer Science & Education, 2009. ICCSE’09. 4th International Conference on (2009), IEEE, pp. 127–130.

10

[10] J ONQUET, C., L E P ENDU , P., FALCONER , S., C OULET, A., N OY, N., M USEN , M., AND S HAH , N. Ncbo resource index: ontology-based search and mining of biomedical resources. Web Semantics: Science, Services and Agents on the World Wide Web 9, 3 (2011), 316–324. [11] K ARYPIS , M., K UMAR , V., AND S TEINBACH , M. A comparison of document clustering techniques. In KKD Workshop on Text Mining (2000). [12] KOTSIANTIS , S., Z AHARAKIS , I., AND P INTELAS , P. Supervised machine learning: A review of classification techniques. Frontiers in Artificial Intelligence and Applications 160 (2007), 3. [13] L AM , W., AND BACCHUS , F. Learning bayesian belief networks: An approach based on the mdl principle. Computational intelligence 10, 3 (1994), 269–293. [14] L ANGLEY, P., I BA , W., AND T HOMPSON , K. An analysis of bayesian classifiers. In Proceedings of the National Conference on Artificial Intelligence (1992), JOHN WILEY & SONS LTD, pp. 223–223. [15] O BIED , A. Bayesian spam filtering, 2007. [16] R ISH , I. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (2001), vol. 3, pp. 41–46. [17] S ALTON , G., AND B UCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing amp; Management 24, 5 (1988), 513 – 523. [18] S ELIM , S. Z., AND A LSULTAN , K. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24, 10 (1991), 1003 – 1008. [19] W ILBER , W. J., AND S IROTKIN , K. The automatic identification of stop words. Journal of Information Science, 18 (February 1992), 45 – 55. [20] Z HANG , H. The optimality of naive bayes. A A 1, 2 (2004), 3.

11