Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes

Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes Jiang Su1 [email protected] Jelber Sayyad-Shirabad1 [email protected]...
Author: Leo Higgins
6 downloads 0 Views 190KB Size
Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes

Jiang Su1 [email protected] Jelber Sayyad-Shirabad1 [email protected] Stan Matwin1,2 [email protected] 1 School of Information Technology and Engineering, University of Ottawa, K1N 6N5 Canada 2 Institute for Computer Science, Polish Academy of Sciences, Warsaw, Poland

Abstract Numerous semi-supervised learning methods have been proposed to augment Multinomial Naive Bayes (MNB) using unlabeled documents, but their use in practice is often limited due to implementation difficulty, inconsistent prediction performance, or high computational cost. In this paper, we propose a new, very simple semi-supervised extension of MNB, called Semi-supervised Frequency Estimate (SFE). Our experiments show that it consistently improves MNB with additional data (labeled or unlabeled) in terms of AUC and accuracy, which is not the case when combining MNB with Expectation Maximization (EM). We attribute this to the fact that SFE consistently produces better conditional log likelihood values than both EM+MNB and MNB in labeled training data.

1. Introduction Multinomial Naive Bayes (MNB) has been widely used in text classification. Given a set of labeled data, MNB often uses a parameter learning method called Frequency Estimate (FE), which estimates word probabilities by computing appropriate frequencies from data. The major advantage of FE is that it is simple to implement, often provides reasonable prediction performance, and is efficient. Since usually the cost of obtaining labeled documents is high and unlabeled documents are abundant, it is desirable to leverage the unlabeled data to improve the Appearing in Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s).

MNB model learned form the labeled data. Numerous semi-supervised learning methods have been proposed to achieve this, and Expectation-Maximization (EM) (Dempster et al., 1977) is often used with MNB in semi-supervised setting. Though the combination of EM+MNB is relatively fast and simple to use, past research has identified some inconsistencies with EM+MNB. Namely, depending on given dataset, EM may increase or decrease the prediction performance of MNB (Nigam et al., 2000). Additionally, (Chawla & Karakoulas, 2005) observed that an EM-based technique called Common Components underperforms naive Bayes in terms of AUC given moderately large labeled data. Thus, there is still a need for a semi-supervised learning method that is fast, simple to use, and can consistently improve the prediction performance of MNB. This paper presents Semi-supervised Frequency Estimate(SFE), a novel semi-supervised parameter learning method for MNB. We first point out that EM’s objective function, maximizing marginal log likelihood(MLL), is quite different from the goal of classification learning, i.e. maximizing conditional log likelihood (CLL). We then propose SFE that uses the estimates of word probabilities, obtained from unlabeled data, and class conditional probability given a word, learned from labeled data, to learn parameters of an MNB model. Our analysis shows that both SFE and EM learn the same word probability estimates from unlabeled data, but SFE obtains better CLL values than EM in labeled training data. SFE can be easily implemented and does not require additional meta-parameter tuning. Our experiments with eight widely used text classification datasets show that SFE consistently improves the AUC of MNB given different number of labeled documents, and also generates better AUC compared to EM for most of

Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes

these datasets without any loss. Finally, while EM is one of the fastest semi-supervised learning methods, our computational cost comparisons for these datasets show that SFE can be as much as two orders of magnitude faster than EM and is potentially scalable to billions of unlabeled documents.

2. Related Work Expectation Maximization (EM) is often chosen to make use of the unlabeled data for learning an MNB model (Nigam et al., 2000). The combination of EM+MNB produces a fast semi-supervised learning method. However, (Nigam et al., 2000) point out that EM may decrease the performance of MNB when the dataset contains multiple subtopics in one class. They proposed a Common Component(CC) method using EM to address this problem. As already mentioned, (Chawla & Karakoulas, 2005) observed that while CC may improve the AUC of naive Bayes given a small number of labeled data, it may significantly underperform naive Bayes given larger labeled data. Though many semi-supervised learning methods have been proposed in recent years, there is no dominating method in this area. (Zhu, 2008) points out that the reason for this is that semi-supervised learning methods need to make stronger model assumptions than supervised learning methods, thus the performance of semi-supervised learning methods may be data dependent. (Mann & McCallum, 2010) proposed the Generalized Expectation method and observed that the classical EM+MNB outperforms it in text classification datasets.

3. Text Document Representation In text classification, a labeled document d is represented as d = {w1 , w2 , · · ·, wi , c}, where variable or feature wi corresponds to a word in the document d, and c is the class label of d. The set of unique words w appearing in the whole document collection is called vocabulary V . Typically, the value of wi is the frequency fi of the word wi in document d. We use the boldface lower case letters w for the set of word in a document d, and thus a document can also be represented as {w, c}. We use T to indicate the training data and the dt for the tth document in a dataset T . Each document d has |d| words in it. In general, we use a “hat” (ˆ) to indicate parameter estimates. Text representation often uses the bag-of-words approach. By ignoring the ordering of the words in documents, a word sequence can be transferred into a bag of words. In this way, only the frequency of a word

in a document is recorded, and structural information about the document is ignored. In the bag-ofwords approach, a document is often stored using the sparse format, i.e. only the non-zero words are stored. The sparse format can significantly reduce the storage space. Text classification is often considered different from traditional machine learning because of its highdimensional and sparse data characteristics. The highdimensional data poses computational constraints, while the sparse data means that a document may have to be classified based on the values of a small number of features. Thus, finding an algorithm which is both efficient and can generalize well is a challenge for this application domain.

4. Multinomial Naive Bayes The task of text classification can be approached from a Bayesian learning perspective, which assumes that word distributions in documents are generated by a specific parametric model, and the parameters can be estimated from the training data. Equation 1 shows Multinomial Naive Bayes (MNB) model (McCallum & Nigam, 1998) which is one such parametric model commonly used in text classification: Qn P (c) i=1 P (wi |c)fi P (c|d) = (1) P (d) where fi is the number of occurrences of a word wi in a document d, P (wi |c) is the conditional probability that a word wi may happen in a document d given the class value c, and n is the number of unique words in the document d. P (c) is the prior probability that a document with class label c may happen in the document collections. The parameters in Equation 1 can be estimated by a generative parameter learning approach, called maximum likelihood or frequency estimate (FE) , which is simply the relative frequency in data. FE estimates the conditional probability P (wi |c) using the relative frequency of the word wi in documents belonging to class c. Nic Nic Pˆ (wi |c) = = P|V | Nc Njc

(2)

j=1

where Nic is the number of occurrences of the word wi in training documents T with the class label c. Nc is the total number of word frequencies in documents with class label c in T , and can be estimated through Nic .

Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes

For convenience of implementation, the FE parameter learning method only needs to update the word frequencies Nic , which can be easily converted to Pˆ (wi |c) using Equation 2. To compute the frequencies from a given training dataset we go through each training document, and increase the entry for Nic in a word frequency table by 1 or a constant. By processing the training dataset once we can obtain all the required frequencies: |T |

Nic =

X

t fic

(3)

t=1 t where fic is the number of occurrence of a word wi in the document dt with the class label c. Once we have Nic in hand, we can also estimate P (wi ):

P|C| Nic Ni ˆ P (wi ) = P|V | c=1 = P|V | P|C| j=1 j=1 Nj c=1 Njc

(4)

where Ni is the number of occurrence of a word wi in dataset. The FE method is a generative learning approach because its objective function, shown in Equation 5, is the log likelihood (LL):

LL(T ) =

|T | X t=1

log Pˆ (c|wt ) +

|T | X

log Pˆ (wt )

(5)

5. Semi-supervised Learning for MNB In practice, it is often desirable to use unlabeled documents in order to partially compensate for the scarcity of the labeled documents. While the unlabeled documents only provide P (w) information, the MLL term in Equation 5 provides an opportunity to utilize those documents in classification. In this section, we use subscripts l and u to distinguish the parameters estimated from labeled data Tl , unlabeled data Tu and the combination of labeled and unlabeled data Tu+l . We also assume that |Tl |

Suggest Documents