Promoting Domain-Specific Terms in Topic Models with Informative Priors Angela Fan, Finale Doshi-Velez, Luke Miratrix

arXiv:1701.03227v1 [cs.CL] 12 Jan 2017

Harvard University

Abstract Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. We propose a simple strategy for automatically promoting terms with domain relevance and demoting these domain-specific stop words. Our approach is easily applied within any existing LDA framework and increases the amount of domain-relevant content and reduces the appearance of canonical and humanevaluated stopwords in three very different domains: Department of Labor accident reports, online health forum posts, and NIPS abstracts. Along the way, we show that standard topic quality measures such as coherence and pointwise mutual information act counter-intuitively in presence of common but irrelevant words. We also explain why these standard metrics fall short, propose an additional topic quality metric that targets the stopword problem, and show that it correlates with our human subject experiments.

Introduction Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) is a popular model for modeling topics in large textual corpora as probability vectors over terms in the vocabulary. LDA posits that each document d is a mixture θd over K topics, each topic k is a mixture βk over the vocabulary, and wd,n , the nth word in document d, is generated by first sampling a topic and then drawing a word from that topic: θd ∼ Dirichlet(α) βk ∼ Dirichlet(η) zd,n ∼ Mult(θd ) wd,n ∼ Mult(βzd,n ) Topics are then commonly interpreted by looking at most probable words in their distributions βk , k = 1, . . . K. Unfortunately, stopwords—words with no contextual information—often dominate these lists of highest probability words. Stopword-dominated topics are uninterpretable as semantic themes, and even if canonical stopwords are removed, topics dominated by overly general and uninformative words reduce the utility, reliability, and acceptance of statistical topic models by users outside of the machine learning community. To improve topic quality, practitioners typically rely on heavy pre- and post-processing, such as creating stopword

lists and re-training the LDA models without those words. Broadly, stopwords can be divided into two categories: canonical (“the,” “and”) or domain-specific (“child,” “son” in a corpus about children). Canonical stopwords can often be removed by referring to standard, publically available lists. However, constructing lists of domain-specific stopwords is a non-trivial task and risks introducing human bias if the model trainer builds these lists over repeated LDA runs. Such extensive processing is also a challenge for scientific reproducibility, as the full preprocessing steps and deleted word lists aren’t often included in publications. Further, many proposed methods to improve topic quality are complex and not easily integrated into existing software, particularly for the applied LDA community or as part of a larger and more complex graphical model. In this work, we propose a set of specific asymmetric priors for LDA that create informative, human interpretable topics without necessitating stopword deletion. We expose subtle but important concerns regarding the evaluation of topic quality when documents contain many irrelevant words. We show that common metrics to assess topic quality, such as coherence (Mimno et al. 2011) and pointwise mutual information (PMI) (Newman, Bonilla, and Buntine 2011) actually prefer topics that place high probability on canonical stopwords; without stopword removal, these metrics neither correlate with human interpretability nor reflect topic quality. Furthermore, these standard topic quality metrics, along with perplexity, cannot compare LDA models trained across different vocabularies—as is the case when one is removing potential stopwords; in general, the removal of any term will improve these metrics. We quantify topic quality via human evaluation and then propose a lift-based metric to approximate this measure.

Related Work To produce topics with more domain-relevant words, much work has focused on automatic stopword detection and removal. Popular techniques for automatic stopword detection include keyword expansion and other information retrieval approaches (Lo, He, and Ounis 2005; Saif, Fernandez, and Alani 2014; Sinka and Corne 2003; HaCohen-Kerner and Blitz 2010). Several approaches identify stopwords based on term weighting schemes (Ming, Wang, and Chua 2010) or word occurrence distributions (Wibisono and Utomo 2016;

Baradad and Mugabushaka 2015). Makrehchin and Kamal (2008) assume that every document has a type or label and only include the words that are most correlated with the document label while minimizing information loss. Lo et al. (2005) begin with a set of pre-generated search engine queries and quantify word informativeness via the KL divergence between the query term distribution and the corpus background distribution. These approaches require parameter tuning to set various penalty cutoffs (Lo, He, and Ounis 2005). Several require document-specific labels and query terms. In contrast, our approach is both simple to apply within existing LDA software frameworks and uses flexible penalization rather than hard trimming thresholds. By avoiding deletion, our models reflect full text, don’t require domain-specific stopword lists, and reduce the human bias of deciding upon deleted words. More broadly, many efforts work to improve the semantic interpretability of topic models (Mehrotra et al. 2013; Yang et al. 2015; Xie, Yang, and Xing 2015; Andrzejewski, Zhu, and Craven 2009; Bischof and Airoldi 2012). In particular, much work has improved topic quality via different priors: Wallach et al. (2009) show the effectiveness of general asymmetric priors to improve topic quality, Newman et al. (2011) use an informative prior capturing short range dependencies between words, and Andrzejewski et al. (2009) use Dirichlet Forest priors to capture corpus structure. Other models modify LDA to incorporate corpus-wide data of word frequency and exclusivity (Bischof and Airoldi 2012; Jagarlamudi, Daum´e, and Udupa 2012). However, the majority of these models are not targeted towards isolating stopwords. Many proposed methods, even in the LDA literature, still require stopword removal or deletion of high frequency words (Newman, Bonilla, and Buntine 2011; Zhao et al. 2011; Jagarlamudi, Daum´e III, and Udupa 2012; Andrzejewski, Zhu, and Craven 2009; Lee and Mimno 2014). Some models have shown some robustness towards the presence of stopwords, but perform noticeably better with canonical stopword deletion (Tan and Ou 2010; Zhao et al. 2011). Further, while many methods have been proposed to identify stopwords and model only domainrelevant words, many LDA users extensively use canonical stopword deletion, particularly for more complex graphical models.

Promoting Relevant Terms with Asymmetric, Informative Priors Our approach to creating topics with fewer stopwords combines two ideas: (1) different Dirichlet concentrations η0 and η1 for different topics to encourage stopwords and domain-relevant words to appear in different topics, and (2) an asymmetric prior η1 that penalizes likely stopwords in the domain-relevant topics and promotes domain-specific words. Importantly, since we only change the prior concentrations η0 and η1 , our approach can easily be used to augment more complex LDA extensions. We validate this approach by showing that our models produce (1) topics with more domain-specific keywords deemed important by experts and (2) fewer stopwords on three datasets.

Stopword Topics (η0 ) The LDA model tries to explain all of the words in the document. Thus, it is important that high-frequency words be explained somehow—we cannot just relegate them to low probabilities in all our topics if we think that have little information-bearing content. Thus, of the K topics, we let I of them be stopword topics: βk ∼ Dirichlet(ηo ), where η0 is uninformative (1, 1, . . . , 1). The explicit stopword topics give high-frequency, but uninteresting, words a place to go. Choices for Domain Topics (η1 ): Term Weighting One intuitive prior-based penalization is to set the Dirchlet weights η for the non-stopword topic priors to be the inverse of the corpus unigram frequency, which gives high frequency words low prior probabilities of occurring as a draw from a domain topic. This Word Frequency Prior is our most naive approach, and allows the overall model to achieve reasonable perplexities because frequent corpus words can be explained in the stopword topics. This sequestering of high frequency words to specific topics allows the non-stopword topics to more accurately reflect the nuances of the corpus. However, while penalizing words based on their frequency effectively limits stopwords, it is not a targeted form of restriction — it equally penalizes a term that occurs a few times in many documents and a term that occurs often in just one document. We propose instead a prior penalization for the domain-relevant topics proportional to the TF-IDF score (Salton 1991) of the word (the overall model can still achieve low perplexities because frequent terms can be explained by stopword topics). Our TF-IDF Prior for K topics in an LDA model with vocabulary size V and I stopword topics and K − I non-relevant topics is β1 , . . . , βI ∼ Dir(1, 1, . . . , 1)   βI+1 , . . . , βK ∼ Dir c1TI(w1 ), . . . , c1TI(wV ) where TI(wv ) represents the average TF-IDF score of word v in the corpus and c1 is an arbitrary scaling constant used to appropriately size the TF-IDF scores. We use the common TF-IDF score for word v in document d of TF(v, d) log IDF(v). This prior shrinks the posterior probability of words with small TF-IDF scores, e.g., common words that consistently appear across the corpus, in the nonstopword topics. Choices for Domain Topics (η1 ): Keyword Seeding The term weighting approach relies only on the statistics of the documents to create the prior. However, in many situations, domain experts may have additional knowledge about the corpus vocabulary that the TF-IDF score does not take into account. In particular, many domains have publicly available, curated lists such as key terms for article abstracts, lists of medications, or categories of accidents. We incorporate such information using keyword topics with a prior that reduces shrinkage on those pre-specified vocabulary words. Similar to the TF-IDF prior, we set a K topic LDA model to have I stopword topics and J TF-IDF weighted topics, but

Alternative to PMI and Coherence: Lift

then set the remaining keyword topics to promote domain specific words: β1 , . . . , βI ∼ Dir(1, 1, . . . , 1)   βI+1 , . . . , βI+J ∼ Dir c1TI(w1 ), . . . , c1TI(wV )

Model

Experiments Corpora In this section, we demonstrate the effectiveness of our proposed priors with results on three datasets with varying characteristics. The ASD corpus contains 656,972 posts from three online support communities for autism patients and their caretakers. Posts contain non-clinical medical vocabulary (e.g. “potty going” instead of “toilet training”) and abbreviations (e.g. “camhs” for “Child and Adolescent Mental Health Services”). The keywords used for keyword seeding models were downloaded from the Unified Medical Language System. The OSHA corpus contains 49,558 entries from the Department of Labor Occupational Safety and Health database of casualties. Unlike the ASD corpus, the OSHA posts are short and structured. The keyword seeds used for the OSHA corpus were pretagged in the dataset as one-word descriptors of the accident (e.g. “ship” to indicate the accident occurred on a ship). The NIPS corpus contains 403 abstracts from the Neural Information Processing Systems Conference 2015 accepted papers. These concisely written abstracts are of medium length with a highly technical vocabulary and comparatively few stopwords. We used the list of 2015 NIPS submission category keywords for keyword seeding. Baselines We compare to four baselines: (1) No Deletion Baseline — vanilla LDA without stopword removal, (2) Stopword Deletion Baseline — LDA deleting the 127 canonical stopwords from the Stanford Natural Language

% Stopwords

ASD

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior

1.939 2.167 2.218 2.608 3.649 6.707 5.978

75.8% 0.0% 91.8% 53.5% 14.7% 10.3% 9.3%

OSHA

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior

2.887 3.286 3.017 2.914 4.834 5.871 5.184

39.3% 0.0% 13.5% 10.5% 6.5% 5.3% 5.0%

NIPS

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior

2.962 3.584 3.720 3.418 3.913 6.596 6.271

7.6% 0.0% 11.3% 5.3% 3.2% 4.0% 3.3%

βI+J+1 , . . . , βK ∼ Dir(c2 γ1 , . . . , c2 γV ) where γi = c with c  1 if wi is a keyword and 1 otherwise. The presence of the TF-IDF weighted topics serves a similar purpose as the stopword topics — providing topics for nonkeywords to fill discourages word intrusion into the keyword topics. Furthermore, the large prior setting on relevant domain keywords allows them to be more easily sampled into in early iterations of Gibbs Sampling, which discourages the appearance of stopwords in the keyword topics. We emphasize that these domain-specific keywords— words to include rather than words to exclude—are much distinct from domain-specific stopword lists. The keyword lists used in our experiments are large, generic, and downloaded off the internet. We find that even very general lists of domain terminology, when used as keywords, significantly reduces the number of stopwords in topics. Generating lists of domain-specific terminology often does not require an expert; it is easy to point other researchers to these terminology sources for reproducing experiments. In contrast, domainspecific stopword lists—words to exclude—often require an iterative process of expert-pruning based on the specific corpus and repeated LDA model runs.

Log Lift

Table 4: Log lift calculated on the top 30 words of all models. Lift (large is good) represents the ratio of probability of word i in a given topic compared to the corpus probability, averaged for all topics. TF-IDF Prior and Keyword Seeding Prior have better performance, correlating with results in Tables 1 and 2 (stopword removal metric is shown for comparison).

Toolkit, a common preprocessing step and a standard canonical stopword list, (3) TF-IDF Deletion Baseline — LDA deleting words with TF-IDF scores in the lowest 5%, similar to the stopword removal work in Lo et al. (2005) and Ming et al. (2010), and (4) Keyword Topics Baseline — LDA with only keyword seeding prior topics (e.g. no predesignated stopword or TF-IDF topic types). This last baseline is designed to demonstrate that seeding domain words alone is not sufficient, and the prior designation of stopword and non-stopword topics is a necessary prerequisite to creating interpretable topics. These baselines, particularly canonical stopword deletion and TF-IDF based deletion, are extensively used by applied LDA users as well as the research community. Evaluation Metrics Following standard practice, we take the n most probable words in each topic as the ones that define the topic. We found that in conventional LDA models, even if the top 10 words appeared reasonable, stopwords often dominated the next most probable. Looking at graphs of the probability mass captured by the top n words of a topic (see supplement), we find that most of the mass is captured before word 30, so we evaluate on the n = 30 most probable words in a topic. We considered two axes for evaluation: what proportion of the top word lists were identifiable as stopwords and how many words were identifiable as domain-relevant. To measure the number of stopwords, we report both

Quantitative Evaluation Model

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior

Canonical Stopwords (non-stopword topics) ASD OSHA NIPS 75.8% 0.0% 91.8% 53.5% 14.7% (10.6%) 10.3% (5.1%) 9.3% (6.5%)

39.3% 0.0% 13.5% 10.5% 6.5% (1.8%) 5.3% (1.2%) 5.0% (2.3%)

7.6 % 0.0% 11.3% 5.8% 3.2% (0.6%) 4.0% (0.5%) 3.3% (1.7%)

Expert Words (non-stopword topics) ASD OSHA NIPS 1.5% 3.5% 4.3% 3.6% 14.0% (13.8%) 8.7% (8.2%) 20.3% (20.3%)

4.3% 6.5% 6.8% 7.0% 8.2% (7.0%) 7.2% (6.2%) 17.2% (16.2%)

13.0% 15.2% 14.5% 17.3% 15.7% (14.5%) 24.5% (24.2%) 47.8% (45.7%)

Codocument Appearance ASD OSHA NIPS 70% 71% 71% 72% 90% 92% 92%

58% 75% 66% 66% 90% 91% 91%

73% 84% 84% 86% 92% 93% 95%

Table 1: Evaluation of different fitted models. Evaluation columns are % of canonical stopwords (small is good), % expert words (large is good), and percentage of documents where expert words and top topic words either both appear or do not appear (large is good). Parentheses report evaluation metrics calculated only on the 19 non-stopword topics for the three informative prior models. Our approaches have better performance on all metrics, absent number of canonical stopwords from the Stopword Deletion Baseline. the percentage of NLTK canonical stopwords (Canonical Stopwords) and also asked human evaluators to identify the number of low-information words (Human Stopwords). We present the median percentage of identified Human Stopwords on the ASD dataset for three models: No Deletion baseline, Stopword Deletion Baseline, and TF-IDF Prior. 10 evaluators each assessed 2 runs of each model. The evaluators were college students presented with the task of identifying low-information words. No examples were given to avoid priming the identification of canonical stopwords. To verify that the topics contained domain-relevant content, we asked 2 domain experts each in the medical and law domains to independently identify terms deemed important to generate keyword whitelists. For the NIPS corpus, we used the paper titles as whitelist words (canonical stopwords were removed). The average percentage of these whitelist words in the top 30 words of each topic are reported as the Expert Words. This expert whitelist evaluation is a proxy to the studies presented in Chang et al. (2009), which used Mechanical Turk to identify topic words that did not belong. We quantify the opposite—words pre-designated to belong by domain experts—for three main reasons. First, expert whitelists are a more scalable evaluation method compared to Turk. Second, our corpora require more specific domain knowledge for accurate topic evaluation, making our topics less accessible for the average Turk worker. Finally, unlike the generic keyword lists, the experts were very selective in choosing important words from the corpora. Thus, we also report the co-occurence of the top topic words with the expert-identified terms per document (Codocument Appearance) as a measure of whether our top words are correlated with the expert-produced lists. Parameter Settings We performed a gridsearch over the number of topics (5 - 50 topics in increments of 5), settings 1 1 for prior weights c1 (100, 10, 1, 10 , 100 ) and c2 (100, 10, 1, 1 1 10 , 100 ), number of TF-IDF topics (1, 5, 10, 19), number of keyword seeding topics (1, 5, 10, 18, 19), weight of keyword seeding c (10, 50, 100, 1000), and number of Gibbs Sampling iterations (100, 200, 500, 1000). We explored various combinations, including c1 = c2 and c1 6= c2 . Our models were largely insensitive to these choices: the number of stopwords and number of expert words deviated little. The

most important parameter setting was that the prior on the stopword topics (η0 ) should be larger than the prior weight placed on the informative topics (c1 , c2 , γi , TI(w)). This encourages separation between stopword and non-stopword topics by ensuring that stopwords are sufficiently penalized. We present results for K = 20 topics, c1 = c2 = 1, and c = 100. Word Frequency Prior and TF-IDF Prior models were trained with I = 1 stopword topic. Keyword Prior models were trained with I = 1 stopword topic, 9 TF-IDF topics, and 10 keyword seeding topics. Informative priors have superior quantitative performance. Our models drastically reduce the number of stopwords with high posterior probability mass and capture a greater amount of domain content in all corpora (Table 1), better than baselines with a hard trimming threshold such as the TF-IDF Deletion Baseline (Lo, He, and Ounis 2005; Ming, Wang, and Chua 2010). The stopwords that remain in our models are almost all present in the predesignated stopword topics. Our models increase the number of expertdesignated domain-relevant words even though those words were not used for the keyword seeding; our keyword seeds came from large, generic online lists. Most domain content appears in the domain-relevant topics, with the predesignated stopword topic containing incredibly few expert words. Additionally, the co-occurrence scores reveal that topic words from the informative prior model correlate more strongly with the independently-produced expert keywords. In contrast, the deletion-based methods reduced the number of canonical stopwords present but fail to capture as much domain content and do not remove domain-specific stopwords. In a separate study from Table 1, human evaluators marked 70.7% of the words in the No Deletion Baseline as stopwords, 25.9% of words in the Stopword Deletion Baseline as stopwords, but only 17.1% of words in the TFIDF Prior model as stopwords. For the TF-IDF prior model, the 19 non-stopword topics contained 12.7% of stopwords, meaning almost a third of human-evaluated stopwords were contained by the predesignated stopword topic. Such results emphasize that canonical stopword deletion cannot create topics that humans judge to be stopword-free. In contrast, our TF-IDF model can create more readable, domainspecific topics with no vocabulary removal.

Qualitative Topic Evaluation Model ASD

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior Example Stopword Topic

OSHA

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior Example Stopword Topic

NIPS

No Deletion Baseline Stopword Deletion Baseline TF-IDF Deletion Baseline Keyword Topics Baseline Word Frequency Prior TF-IDF Prior Keyword Seeding Prior Example Stopword Topic

Topic social diagnosis as an or only are autism that child schools lea information need special son statement parents support class had just get school of will very not me out special lea it need has statement support needs school to reading paed attention helpful short communication cope aba diagnosed system mobility improvements treatment preschool responsible expected friends panic professionals speak learning attention symptoms similar problem development negative disorder positive school child autism or on you it parent as son have from approximately fell his hospitalized is him falling injured in report trees surface backing inc degree determined forks fork board his hospital due while work him in death pronounced tree head at injured falls for an balance fractures slipped lost mower limb top operator chain trees cutting log ground fell collapses street trees lacerations wooden laceration construction chipper facility tree work rope tree landing protection caught lift edge open story hospitalized employee by for at when ft fall his fell determinantal progress coordinate real learned cases theta ll super arms includes top analytically margin framework incurs parameterizations normal confirmed ucsd optimizations sgld finding others brownian strings logs generation recognize neurophysiological kappa keyword weights integrating similarly geometric dependence spatial definiteness either mlfre vanilla wold validation inexactness benchmark gumbel bckw newton generalized scalability variations index parameter parametric calibration versions condition infinite generalize nonlinear scalable newton optimisation hyperparameter stochastic optimality outliers epoch control based problem algorithms method from show be can learning data

Table 2: Sample illustrative topics. We present the top 10 words of a tree accident topic for OSHA, a school difficulty topic for ASD, and a hyperparameter tuning topic for NIPS. Informative prior model topics are more specific and contain fewer stopwords. Stopword topics contain both domain specific and canonical stopwords.

Lastly, simply seeding keywords as a prior without having topic types (Keyword Topics Baseline) is not effective at reducing the stopword effect or generating domain relevant topics (Table 1). The combination of penalizing priors and topic types is is required for interpretable topics. Informative prior topics are more readable Our informative prior models generate more interpretable topics (see Table 2). For example, in the OSHA dataset the baselines were too general — the deletion baseline topics included domain-specific stopwords such as “report” (e.g. “an accident report was filed”). The informative prior models captured greater specificity, such as the Word Frequency prior displaying that tree-related accidents often occur when “cutting” “log[s].” In the ASD corpus, the informative prior models captured specific concerns about “learning,” “reading,” and “mobility” for ASD patients entering primary education. In contrast, the deletion-baseline topics included the domain-specific stopwords “son” and “parent” and addressed school concerns only vaguely. In the NIPS dataset, the more concise writing and technical terminology allow the baseline topics to contain far fewer stopwords. However, the topic words do not form a coherent theme with each other. In contrast, the topics for the Keyword Seeding Prior and the TF-IDF Prior are much clearer as a grouping. For example, the words “optimization,” “hyperparameter,” and “epoch” reference tuning various model parameters.

The learned stopword topics capture both canonical and domain-specific stopwords. In the OSHA case, we see the words “employee,” “ft,” and “hospitalized,” as well as “by,” “for,” and “at.” For ASD, we see “child,” “autism,” “son,” and “parent” as domain-specific stopwords. In the NIPS dataset, the words “problem,” “algorithms,” “method,” “data,” and “learning” are domain-specific stopwords.

Improving Traditional Metrics In the previous section, we showed that according to human experts, our approaches promoted more domain-relevant terms and demoted more domain-irrelevant terms. However, human subject evaluation, such as Chang et al, 2009 is unscalable; ideally we would prefer a metric that can be automated. Unfortunately, the results in table 3 show that coherence and PMI—two standard quality metrics for LDA—do not correlate with our human evaluations: the coherence and PMI scores are highest for the No Deletion Baseline, even though over 50% of the top words are canonical stopwords. Similarly, the Keyword Topics Baseline performs well, despite containing both more canonical stopwords and less domain words than the informative prior models.1 Are the results in table 3 an empirical fluke? No. We next explain why 1 For numerical comparability, we leave out the Stopword Deletion Baseline and TF-IDF Deletion Baseline, as their vocabulary sizes differ from our other baselines and proposed informative prior models.

Coherence and PMI Results, ASD Metric Avg Coherence

Avg PMI

No Deletion Baseline # Top Words 10 20 30

−45.5 −235.5 −554.2 −1.56

Keyword Topics Baseline

Word Frequency Prior

−48.2 −244.3 −580.1

−115.2 −543.9 −1278.3

−1.42

−2.02

TF-IDF Prior

−143.3 −680.1 −1611.8 −2.08

Keyword Seeding Prior

−102.8 −481.8 −1119.6 −2.42

Table 3: Average topic coherence and average pointwise mutual information on the ASD corpus (closer to 0 is better). All metrics prefer the No Deletion and Keyword Topics Online baselines, even though they have many more stopwords and less domain content (See Table 1).

these metrics systematically produce counterintuitive results when faced with irrelevant words and propose a metric to diagnose stop-word dominated topics. Traditional topic quality metrics are not robust to stopwords. The Coherence of a topic is defined as t PM Pm−1 D(vm ,vlt )+1 , where vit is the ith most m=2 l=1 log D(vlt ) probable word in topic t and M represents the number of top topic words to evaluate (the 1 ensures the log is defined). D(x) represents the number of documents word x appears in, and D(x, y) represents the number of documents x and y co-appear in. Coherence is largest when t , vlt ) = D(vlt ), which occurs when either (a) the words D(vm co-occur in a very small subset of documents and are absent elsewhere, or (b) the words appear in all documents. The former case is incredibly unlikely, particularly as topic evaluation is conducted on the top M most probable topic words. For concreteness, we provide some examples on ASD, a corpus about children with autism. “Autism” and “child” only co-occur in 3% of documents, but “and” and “the” co-occur in 58% of documents. On the other hand, the latter case is achieved when the M words evaluated are all common words appearing in every document, i.e. are stopwords. For example, “the” appears in 93% of OSHA documents, 74% of ASD documents, and 99% of NIPS documents. When averaged across all topics, coherence is maximized when all top topic words are common and overlapping. p(vi ,vj ) The PMI Score of a topic is the median of log p(vi )p(v j) calculated for all vi , vj in the top 10 words of a topic, where p(x) is the probability of seeing word x in a random document, and p(x, y) is the joint probability of seeing x and y appearing together in a random document. The frequencies are traditionally estimated from Wikipedia. PMI for each pair of words is maximized if vi and vj co-occur. In practice, this is easily achieved with high frequency words that appear with high probability in all documents—stopwords. Particularly in real world, noisy corpora, domain words alone are relatively rare, so multiple domain-relevant words cooccuring strongly is incredibly rare. For example, on the ASD dataset, the words “school” and “read,” both fairly common words in English and topics of high concern for parents, only co-occur in 1.5% of documents. Variants of PMI, such as Normalized PMI (Lau, Newman, and Baldwin 2014) adjust the frequencies to reflect specialized or technical corpora but suffer similar drawbacks—they are still

maximized if topics are full of the same, high frequency, co-occurring words. We stress that these issues are not limited to canonical stopwords, but plague topic model evaluation even for stopword deletion models, as these scoring mechanisms inevitably prefer topics of common words. These metrics, while sensible in the absence of stopwords, produce results that prefer stopword-laden topics, and do not correlate with our human evaluation studies or expert topic evaluation. Lift-Score: An Automated Topic Quality Metric In topic modeling, lift (Taddy 2012) is the ratio of a word’s probability within a topic to its marginal corpus probabilβ ity. The lift of word j in topic t is defined as P xij /tjP mi , where βtj represents the probability mass of word j in topic t, each xi is a vector of word counts in a document, with total term count mi (Taddy 2012). Previous work has used lift to sort topic words, as it reduces the appearance of globally frequent terms (Sievert and Shirley 2014). We calculate the average log lift of the top 30 words of each topic. Common words, such as stopwords, have comparatively lower lift, while rarer words that appear in fewer topics correspond to higher lift, allowing this metric to better target the stopword problem. The lift-score (Table 4) correlates with the quantitative/qualitative performance metrics in Tables 1 and 2. The informative prior models perform better than all baselines, with the TF-IDF Prior and Keyword Seeding Prior the best. Unlike Coherence and PMI, lift can be calculated across LDA models of varying vocabulary sizes and is not easily maximized by topics full of frequent words.

Discussion The problem of stopwords is systemic—while LDA has been empirically useful, it often picks up on spurious word co-occurences as a result of lingual structure. For example, researchers may wish to model important nouns, but these are often preceded by articles such as “the.” LDA’s bag of words assumption treats these co-occurences as important indicators of words that appear together, allowing stopwords to have undue influence. Our approach penalizes these uninteresting co-occurences and promotes relevant words, creating more interpretable topics. Importantly, it is easily incorporated into existing software by changing the existing symmetric Dirichlet prior

on the word-topic distribution to one of the proposed priors, with no other inference modifications. This property is particularly powerful for the Keyword Seeding Prior, as many other LDA models incorporating external information require custom inference methods that may not be accessible to all users. This simplicity is a strength, and allows our methods to be easily incorporated into existing work. Our prior parameter settings are also quite robust and require little modification. Despite the difference between the corpora, the same parameters produced interpretable topics that performed well both quantitatively and qualitatively. Interesting avenues of future work include incorporating such priors into much more complex topic models, such as supervised LDA models with correlational and time-varying structure. In these scenarios, more elaborate modeling of word frequencies might render the larger effort computationally infeasible. More generally, it would be interesting to see the whether these more interpretable topics show benefits in downstream prediction tasks. We also expose an important gap in topic quality evaluation. We showed that even if deletion methods are used to remove generic stopwords, human evaluators judge the topics to contain large quantities of low-information words. In contrast, the TF-IDF informative prior model not only drastically reduces the number of canonical stopwords appearing in the top 30 words of each topic, but also curtails the number of general, low information words. However, traditional topic quality measures did not reveal these trends. Our proposed lift-score, however, correlates well to both human stopword evaluation and domain expert topic assessment. More generally, an important question is to define an appropriate constellation of metrics to capture factors such as concentration, uniqueness, and relevance which are all relevant to evaluating topic quality.

References [Andrzejewski, Zhu, and Craven 2009] Andrzejewski, D.; Zhu, X.; and Craven, M. 2009. Incorporating domain knowledge into topic modeling via dirichlet forest priors. Proc Int Conf Mach Learn, 382(26). [Baradad and Mugabushaka 2015] Baradad, V. P., and Mugabushaka, A.-M. 2015. Corpus specific stop words to improve the textual analysis in scientometrics. European Research Council Executive Agency. [Bischof and Airoldi 2012] Bischof, J., and Airoldi, E. M. 2012. Capturing semantic content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning. [Blei, Ng, and Jordan 2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3. [Chang et al. 2009] Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; and Blei, D. M. 2009. Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems. [HaCohen-Kerner and Blitz 2010] HaCohen-Kerner, Y., and Blitz, S. Y. 2010. Initial experiments with extraction of stopwords in hebrew. In KDIR, 449–453. [Jagarlamudi, Daum´e, and Udupa 2012] Jagarlamudi, J.; Daum´e, III, H.; and Udupa, R. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 204–213. Association for Computational Linguistics. [Jagarlamudi, Daum´e III, and Udupa 2012] Jagarlamudi, J.; Daum´e III, H.; and Udupa, R. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 204–213. Association for Computational Linguistics. [Lau, Newman, and Baldwin 2014] Lau, J. H.; Newman, D.; and Baldwin, T. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, 530–539. [Lee and Mimno 2014] Lee, M., and Mimno, D. 2014. Lowdimensional embeddings for interpretable anchor-based topic inference. In Proceedings of Empirical Methods in Natural Language Processing. Citeseer. [Lo, He, and Ounis 2005] Lo, R. T.-W.; He, B.; and Ounis, I. 2005. Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), volume 5, 17–24. Citeseer. [Makrehchi and Kamel 2008] Makrehchi, M., and Kamel, M. S. 2008. Automatic extraction of domain-specific stopwords from labeled documents. In Advances in information retrieval. Springer. 222–233. [Mehrotra et al. 2013] Mehrotra, R.; Sanner, S.; Buntine, W.; and Xie, L. 2013. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Pro-

ceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 889– 892. ACM. [Mimno et al. 2011] Mimno, D.; Wallach, H. M.; Talley, E.; Leenders, M.; and McCallum, A. 2011. Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. [Ming, Wang, and Chua 2010] Ming, Z.-Y.; Wang, K.; and Chua, T.-S. 2010. Vocabulary filtering for term weighting in archived question search. In Advances in Knowledge Discovery and Data Mining. Springer. 383–390. [Newman, Bonilla, and Buntine 2011] Newman, D.; Bonilla, E. V.; and Buntine, W. 2011. Improving topic coherence with regularized topic models. Advances in neural information processing systems. [Saif, Fernandez, and Alani 2014] Saif, H.; Fernandez, M.; and Alani, H. 2014. Automatic stopword generation using contextual semantics for sentiment analysis of twitter. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track-Volume 1272, 281–284. CEUR-WS. org. [Salton 1991] Salton, G. 1991. Developments in automatic text retrieval. Science 253(5023):974–980. [Sievert and Shirley 2014] Sievert, C., and Shirley, K. E. 2014. Ldavis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces, 63–70. [Sinka and Corne 2003] Sinka, M. P., and Corne, D. 2003. Evolving better stoplists for document clustering and web intelligence. In HIS, 1015–1023. [Taddy 2012] Taddy, M. 2012. On estimation and selection for topic models. In AISTATS, 1184–1193. [Tan and Ou 2010] Tan, Y., and Ou, Z. 2010. Topic-weakcorrelated latent dirichlet allocation. In Chinese Spoken Language Processing (ISCSLP), 2010 7th International Symposium on, 224–228. IEEE. [Wallach, Mimno, and McCallum 2009] Wallach, H. M.; Mimno, D.; and McCallum, A. 2009. Rethinking lda: Why priors matter. Advances in neural information processing systems. [Wibisono and Utomo 2016] Wibisono, S., and Utomo, M. S. 2016. Dynamic stoplist generator from traditional indonesian cuisine with statistical approach. Journal of Theoretical & Applied Information Technology 87(1). [Xie, Yang, and Xing 2015] Xie, P.; Yang, D.; and Xing, E. P. 2015. Incorporating word correlation knowledge into topic modeling. In Conference of the North American Chapter of the Association for Computational Linguistics. [Yang et al. 2015] Yang, Y.; Downey, D.; Boyd-Graber, J.; and Graber, J. B. 2015. Efficient methods for incorporating knowledge into topic models. In Empirical Methods in Natural Language Processing. [Zhao et al. 2011] Zhao, W. X.; Jiang, J.; Weng, J.; He, J.; Lim, E.-P.; Yan, H.; and Li, X. 2011. Comparing twitter and

traditional media using topic models. In European Conference on Information Retrieval, 338–349. Springer.