Extraction of synonyms and semantically related words from chat logs

Extraction of synonyms and semantically related words from chat logs Fredrik Norlindh Uppsala University Department of Linguistics and Philology Mast...

Author: Ethelbert John Oliver

9 downloads 0 Views 539KB Size

Report

Download PDF

Recommend Documents

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

Extracting Synonyms from Dictionary Definitions

LIST OF SYNONYMS AND ANTONYMS

wince queasy component theology Page Number consolation Synonyms Definition: Synonyms Definition: Synonyms Definition: Synonyms Definition: Synonyms

FDR's First Fireside Chat: The Power of Words 1933

The Influence of Semantically Related and Unrelated Text Cues on the Intelligibility of Sentences in Noise

List of Synonyms. A list of synonyms & antonyms for the 100 most often used words in the English language

Principle of Extraction (Overview) Extraction of Semivolatile Organics from Liquids

Synonyms are words with similar meaning but they are

New Testament study of eternal torment and related words

Thermal Conductivity of Soils from the Analysis of Boring Logs

CHAT to Helpdesk. Why Chat. Why chat

Porewater Extraction from

Driver logs. Introduction - Driver logs

Learning Classifiers from a Relational Database of Tutor Logs

SENIOR SYNONYMS OF PTYCHODUS LATISSIMUS

Towards Preservation of semantically enriched Architectural Knowledge

Automatic Extraction of Features From Line Drawings

Subcritical Fluid Extraction of Stevia Sweeteners from

Extraction of Attribute Concepts from Japanese Adjectives

EXTRACTION OF KERATIN PROTEIN FROM CHICKEN FEATHER

Extraction of Airways from CT (EXACT 09)

A Nepali Dictionary of Synonyms

Automatic Extraction of Hierarchical Relations from Text

Extraction of synonyms and semantically related words from chat logs Fredrik Norlindh

Uppsala University Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology November 20, 2012 Supervisors: Mats Dahllöf, Uppsala University Sonja Petrovi´c Lundberg, Artificial Solutions

Abstract This study explores synonym extraction from domain specific chat log collections by means of a tool-kit called JavaSDM. JavaSDM uses random indexing and measures distributional similarities. The focus of this study is to evaluate the effect of different preprocessing operations of training data and different extraction criteria. Four chat log collections containing approximately 1,000,000 tokens were compared: one English and one Swedish from a retail company and one English and one Swedish from a travel company. One gold standard was based on synonym dictionaries and one was a manually extended version of that gold standard. The extended gold standard included antonyms, misspellings and near-related hyponyms/hypernyms/siblings and was about 20 % bigger. On average around two of the extracted synonym candidates per test word were falsely classified as incorrect because they were not included in the dictionarybased gold standard. Precision, recall and f-score were computed. Test words were either nouns, verbs or adjectives. The f-scores were three to five times higher when using the extended gold standard. The best f-scores were achieved when training data had been lemmatized. POS-tagging improved precision but decreased recall and decreased the number of extractions of misspellings as synonyms. A cosine similarity score threshold 0.5 could be used to increase precision and f-score without substantially decreasing recall.

Contents Acknowledgments

5

1 Introduction

6

2 Background 2.1 Synonymy and other lexical sense relations 2.2 Usage of extracted synonyms . . . . . . . . 2.3 Random Indexing . . . . . . . . . . . . . . 2.4 Synonym Extraction Methods . . . . . . . 2.5 Evaluation Methods . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

7 7 8 9 10 11

3 Data and Method 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preprocessings . . . . . . . . . . . . . . . . . . . 3.2.1 Extraction of User Input from Chat Logs 3.2.2 Tokenization . . . . . . . . . . . . . . . . 3.2.3 Part-Of-Speech Tagging . . . . . . . . . . 3.2.4 Lemmatization . . . . . . . . . . . . . . 3.2.5 Stop Word Lists . . . . . . . . . . . . . . 3.3 Synonym Extraction by Random Indexing . . . 3.3.1 Random Indexing Tool-Kit . . . . . . . . 3.3.2 Extraction Criteria . . . . . . . . . . . . 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . 3.4.1 Test Words . . . . . . . . . . . . . . . . . 3.4.2 Gold Standard . . . . . . . . . . . . . . . 3.4.3 Adapted Gold Standard . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

13 13 15 15 15 15 16 16 16 16 18 19 20 20 21

4 Results 4.1 Tests . . . . . . . . . . . . . . 4.2 Threshold . . . . . . . . . . . 4.3 Part-of-speech tagging . . . . 4.4 Lemmatization . . . . . . . . 4.5 The different gold standards 4.6 Precision . . . . . . . . . . . 4.7 Recall . . . . . . . . . . . . . 4.8 F-score . . . . . . . . . . . . 4.9 Comparison Overview . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

24 24 25 25 26 26 27 28 29 30

. . . . . . . . .

5 Conclusions

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

34 3

5.1 Overview of the results . . . . . . . . . . . . . . . . . . . . . . . 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Future Study suggestions . . . . . . . . . . . . . . . . . .

35 35 39

References

40

Bibliography

40

4

Acknowledgments I would like to thank Mats Dahllöf for continuous support and feedback throughout this project. I’m greatly appreciative to Artificial Solutions for the opportunity to do this project with a leading language technology company and the use of their data. Especially I want to thank my supervisor Sonja Petrovic Lundberg. I would also like to thank Per Starbäck for the time he took to read my thesis and the advice he gave me regarding formatting estestics and Jörg Tiedemann for helpful comments and feedback.

5

1

Introduction

This study explores synonym extraction from chat log collections by means of an open source tool-kit called JavaSDM. JavaSDM measures distributional similarities and uses a method called random indexing. The focus of this study is to evaluate effects of different preprocessing operations of training data and different extraction criteria. Preprocessing operations are for example lemmatization and POS-tagging. An extraction criterion is for example a similarity threshold t which says that words with lower similarity score than t are not to be considered as synonym suggestions. The study is intended to help the automatic part of a semi-automatic synonym extraction process (a process where humans extract synonyms from automatically extracted synonym candidates). The effects these operations and criteria have on two languages and two domains will be compared. Artificial Solutions provided four different collections of chat logs which contain approximately 1,000,000 tokens each. One Swedish and one English collection from a travel company and one Swedish and one English collection from a retail company. Four gold standards were created for each chat log corpus. One dictionarybased and one adapted gold standard for graph words and one dictionary-based and one adapted gold standard for lemmas. The dictionary-based gold standards were based on existing synonym dictionaries but did not include words that did not exist in the input data. Inflections were added to graph word based gold standards. Adapted gold standards were created by manually reading synonym proposals that were not included in the dictionary-based gold standards. The words that actually were correct were added to adapted gold standard where all words from the dictionary-based gold standard also are included. Graph word data and lemmatized data are different things, but, unless a combination of both is created and alterations in the JavaSDM code is done, one and only one of them must be chosen for synonym extraction with JavaSDM.

6

2

Background

Section 2.1 discusses when words can be seen as synonyms. Section 2.2 exemplifies how extracted synonyms can be used. Section 2.3 is about random indexing, which is implemented by JavaSDM. Section 2.4 gives an overview of different synonym extraction approaches that have been used in prior studies. Section 2.5 is about ways people have evaluated synonym extractors.

2.1

Synonymy and other lexical sense relations

A synonym is a word having the same or nearly the same meaning as another word. For example “pal” and “friend” or “jump” and “leap”. Creating an absolute definition of how similar a pair of words need to be to be synonyms is almost impossible or at least rather subjective (Edmonds and Hirst (2002)). Are for example “buxom”, “chubby”, “plump”, “obese” and “fat” all synonyms? The lexical database WordNet have chosen to classify their relation as “similar” instead of “synonym”. The difficulty of synonym extraction evaluation is illustrated by the fact that not even professional lexicographers always agreed on whether two words are synonyms or not when evaluating a synonym extractor (Plas and Bouma (2004))(see Section 2.5 about evaluation methods.) Polysemous and homonymous graph words have different senses. This means that if two graph words are synonyms in one situation they are not necessarily synonyms in other situations. For example “I don’t buy her story” could be replaced with “I don’t believe her story” but “I will buy a new house” can’t be replaced by “I will believe a new house”. Hyponyms and hypernyms can be synonymous, especially within specific domains. For example a lot of companies have a “XXX membership card” and due to the obvious reference the company and their customers often only say “membership card” although there are lots of other membership cards in the world. Whether a hypernym and a hyponym can be used as synonyms depends on how precise they are and the context. For example “thing”, which is a hypernym of everything, is not to consider as a synonym for everything. “Marabou” (the biggest chocolate company in Sweden today) is a hypernym often used as a synonym/referent to “choklad” (“Chocolate”) in Sweden. Proper nouns, such as “Marabou” are generally not included in synonym dictionaries, but there are exceptions. For example “MAC” and “computer” are listed as synonyms in English thesauri. Conceptual siblings are words that are hyponyms “on the same level” to the same hypernym. For example “father” and “mother” are siblings that are used to tell that a person is the parent of a child. Another example of conceptual

7

siblings are products that have different names and/or are created in a slightly different way. This is very common at pharmacies in for example Sweden, where the pharmacists occasionally say they are out of the medecine you ask for or that they have a similar cheaper medicine that you should pick according to the laws regarding high-cost protection. Siblings can also be antonymous, for exampel man and woman. Antonymous word pairs often have paradigmatic relations, which means that it is possible to replace one word, for example “bad” with its antonym “good”, and still have a syntactically and semantically coherent sentence and due to irony they can mean the same thing. Combinations of negations and antonyms can be used as a synonym phrase, for example one can say “Not bad!” instead of “Good!”. Actually “bad” can be positive in slang sense, for example “he’s a baaad dude” (Cruse (2002)). Since antonyms often represent opposite polarities on the same “scale”, they are often used in the same contexts, for example I hate/love you”. This is why they might be suggested as synonym candidats when measuring distributional similarity. In many situations the positive/negative energy of a word is used to decide whether two words can be used to send the same message. For example the definitions of “stupid” and “ugly” are not synonyms but in sentences such as “that dress looks ugly/stupid” they are both used to say that something does not look good because they carry the same negative energy. The affective meaning of words are for example studied by Strapparava and Mihalcea (2007) . Another relation that has not been mentioned in earlier synonym extraction projects is misspellings. Due to the fact that the training data in this project is unedited chat data, misspellings occur. Actually some words were more frequently misspelled than correctly spelled. Sometimes users of a dialogue system mix words from different languages. During the Swedish tests in this study for example “infant” was extracted as a synonym for “bebis” (“infant”) because they were used in similar contexts. In such cases the translation should be considered as a synonym.

2.2

Usage of extracted synonyms

Semi-automatic synonym extractions is probably the best way to find many synonyms with good quality. Muller et al. (2006) states that synonym extractors are supposed to complement existing thesauri or to help targeting a specific domain. Specific domains might use words in new senses or new words that are not used outside the domain. Existing thesauri don’t include all possible synonyms in a language and new words and new word senses might come every day. Automatic synonym extraction is a much more effective way than letting lexicographers think or analyze corpora by themselves to find synonyms candidates. Then lexicographers can weed out bad synonym candidates in a set of proposals. A synonym is relevant for dialogue systems to understand users and to widen their own vocabulary. Dialogue systems, that for example are created by Artificial Solutions, are often domain specific. Plas and Bouma (2004) exemplify why it might be a good thing if hypernyms, hyponyms and conceptual siblings are found. If a question answering system is asked “What actor is used as Jar Jar Binks voice?” The system looks for the name of a person if the system know sthat an actor “IS-A-person”. If a 8

question is “What is the profession of Renzo Piano?” and the system has access to a document saying “Renzo Piano is an architect and an Italian”, then the knowledge that architect is a profession would help answering the question. The knowledge of product Y as a similar alternative to product X, such as the pharmacy-example in Section 2.1, is useful when a customer is looking for something that for example is not at hand or more expensive than a similar product.

2.3

Random Indexing

JavaSDM, which is the tool-kit used for synonym extraction in this study, uses random indexing. Word space models use distributional statistics to generate high-dimensional vector spaces, in which words are represented by context vectors whose relative directions are assumed to indicate semantic similarity. Words with similar meanings tend to occur in similar contexts. Contexts can be either whole documents or context windows. A context window can include x words directly before the focus word and y words directly after. Random indexing is a method based on an incremental word space model which accumulates context vectors based on the occurence of words in contexts. Each word in the text is assigned a sparse random index vector with dimensionality usually chosen to be in the range of a couple of hundred up to several thousands, depending on the size and redundancy of the data. A small number of the elements of the vector are randomly distributed +1’s and an equal number of elements are distributed -1’s and the rest is set to 0. The index vectors are used to construct context vectors for each word in the text (Sahlgren (2005) and Hassel (2007)). This is exemplified by Figure 2.1.

Figure 2.1: A context window focused on the token “merchandise”, taking note of the co-occurring tokens. The row cv represents the continuously updated context vectors, and the row rl represents the static randomly generated index labels. The grey fields are not involved in this update example.”

Sahlgren (2005) states that the most well-known word space alternatives to random indexing such as Latent Semantic Analysis (LSA) first construct a huge co-occurrence matrix and then use a separate time and memory consuming dimension reduction phase. He lists four advantages of Random indexing compared to other word space methodologies: • The context vectors can be used for similarity computations even after just a few examples have been encountered (Ferret (2010) confirmed this by trying different frequency thresholds deciding which words got their 9

own context vectors, i.e decided which words could be extracted, and found that such thresholds only lowered results). By contrast, most other word space models require the entire data to be sampled before similarity computations can be performed. • Random Indexing uses “implicit” dimension reduction, since the fixed dimensionality d is much lower than the number of contexts c in the data. This leads to a significant gain in processing time and memory consumption as compared to word space methods that employ computationally expensive dimension reduction algorithms. • The dimensionality d of the vectors is a parameter in Random Indexing. This means that d does not change once it has been set. When encountering new data the values of the elements of the context vectors increases, but never their dimensionality. In HAL (Hyperspace Analogue to Language) and LSA, matrixes are created of size T × T or D × T , where D is the number of indexed documents and T is equal to the number of unique terms found in the data set, so far (Hassel (2007)). • Random indexing can be used with any type of context. Other word space models typically use either documents or words as contexts. Random indexing is not limited to these naive choices, but can be used with basically any type of contexts.

2.4

Synonym Extraction Methods

There have been plenty of papers written about synonym extraction methods. Three important approaches are: 1. Distributional approaches to monolingual corpora, for example Rosell et al. (2009) and Ferret (2010). This method, which is used by JavaSDM, measures distributional similarities within a corpus. The approach is based on the distributional hypothesis (Harris (1954)) that words that occur in similar contexts also tend to have similar meanings/functions. Systems using this approach provide ranked lists of semantically related words according to their context similarities. Random indexing is a common method for this (Gorman and Curran (2006)). 2. Alignment of bi- or multilingual corpora, for example van der Plas and Tiedemann (2006) and Blondel and Senellart (2002). This is a method using two or more parallel corpora based on the assumption that two words are similar if they have the same translation. If two words also share translational contexts they are even more similar. This is measured in a similar way to distributional monolingual corpora. Automatic word alignment can be used to find the translations in parallel corpora. This approach is known as a way to separate synonyms from other semantic relations; for example “apple” is typically not translated with a word for “fruit” nor “pear”. 3. Dictionary-based approaches, for example Wang and Hirst (2012) and Muller et al. (2006). This method measures distributional similarities 10

among definitions in dictionaries. It is similar to the distributional approaches to monolingual corpora, but specificly a dictionary is used instead. Hence, relatedness of words is measured using distributional similarity in the same way as in the monolingual case but with a different type of context. Automatic word alignment can be used to finding translations in parallel data. The two latter approaches are not relevant for this study since each chat data collection is a monolingual corpus. A method with distributional approach assesses the degree of synonymy between words according to their co-occurrence patterns within text corpora, under the assumption that similar words tend to appear in similar contexts (Wang (2009)). For example van der Plas and Tiedemann (2006) mention suggestions of hyponyms, antonyms, hypernyms as synonyms as a problem encountered with distributional similarities but as exemplified in the Section 2.1 such synonym proposals are not necessarily negative. Wu and Zhou (2003) used a bilingual method, monolingual dictionary and a monolingual corpus and concluded that the corpus based method can get nuanced synonyms or near synonyms which generally cannot be found in manually built thesauri, and exemplifies with “handspring → handstand” , “audiology → otology” , “roisterer → carouser” and “parmesan → gouda”.

2.5

Evaluation Methods

The validation of a method’s ability to propose synonyms can be done with a test using gold standard or an extrinsic evaluation. Muller et al. (2006) states that ”Comparing to already existing thesaurus is a debatable means when automatic construction is supposed to complement an existing one, or when a specific domain is targeted”. He also says that, in order to get an indication of the extraction quality, manual verification of a sample of synonym proposals is a common practice. Time is the reason only a sample of the proposals are verified. Such verification is normally done either by the authors of a study or by independent lexicographers. The larger sample the more reliable conclusions but the more time is required. Automatic verifications can verify more proposals in much shorter time but are less reliable. The simplest evaluation measure of synonym extractions is a comparison with manually created thesauri (Wu and Zhou (2003)). Precision, recall and f-score are the most common measures. Gold standard words are often taken from one or several thesauri such as WordNet, Roget’s, the Macquarie, Moby. If a word was polysemous all the synonyms of all its senses are included in the gold standard in prior studies, for example Wu and Zhou (2003), Curran and Moens (2002) and van der Plas and Tiedemann (2006). Wang and Hirst (2012), who evaluated a dictionary method, tried to reduce this “problem” by using a POS-tagger. In their study a synonym proposal was considered correct if it both got the same POS-tag as the test word and was included in the same synset in WordNet. Test data can be created using different techniques. van der Plas and Tiedemann (2006) automatically tagged all words and made a list of all words tagged as nouns with a frequency of 4 or more. Then they randomly used 1,000 11

of these as test words and the synonyms found in Dutch EuroWordnet for these words were used as gold standard. A correct synonym extraction, for example “e-mail” (“e-mail”) for “e-postadress” (“e-mail address”) might be considered as incorrect because it is not included in the thesauri. They had 10 lexicographers evaluate a sample of 100 incorrect suggestions. 10 out of the 100 word pairs were classified as synonyms by all evaluators and 37 word pairs were classified as synonyms more than half of the human evaluators. There have been several other evaluation methods, for example TOEFL (Test of English as a Foreign Language) for example Ferret (2010) and Wang (2009). TOEFL is originally an obligatory test for non-native speakers of English who intend to study at a university with English as the teaching language. The synonym extractor is supposed to pick the one out of four words that is most similar to a test word (Ferret (2010)). Another method, clustering, has been used by Rosell et al. (2009). First they assigned every word in their data to one of several clusters based on the similarity measure. A test group had graded the similarities between word pairs from 0-5 to build a synonym dictionary. The word-pairs with mean grade of 3.0 to 5.0 were put on a list. They used the list of synonym pairs to see if the words ended up in the same cluster or not. If a word pair graded as synonyms by the users had ended up in the same cluster it was counted as correct. van der Plas and Tiedemann (2006) acknoledged that their conclusions about precision when using monolingual distributional approach might have been a bit unfair since they always extracted the same number of synonym candidates no matter how low the similarity scores were. Wu and Zhou (2003) tried different similarity thresholds and showed that a higher threshold increases precision but also decreases recall and vice versa. Wu and Zhou (2003) describes evaluation alternatives of synonyms extracted from a corpus as the following: “Comparing the results with the humanbuilt thesauri is not the best way to evaluate synonym extraction because the coverage of the human-built thesaurus is also limited. However, manually evaluating the results is time consuming. And it also cannot get the precise evaluation of the extracted synonyms. Although the human-built thesauri cannot help to precisely evaluate the results, they can still be used to detect the effectiveness of extraction methods.”

12

3

Data and Method

In this chapter Section 3.1 discusses the data used in this study. Section 3.2 is about the preprocessing operations and the tools used for that. Section 3.3.1 describes the random indexing tool-kit and the extraction criteria used in this study. Section 3.4 is about the creation of test data and gold standards. Figure 3.1 gives an overview of the extraction experiments.

Figure 3.1: A flow chart of the extraction experiments

3.1

Data

Artificial Solutions provided four chat log collections with sizes of approximately 1,000,000 tokens: • Swedish Retail: Swedish chat log collections from the retail company • English Retail: English chat log collections from the retail company • Swedish Travel: Swedish chat log collections from the travel company • English Travel: English chat log collections from the travel company A chat log file could look like this: 13

Welcome to XXX. How can I help you today? hi Welcome. i want pepper mill You can find information about it on the webpage I've just opened for you. thanks for your help You are welcome. Text categorized as “QUESTION” has been written by users and is used as training corpora in this study. The actual user input looks like this: hi 14

i want pepper mill thanks for your help

3.2

Preprocessings

3.2.1

Extraction of User Input from Chat Logs

The actual user input was extracted from the chat logs by using regular expressions in Linux terminal.

3.2.2

Tokenization

The JavaSDM tokenizer was not fully appropriate for noisy data, for example character sequences like “,candy” and “broken,it” were not split up. I wrote a new tokenizer which split up tokens that include one or more punctuations. It did not split up hyphenated words and used abbreviation lists to not split up abbreviations. The abbreviation lists were created by extracting Swedish abbreviations from Stockholm-Umeå Corpus and Talbanken and English abbreviations from Penn Treebank.

3.2.3

Part-Of-Speech Tagging

The open source Hidden Markov Model tagger Hunpos was used to POStag the data before training. Swedish models trained by Beáta Megyesi on the Stockholm-Umeå Corpus and English models trained on Wall Street Journal corpus were downloaded from http://stp.lingfil.uu.se/~bea/resources/ hunpos/ and http://code.google.com/p/hunpos/. The tagging was not more fine-grained then “_a”=adjective, “_n”=noun, “_v”=verb and “_other”= any other word class. POS-tagging initialize a distinction between homonyms of different word classes. Morphological tagging can be useful for separating homonyms, but during pre-tests it gave lower results than the chosen POS-tagging (see Section 5.2). The test words were either nouns, verbs or adjectives (see Section 3.4.1). After Hunpos tagging, past participle tags were converted to adjective tags because they are very similar and in Svenska Akademins ordlista (SAOL, The dictionary of the Swedish Academy) several words, for example “intresserad” (“interested”), are classified both as adjective and perfect participle. According to Holmer (2009) a past participle has a function somewhere between adjective and verb. If a past participle frequently enough has been used with rather an adjective than a verb function it is included as an individual lexeme classified as adjective in SAOL. Proper nouns were seen as nouns since both domains in this study have named products. As mentioned in Section 2.1, names can be near-synonyms, for example “MAC” and “computer” and “Marabou” and “chocolate”. Tests using the dictionary-based gold standards were run to find indications of whether to tag the whole corpus or only verbs, nouns and adjectives since they are the only classes included in the test data. These tests, that were done 15

before any manual correction, showed results favoring tagging only nouns, verbs and adjectives. This POS-tagging tag set was consequently used in this study. The corpora could look like this: the fat_a lady_n sings_v POS-tagged data were required by the lemmatization tools.

3.2.4

Lemmatization

The open source lemmatizer Lempas by Silvia Cinkova was used for Swedish (+http://ufal.mff.cuni.cz/~cinkova/+). A lemmatizer from the NLTK tool-kit was used for English (http://nltk.org/). Data needed to be POS-tagged before lemmatization could be performed by theses lemmatizers. The purpose is to group the inflected forms of a word as a single item (http:// www.collinsdictionary.com/dictionary/english/lemmatisation). If for example the graph words “run” and “runs” occur 5 times each, their lemma occurs 10 times if the data is lemmatized. Cases when it is important to make differences between different inflections are disadvantages of lemmatization. For example the plural form “tillgångar” (assets) often means money which is not what its lemma form “tillgång” (asset/access) is used for.

3.2.5

Stop Word Lists

Both Swedish and English stop word lists were taken from the NLTK tool-kit (http://nltk.org/). A regular expression for tokens that did not contain any alphabetic character was added to the stop word lists. A few quick tests were enough to clearly see that the use of stop word lists improved the precision, therefore all tests used stop word lists. JavaSDM can read a given list of stop words and remove all those words from the training data before doing any training. Tests using the dictionary-based gold standards were run and indicated that it is better to keep stop words during training and instead use the list of stop words as a way to stop extractions of those words from being synonym proposals (see Section 5.2 for discussion why). For that reason and in order to not double number of experiments and manual evaluation, that method was used in the experiments.

3.3

Synonym Extraction by Random Indexing

3.3.1

Random Indexing Tool-Kit

The tool-kit called JavaSDM (http://www.csc.kth.se/~xmartin/java/, which was written by Martin Hassel, is the software used to extract synonyms in this study. It is an open source tool-kit which uses random indexing (more about random indexing in Section 2.3). JavaSDM can use one or more documents 16

to train a model for synonym extraction. The model can be given a list of test words and provide a ranking list of synonym candidates for each test word. Dimensionality, random degree, weighting scheme and context window size are training parameters that can be altered. All these parameters affect synonym extraction but an examination of them is beyond the scope of this study. For example 100 models would have to be created for each chat log corpora in order to find the best context window combining 1–10 word before and 1–10 words after the focus word. Each of these models would need to be combined with the other parameters, which means a 100 times more models. Each new parameter setting multiply the number of tests. For this reason default settings were used in the experiments. • A lower dimensionality without decreasing extraction quality is desired due to less time and memory consumption. Where that point is depends on a combination of size and redundancy of the data. The best dimensionality size is usually “between a couple of hundred and several thousands depending on the size and redundancy of the data” (Hassel (2007)). The default dimensionality is 1000. • Random degree is the number of random +1’s and -1’s in each random vector. The default random degree is 8 which means four +1’s and four -1’s. • Context window size is the number of words before and after the focus word that will be considered as the context for each word in the text. When scanning through the text, the random index vectors of the neighbors in the sliding window are added to the context vector of the current focus word (Sahlgren (2005) and Hassel (2007)). A brief context window test gave better results for symmetrical windows with 4 or 6 words before and after the focus word than for a window with 5 words before and after the focus word. This indicates that the effect of the window size is not obvious. A symmetric context window with 4 words on each side of the focus word was the default settings. • The three simplest weighting schemes available in JavaSDM are: – ConstantWS: Using a constant weight makes the weighting word order independent. – MangesWS: Calculates the weighting based upon the distance to the current label in the following manner: weight = 21−distance to focus word . – MartinsWS: Calculates the weighting based upon the distance to the current label in the following manner: weight = 1/distance to focus word. ”Manges Weighting scheme” is the default weighting scheme. It weights with exponential dampening, 21−d , where d is the distance between the focus word and the neighbor. This weighting scheme has for example been used in papers by the creator of JavaSDM (Rosell et al. (2009)). When the context window has four words on each side of the focus word, Manges, Martins and Constant weighting scheme, define weights like this: 17

can Manges 1/8 Martins 1/4 Constant 1

i 1/4 1/3 1

buy the merchandise online to 1/2 1 Focus Word 1 1/2 1/2 1 Focus Word 1 1/2 1 1 Focus Word 1 1

be pick 1/4 1/8 1/3 1/4 1 1

An extraction example from JavaSDM is given below. JavaSDM only measures the similarity between two words by the cosine similarity of their corresponding context vectors (the dot product of the normalized vectors). The maximal cosine similarity is 1. Ferret (2010) showed that cosine-similarity gave the best results in a comparison with four other semantic similarity measures called ehlert, jaccard, lin and dice. The test word is “cute” and the other words are synonym proposals followed by their cosine similarity to the test word. cute ugly pretty sexy weird funny thick stupid beautiful hair

3.3.2

0.92525595 0.9251133 0.915225 0.9052225 0.89562523 0.890025 0.88813686 0.876878 0.8744281

Extraction Criteria

The extraction criteria were: • Candidate set size limit: – a limit of at most 15 synonym proposals for models trained on graph word data – a limit of at most 10 synonym proposals for models trained on lemmatized data A higher number of proposals is expected to improve recall and lower precision. In preliminary test runs 5, 10, 15 and 30 were used as the maximal number of proposals per test word. The f-scores for lemma models were highest when using 10, and 15 candidates gave best f-score for graph word models. Thus the number of synonym proposals was limited to 10 for lemma based models and 15 for graph word based models. The reason graph word based models benefitted from more candidates is probably because they have more words in the gold standard. 30 proposals was very memory- and time-consuming. • Different cosine similarity thresholds: – 0 (no threshold) – 0.5 18

– 0.7 – 0.85 The purpose of a similarity score threshold is to weed out bad synonym proposals and thereby get better precision. With a too high threshold, good synonym proposals might not be given and thereby recall would be lower. A cosine similarity threshold was not an optional setting in the downloaded JavaSDM code and was therefor new code had to be written. Cosine similarity thresholds of 0, 0.5, 0.7 and 0.85 were used. These have different effects on precision, recall and f-score. The thresholds were chosen after running pre-tests of many different thresholds to see around which cosine similarity scores there is more or less result differences. The synonym candidates with a cosine similarity score above the threshold are shown, but never more than the max number of candidates. A max number of candidates is preferable because there are words with really high numbers of candidates with high similarity and the more suggestions the more time and memory consumption (when having 30 as a max number of candidates, my computer randomly crashed even when having a cosine similarity of 0.85 and it was very time consuming). Having 30 candidates as max number did not clearly decreased f-score in preliminary tests. If a user wants to use this tool to get a list of synonym candidates, he/she should set the highest number or candidates he/she wants to read. Another alternative was to, instead of limiting candidate set size, use a threshold like the one mentioned above, and also have a relative threshold compared to the synonym candidate with highest similarity score. A few attempts were made to examine this idea but as only negative effects were found this method was not further explored in this study. • Extracted words whose POS-tag does not match the test word’s tag are not given as synonym proposals. This is a criterion intended to separate the verb “book” from the noun “book” and also to exclude extractions of words from other word classes than the test word since synonym words are supposed to be from the same word class. For example “hair” would not have been extracted as a synonym candidate to “cute” as in the extraction example in Section 3.3.1. This criterion is only used when a model is trained on POS-tagged data. This criterion would have a negative effect if a POS-tagger has mistagged an extraction that actually is of the correct word class. • Extracted stop words are not synonym proposals The stop word lists were collected as described in Section3.2.5 but those tokens were kept during training. Instead JavaSDM extractions of tokens that are in those lists were filtered out and thereby not counted as synonym proposals.

3.4

Evaluation

The focus of this study is to evaluate preprocessing operations and extraction criteria. The evaluation is done by computing precision, recall and f-score. 19

One test word set based on lemmas and one based on graph words set were created. For each set one dictionary-based and one adapted gold standard were created. The dictionary-based gold standards were created from existing synonym dictionaries and inflections of those synonyms and the test words themselves. The adapted gold standard is the dictionary-based gold standard extended by manual correction.

3.4.1

Test Words

The test words in this study were selected among adjectives, nouns and verbs. The POS-tagging described in Section 3.2.3 was needed for this. The test words had to be chosen from the training data if JavaSDM would have a possibility to suggest any synonym candidates. Keyword extraction based on a method by Ortuño et al. (2002) was used to extract the top 200 nouns, 100 verbs 100 adjectives from the chat log collections. Those words also had to have a frequency of at least 30 in the chat log collection. This was done with POStagged data. No more than one inflected form of a word was included in the test word set based on graph words in order to have the same number of test words for lemma and graph words. The test word set based on lemmas consisted of the lemmas of the words in the test word set based on graph words. If no synonyms were found for a word so far qualified to the gold standard, the word was not used as a test words.

3.4.2

Gold Standard

The web-pages www.synonymer.se and www.synonymer.org were used to look up base forms of Swedish synonyms. WordNet and www.thesaurus.com were used to find English synonyms. For example the adjective able got the synonyms bright, capable, easy, good, strong, worthy included in the gold standard. The English synonym sources included a lot more words than Swedish sources. For a synonym found in a thesauri to be included in the gold standard its graph word had to occur at least once with the correct POS-tag in the training data. Otherwise it is impossible to extract that word and the non-extraction of the word would say nothing about the quality of JavaSDM. For example if an English thesauri had “book” as a synonym to “order” it would have been included in the gold standard because of lines like line 2 and not because of lines like line 1: Line 1: I read_v a book_n

Line 2: I want_v to book_v a ticket_n

Synonym phrases and compound words (separated by space) were not included in the gold standard because JavaSDM only extracts single tokens. For example “be good for” was not included in the gold standard as a synonym 20

for “benefit”. With a collocation extractor a phrase like that could have been transformed into the token “be_good_for”. If a model is trained on a perfectly lemmatized data it can’t be used to extract other inflected forms of a word than the lemma, which models trained on graph word data often can (during the tests there were occasions when the lemmatizers had failed to lemmatize every form of a word the same way). It is also possible that a form of a word identical with the lemma is never used in graph word data. Therefore the lemma and graph word models needed different gold standards. First the lemma gold standard was created as described above. Then a triple column version of the training data was created which showed the graph word version of the training data in the first column, the lemmatized version in the second column and the POS-tags in the third column like this: dogs dog NN are be VB cute cute JJ Every graph word that had the same lemma and POS-tag as a word in the lemma gold standard were included in the graph word gold standard. Table 3.1 the effect of this gold standard creation method for the noun “babe”: Table 3.1:

Lemma Gold Standard baby child newborn

Graph Word Gold Standard babies baby babys children childs newborn

In this example there are twice as many words in the graph word gold standard. Still one can see that not all inflections are included in the graph word gold standard, for example the plural form of newborn. That is because “newborns” never occurred in the training data. On the other hand the informal or incorrect forms “babys” and “childs” were included because they were used by users and they had been lemmatized as “baby” and “child” and were POS-tagged as nouns.

3.4.3

Adapted Gold Standard

The dictionary-based gold standards include far from all synonyms. For example ”monteringsanvisning” (“assembly instruction”) could not be counted as a correct synonym proposal to “instruktion” (“instruction”) in tests with the dictionarybased gold standard, since it was not included there. Manual correction was done to avoid that and to get an idea how useful addition a synonym extractor can be to existing thesauri and to get more accurate results. Manual control was done on synonym proposals, from all models, that were not included in the gold standard when not using any threshold. Tests using thresholds could not possibly find words that were not found without 21

threshold. Manually found synonyms, antonyms and semantically near-related words such as hypernyms, hyponyms, siblings were added to the extended gold standard (more about synonyms and near-related words in Section 2.1). For example “trasig” (“broken”) was added to the gold standard for “skadad” (“wounded”) and “kontokort” (“credit card”) added to the gold standard for “kort” (“card”). Antonyms/siblings such as “mother” and “father” were also added to the extended gold standard. For the Swedish test word “bebis” (“baby”) the non-Swedish word “infant” had been extracted and since “infant” is an English word translation of “bebis” was added to the extended gold standard. Generally the synonym proposals manually found as correct were added to both lemma and graph word gold standards. When graph word based models had found more than one inflection of a word applicable for gold standard only one inflection of that word was added to the lemma gold standard although all the inflections were added to the graph word gold standard. If a word was found by a lemma based model but not by any of the graph word based models there was no chance that the graph word based models would find that word in tests using thresholds since these models would not find more words than the ones they found using no threshold. Still it was important to add that word to both lemma and graph word gold standards to get a good recall and f-score comparison. Examples of differences between dictionary-based and adapted gold standards are shown in Table 3.2– 3.4. Table 3.2: English Retail Lemma, Test word = crap (noun): Dictionary-based Gold Standard nonsense bunk twaddle

Adapted Gold Standard nonsense bunk twaddle rubbish shit

Misspellings were also added to the extended gold standard. In order to save time, Levenshteins algorithm was used to see if an extraction was a misspelling of the test word (Jurafsky and Martin (2009): p. 108). The editing distance had to be 1 or 2 and the test word had to have a length of at least five letters, otherwise too many words could be seen as misspellings. The longer a word is the more risk for longer editing distances, but that would have needed further research for which word lengths to increase the editing distance. If a word with editing distance 1 or 2 already was included in the gold standard, for example “monkeys” - “monkey”, then that misspelling was not added to the gold standard. More misspellings were manually found afterwards among the incorrectly classified extractions. In average 1–3 new tokens, per test word were added to the adapted gold standards for the four chat collections respectively. Approximately one fifth were misspellings. Table 3.5 shows the number of tokens included in each gold standard. The table also shows that there are more keywords with synonyms and semantically related words in the retail data than in the travel data.

22

Table 3.3: English Retail Graph Word, Test word = noticed (verb): Dictionary-based Gold Standard observe advert catch mark mind minded note noted recognize recognized refer referred referring regard regarding see seeing spot spotted

Adapted Gold Standard observe advert catch mark mind minded note noted recognize recognized refer referred referring regard regarding see seeing spot spotted discovered found realised realized notice (inflection of the test word)

Table 3.4: Swedish Retail Lemma, Test word = nummer (noun): Dictionary-based Gold Standard exemplar format mått nr personnummer siffra storlek stycke tal telefonnummer

Adapted Gold Standard exemplar format mått nr personnummer siffra storlek stycke tal telefonnummer telefonnr telnr telefonnumer (misspelling)

Table 3.5: The gold standard is defined with the following abbreviations: DB = dictionary based gold standard, A = adapted gold standard, GW = graph words, L = lemma. The chat collection and number of test words are defined like this: R-Swe(244) = Swedish Retail and 244 test words. DB-GW A-GW DB-L A-L

R-Swe(244) 3065 3675 1271 1849

T-Swe(152) 1509 1852 574 890

23

R-Eng(171) 3322 3642 1874 2139

T-Eng(175) 2447 2696 1359 1568

4

Results

The first section shows tested combinations of models and criteria that will be discussed in this chapter. It also tells what kind of metrics that have been computed in the study. Sections 4.2– 4.4 summarize the main effects of thresholds, POS-tagging and lemmatization in the experiments. The dictionarybased and the adapted gold standards are compared in Section 4.5. F-score is the main metric but Sections 4.6 and 4.7 exemplify how precision and recall were affected by preprocessing operations and extraction criteria. Section 4.8 shows the f-score for all tests using the adapted gold standards. Tables in Section 4.9 give a comparison overview of the evaluated combinations of preprocessing operations and extraction criteria.

4.1

Tests

16 tests were run on each of the four chat log corpora. Table 4.1 shows the following parameters: Table 4.1: Name = the name of the test, Text Unit, Set = the candidate set size limit, Threshold = the cosine similarity threshold and POS = part-of-speech tagged or not: Name 15-Graph-0 15-Graph-0p5 15-Graph-0p7 15-Graph-0p85 15-GraphPOS-0 15-GraphPOS-0p5 15-GraphPOS-0p7 15-GraphPOS-0p85 10-Lemma-0 10-Lemma-0p5 10-Lemma-0p7 10-Lemma-0p85 10-LemPOS-0 10-LemPOS-0p5 10-LemPOS-0p7 10-LemPOS-0p85

Text Unit Graph word Graph word Graph word Graph word Graph word Graph word Graph word Graph word Lemma Lemma Lemma Lemma Lemma Lemma Lemma Lemma

Set 15 15 15 15 15 15 15 15 10 10 10 10 10 10 10 10

Threshold 0.5 0.7 0.85 0.5 0.7 0.85 0.5 0.7 0.85 0.5 0.7 0.85

POS + + + + + + + +

The metrics that were computed for all the test words are discussed in this chapter: • Precision, recall and f-score separately – Using the dictionary-based and the adapted gold standards separately 24

– Averages and standard deviations

4.2

Threshold

As expected a higher precision is reached the higher threshold you use. Strict extraction criteria in general benefit precision but not recall. A threshold of at least 0.5 is recommended since practically no correct synonym proposals were extracted below that level. Thus precision and f-score are lower with lower threshold. Models trained on both POS-tagged and lemmatized data from English Retail and Swedish Travel got better precision and f-score when using a threshold of 0.7 but in all the other cases 0.5 got the best f-score and recall. Figure 4.1s an example of how precision, recall and f-score were affected by thresholds in this study.

Figure 4.1: A diagram showing the effect of cosine similarity threshold on precision, recall and f-score when using a model trained on the lemmatized version of Swedish Retail.

4.3

Part-of-speech tagging

POS-tagging improves precision but lowers the number of found misspellings among synonyms proposals. Misspellings are harder to give a correct POStag and due to the word-class criterion a correct tag is necessary. A fact that POS-tagged models got the best f-score for all collections when using dictionarybased gold standard indicates that POS-tagging is useful handling “well-known” words (see appendix). When using travel data regardless of language, POStagging improved f-score for both lemma and graph word based models. Models trained on POS-tagged data got more words with high similarity score.

25

4.4

Lemmatization

Models trained on lemmatized data gave better precision, recall as well as f-score than graph word based models for all chat collections. Models trained on lemmatized data got more words with high cosine similarity score than graph word based models. Worth noting is that models trained on lemmatized data can propose misspellings which in theory could be lemmatized misspellings. For example “essen” could be a lemmatization of the token “essens” which probably is a misspelling of “essence”.

4.5

The different gold standards

POS-tagging combined with lemmatization improved f-score when using the dictionary-based gold-standards. When using adapted gold standards POStagging lowered f-score for all chat collections but Swedish travel. The best cosine similarity thresholds where either equal or lower when using adapted gold standards. The results for precision, recall and f-score were much higher when using the adapted gold standards. If excluding misspellings from the adap ted gold standard there were not many differences between the ranking of the models and thresholds except for English Travel. The differences were the cosine similarity thresholds, as described above regarding adapted gold standards including misspellings, and the fact that lemmatized English retail data gave better f-score if it wasn’t POS-tagged when the adapted gold standard without misspellings was used. These observations are well exemplified by Tables 4.2–4.4 showing the ranking lists according to average f-score results using the dictionary-based gold standard, the adapted gold standard and the adapted gold standard without misspellings for Swedish Travel.

Table 4.2: Swedish Retail using the dictionary-based gold standard Rank 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Model 10-LemPOS-0p7 10-LemPOS-0p5 10-LemPOS-0 10-Lemma-0p7 10-LemPOS-0p85 10-Lemma-0p5 10-Lemma-0 15-Graph-0p5 15-Graph-0 10-Lemma-0p85 15-GraphPOS-0p5 15-GraphPOS-0 15-Graph-0p7 15-GraphPOS-0p7 15-GraphPOS-0p85 15-Graph-0p85

Table 4.3: Swedish Retail using the adapted gold standard: Rank 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

F-score 0.059+-0.085 0.059+-0.084 0.059+-0.084 0.055+-0.085 0.055+-0.103 0.053+-0.075 0.053+-0.075 0.05+-0.059 0.049+-0.059 0.048+-0.104 0.047+-0.059 0.047+-0.059 0.043+-0.063 0.043+-0.057 0.03+-0.062 0.027+-0.069

26

Model 10-Lemma-0p5 10-Lemma-0 10-Lemma-0p7 10-LemPOS-0p5 10-LemPOS-0 10-LemPOS-0p7 15-Graph-0p5 15-Graph-0 15-Graph-0p7 10-LemPOS-0p85 15-GraphPOS-0p5 15-GraphPOS-0 15-GraphPOS-0p7 10-Lemma-0p85 15-GraphPOS-0p85 15-Graph-0p85

F-score 0.247+-0.17 0.246+-0.171 0.236+-0.177 0.231+-0.179 0.231+-0.179 0.231+-0.179 0.176+-0.129 0.175+-0.129 0.156+-0.130 0.155+-0.177 0.149+-0.126 0.149+-0.126 0.142+-0.127 0.118+-0.154 0.086+-0.123 0.07+-0.113

Table 4.4: Swedish Retail using the adapted gold standard without misspellings: Rank 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

4.6

Model 10-LemPOS-0p7 10-LemPOS-0p5 10-LemPOS-0 10-Lemma-005 10-Lemma-0 10-Lemma-0p7 10-LemPOS-0p85 15-Graph-0p5 15-Graph-0 15-GraphPOS-0p5 15-GraphPOS-0 15-Graph-0p7 15-GraphPOS-0p7 10-Lemma-0p85 15-GraphPOS-0p85 15-Graph-0p85

F-score 0.20861813+-0.17 0.20810248+-0.17 0.20809248+-0.17 0.20524344+-0.157 0.2045285+-0.157 0.20224719+-0.168 0.14682704+-0.182 0.14472191+-0.114 0.14366198+-0.113 0.13288288+-0.119 0.13287288+-0.119 0.12905452+-0.12 0.12848777+-0.122 0.10565685+-0.163 0.08052743+-0.123 0.06271605+-0.112

Precision

For all chat log collections the tests with a cosine similarity threshold of 0.85 were the four models with the highest precision as in Table 4.5 and Table 4.6. POS-tagging and lemmatization had positive effect on precision. When models are trained on lemmatized data more words got higher similarity scores. Although models trained on graph word data have a limit of 15 synonym proposals instead of 10, the models trained on lemma extracted almost as many or more words when the similarity threshold was set to 0.85. Table 4.5 exemplifies how prominently POS-tagging improved precision for all chat log collections but Swedish Retail. Table 4.5: Precision ranking for English Retail using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Model 10-LemPOS-0p85 15-GraphPOS-0p85 10-Lemma-0p85 15-Graph-0p85 10-LemPOS-0p7 10-LemPOS-0p5 10-LemPOS-0 15-GraphPOS-0p7 15-GraphPOS-0p5 15-GraphPOS-0 10-Lemma-0p7 15-Graph-0p7 10-Lemma-0p5 10-Lemma-0 15-Graph-0p5 15-Graph-0

Precision 0.418+-0.327 0.357+-0.367 0.275+-0.311 0.249+-0.313 0.247+-0.256 0.217+-0.246 0.215+-0.243 0.213+-0.238 0.196+-0.221 0.192+-0.216 0.185+-0.174 0.162+-0.154 0.159+-0.154 0.159+-0.15 0.139+-0.117 0.136+-0.116

Correct Proposals / Proposals 89/213 91/255 131/476 125/502 166/671 180/831 180/838 191/895 231/1180 231/1201 245/1324 283/1750 262/1644 262/1650 338/2426 338/2494

When threshold is 0.85, models trained on POS-tagged Swedish data did more synonym proposals than untagged models, but they did much less propos27

als when a threshold was low, see Table 4.6. Table 4.6: Precision ranking for Swedish Retail using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Model 10-Lemma-0p85 15-Graph-0p85 10-LemPOS-0p85 15-GraphPOS-0p85 10-LemPOS-0p7 10-Lemma-0p7 10-LemPOS-0p5 10-LemPOS-0 10-Lemma-0p5 10-Lemma-0 15-Graph-0p7 15-GraphPOS-0p7 15-GraphPOS-0p5 15-GraphPOS-0 15-Graph-0p5 15-Graph-0

Precision 0.333+-0.384 0.294+-0.380 0.291+-0.351 0.268+-0.391 0.247+-0.206 0.244+-0.218 0.24+-0.195 0.24+-0.195 0.224+-0.171 0.222+-0.171 0.206+-0.201 0.187+-0.185 0.182+-0.165 0.182+-0.165 0.18+-0.133 0.178+-0.133

Correct Proposals / Proposals 133/399 147/500 195/671 187/697 402/1625 423/1731 413/1720 413/1720 508/2265 508/2279 462/2246 422/2256 461/2531 461/2531 631/3498 632/3550

A summary of the precision effects of preprocessings and thresholds for the four chat log collections: 1. A high threshold. 0.85 was best no matter what model or domain. 2. POS-tagging improved precision for both lemma(Swedish Retail was the only exception)

graph

words

and

3. The lemma based models got higher precision than graph word based models

4.7

Recall

For Swedish and English Retail and Swedish Travel no more correct candidates were found among the synonym proposals below cosine similarities of 0.5. Only English Travel made a few correct extractions below 0.5 which is shown in Table 4.7. Graph word based models generally extract more correct candidates due to all inflections of the test words and other synonyms. If for example subtracting proposals with the same lemma as the test words, the number of correct synonym proposals are pretty close for graph word and lemma models. A summary of the recall effects of preprocessings and thresholds for the four chat log collections: 1. Low thresholds benefit recall but practically no more correct synonyms were found with thresholds below 0.5 with the selected candidate set size limits 2. Lemma based models got higher recall than graph word based models 3. POS-tagged models got lower recall

28

Table 4.7: Recall ranking for English Travel using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10

4.8

Model 10-Lemma-0.txt 10-Lemma-0p5.txt 10-LemPOS-0.txt 10-LemPOS-0p5.txt 15-Graph-0.txt 15-Graph-0p5.txt 10-Lemma-0p7.txt 15-GraphPOS-0.txt 15-GraphPOS-0p5.txt 10-LemPOS-0p7.txt

Recall 0.136+-0.157 0.135+-0.155 0.106+-0.147 0.105+-0.144 0.104+-0.135 0.102+-0.133 0.098+-0.125 0.083+-0.116 0.082+-0.112 0.075+-0.123

Correct Proposals / Words in Gold Standard 213/1568 212/1568 166/1568 164/1568 280/2696 274/2696 153/1568 223/2696 220/2696 118/1568

F-score

The f-score ranking in Tables 4.8–4.11 was computed with the adapted gold standards including misspellings. When using the dictionary-based gold standard, POS-tagging gave the best f-scores for all collections. When excluding misspellings from the adapted gold standard, models that were both lemmatized and POS-tagged gave the highest f-score for all collections but English Retail. When including misspellings in the adapted gold standard, models trained on data without POS-tags got better than models trained on POS-tagged data.

Table 4.8: F-score ranking Swedish Retail using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Model 10-Lemma-0p5 10-Lemma-0 10-Lemma-0p7 10-LemPOS-0p7 10-LemPOS-0p5 10-LemPOS-0 15-Graph-0p5 15-Graph-0 15-Graph-0p7 10-LemPOS-0p85 15-GraphPOS-0p5 15-GraphPOS-0 15-GraphPOS-0p7 10-Lemma-0p85 15-GraphPOS-0p85 15-Graph-0p85

F-score 0.247+-0.17 0.246+-0.171 0.236+-0.177 0.231+-0.179 0.231+-0.179 0.231+-0.179 0.176+-0.129 0.175+-0.129 0.156+-0.13, 0.155+-0.177 0.149+-0.126 0.149+-0.126 0.142+-0.127 0.118+-0.154 0.086+-0.123 0.07+-0.113

Table 4.9: F-score ranking English Retail using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

29

Model 10-Lemma-0p7 10-Lemma-0p5 10-Lemma-0 10-LemPOS-0p5 10-LemPOS-0 10-LemPOS-0p7 15-Graph-0p5 15-Graph-0 15-Graph-0p7 10-Lemma-0p85 15-GraphPOS-0p5 15-GraphPOS-0 15-GraphPOS-0p7 10-LemPOS-0p85 15-Graph-0p85 15-GraphPOS-0p85

F-score 0.141+-0.139 0.139+-0.134 0.138+-0.134 0.121+-0.143 0.121+-0.143 0.118+-0.143 0.111+-0.107 0.11+-0.106 0.105+-0.111 0.1+-0.14 0.096+-0.114 0.095+-0.113 0.084+-0.114 0.076+-0.165 0.060+-0.127, 0.047+-0.129

Table 4.10: F-score ranking Swedish Travel using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

4.9

Model 10-LemPOS-0p7 10-LemPOS-0p5 10-LemPOS-0 10-Lemma-0p5 10-Lemma-0p7 10-Lemma-0 10-LemPOS-0p85 15-GraphPOS-0p5 15-GraphPOS-0 15-GraphPOS-0p7 15-Graph-0p5 15-Graph-0 15-Graph-0p7 10-Lemma-0p85 15-GraphPOS-0p85 15-Graph-0p85

F-score 0.235+-0.199 0.23+-0.195 0.229+-0.195 0.218+-0.177 0.215+-0.184 0.215+-0.174 0.19+-0.219 0.179+-0.148 0.179+-0.148 0.17+-0.149 0.163+-0.146 0.160+-0.139 0.142+-0.149 0.138+-0.2 0.091+-0.159 0.071+-0.139

Table 4.11: F-score ranking English Travel using the adapted gold standard: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Model 10-Lemma-0p5 10-LemPOS-0p5 10-LemPOS-0 10-Lemma-0 10-Lemma-0p7 15-GraphPOS-0p5 15-GraphPOS-0 15-Graph-0p5 10-LemPOS-0p7 15-Graph-0 15-GraphPOS-0p7 15-Graph-0p7 10-Lemma-0p85 10-LemPOS-0p85 15-Graph-0p85 15-GraphPOS-0p85

F-score 0.136+-0.126 0.135+-0.146 0.135+-0.146 0.135+-0.124 0.119+-0.121 0.112+-0.115 0.112+-0.114 0.11+-0.102 0.109+-0.141 0.109+-0.097 0.089+-0.115 0.087+-0.104 0.065+-0.125 0.038+-0.124 0.03+-0.093 0.028+-0.108

Comparison Overview

Tables 4.12– 4.16 compare each test combination of preprocessing operations and extraction criteria with every other combination. The comparison is done by counting how many percent of the times the combination associated with the row gave better results than the combination associated with the column. Both dictionary-based and adapted gold standard tests are counted in this chapter. Misspellings are included in all gold standard tests presented in this section. The five tables represent Swedish corpora, English corpora, corpora from the retail company, corpora from the travel company and all corpora. • LU0 = Lemmatized and untagged, threshold = 0.0 (no threshold) • LU5 = Lemmatized and untagged, threshold = 0.5 • LU7 = Lemmatized and untagged, threshold = 0.7 • LU85 = Lemmatized, threshold = 0.85 • LP0 = Lemmatized and POS-tagged, threshold = 0.0 (no threshold)d • LP5 = Lemmatized and POS-tagged, threshold = 0.5 • LP7 = Lemmatized and POS-tagged, threshold = 0.7 • LP85 = POS-tagged, threshold = 0.85 • GP0 = POS-tagged, threshold = 0.0 (no threshold) • GP5 = POS-tagged, threshold = 0.5 • GP7 = POS-tagged, threshold = 0.7 • GP85 = POS-tagged, threshold = 0.85

30

• GU0 = untagged, threshold = 0.0 (no threshold) • GU5 = untagged, threshold = 0.5 • GU7 = untagged, threshold = 0.7 • GU85 = untagged, threshold = 0.85 These tables show that models trained on both lemmatized and POS-tagged data gave the best f-scores. A threshold of 0.5 is better than no threshold regardless of the preprocessing operations. For models trained on lemmatized Swedish data a threshold of 0.7 gives better f-score than 0.5. For models trained on graph words 0.5 was the threshold giving the best f-score. POS-tagging had a clearly a positive impact on travel data. On graph word data from the retail company POS-tagging lowered the f-score. 0.85 was the threshold giving the lowest f-scores. Table 4.12: All tests (including both dictionary-based and adapted gold standard tests on Retail and Travel data in Swedish and English): L0 LU0 LU5 LU7 L85 LP0 LP5 LP7 LP85 GP0 GP5 GP7 GP85 GU0 GU5 GU7 GU85

87.5 75 12.5 75 75 50 12.5 25 25 12.5 0 0 0 0 0

L5

L7

L85

LP0

LP5

LP7

LP85

P0

P5

P7

P85

R0

R5

R7

R85

12.5

25 50

87.5 100 87.5

25 37.5 25 0

25 37.5 25 0 0

50 50 50 0 37.5 37.5

87.5 75 87.5 37.5 87.5 87.5 87.5

75 87.5 75 25 100 100 75 62.5

75 75 75 25 100 100 75 50 0

87.5 87.5 100 37.5 100 100 100 162.5 100 100

100 100 100 100 100 100 100 100 100 100 100

100 100 87.5 12.5 100 100 87.5 37.5 100 50 25 0

100 100 87.5 12.5 100 100 75 37.5 100 50 25 0 0

100 100 87.5 25 100 100 100 37.5 100 75 37.5 0 87.5 87.5

100 100 100 100 100 100 100 100 100 100 100 50 100 100 100

50 0 62.5 62.5 50 25 12.5 25 12.5 0 0 0 0 0

12.5 75 75 50 12.5 25 25 0 0 12.5 12.5 12.5 0

100 100 100 62.5 75 75 62.5 0 87.5 87.5 75 0

100 62.5 12.5 0 0 0 0 0 0 0 0

62.5 12.5 0 0 0 0 0 0 0 0

12.5 25 25 0 0 12.5 25 0 0

31

37.5 50 37.5 0 62.5 62.5 62.5 0

100 0 0 0 0 0 0

0 0 50 50 25 0

0 75 75 62.5 0

100 100 100 50

100 12.5 0

12.5 0

0

Table 4.13: Swedish (including both dictionary-based and adapted gold standard tests on Swedish Retail and Travel): L0 LU0 LU5 LU7 LU85 LP0 LP5 LP7 LP85 GP0 GP5 GP7 GP85 GU0 GU5 GU7 GU85

100 75 25 75 75 75 50 25 25 25 0 0 0 0 0

L5

L7

L85

LP0

LP5

LP7

LP85

P0

P5

P7

P85

R0

R5

R7

R85

0

25 50

75 100 75

25 25 25 0

25 25 25 0 0

25 25 50 0 0 0

50 75 75 0 75 75 75

75 75 75 25 100 100 100 100

75 75 75 25 100 100 100 100 0

75 75 100 25 100 100 100 100 100 100

100 100 100 100 100 100 100 100 100 100 100

100 100 100 25 100 100 100 75 50 50 50 0

100 100 100 25 100 100 100 75 50 50 50 0 0

100 100 100 50 100 100 100 75 25 25 50 0 75 75

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

50 0 75 75 75 25 25 25 25 0 0 0 0 0

25 75 75 50 25 25 25 0 0 0 0 0 0

100 100 100 100 75 75 75 0 75 75 50 0

100 100 25 0 0 0 0 0 0 0 0

100 25 0 0 0 0 0 0 0 0

25 0 0 0 0 0 0 0 0

0 0 0 0 25 25 25 0

100 0 0 50 50 25 0

0 0 50 50 25 0

0 50 50 50 0

100 100 100 0

100 25 0

25 0

0

Table 4.14: English (including both dictionary-based and adapted gold standard tests on English Retail and Travel): L0 LU0 LU5 LU7 LU85 LP0 LP5 LP7 LP85 GP0 GP5 GP7 GP85 GU0 GU5 GU7 GU85

100 50 0 75 75 25 0 25 25 0 0 0 0 0 0

L5

L7

L85

LP0

LP5

LP7

LP85

P0

P5

P7

P85

R0

R5

R7

R85

0

50 50

100 100 100

25 50 25 0

25 50 25 0 0

75 75 50 0 75 75

100 75 100 75 100 100 100

75 100 75 25 100 100 50 25

75 75 75 25 100 100 50 0 0

100 100 100 50 100 100 100 25 100 100

100 100 100 100 100 100 100 100 100 100 100

100 100 75 0 100 100 75 0 50 50 0 0

100 100 75 0 100 100 50 0 50 50 0 0 0

100 100 75 0 100 100 100 0 75 75 25 0 100 100

100 100 100 100 100 100 100 100 100 100 100 0 100 100 100

50 0 50 50 25 25 0 25 0 0 0 0 0 0

0 75 75 50 0 25 25 0 0 25 25 25 0

100 100 100 25 75 75 50 0 100 100 100 0

100 25 0 0 0 0 0 0 0 0 0

25 0 0 0 0 0 0 0 0 0

0 50 50 0 0 25 50 0 0

32

75 100 75 0 100 100 100 0

100 0 0 50 50 25 0

0 0 50 50 25 0

0 100 100 75 0

100 100 100 100

100 0 0

0 0

0

Table 4.15: Retail (including both dictionary-based and adapted gold standard tests on Swedish and English Retail): L0 LU0 LU5 LU7 LU85 LP0 LP5 LP7 LP85 GP0 GP5 GP7 GP85 GU0 GU5 GU7 GU85

75 100 0 50 50 50 0 0 0 0 0 0 0 0 0

L5

L7

L85

LP0

LP5

LP7

LP85

P0

P5

P7

P85

R0

R5

R7

R85

25

0 25

100 100 75

50 50 50 0

50 50 50 0 0

50 50 75 0 25 25

100 75 100 25 100 100 100

100 100 100 50 100 100 100 75

100 100 100 50 100 100 100 50 0

100 100 100 75 100 100 100 50 100 100

100 100 100 100 100 100 100 100 100 100 100

100 100 100 0 100 100 100 25 0 0 0 0

100 100 100 0 100 100 100 25 0 0 0 0 0

100 100 100 25 100 100 100 25 50 50 0 0 100 100

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

75 0 50 50 50 25 0 0 0 0 0 0 0 0

25 50 50 25 0 0 0 0 0 0 0 0 0

100 100 100 75 50 50 25 0 100 100 75 0

100 75 0 0 0 0 0 0 0 0 0

75 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

25 50 50 0 75 75 75 0

100 0 0 100 100 50 0

0 0 100 100 50 0

0 100 100 100 0

100 100 100 50

100 0 0

0 0

0

Table 4.16: Travel (including both dictionary-based and adapted gold standard tests on Swedish and English Travel): L0 LU0 LU5 LU7 LU85 LP0 LP5 LP7 LP85 GP0 GP5 GP7 GP85 GU0 GU5 GU7 GU85

100 50 25 100 100 50 25 50 50 25 0 0 0 0 0

L5

L7

L85

LP0

LP5

LP7

LP85

P0

P5

P7

P85

R0

R5

R7

R85

0

50 75

75 100 100

0 25 0 0

0 25 0 0 0

50 50 25 0 50 50

75 75 75 50 75 75 75

50 75 50 0 100 100 50 50

50 50 50 0 100 100 50 50 0

75 75 100 0 100 100 100 75 100 100

100 100 100 100 100 100 100 100 100 100 100

100 100 100 25 100 100 75 50 100 100 100 0

100 100 100 25 100 100 50 50 100 100 100 0 0

100 100 100 25 100 100 100 50 100 100 100 0 75 75

100 100 100 100 100 100 100 100 100 100 100 50 100 100 100

25 0 75 75 50 25 25 50 25 0 0 0 0 0

0 100 100 75 25 50 50 0 0 0 0 0 0

100 100 100 50 100 100 100 0 75 75 75 0

100 50 25 0 0 0 0 0 0 0 0

50 25 0 0 0 0 0 0 0 0

25 50 50 0 0 25 50 0 0

33

50 50 25 0 50 50 50 0

100 0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0

100 100 100 50

100 25 0

25 0

0

5

Conclusions

The purpose of this study was to evaluate effects of different preprocessing operations of chat log collections and different extraction criteria when extracting synonyms with JavSDM. In the experiments and this chapter, lemmatization and POS-tagging are the main preprocessing operations and cosine similarity the main criterion that are discussed. Conclusions were drawn if there was similar effect on at least two of the chat collections with either the language or domain in common. What strengthens the conclusions drawn in this chapter is that the rankings of models and thresholds looked very similar no matter which gold standard or chat collection was used. The only obvious difference was that POS-tagging had more positive effect on travel data. The standard deviations made all results overlap since precision and recall often varied from 0.0–1.0 for single test words. Consequently these deviations are not very informative. Figure 5.1 is an example why the standard deviations had wide ranges. Section 5.1 is a summary of the conclusions drawn from the

Figure 5.1: The example shows how many percent of the test words that got synonym proposals with a specific f-score. The results in the diagram are from a model trained on the lemmatized chat collection of Swedish Retail tested with 250 test words. 25 per cent of the test words got f-score of 0.

results in this study. Section 5.2 discusses interesting observations from the study that were outside the scope of the purpose of thesis, alternative ways things could have been done, things that would have been tested/studied if 34

there had been more time and future study suggestions.

5.1

Overview of the results

The purpose of this study was to compare preprocessing operations and extraction criteria and thereby help users of JavaSDM for synonym extraction. The following is a list of main conclusions that can be drawn from this study. • POS-tagging clearly improves precision for all data collections except for Swedish Retail. • Lemmatized data give better precision, recall and f-score for all collections (this is discussed in Section 5.2). • As expected, best precision is achieved with higher thresholds. 0.85 was the highest in this study. • Best recall is achieved with a threshold of 0.5 or lower. 0.5 is preferable since lower thresholds decrease precision and f-score • Best f-score for retail data is achieved without POS-tagging when misspellings are included in the gold standard. • Best f-score for travel data is achieved with lemmatized and POS-tagged data. • Best threshold regarding f-score was generally 0.5 for unlemmatized data. With lemmatized data the synonym candidates tend to gain higher similarity scores and therefore 0.7 is quite even with 0.5. For models trained on both lemmatized and POS-tagged Swedish corpora 0.7 gave better f-score every time. With a larger corpora words will probably get higher similarity scores since words in lemmatized data have higher frequency and lemmatized models got higher cosine similarity scores than graph word models. • If ignoring misspellings, POS-tagging improves f-score for all collections but English Retail. • The fact the adapted gold standards gives three to four times better f-score than dictionary-based indicates how useful synonym extraction can be when handling a domain specific vocabulary or complementing existing thesauri.

5.2

Discussion

As van der Plas and Tiedemann (2006) stated, a higher lemmatization and POStagging quality would improve the results, e.g. when a Swedish gold standard included “smart” the program extracted “smar”, since that was how “smart” had been lemmatized. In such cases the random indexing is successful and the problem is the lemmatization quality or in this example rather the lemmatizer’s choice of lemma form for inflections of smart. Unedited chat logs often lack 35

punctuation and capitalization of proper nouns and often contain misspellings and words from foreign languages. In specific domains there might be named entities that are rarely mentioned outside the domain. These are factors that might make things more difficult for lemmatizers and part-of-speech-taggers. Gold standard exclusion of synonyms from synonym dictionaries that do not appear in training data makes recall scores more reflective of JavaSDM’s ability to extract synonyms. Still it is not certain that the ones that occurred in the training data appeared in a sense that is synonymous to the test word. The POS-tagging solves parts of this problem, but a good word-sense or morphology tagging would do it more accurately. Morphology is in theory useful for separating noun homonyms in Swedish. For example gender separates the Swedish graph word “lag” that means “law” if it is utrum and “team” if it is neutrum and species can decide whether the Swedish graph word “banan” means “banana” or “the track”. Morphological use did not improve results in this study, but the training data was rather small and it could be worth trying morphological tagging on a larger corpus. Lemmatized Gold Standard ersätta skifta ändra

Graph Word Gold Standard ersätta skifta ändra ersatt ersatte ersatts ersätter skiftar ändrade ändrat

When using the graph word gold standard all inflections of a synonym, that has occurred in the training data, are seen as correct synonyms e.g. “ersatt”, “ersatte”, “ersätta” and “ersatts” count as four correct synonyms for “byta” (see Section 3.4.2). Is it better if for example a graph word based model’s proposals include those four correct candidates than if a lemmatized model’s proposals only include two correct candidates: “ersätta” and “skifta”? Precision-wise the answer is yes according to the evaluation method if one compares 4/15 to 2/10. Recall-wise the answer is no in this example since the graph word gold standard includes more inflections of the other words as well and thereby get a recall of 4/10 while the lemma model would get 2/3. The number of found words is counted and compared to get indications of the effects of this risk. If a lemma version does more correct extractions this issue is no problem, but if a graph word version does more correct extractions but has lower recall the issue could be there. For a clearer recall comparison between lemma and graph word based models, one could have counted extractions of words with the same lemma form as one suggestion, lemmatize that extraction and use the lemmatized gold standard for evaluation. That would for example have meant a recall of 1/3 instead of 4/10 for a graph word based model that extracted “ersatt”, “ersatte”,

36

“ersätta” and “ersatts” for “byta” using the gold standard below. The extractions and transformed extractions for “byta” are shown below: Extractions -> Transformed Extractions ersatt ersätta ersätta ringa ersatte hitta ersatts välja ringa använda hitta montera välja ställa använda skriva montera se ställa hämtar skriva paxa se läsa hämtar paxa läsa How to count precision and thereby f-score when using such a method is not as self-evident. One could choose between the following: • The precision would be 1/15 if you only count a lemma form as correct one time • The precision would it be 1/11 if you count all extractions with identical lemma form as only one extraction • The precision would be 4/15 if you count each inflection individually. If “ersätta” had been an incorrect extraction, would the four inflections be counted as one or four incorrect extractions? If the user of JavaSDM don’t need more in more than on form and want less extractions to read, it would be better to lemmatize extractions and only show a reappearing lemmas once. In this study models were trained either on graph words or lemma but one could try to combine graph words and lemma when training models. For example each focus word could be the graph word while the surrounding words could be lemmatized or the other way around. For example “the dogs waved their tales” could be “the dog waved their tail” or “the dogs wave their tails”. Preliminary tests indicated that it is better to keep stop words during training and instead use the list of stop words as a way to stop extractions of those words from being synonym proposals. An explanation why could be that stop words often give syntactic information and are for example used by POS-taggers. The syntactic information from stop words is probably not as relevant if data is POS-tagged. Antonyms, which prior studies have mentioned as problem, might be used ironically. With smileys irony is sometimes more clearly revealed in chat logs. Smileys can carry lots of information and are therefore something that should be explored further for use in dialogue systems. Sometimes smileys are used 37

instead of words to communicate positive or negative energy. Lika all other tokens without alphabetic characters, smileys could not be extracted in this study. When training models with JavsSDM, one can set different parameters such as type of weighting scheme and size of context windows. Default parameters were used in this study since no report was found stating when to use which combination of parameters. Different settings might be better for graph word data but not for lemmatized data and vice versa. Also the corpora sizes might have affected the conclusions drawn in this study. There are more things left to try regarding model training of POS-tagged data. There is a weighting scheme that during training can give extra weight to content words, for example words tagged as nouns, adjectives and verbs. It is not sure that it would improve results since the pre-tests indicated that it can be a good choice to keep stop words during training, but it should definitely be tested. Misspelled tokens are intended to have the same meaning as test words and should have distributions suitable for the test word. Graph word data is better for this since lemmatizers and part-of-speech taggers are not trained to handle unknown spellings. Even if misspellings should be seen as correct synonym suggestions and actually might be very relevant, there are other ways to find misspellings. Though a distributional similarity can be used to enhance the probability that a detected/suspected misspelling is a misspelling. JavaSDM found several misspellings with frequencies of 1 which in a way is not as relevant as frequent misspellings. Frequency counting was done and showed that some misspellings were more frequent than the “correct” spelling and those misspellings are very interesting. It is likely that very frequent misspellings are not misspellings in the users’ heads. Compound words are not handled automatically by JavaSDM. Spaces are the same as the start and end of a word which means that “answering machine”, for example is read as “answering” and “machine” and “washing machine” as “washing” and “machine”. Such an example increases risk of having “washing” suggested as a synonym for “answering”. That problem could be avoided if such phrases were somehow recognized as one word when training JavaSDM models. This is a problem because compound words do not get as much data as they should and there may be distributional relations calculated, e.g. if “asbra” (“cadaver”+”good” meaning “extremely good”) is written “as bra” then “asbra” has one less occurrence calculated and also “as” (“cadaver”) becomes associated with a positive word such as “good” and vice versa. This is extra difficult regarding verb compounds that can have other words in between for example “Switch off the light!” and “Switch the light off!”. I wrote a simple program that extracted nouns followed by nouns and then checked if these coordinate nouns occurred compounded as one word with or without hyphen. Then it compared how many times each different compounding method was used and actually quite often the coordinate nouns had occurred in all three versions and even more often the compounding method was one that would be considered as a misspelling in formal Swedish. This should be examined in further studies, because it is an obstruction for synonym extraction and it is a well-known problem in linguistics. Incorrect disjunctions are the most common problem in Swedish (Melin (2006)). 38

5.2.1

Future Study suggestions

The following is a summary list of the most relevant future study suggestions mentioned in the discussion: • Lemmatize graph word extractions for a better comparison. • Join compund words as one token. Try to handle incorrect spelling of Swedish compound words. • Examine effects of JavaSDM parameters, for example different context window sizes. • Try weighting words in the context windows differently depending on their POS-tag.

39

Bibliography V.D. Blondel and P. Senellart. Automatic extraction of synonyms in a dictionary. Technical report, University of Louvain, 2002. D. Alan Cruse. Notes on meaning in language. Master’s thesis, Mitthögskolan, 2002. James R. Curran and Marc Moens. Improvements in automatic thesaurus extraction. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9, ULA ’02, pages 59–66, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1118627.1118635. URL http://dx.doi.org/10.3115/1118627.1118635. Philip Edmonds and Graeme Hirst. Near-synonymy and lexical choice. Computational Linguistics, 28(2), 2002. Olivier Ferret. Testing semantic similarity measures for extracting synonyms from a corpus. Proceedings if LREC, 2010. James Gorman and James R. Curran. Random indexing using statistical weight functions. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 457–464, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. ISBN 1-932432-73-6. URL http://dl.acm.org/citation.cfm?id=1610075.1610139. Zelig S. Harris. Distributional structure. Word, 10:146–162, 1954. Martin Hassel. Resource Lean and Portable Automatic Text Summarization. PhD thesis, KTH, 2007. Louise Holmer. Passiv och perfekt particip i saol plus. en dokumentation av den lexikografiska arbetsprocessen. Master’s thesis, University of Gothenburg, 2009. Daniel Jurafsky and James H. Martin. Speech and Language Processing. An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Education International, 2009. Lars Melin. Experimentell språkvård. Språkvård 4, 2006. Philippe Muller, Nabil Hathout, and Bruno Gaume. Synonym extraction using a semantic distance on a dictionary. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, TextGraphs-1, pages 65–72, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1654758.1654773. 40

M Ortuño, P Carpena, P Bernaola-Galván, E Muñoz, and A.M Somoza. Keyword detection in natural languages and dna. EPL (Europhysics Letters), 57 (5):759–764, 2002. Lonneke Van Der Plas and Gosse Bouma. Syntactic contexts for finding semantically related words. In In CLIN, pages 173–186, 2004. M. Rosell, M. Hassel, and V. Kann. Global evaluation of random indexing through swedish word clustering compared to the people’s dictionary of synonyms. 2009. Magnus Sahlgren. An introduction to random indexing. SICS, Swedish Institute of Computer Science, 2005. Carlo Strapparava and Rada Mihalcea. Semeval-2007 task 14: Affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval- 2007), pages 70–74, 2007. Lonneke van der Plas and Jörg Tiedemann. Finding synonyms using automatic word alignment and measures of distributional similarity. In Proceedings of the COLING/ACL on Main conference poster sessions, COLING-ACL ’06, pages 866–873, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1273073.1273184. Tong Wang. Extracting synonyms from dictionary definitions by. Master’s thesis, University of Toronto, 2009. Tong Wang and Graeme Hirst. Exploring patterns in dictionary definitions for synonym extraction. Natural Language Engineering, 18:313–342, 2012. Hua Wu and Ming Zhou. Optimizing synonym extraction using monolingual and bilingual resources. In Proceedings of the second international workshop on Paraphrasing - Volume 16, PARAPHRASE ’03, pages 72–79, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. doi: 10.3115/ 1118984.1118994. URL http://dx.doi.org/10.3115/1118984.1118994.

41