Linguistic and Pragmatic Patterns in Understatement Tweets Gwen Wijman ANR: 620029
Bachelor Thesis Communication and Information Sciences Specialization Human Aspects of Information Technology
Faculty of Humanities Tilburg University, Tilburg
Supervisor: dr. G.A. Chrupala Second Reader: dr. ir. P.H.M. Spronck
January 2015
Acknowledgement
This thesis is the final product for my Bachelor‟s program Communication and Information Sciences at Tilburg University. I have learned a great deal the last three and a half years and even though I could not use all that I have learned in this particular thesis project, I am happy that I took on this challenge.
First and foremost I want to thank my thesis supervisor, Gzregorz Chrupala, for his support and advice. Thanks for being very patient and answering all my (sometimes foolish) questions with care.
Thanks to my second reader, Pieter Spronck, for assessing my thesis in its final stages.
Thanks to Bram Jansen for your help and words of encouragement. You were the light when all I could see was darkness. I literally would not have completed my thesis without your support.
Last but not least, I want to thank everyone who ever helped me to complete the journey of my Bachelor‟s program. I hope to have you along for another year as I will take on the challenge of obtaining a Master‟s degree in Tilburg.
2
Abstract In this study, I investigate whether linguistic and pragmatic features that have previously been found in sarcastic tweets, are present in understatement tweets. Two binary logistic regression models were used to find character and word n-grams to predict whether a tweet would be an understatement or not. Patterns that had been found in sarcastic tweets appeared in the understatement tweets, these features included laughter expressions, emoticons, repeated punctuation, question marks, exclamations, intensifiers, emotion words and explicit markers for sarcasm. The results were not completely in line with research previously conducted by De Freitas et al. (2014), Liebrecht et al. (2013) and González-Ibáñez et al. (2011) because the features present in my dataset differed in polarity and strength. Approximately half of the features found were positive predictors for understatements and the other 50 percent were negative predictors for understatements. Not all features relating to sarcastic tweets were strong predictors for understatements or non-understatements.
3
Table of contents
1. Introduction
p. 6
1.1 Problem Statement
p. 6
1.2 Research question
p. 6
2. Theoretical Background
p. 8
2.1 Sarcasm and irony
p. 8
2.2 Understatements
p. 10
2.3 Patterns in sarcastic tweets
p. 10
3. Methods
p. 13
3.1 Data collection
p. 13
3.2 Software
p. 14
3.3 Analysis
p. 14
3.3.1 N-grams
p. 14
3.3.2 Logistic regression
p. 14
4. Results
p. 16
4.1 The classification models
p. 16
4.2 Character n-grams
p. 20
4.3 Word n-grams
p. 25
4.4 New data
p. 29
5. Discussion
p. 31
5.1 Confusion matrices
p. 31
5.2 Error analysis
p. 32
5.3 Research question
p. 35
4
6. Conclusion
p. 36
6.1 Conclusion
p. 36
6.2 Implications and future work
p. 37
References
p. 38
5
1. Introduction 1.1 Problem statement Understatements are often used in different settings, including movies, books, social media and real life settings. One example of understatement use in movies is from Monty Python and the Holy Grail by Gilliam & Jones (1975). Both arms of the Black Knight are cut off but the knight refers to it as „jut a flesh wound‟. Another example is from the social media network Twitter: googlemaps (2014): “Today‟s @letour stage is a bit on the twisty side. #understatement http://goo.gl/8l9nMe”. A final example of using an understatement in real life would be saying in a conversation that „it is raining a bit‟ while it in fact is pouring. Because companies are interested in the valuable information that utterances on social media contain, Justo, Corcoran, Lukin, Walker, & Torres (2014) note that they could be interested in opinion mining and sentiment analysis on social media platforms. They also state that language on social media is challenging to examine through Natural Language Processing (NLP) because of its informal and unstructured nature. That makes language on social media different from language as written in books and language used in face to face conversations. Research has been conducted related to finding patterns and features of sarcastic or ironic tweets. But no such research has been done concerning one specific type of irony: understatements.
1.2 Research Question Because a lot of research has already been conducted for finding linguistic and pragmatic patterns in ironic and sarcastic tweets, I am interested in a smaller part of the whole. I want to know whether patterns that already have been found for sarcastic and ironic tweets can be found in tweets containing understatements as well.
6
This leads to the following research question: Research question (RQ): Are there linguistic and/or pragmatic patterns to be found in tweets containing understatements? The results of this study indicate that there are patterns relating to linguistic and pragmatic features in sarcasm research to be found in understatement tweets. However, these patterns are not all positive predictors for understatements. Approximately half of the features were negative predictors for understatements. This entails that they rather predict that an utterance is not an understatement. The features previously found in sarcasm research are not as strong in understatement tweets as they were for sarcastic tweets. The following chapters will be discussed in this thesis: Chapter two focuses on the theoretical background that underlies this research. Chapter three discusses the methods used for our experiments. Chapter four presents the results that were found by conducting our experiments. Chapter five gives a detailed error analysis and discusses the results found. Finally, chapter six gives a conclusion of this thesis and discusses limitations and implications for future work.
7
2. Theoretical Background In this chapter we will focus on the theoretical background which is the foundation for this thesis. We will start by explaining sarcasm and irony in section 2.1. Section 2.2 focuses on understatements and the concluding section 2.3 gives an overview of features found in previous research of sarcastic tweets. 2.1 Sarcasm and irony Researchers do not agree about the use of the terms “irony” and “sarcasm”. The terms are sometimes used interchangeably in research literature (Kreuz & Caucci, 2007; Liebrecht, Kunneman & Van Den Bosch, 2013). While other researchers choose to make a distinction between (verbal) irony and sarcasm, for example Caucci and Kreuz (2012) and Colston and O‟Brien (2000). The Oxford English Dictionary defines irony as “the expression of meaning through the use of words which normally mean the opposite in order to be humorous or to emphasize a point” (Soanes et al., 2006, p.384). While sarcasm is characterized as “a way of using words which say the opposite of what you mean, in order to upset or mock someone (Soanes et al., 2006, p.642). These definitions look very similar but according to Nunberg (as cited in Caucci & Kreuz, 2012) sarcasm is “just one of many types of verbal irony” (p.1). Sarcasm and irony are seen as equal in this research. We therefor use „sarcasm‟ when irony might be meant, but do keep in mind that sarcasm is a type of verbal irony. In a conversation, speakers normally comply to certain rules or “maxims” (Grice, as cited in Colston, 1997). When a speaker violates a maxim, the speaker does so intentionally. This is called a conversational implicature. According to Grice (as cited in Colston, 1997), the maxim of quality is violated when using sarcasm. When the sarcastic utterance is taken literally, it has a different meaning than when the receiver would understand the speaker‟s intent or implicature of the message. An example is a child who tells his mother that the meal she prepared is really fantastic. He however refuses to eat any of it. The mother might think
8
that there is something wrong with his appetite, when she takes his statement literally. The child however, does not like the food at all, and uses a sarcastic utterance to implicate this. According to Maynard and Greenwood (2014) sarcasm often occurs in user generated content and is challenging to investigate. Background knowledge such as context and culture are needed to identify whether an utterance is sarcastic or not. Because it is difficult for computers to interpret this context and culture it is challenging to detect sarcasm through machines. In spoken language sarcasm is identifiable through non-verbal cues such as face expressions and the use of voice inflections (Lee & Narayanan, 2005). In written text, Lee and Narayanan (2005) claim, there are no standard cues to inflect the reader‟s intent of being sarcastic. However, Liebrecht et al. (2013) show that hashtags are used as extra linguistic elements. The hashtag #sarcasm for example, can be seen as “the social media equivalent” (p.35) of non-verbal cues that people transmit when using sarcasm while communicating face to face. In their research, González-Ibáñez, Muresan, & Wacholder (2011) used sarcastic utterances which where explicitly identified as sarcastic by the writer for their analysis. They assume that the best judge is the author of the tweet because other human judges do not have enough context to successfully judge whether a tweet is sarcastic or not. Other research has also focused on author identification (the usage of a hashtag) of sarcastic tweets (Maynard & Greenwood, 2014; Liebrecht, Kunneman & Van Den Bosch, 2013). The use of hashtags has to be handled with care. One has to take into account that hashtags might not always be used reliably because a user assigns them to their own writings. A wrong understanding of the concept of sarcasm might lead to sarcasm assignment failure. Justo et al. (2014) note that not every sarcastic utterance is tweeted with the hashtag #sarcasm, and that the hashtag may only be used for the most obvious forms of sarcasm.
9
2.2 Understatements An understatement is an expression which is intentionally expressed weaker than it truly is. According to the Oxford English Dictionary an understatement (noun) or to understate (verb) is to “describe or represent something as being smaller or less important than it really is” (Soanes, Hawker, & Elliott, 2006, p.800). An understatement can be seen as a form of verbal irony, such as sarcasm. As in irony, the maxim of quality is broken when using understatements (Grice, in Colston, 1997). A soccer supporter (A) could be watching a game of their favorite team, and that time could be losing because they are not playing very good. That supporter might say to a fellow supporter (B) sitting next to him that „they are messing things up a bit‟. Supporter A intentionally weakens his utterance towards supporter B for ironic purposes. Supported A knows that supporter B sees that their favorite team are messing things up very badly. In addition to the violation of the maxim of quality. The maxim of quantity is broken in understatements. The maxim of quantity entails that one should only give enough information to allow the receiver to understand the message (Martin, 1991). The implicature of an understatement is therefore transmitting a message without literally saying it. Supporter A says that his team is „messing things up a bit‟ while in fact his team is playing very badly.
2.3 Patterns in sarcastic tweets Different patterns can be distinguished in sarcastic utterances on Twitter. De Freitas, Vanin, Hogetop, Bochernitsan & Vieira (2014) identified several linguistic patterns for irony in Portuguese tweets related to the topic „Fim do mundo‟, or the end of the world in English. Patterns they found include laughter expressions (for example hahaha in English), emoticons (, ), hashtags (#ironia, #joking, #kidding) and different specific utterances such as “só que” which translates to “NOT” in English. They also looked for the use of repeated
10
punctuation (???, !!!!) and quotation marks (?). The patterns that occurred most in their corpus were laughter expressions, emoticons, the use of repeated punctuation and quotation marks. Liebrecht et al. (2013) collected a corpus of Dutch tweets labeled with the hashtag #sarcasm (#sarcasme in Dutch). They trained a classifier to predict sarcasm in tweets with the hashtag #sarcasm removed. Their analysis using a Balanced Winnow Classifier shows that sarcastic tweets often contain hyperboles (in 60% of the tweets), which are formed with the use of intensifiers (positive words including „awesome‟, „lovely‟, „fantastic‟) and positive exclamations („wow‟, „yay‟, „yes‟). The sarcastic tweets that did not use a hyperbole (34%) often contained an explicit marker (words that are possible synonyms for #sarcasm such as #LOL, #joke and #NOT, the use of the hashtag is not required). González-Ibáñez et al. (2011) collected sarcastic, positive and negative tweets based on the hashtags that users had assigned to their tweets. They described features which could occur in sarcastic, positive and negative tweets and investigated whether these factors could be used to identify sarcastic, positive or negative tweets with help of a χ2 analysis. They made a clear distinction between lexical and pragmatic factors. Lexical factors include emotion words (positive and negative), negations (unbreakable, misunderstood) and interpunction (exclamations marks, question marks). Emoticons and mentions (@google) were identified as pragmatic factors. González-Ibáñez et al. (2011) found that positive emotions (for example happy, excited), negations and mentions are important patterns for sarcasm detection. In this research we choose not to consider mentions as a marker for sarcastic tweets, because mentions are very common in overall Twitter usage. In table 1, a summary is made of the linguistic and pragmatics patterns that are found in sarcastic and ironic tweets in studies conducted by De Freitas et al. (2014), Liebrecht et al. (2013) and González-Ibáñez et al. (2011). The examples used are real tweets collected from the Twitter API.
11
Table 1. Linguistic and pragmatic patterns in sarcastic tweets. Pattern
Example
Laughter expressions
Haha the joys of coming home from work at 10.30 and having to study. I have such an amazing life. #Sarcasm.
Emoticons
Mid life crisis is what makes some men popular on twitter :) #FACT #outrage #sarcasm
Repeated punctuation
NO!! I'm soooo surprised!! #Sarcasm
Question mark
This is what the 10 Year Treasury has done so far this morning. Totally rational, right? #sarcasm
Intensifiers
Well, this is an awesome day. #sarcasm
Exclamations
off to work…..YAY!!!! #sarcasm – Thankfully I only have to do my old job for 2 nights
Explicit markers
This has been de best day ever #not #lies #sarcasm
Emotion words
I start my day reading class tomorrow and I am SO excited. #sarcasm
12
3. Methods In this chapter the methodologies that underlie the experiments are discussed. In Section 3.1 information about data collection and the dataset is given. Section 3.2 discusses the software used for the experiments and in 3.3 the analysis of our data is explained.
3.1 Data collection Data used for this research consists of English tweets which contain the hashtag #understatement. Tweets were collected using a script in Python. The script was used twice daily to collect as many tweets as possible. In the period between September 28th and October 19th 2958 tweets were gathered. The tweets collected then needed to be manually filtered for retweets. Retweets are tweets that have been reposted by a user other than the author. It was then needed to filter the tweets for true understatements. In this research an understatement is considered a true understatement when the utterance is intentionally weakened for sarcastic purposes. 656 tweets were used for the final analysis. All hashtags #understatement were removed before analyzing the data. To compare the understatement tweets to non-understatement tweets, tweets from the same users as in the understatement dataset were gathered. A script in Python is used to gather the tweets in the timeline of every user, and for each user a non-understatement tweet was collected. Because it was not possible to retrieve tweets of every user of the understatement dataset 646 non-understatement tweets were collected. Some users protected their accounts, making it only possible for followers to collect their tweets and other users had deleted their profiles. The total dataset consisting of understatements and non-understatements is divided into three parts. One part for training the classifier, and two parts for testing. The training set
13
consists of 60 percent of the tweets (782 tweets), the validation set consists of 20% (260 tweets) and the final testing set consists of the remaining 20% (260 tweets).
3.2 Software The software used in this research includes PythonXY, the Python GUI and editor named IDLE. For collecting tweets the Twitter API is used in combination with a Python script. For the analysis of the data scikit-learn python module was used.
3.3 Analysis Using Python and the Scikit-learn module, binary logistic regression was used to predict the dependent variable on the basis of the independent variables. The dependent variable in this research is whether a tweet is an understatement or not. The independent variables are ngrams extracted from the tweets. We compare the extracted n-grams to the features listed in Table 1 (in section 2.3).
3.3.1 N-grams N-grams were used to find whether subsequent characters and words were predictors for tweets containing understatements. We experimented what range of n-grams gave the best accuracy score and the best model was used.
3.3.2 Logistic regression Logistic regression is a linear model for classification. The logistic regression estimates what the probability is that an event will occur. When an event does not occur it is labeled with 0 and when it does occur it is labeled with 1.
14
With use of binary logistic regression we want to predict whether a tweet contains an understatement or not. In this research, tweets with no understatement belong to category 0 and tweets with an understatement belong to category 1. It predicts whether in this case a tweet belongs to category 0 or 1 on the basis of other information. The information in which the classification is based in this research are character or word n-grams which are extracted from the tweets in our dataset. The model that is built using the n-grams and the logistic regression helps us to predict whether new tweets are understatements or non-understatements (Field, 2009). Because there are several predictors in our logistic regression our equation is as follows: ( | )
(
∑
)
With β0 as the bias, Xn as the predictor variable and βn as the weight or coefficient of the corresponding predictor variable. Our model has a value which regulates the strength of the regularization of our analysis, the C-value. Regularization is the process of the penalization of extreme parameter values. The smaller the C-value, the stronger the regularization is in our models. Trying different values for the regularization variable is helpful, because the model could become more accurate for a certain value of C. When the best value of C is found, the model does not overfit or underfit the model to our training set.
15
4. Results This chapter will show the results of our experiments. First, the experiments concerning the character n-grams and word n-grams classification models are discussed in section 4.1. Subsequently, the results relating to character n-grams and word n-grams are given in section 4.2 and 4.3. Finally, we present the results of the classification models when tested with new data in section 4.4.
4.1 The classification models The experiments for this study commenced with finding the right classification models for analyzing the data. In order to find the best model I experimented with different ranges for character and word n-grams and chose the range with the highest accuracy score. Then I tested whether different C-values had an effect on the accuracy scores for the n-gram ranges. The combination of the best n-gram range and the C-value with the highest accuracy score was used as our classification model for further experiments. When accuracy scores for different ranges and c-values were identical, the most simple model was used. This section will give an overview of the results of the experiments conducted to find the best classification models. We start by presenting the results of the character n-gram classification models. Figure 1 shows the different character n-gram ranges that were tested to find the right model and their corresponding accuracy scores. The different lines correspond to the minimum n-gram range, and the numbers on the x-axis correspond with the maximal value of the n-gram range. The lowest accuracy score found was 0.569 for n-gram range [6,10], the highest accuracy score was 0.673 for n-gram ranges [2,4] and [2,5]. What attracts attention is that the ranges of [2,4] and [2,5] hold the same accuracy score (0.673). Because the n-gram ranges of [2,4] and [2,5] had the same accuracy score I
16
decided to test both ranges with different C-values. In this model the C-value entails a variable that regularizes strength. A smaller value of C defines a stronger penalty on the coefficients. In this research the default L2-regularization is used. The default value of C is 1. In Figure 2 an overview is given of the different C-values used with the n-gram ranges [2,4] and [2,5] and their corresponding accuracy scores.
Figure 1. Character n-gram ranges and corresponding accuracy scores, C=1.
The lowest accuracy score found for the different C-values in n-gram range [2,4] was 0.669 for C=4, C=5, C=10 and C=100. The highest accuracy score was 0.677 and corresponds with C=1.1. The lowest accuracy score found for the different C-values in n-gram range [2,5] was 0.661 for C=10 and C=100. The highest accuracy score was 0.673 and corresponds with C=1 and C=1.2. The best character model with which further experiments were carried out thus is the model with a character n-gram range of [2,4] and a C-value of C=1.1. The corresponding accuracy score of the model is 0.677.
17
Figure 2. C-values and accuracy scores for character n-gram range [2,4] and [2,5].
I now continue with discussing the results of the word n-gram classification model. Figure 3 gives an overview of the different word n-gram ranges tested and the corresponding accuracy scores. The lowest accuracy score found was 0.553 for n-gram ranges [2,5] and [2,7]. The highest accuracy score of 0.631 goes with the n-gram ranges of [1,5] and [1,6]. Because there were two ranges found with an equal high accuracy score we decided to conduct further experiments on both ranges. Figure 4 shows the different C-values experimented with for word n-gram range [1,5] and [1,6] and their corresponding accuracy scores. For [1,5] the lowest accuracy score found was 0.619 for C=0.1 and C=100. The highest accuracy score was 0.631 for C=1. For n-gram range [1,6] the lowest accuracy score found was 0.612 for C=100. The highest accuracy score encountered was 0.631 for C=0.5, C=1 and C=2.
18
Figure 3. Word n-gram ranges and corresponding accuracy scores, C=1.
Figure 4. C-values and accuracy scores for word n-gram range [1,5] and [1,6].
Figure 4 made clear that there were four models with the highest accuracy score (0.631) and therefore performed the best. The best word model that was used for further experiments was the most simple version of the models. Conclusively, the model with a word n-gram range of [1,5] and a C-value of C=1 was chosen to continue the experiments with. 19
4.2 Character n-grams In the following sections I will discuss the n-grams that have been found in our analysis. I look at the best predictors for understatements tweets and also try to detect the features previously found in sarcasm research (Table 1, section 2.3) in our dataset. This section (section 4.2) will involve the results concerning the character n-gram model. Section 4.3 focuses on the results found with the word n-gram model. As discussed in section 4.1, the best character and word n-gram models were used for this analysis. The coefficients given are relative values, this implies that the values are meaningful in relation to the model and each other. They are not meaningful in comparison to other models, for example. Figure 5 gives an overview of the 20 character n-grams that are the best predictors for tweets containing understatements and non-understatement tweets. It is noticeable that the first and second strongest predictors for understatements are 0.764 apart. What is also remarkable is that the strongest predictor with a value of 1.303, is a dot followed by a blank space („._‟). This punctuation mark is used extensively in language and is therefore a questionable predictor for understatements. The combination of a dot and blank space appeared 158 times in the validation set. The combination appeared mostly when tweets were composed of multiple sentences and when these sentences were separated by a dot and a blank space. Other occurrences of the character n-gram are due to the manual labeling of the tweets. A label was added at the end of every tweet, but in some tweets a blank space was present between the last character of the tweet, in this case a dot, and the label. The predictor „ha‟ could be an expression of laughter but this is also questionable. The most laughter expressions are a repetition of „ha‟. In the following tweet from the validation set, a repetition of „ha‟ is used as a laughter expression: “Haha the guys at work just are calling me a shopaholic how rude !!”. In the validation set the character combination „ha‟ occurred 147 times. In none of these occurrences the expression „ha‟ was a laughter
20
expression. It was rather part of a word, for example „thanks‟ and „happy‟ as shown in the following two tweets from the validation set: “@LobsterLiveMixx Thanks for following! via http://t.co/ctftxCN6yq” “@brihtarchik happy birthday”. Other positive character n-grams found in this experiment show no clear relation with the features previously found in sarcastic tweets. As with the majority of the positive understatement predictors, no clear relation between most character n-grams and the features found in previous sarcasm research can be distinguished. We did, however find a 2-gram consisting of two exclamation marks „!!‟, which is related to the feature „repeated punctuation‟ in sarcasm research. The repetition of punctuation is supposed to be a predictor for sarcastic tweets but these results suggest that „!!‟ is a negative predictor for understatement tweets.
21
Figure 5. Top 20 character n-grams of best character model. Note. _ = blank space.
Figure 6 shows the features previously found in sarcasm research and the relating character n-grams extracted from our data. With our character n-gram model we found four laughter expressions, five emoticons, five different cases of repeated punctuation, one question mark, four exclamations and five explicit markers.
22
Figure 6. Features and corresponding character n-grams of best character model. Note. _ = blank space.
The laughter expressions found include „haha‟ with a coefficient of -0.122, „ha‟ with a value of -0.225, „lol‟ with a coefficient of 0.013 and „lol_‟ with a score of 0.032. It is noticeable that the laughter expressions „haha‟ and „ha‟ are negative predictors for understatements tweets. This entails that they do not predict an understatement tweet but rather a non-understatement tweet. „lol‟ and „lol_‟ are positive predictors for understatement tweets, but their coefficient values are rather low.
23
Emoticons found in our character n-grams are „:)‟ with a coefficient of -0.010, „:)_‟ with a score of 0.014, „=)‟ with a coefficient of 0.005, „:-D‟ with a coefficient of -0.009 and „:o)‟ with a value of 0.015. Two of the emoticons were negative predictors and three positive predictors. All of their coefficient scores are low, therefore they are not significant predictors for either understatements or non-understatements. Repeated punctuation appeared five times in our n-gram list, namely „!!‟ with a score of -0.366, „!!!‟ with a coefficient of -0.226, „!!!!‟ with a value of -0.136, „??‟with a coefficient of -0.121 and „???‟ with a coefficient of -0.062. It is noteworthy that all five repeated punctuation n-grams are negative predictors for understatements. They thus do not predict understatements but rather non-understatements. As the repeated punctuation marks, the single question mark is a negative predictor for understatements with a value of -0.125. Exclamations are also represented in our extracted n-grams, four variants were found. „Wow‟ with a coefficient of 0.004, „wow‟ with a coefficient of 0.024, „Yay‟ with a value of 0.004 and „yay‟ with a coefficient of -0.046. „Wow‟ and „wow‟ are positive predictors for understatements while „Yay‟ and „yay‟ are negative predictors for understatements. All four exclamation n-grams are rather weak predictors for understatements or non-understatements. Lastly, explicit markers for sarcasm were encountered. We found „NOT‟ with a coefficient of -0.023, „Not‟ with a value of 0.026, „not‟ with a score of 0.103, „#Not‟ with a coefficient of 0.010 and „#not‟ with a value of 0.041. All but „NOT‟ were positive predictors for understatements. What is remarkable is that only „not‟ has a relatively high coefficient (0.103) in comparison with the other explicit markers. It could be argued that only „not‟ is a plausible predictor for understatements and that the other coefficients are too low to be of significance.
24
Intensifiers and emotion words were not found in the n-gram list. This can be explained by the number of characters that were extracted from the tweets. Strings existing of two, three or four characters could be too small to extract intensifiers or emotion words from.
4.3 Word n-grams We now continue with discussing the results concerning the word n-gram model. Figure 7 gives an overview of the 20 word n-grams that are the best predictors for understatement and non-understatement tweets. One positive predictor for understatement tweets has a relation with the features found in sarcastic tweets. „not‟ could be an explicit marker for sarcasm, for instance used in the following tweet: “@FoxNews breaking: climate change robs walruses of ability to swim. #NotReally am I the only one who remembers haulouts on @NatGeo ?”. Three other positive predictors that were found could be used for weakening an utterance. „little‟ „pretty‟ and „bit‟ are words that could be used in an understatement context. For example “Ok that last 3 minutes was pretty fun #DallasStars” or “Got a little wet walking home from the grocery store. #stormTO”. The negative predictors for understatement tweets have no clear relation with the features previously found in sarcasm research.
25
Figure 7. Top 20 word n-grams of best word model.
Figure 8 shows the features previously found in sarcasm research and the relating word n-grams extracted from our data. With our best word n-gram model we found three laughter expressions, nine intensifiers, six emotion words, seven exclamations and two different explicit markers.
26
Figure 8. Features and corresponding word n-grams of best word model.
The laughter expressions found include „LOL‟ with a coefficient of -0.062, „lol‟ with a value of 0.038 and „hahaha‟ with a coefficient of -0.087. It is notable that „LOL‟ and „hahaha‟ are negative predictors for understatements and thus predict non-understatements while „lol‟ is a positive predictor for understatements. All three laughter expressions have a rather low coefficient value. They are not important predictors for understatements or nonunderstatements.
27
Intensifiers found are „Really‟ with a coefficient of 0.055, „really‟ with a coefficient of -0.201, „So‟ with a value of -0.193, „so‟ with a coefficient of -0.248, „Super‟ with a score of 0.094, „super‟ with a coefficient of 0.056, „Too‟ with a coefficient of -0.125, „too‟ with a value of 0.064 and „very‟ with a coefficient of 0.222. It is notable that „really‟, „So‟, „so‟, „Super‟ and „Too‟ are negative predictors for understatements tweets. This entails that they are predictors for non-understatements. Of these five negative predictors, „really‟, „So‟, „so‟ and „too‟ have rather high coefficient value. They could be seen as plausible predictors for non-understatements. Out of the four positive predictors, „Really‟, „super‟, „too‟ and „very‟, only „very‟ has a rather high coefficient value. „Really‟, „super‟ and „too‟ have such low coefficients that it could be argued that they are no significant predictors for understatements. The six emotion words found in our word n-gram experiment are „not happy‟ with a coefficient of 0.196, „happy‟ with a coefficient of 0.399, „sad‟ with a score of 0.020, „Excited‟ with a coefficient of -0.118, „excited‟ with a value of 0.181 and „disappointed‟ with a coefficient of 0.079. It is noteworthy that there is only one negative predictor for understatements amongst the six emotion words, namely „Excited‟. This is simultaneously the only word that starts with a capital letter. Another remarkable thing is that the negative emotion words „sad‟ and „disappointed‟ have a rather low coefficient value. „Not happy‟ is an amalgamation of „not‟ and „happy‟ and that could be why the value of that negative emotion word is higher than that of the other two emotion words. Two positive emotion words in lowercase letters, „happy‟ and „excited‟ have a relative high coefficient value and could therefor been seen as plausible predictors for understatements. I found seven exclamations, „YAY‟ with a coefficient of -0.072, „Yay‟ with a coefficient of -0.027, „yay‟ with a score of -0.073, „Wow‟ with a coefficient of 0.131, „wow‟ with a coefficient of -0.125, „Yes‟ with a value of -0.100 and „yes‟ with a coefficient of 0.199. It is notable that only one exclamation, „wow‟, is a positive predictor for
28
understatements. All other exclamations are negative predictors for understatements and thus predict non-understatements. „YAY, „Yay and „yay‟ have a relatively low coefficient value and are therefore not strong predictors for non-understatements. „wow‟, „Yes‟ and „yes‟ have a coefficient value that is rather high. It could be said that they are plausible predictors for non-understatements. Finally, two explicit markers for sarcasm were found. „Not‟ with a coefficient of 0.438 and „not‟ with a value of 0.829. The coefficient values of these explicit markers are high compared to the coefficients of the other features found in the word n-gram experiment. „Not‟ and „not‟ are plausible predictors for understatements.
4.4 New data In order to test whether the two best classification models performed better, equal or worse when given new data, I used my test set for a final accuracy measurement. Table 2 gives an overview of the classifiers, their performance accuracy on the validation set and the performance accuracy on the test set. The best character n-gram model (n-gram range 2,4, C=1.1) had an accuracy of 67.7% on the validation data. When this model was used with the test set, an accuracy of 73.8% was measured. The best word n-gram model (n-gram range 1,5, C=1) had an accuracy of 63.2% when using the validation dataset. When the model was used with the test set, an accuracy of 66.9% was measured. These results show that the test set was easier to process for both models than the validation set.
29
Table 2. Comparing accuracy scores of validation set and test set. Classifier
Validation set accuracy
Test set accuracy
Character N-grams
0.677
0.738
0.631
0.669
range 2,4 C = 1.1 Word N-grams range 1,5 C=1
30
5. Discussion In this chapter the results of our experiments will be interpreted. We start by presenting the confusion matrices for both the best character and best word models in section 5.1. In section 5.2 we will have a detailed look at the type of errors that the models made. Finally, our research question will be answered in section 5.3
5.1 Confusion matrices Confusion matrices were created to give an overview of the mistakes that the classification model makes. Mistakes include false positives, the assignment of a positive label to a negative tweet and false negatives, the assignment of a negative label to a positive tweet. The validation set consisted of 260 tweets, of which 130 tweets were labeled positively and 130 tweets negatively. Figure 9 shows the correct and incorrect assignments of the best character model on the validation dataset. The model assigned a total of 176 tweets with the correct label and made 84 mistakes. This leads to an accuracy score of 67.7%. 80 tweets were labeled correctly as understatement and 96 tweets were correctly labeled as non-understatement tweets. The model assigned a positive label to a negative tweet 35 times and false negatives were found 49 times. Figure 10 shows the mistakes made by the best word classification model on the validation dataset. The model assigned a total of 164 tweets with the correct label and made 96 mistakes. This leads to an accuracy score of 63.1%. 83 tweets were correctly labeled as positive and 81 tweets were correctly labeled as negative. The model made 96 mistakes of which 50 were false positives and 46 were false negatives.
31
Classifier positive labels
Classifier negative labels
True positive labels
80
49
True negative labels
35
96
Figure 9. Confusion matrix of best character model on validation data.
Classifier positive labels
Classifier negative labels
True positive labels
83
46
True negative labels
50
81
Figure 10. Confusion matrix of best word model on validation data.
A paired samples t-test was conducted to compare the number of correct label assignments in the best character n-gram classifier and the best word n-gram classifier. There was no significant difference found in the number of correct label assignment between the best character n-gram classifier (M=0.68, SD=0.469) and the best word n-gram classifier (M=0.63, SD=0.484); t(1.398), p=.163.
5.2 Error analysis In this section the errors that our models made on the verification dataset are discussed. I start by discussing the average words and characters of the right and wrongly assigned tweets of the best character n-gram model in section 5.2.1 and continue with the best word n-gram model in section 5.2.2. We then look at the type of mistakes that both models made in section 5.2.3.
32
5.2.1 Character n-gram model As discussed in section 5.1, our best character n-gram model had an accuracy score of 67.7%. This entails that the model was right in 67.7% of the label assignments and was wrong in 32.3% of the label assignments. In this section we talk about the latter, the 32.3% in which the model was wrong. Table 3 shows the means of the amount of words and characters with either the blank spaces excluded or included for the right and wrong assignments. The difference in average word count between wrong and right assignments is 0.78. The differences in average between the amount of characters without and with blank spaces included are respectively 0.20 and 1.29. These scores show that the average correctly assigned tweet is smaller than wrongly assigned tweets. The largest difference can be found in the amount of characters with blank spaces included.
Table 3. Average of words and characters of wrong and right assignments by the best character ngram model compared. Wrong assignments
Correct assignments
Average words
11.71
10.93
Average characters (no blank spaces)
60.46
60.26
Average characters (with blank spaces)
73.70
72.41
5.2.2. Word n-gram model We now continue with the error analysis of the best word n-gram model. As stated in section 5.1, the best word n-gram model had an accuracy score of 63.1%. This means that the model
33
made correct predictions in 63.1% of the cases and made false predictions in 36.9% of the cases. Table 4 shows the average of words and characters of wrong and right assignments by the best word n-gram model. The difference in amount of words between wrong and right assignments is -1.42. The differences in number of characters without or with blank spaces taken into account are respectively -10.37 and -11.79. These results indicate that correct assignment tweets consist of more words and more characters on average.
Table 4. Average of words and characters of wrong and right assignments by the best word n-gram model compared. Wrong assignments
Correct assignments
Average words
10.24
11.66
Average characters (no blank spaces)
53.69
64.06
Average characters (with blank spaces)
65.23
77.02
5.2.3 Mistakes by both models To complete the error analysis, I look at the mistakes that models both made. What attracted my attention was that almost all false negatives occurred because of labeling errors. Two examples of false negatives are as follows: “@akilpin: There's about a million fireworks and drums going off outside. The Spanish love their fiestas!” and “@ESPNChiHawks: Richards still adjusting with Blackhawks”. These examples were labeled as understatements but the model assigned them as non-understatement tweets. These errors were made due to insufficient manual labeling of the dataset and the model was correct in assigning a negative label to these tweets.
34
The models also made errors in the form of false positives. The nature of these mistakes was harder to discover than the false negatives. Two examples of false positives are as follows: „@francaiskitty Actually, you can pay someone to do that. lol.” and “10 Writing Tips To Connect With Your Readers”. It is not clear to me why both models assign a positive label to these (and similar) negative tweets. The first example does contain a positive predictor for understatement tweets („lol‟) but that predictor was rather weak with a coefficient of 0.013 for the best character n-gram model and a coefficient of 0.038 for the best word n-gram model. Both models thus do classify negative tweets as positive for unclear reasons.
5.3 Research question This section will answer the overarching research question. The research question was formulated as follows: RQ: Are there linguistic and/or pragmatic patterns to be found in tweets containing understatements? In this thesis I attempted to find linguistic and pragmatic patterns in tweets containing understatements in order to be able to answer the research question. I found laughter expressions, emoticons, repeated punctuation, question marks, exclamations and explicit markers for sarcasm with our best character n-gram model and laughter expressions, intensifiers, emotion words exclamations and explicit markers for sarcasm with our best word n-gram model. I thus did find linguistic and pragmatic markers that had previously been encountered in sarcastic tweets, but not all of these features were positive predictors for understatements. Half of the character n-gram features found were negative predictors for understatements. Out of the encountered word n-gram features the distribution between positive and negative predictors for understatements was almost equally divided.
35
6. Conclusion In this final chapter I will give a conclusion of this study (section 6.1) and discuss the limitations that arose during this study (section 6.2). Implications for future work are also given in section 6.2.
6.1 Conclusion This study focused on finding linguistic and pragmatic features in understatement tweets. Two binary logistic regression models were used to find character and word n-grams to predict whether a tweet would be an understatement or not. Linguistic and pragmatic features were found, but not all of these features were positive predictors for understatements. The best character n-gram model found laughter expressions, emoticons, repeated punctuation, question marks, exclamations and explicit markers for sarcasm. The best word n-gram model found laughter expressions, intensifiers, emotion words exclamations and explicit markers for sarcasm. There were an equal amount of positive and negative predictors for the character n-grams and an almost equal amount of positive and negative predictors for word n-grams. My findings were not completely in line with research previously conducted by De Freitas et al. (2014), Liebrecht et al. (2013) and González-Ibáñez et al. (2011). The findings were similar in a way that the understatement tweets did contain similar predictors as sarcastic tweets, but differed in the polarity and strength of these predictors. In previous work the pragmatic and linguistic markers found in sarcastic tweets were all positive predictors for sarcasm. In my results, these markers were either positive or negative predictors for understatements. Approximately half of the markers encountered in my dataset were positive predictors, the other half were negative predictors for understatements.
36
6.2 Limitations and future work In this section we will discuss the limitations that arose while conducting this research and implications for future work will be given. We start by discussing the limitations. Because of the limited time in which the research had to be done, the amount of tweets collected was rather small. To get better results it is recommended to use a larger dataset collected over a greater timespan for this type of research. The filtering of the tweets was done by one untrained person who is not an English native speaker. A more reliably dataset could be assembled when multiple trained judges assigned the tweets to the understatement or non-understatement conditions. When their assignments are compared an even better dataset could emerge. More research is needed to get a clear view of the linguistic and pragmatic markers that are predictors for understatements. Future studies could look at features that are characteristics of understatements instead of using the features that have been found in past sarcasm research. A feature could be words that weaken a statement, for example „little‟, „mildly‟ and „a bit‟. These weakening words occurred in my dataset and it could be important to have a further look at them. Future work could also look at understatements in languages other than English. For example Dutch or German. These two languages both use understatements as a stylistic device. When other languages have been studied, the results could be compared to investigate whether the linguistic and pragmatic markers that predict understatement tweets are alike or different. To take this research to another level one could also investigate whether people with different cultural backgrounds use different pragmatic or linguistic markers in their understatements. In short, not a lot of research has been done concerning understatements and linguistic and pragmatic markers as predictors. In order to get a better understanding of the patterns that are the foundation of understatements more research is needed.
37
References Caucci, G. M., & Kreuz, R. J. (2012). Social and paralinguistic cues to sarcasm. In Humor, 25, 1-22. Colston, H. L. (1997). " I've Never Seen Anything Like It": Overstatement, Understatement, and Irony. Metaphor and Symbol, 12(1), 43-58. Colston, H. L., & O'Brien, J. (2000). Contrast and pragmatics in figurative language: Anything understatement can do, irony can do better. Journal of Pragmatics, 32(11), 1557-1583. de Freitas, L. A., Vanin, A. A., Hogetop, D. N., Bochernitsan, M. N., & Vieira, R. (2014). Pathways for irony detection in tweets. In Proceedings of the 29th Annual ACM Symposium on Applied Computing (pp. 628-633). ACM. Gilliam, T. (Director), & Jones, T. (Director). (1975). Monty Python and the Holy Grail [Motion picture]. United Kingdom: EMI Films. González-Ibáñez, R., Muresan, S., & Wacholder, N. (2011). Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2 (pp. 581-586). Association for Computational Linguistics. googlemaps (2014, July 18). Today‟s @letour stage is a bit on the twisty side. #understatement
http://goo.gl/8l9nMe
[Twitter
post].
Retrieved
from
https://twitter.com/googlemaps/status/490149030781919232 Justo, R., Corcoran, T., Lukin, S. M., Walker, M., & Torres, M. I. (2014). Extracting relevant knowledge for the detection of sarcasm and nastiness in the social web. Knowledge Based Systems, 69, 124-133. Martin, N. D. (1990). Understatement and Overstatement in Closing Arguments. La. L. Rev., 51, 651-666.
38
Maynard, D., & Greenwood, M. A. (2014). Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis. In Proceedings of LREC. Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. Speech and Audio Processing, IEEE Transactions on, 13(2), 293-303. Liebrecht, C.C., Kunneman, & F.A., Bosch, A.P.J. van den. (2013). The perfect solution for Detecting sarcasm in tweets #not. In: Balahur, A., Goot, E. van der., Montoyo, A.(ed.), Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 29-37). New Brunswick, NJ : ACL. Soanes, C., Hawker, S., & Elliott, J. (Eds.). (2006). Paperback Oxford English Dictionary (Vol. 10). Oxford University Press.
39