Automated detection of offensive language behavior on social networking sites

Automated detection of offensive language behavior on social networking sites Baptist Vandersmissen Promotoren: prof. dr. ir. Filip De Turck, dr. ir....

Author: Natalie Griffith

23 downloads 0 Views 1000KB Size

Report

Download PDF

Recommend Documents

Social Networking Web Sites

Impact of Social Networking Sites on Hospitality and Tourism Industries

MUSLIM WOMEN AND SOCIAL NETWORKING SITES

FACTORS AFFECTING PRIVACY INTRUSION ON SOCIAL NETWORKING SITES

Parents Shares on Social Networking Sites About their Children: Sharenting

Brand Marketing Via Video Social Networking Sites

Make MONEY Now. with. Social Networking Sites

Narrative Identity and Social Networking Sites

Social Networking Sites: will they survive?

LEGAL AND PRIVACY CHALLENGES OF SOCIAL NETWORKING SITES

Organizational Member Use of Social Networking Sites and Work Productivity

Analysis of Social Networking Sites Using K- Mean Clustering Algorithm

Motivations for Using Social Networking Sites: The Case of Romania

Overview of Privacy in Social Networking Sites (SNS)

or Networking Sites) Policy

Social Transactions on Social Network Sites: Can Transaction Cost Theory Contribute to a Better Understanding of Internet Social Networking?

Moving On: Predicting Continuance Intention on Social Networking Sites. through Alternative Products

Exploring the Impacts of Social Networking Sites on Academic Relations in the University

UNDERSTANDING AND AUTHENTICATING EVIDENCE FROM SOCIAL NETWORKING SITES

Chapter 2 SOCIAL NETWORKING SITES: THEIR ROLE IN CONTEMPORARY INDIA

BALANCING THE SCALES OF JUSTICE: UNDERCOVER INVESTIGATIONS ON SOCIAL NETWORKING SITES

Survey of the Current Use of Social Networking Sites to Develop English Language Skills of GTU students

QUANTITATIVE ASSESSMENT OF AUTOMATED CRATER DETECTION ON MARS

Privacy Calculus on Social Networking Sites: Explorative Evidence from Germany and USA

Automated detection of offensive language behavior on social networking sites Baptist Vandersmissen

Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media) Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Dani¨el De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2011-2012

Automated detection of offensive language behavior on social networking sites Baptist Vandersmissen

Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media) Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Dani¨el De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2011-2012

Acknowledgments First of all, I would like to express great gratitude to my supervisors Philip Leroux and ir. Johannes Deleu for their many insights, continuous support and patience. Also dr. Thomas Demeester should be noticed for his contributions to this study. I thank Massive Media for providing me with data of Netlog.com I also want to thank my dear friends at university, Ruben Verhack and Sacha Vanhecke for these marvelous and intense five years. As I want to thank my friends and family in general for their support and contagious enthusiasm.

iii

Permission for Use of Content “The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.”

Baptist Vandersmissen, June 2012

Automated detection of offensive language behavior on social networking sites Baptist Vandersmissen

Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media) Scriptie ingediend tot het behalen van de academische graad van burgerlijk ingenieur in de computerwetenschappen Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Dani¨el De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2011-2012

Resume This study applies machine learning techniques to perform automated offensive language detection. A corpus, originating from the Dutch distribution of the social network Netlog, is used. An Information Retrieval system, enhanced with Rocchio Query Expansion, is developed to effectively find offensive messages. Subsequently, a Naive Bayes and Support Vector Machine classifier are implemented and trained on the gathered offensive (and irrelevant) messages. To increase reliability, a third classifier is created that is based on word lists. Since each classifier has its own strengths and weaknesses, a combination of different methods is proposed. The combination of seperate classifiers outperforms all other methods.

Keywords: offensive language detection, query expansion, text classification

Abstract Social Networking Sites are booming as never before. Apart from the numerous new opportunities that are provided, also hazards such as messages containing sexual harrasment or racist attacks have to be taken into account. Since manually monitoring and analysing all messages seperately is unattainable, solutions using automated methods are sought. This study applies machine learning techniques to perform automated offensive language detection. Offensive language can be defined as “expressing extreme subjectivity“ and this study mainly focuses on two categories ’sexual’ and ’racist’. A corpus, originating from the Dutch distribution of the social network Netlog, is used and contains over seven million blog messages. We note that only a very small amount (approximately 0.85%) of these blog messages can be defined as messages that contain abusive language. Initially, the intention is to implement two supervised learning methods Naive Bayes and Support Vector Machine. These methods base the classification of a message on previous experiences, derived from a labeled training set. To build such training set offensive messages should be efficiently extracted out of the corpus. In order to achieve this, an information retrieval system, expanded with a query expansion technique, is applied. A query containing offensive terms delivers offensive messages, however a more efficient approach is considered by enhancing the query using Rocchio query expansion. This study shows that using query expansion can effectively increase the amount of relevant messages retrieved. These supervised classifiers are trained on the labeled set and afterwards their performance is tested on an independant validation set. The Naive Bayes classifier does not perform well on the validation set and is therefore disregarded in the further analysis. Our Support Vector Machine implementation achieves results of approximately 69% precision and 62% recall. However, these results are obtained by ignoring very small messages, since SVM has difficulties classifying messages that do not contain much information. To tackle the issues SVM suffers from a more reliable, but less dynamic method is designed, based on word lists. This method, that is named a semantic classifier, obtains ix

reasonable results on the validation set with a recall of 93%, but more importantly is highly complementary with the SVM classifier. The classification of a message is eventually performed by choosing the correct classifier depending on the situation and context. This method outperforms all others and achieves a precision of 100% combined with a 79% recall. Notwithstanding the solid results of our classifier on the validation set, we note that offensive language detection is a challenging domain that, as many text classification problems, suffers from specific linguistic characteristics. Automated systems can be an excellent aid, but human capacity to properly estimate the real meaning of a message is, still, irreplaceable.

Samenvatting Sociale netwerksites zijn brandend actueel en alomtegenwoordig. Ondanks de talloze nieuwe mogelijkheden zijn er ook risico’s aan verbonden. Deze komen tot uiting in de vorm van gebruikers die het medium trachten te misbruiken. Het verspreiden van ongepaste, aanstootgevende of kwetsende boodschappen is een vaak voorkomend probleem op elk digitaal communicatie middel. Aangezien het niet realistisch is om elk bericht manueel te controleren, worden er automatische oplossingen gezocht. Deze studie probeert aan de hand van technieken gelinkt aan machinaal leren op automatische wijze kwetsende of aanstootgevende taal te detecteren. Beledigend taalgebruik kan gedefinieerd worden als ”het uiten van extreme subjectiviteit“. Deze studie richt zich voornamelijk op berichten die enerzijds seksueel getinte, ongepaste inhoud of anderzijds racistisch getinte boodschappen bevatten. Hiervoor wordt een berichtenverzameling gebruikt, afkomstig van de Nederlandstalige distributie van de sociale netwerksite Netlog, dewelke meer dan zeven miljoen blogberichten telt. Er wordt echter opgemerkt dat slechts 0.85% van deze berichten ook daadwerkelijk ongepaste taal bevat. Initieel was het de bedoeling om twee gesuperviseerde leermethodes te ontwikkelen, die enerzijds gebaseerd zijn op Na¨ıve Bayes en anderzijds op een Support Vector Machine. Deze methoden baseren de classificatie van een bericht op vorige ervaringen, afgeleid van een gekende trainingset. Een trainingset wordt opgebouwd aan de hand van positieve en negatieve voorbeelden. Er moeten met andere woorden beschrijvende voorbeeldberichten verzameld worden uit de complete verzameling. Om dit te bekomen is er een zoeksysteem ge¨ımplementeerd in combinatie met een techniek die zoekopdrachten tracht te verbeteren. Relevante berichten worden in de berichtenverzameling gezocht door het invoeren van zoekopdrachten in het systeem. Deze zoekopdrachten worden op hun beurt uitgebreid door Rocchio’s query expansie techniek. Aan de hand van het gebruik van een query expansie techniek toont deze studie aan dat, op een effici¨entere manier meer relevante berichten kunnen verkregen worden. Hierna worden beide gesuperviseerde classifiers getraind op de gelabelde berichtenverzameling om dan vervolgens hun algemene prestaties te testen op een validatieverzameling. Aangezien de Na¨ıve Bayes classifier eerder ondermaats presteert, laten we deze buiten beschouwing. De Support Vector Machine behaalt hogere resultaten met een precisie xi

van ongeveer 69% en een recall van 62%. Deze resultaten zijn echter bekomen door te kleine berichten, die niet tenminste vijf verschillende woorden bezitten, te negeren. Dit verbetert het resultaat sterk aangezien SVM moeilijk overweg kan met kleine berichten. Om de verschillende problemen van de SVM methode aan te pakken is er ook een derde methode ontwikkeld die berichten classificeert aan de hand van vooraf gedefinieerde woordenlijsten. Deze semantische methode behaalt redelijke resultaten op de validatieverzameling met een zeer hoge recall van 93%, en is daarbovenop zeer complementair met de SVM classifier. De uiteindelijke classificatie van een bericht is dan bepaald door een combinatie van beide methoden. Afhankelijk van de situatie wordt voor de ene of de andere classifier gekozen. Met dit systeem behalen we een precisie van 100% en een recall van 79% op de validatieverzameling. Desalniettemin de uitstekende resultaten op de validatieverzameling, merken we op dat deze verzameling geen perfecte representatie van de gehele berichtenverzameling is. Het detecteren van ongepaste taal is een uitdagend domein dat nog zeer sterk zal moeten evolueren om de moeilijke en typische karakteristieken van taal te doorgronden.

Contents Acknowledgments

iii

Preface

1

1 Introduction: What is Offensive 1.1 Introduction . . . . . . . . . . . 1.2 Goals of This Study . . . . . . 1.3 Approach . . . . . . . . . . . . 1.4 Applications . . . . . . . . . . . 1.4.1 Query Expansion . . . . 1.4.2 Text Categorisation . . 1.5 Challenges . . . . . . . . . . . . 1.6 Related Work . . . . . . . . . . 2 The 2.1 2.2 2.3 2.4

2.5

Language Detection? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dutch Netlog Corpus Netlog . . . . . . . . . . . . . . . . . . . . . Corpus Overview . . . . . . . . . . . . . . . Offensive language . . . . . . . . . . . . . . Creating Training and Validation set . . . . 2.4.1 Labels . . . . . . . . . . . . . . . . . 2.4.2 Validation set . . . . . . . . . . . . . 2.4.3 Training set . . . . . . . . . . . . . . Informal Language and Multiple Languages

3 Methodology 3.1 Information Retrieval . . . . 3.1.1 Features . . . . . . . . 3.1.2 Evaluation . . . . . . 3.1.3 Spelling Correction . . 3.1.4 Stemming . . . . . . . 3.1.5 Part-of-Speech Tagger 3.1.6 Query Expansion . . . 3.1.7 Open Source Software

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

xiii

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

3 3 5 6 6 6 7 7 8

. . . . . . . .

11 11 11 12 13 13 14 15 16

. . . . . . . .

17 17 18 18 19 19 20 20 26

3.2

Machine Learning Techniques . . 3.2.1 Data Representation . . . 3.2.2 Feature Selection . . . . . 3.2.3 Classification Techniques Lexicon Based Text Classification Evaluation . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

28 29 29 30 33 33

4 Design and Implementation 4.1 Information Retrieval . . . . . . 4.1.1 Query Expansion . . . . . 4.2 Offensive Language Detection . . 4.2.1 Naive Bayes . . . . . . . . 4.2.2 Support Vector Machine . 4.2.3 Semantic Methods . . . . 4.2.4 Combination of Methods

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

35 35 35 37 37 38 38 40

3.3 3.4

5 Results 5.1 Query Expansion . . . . . . . . . . . . 5.1.1 Rocchio Relevance Feedback . 5.1.2 Conclusion . . . . . . . . . . . 5.2 Offensive Language Detection . . . . . 5.2.1 Naive Bayes . . . . . . . . . . . 5.2.2 Support Vector Machine . . . . 5.2.3 Semantic Classifier . . . . . . . 5.2.4 Comparison of Single Methods 5.2.5 Combined Algorithm . . . . . . 5.2.6 Corpus Classification . . . . . . 5.2.7 Conclusion . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

43 43 44 49 49 49 50 56 58 60 61 62

6 Discussion 6.1 Main Findings . . . . . . . 6.1.1 Corpus . . . . . . . 6.1.2 Development . . . . 6.1.3 Experimental Study 6.2 Discussion . . . . . . . . . . 6.2.1 Domain-dependency 6.2.2 Implicit Language . 6.2.3 Conclusion . . . . . 6.3 Future Improvements . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

63 63 63 63 64 65 66 66 66 66

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

7 Conclusion 69 7.1 Main Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Preface “On average 37% of the people talk to people more online than they do in real life.” A survey of Badoo in the U.K., U.S. and Germany

Since the beginning of the internet it has always been a goal to create forms of computer-mediated social interaction. The first type of social networking sites (SNS) came in the form of online communities such as Theglobe.com and Tripod.com. They mostly focused on creating interaction between different people by bringing them together in chat rooms. It is only in the late 1990s that many sites began to incorporate more advanced features and user profiles became the central point of view [56]. This somewhat newer form of social networking sites began to flourish with the rise of SixDegrees.com in 1997 [4]. Followed by many other such as Friendster and for example the European Netlog, formerly known as Facebox. It is since then that the popularity of social networking sites has rapidly been increasing up until today. Many of these pioneering sites such as SixDegrees.com and Friendster however died a silent death or lost a great amount of users to the more modern SNS. Social networking sites have become widespread around the global population. Nowadays a staggering 900 million users are monthly active on the networking site Facebook 1 . This means around one person on eight in this world has a facebook profile. Beside Facebook there are off course many more social networking sites. Qzone for example is a Chinese SNS and has a user base around 531 million [51]. Except from the most popular main stream SNSs a numerous smaller niche social network sites exist. The majority of these SNSs in their turn can also present increasing visitor numbers. Facebook is by far the most popular SNS but for others the popularity strongly depends on the specific features and geographical location. Other cultures will for example expect more privacy etc. Social networking sites have long been labeled as unsustainable and have been predicted to defunct sooner than later. Often the reason to describe SNSs more as an internet hype or bubble was because the user base mostly consists out of young and “unpredictable” people. On top the SNSs did not produce any essential value except from staying in con1

Facebook Statistics, http://newsroom.fb.com/content/default.aspx?NewsAreaId=22, [2012-05-04]

1

Automated Detection of Offensive Language Behavior on Social Networking Sites

tact with old friends or sharing pictures. These predictions however have slowly faded away as the importance and value of these sites continued to grow. Nowadays almost all major companies are present on one or more social networking sites. Even marketing strategies and other important decisions are often influenced by the social media. The fact that a whole world is accessible behind a computer is one of the greatest and key aspects of the internet. But at the same time creates a safe haven for criminals and those with less nobel goals. It is obvious that these digital environment pose many threads, especially for younger users. A study, done by ScanSafe, shows that up to 80% of blogs contain offensive language [71]. Offensive language has spread into almost every corner of online communities. Well known issues like bullying, harassment, exposure to harmful content, sexual grooming and racist attacks are important problems that directly and indirectly affect our mental health. To be able to control such digital environments tremendous human efforts have to be made. For example every twenty minutes 1,587,000 blog posts and 10,208,000 comments are posted on Facebook 2 . It is here that this study will present an automatic method to detect offensive or abusive language. We begin with an introduction of what offensive language detection exactly entails in chapter one. Moreover, we describe the challenges associated with this domain and our approach to this study. Chapter two gives an overview of the Netlog data corpus we will work with. In chapter three the available techniques that are interesting in regard to this study are discussed. Which techniques we used and how they are implemented can be found in chapter four, while the fifth chapter reports on the results of the experiments and the evolution of the whole study. Chapter six discusses the main findings of this study, and makes some proposals for future work. Finally Chapter seven summarizes the conclusions.

2

Obsessed with Facebook, http://www.onlineschools.org/blog/facebook-obsession/, [2012-05-01]

2

Chapter 1

Introduction: What is Offensive Language Detection? 1.1

Introduction

What is Automated Detection? Automated means “operating with minimal human intervention; independent of external control” 1 . Automated Detection is the operation of detecting matters with no or minimal human intervention in a controlled environment. Terms directly associated with automated learning are machine learning and supervised learning. Machine learning, a part of the artificial intellegence branch, is a scientific discipline dealing with the design and development of algorithms that allow machines to evolve behavior based on experience. Supervised learning is the machine learning task of deducting a function from labeled (supervised) training data [24]. What is Offensive Language Behavior? Offensive

2

can be described as:

− causing anger or annoyance; ’offensive remarks’ − causing or capable of causing harm − exhibiting lack of respect; rude and discourteous In this study offensive language is defined as the propagation of offensive messages or remarks that in some circumstances are inappropriate, exhibit a lack of respect towards certain groups of people or are just rude in general. In literature offensive language is often expressed as ’flame’. [2] defines flames as ’exhibiting extreme subjectivity’. However, offensive language (flames) is a very vague and ambigious concept. In general 1 2

Definition on www.thefreedictionary.com, [2012-05-02] idem

3

Automated Detection of Offensive Language Behavior on Social Networking Sites

people describe or experience certain events or messages in different ways according to their own education, culture and personal experience. Therefore it is very important to accurately define what we describe as being offensive language. Our final goal is not only to detect offensive language but to be able to discover offensive language behavior. Offensive language behavior can be described as a person - a user of a social networking site - repeatedly propagating offensive language in a certain time interval. Offensive language is not only a vague concept but is also exceptionally wide. Next subjects can all be interpreted as offensive or at least cause certain nuisance: − Messages containing unwanted advertisement or plain spam. − Scammers trying to steal personal information. − Sexually inappropriate messages or indecent proposals − Racist messages: offending certain people or entire groups (aggression against some culture, subgroup of the society, race or ideology in a tirade) − ... Because different categories often require different strategies, we decided to focus on just two categories. As stated above harassment, sexual grooming and racist attacks are important as they are widespread and can do more harm to younger child than spam. Therefore, we focus on the detection of sexually inappropriate and racist messages. However, we intend to build a system that with some effort can be expanded in a flexible way. What is a Social Networking Site? In [4] a social networking site (SNS) is defined as: a web-based service that allows individuals to i Construct a public or semi-public profile within a bounded system. ii Articulate a list of other users with whom they share a connection. iii View and traverse their list of connections and those made by others within the system. The exact meaning of the above terms off course differ from site to site and are relatively flexible. Holding this definition in mind several hundreds social networking sites nowadays exist and are widespread around the global population. For this study we made use of the Dutch Netlog corpus (cf. chapter 2). 4

Automated Detection of Offensive Language Behavior on Social Networking Sites

1.2

Goals of This Study

In this study we attempt to create a machine driven solution to be able to automatically detect offensive language behavior on social networking sites. The final goal is thus to develop a system that can automatically detect profiles/persons that extensively and deliberately post messages containing offensive language (cf. section 2.3). We can briefly describe our study by stating we are dealing with a text classification problem. In the next chapter (cf. chapter 2) we give an overview of what we mean with offensive language. As mentioned before we focus on two main categories being ’sexual’ and ’racist’ messages. The main objective is however divided in multiple smaller parts. The first objective focuses on finding and annotating a relevant subset of messages. As we deal with a rather large message set, it is key to efficiently extract relevant messages. In our case relevant messages are messages that contain offensive language. While searching for relevant messages we also intend to build a thesaurus 3 that can help us to build a classifier, which is not based upon training samples, but on prefabricated word lists. With this annotated subset of messages we will then try to build several supervised learning methods. We mainly focus on two well-known machine learning techniques being Naive Bayes and Support Vector Machine. Since the more complex and advanced nature of a Support Vector Machine, we hypothesize the Support Vector Machine will outperform our Naive Bayes method. Therefore, we will use our Naive Bayes classifier as a baseline to compare the eventual improvements. When finished building/implementing our classifiers we need to be able to validate the results of these classifiers. In order to do this we expand our existing training set4 with a set of irrelevant messages. We also randomly select two thousand messages out of the whole message set to create a realistic validation set. This validation set can be used to test the performance of our classifiers. We then further try to improve the general performance by tweaking the available parameters and isolate determining factors. A third goal is to not only take into account the message itself but to analyze the reactions to a message. Sexually inappropriate or racist messages could cause a fuss, possibly manifested in the reactions to that message. We do this by creating a third classifier5 that is able to not only detect sexual or racist offensive language, but has the ability to detect outrage 6 . In our final goal we will compare the results of the different seperate classification methods. This research also checks if performance can be increased by combining several 3

In this context a thesaurus can be defined as a list of concepts that describe a certain offensive category. 4 One set of sexually inappropriate messages and one set of racist messages. 5 A classifier that is build on prefab word lists. 6 In this study ’outrage’ is defined as ’profound indignation, anger, or resentment’

5

Automated Detection of Offensive Language Behavior on Social Networking Sites

seperate methods into one whole. We then choose our best performing classification method and classify the whole data set to create a general overview. Due to time constraints and the fact that the decision if a user should be blocked depends on much more than only posting offensive messages, the propagation from message level to profile level has not been extensively developed. Our final results intend to sort users based on the number of detected offensive messages. Netlog will have an overview of offensive and non offensive messages per user, ranked according to a relevance score7 .

1.3

Approach

Our study can be seperated in two main parts. The first part is mainly oriented towards finding relevant messages in the whole message corpus (cf. chapter 2 for a detailed overview of the corpus). Because we intend to work with supervised learning methods (cf. section 1.1), it is crucial to find as many relevant messages as possible to build a complete training set. Query expansion is a well-known technique that tries to enhance the search to find more relevant documents by expanding the query and adding informative terms. From our point of view every message containing racist or sexually inappropriate content is a relevant message. In the second part we pursue our final goal, classification of an unknown message, by applying text classification techniques. Text classification is the domain where documents should be assigned one or more predefined classes (cf. chapter 3 for a more elaborated overview of these techniques).

1.4

Applications

The exponential growth of the World Wide Web makes it is essential that people can use developing tools to better find, filter, and manage this electronic information. Document retrieval, categorization, routing and filtering can all be formulated as classification problems [29].

1.4.1

Query Expansion

The most well-known application using query expansion is by far Google search engine. Google will do a numerous things to enhance your search results and increase the amount of relevant pages. From the Google help pages we find: − Words that share the same word stem as the word given by the user. For example, if the user search query includes ’engineer’ the search appliance could add ’engineers’ to the query. − Terms of one or more space-separated words that are synonymous or closely related to the words given by the user. For example, if a user searches for ’FAQ’ the appliance could add ’frequently asked questions’ to the query. 7

This score is based on the belief the classifier attached to the classification of a message to a certain category.

6

Automated Detection of Offensive Language Behavior on Social Networking Sites

1.4.2

Text Categorisation

Text categorization is of great importance for many information organization and management tasks. Spam Filtering The Internet Security Threat Report recently stated that spam email traffic dropped around 13%. Despite this positive evolution still 75% of all email traffic is identitfied as spam 8 . Spam messages are annoying to most users, as they waste their time and clutter their mailboxes. Text classification tries to differentiate regular emails from spam messages. Topic Spotting Topic spotting tries to automatically determine the topic of a text. it is often applied to structure large quantities of documents into a set of categories. For example, a collection of news articles can be divided into different categories such as politics, regional, economy, etc. Language Identification Text classification techniques can also be used to try to automatically determine the language of a text.

1.5

Challenges

Since we are concerned with a text classification problem a number of challenges arise. The main difficulty in the field of Computational Linguistics is ambiguity. Ambiguity can occur in a semantic, lexical or syntactic way [58] and is a challenging issue in natural language processing (NLP) [50, 59]. Lexical ambiguity of a word (or phrase) implies that this word has more than one meaning in the language to which it belongs. For instance, the word “bow” has several distinct lexical definitions, including “the front of the ship” and “the weapon to shoot arrows with”. Lexical ambiguity can be solved by inferring the meaning of a word out of the context. In contrast with lexical ambiguity there is semantic ambiguity that imposes a choice between any number of possible interpretations. This form of ambiguity is closely related to vagueness [58]. When a phrase can be parsed in two or more ways we speak of syntactic ambiguity. The same sequence of words can then be matched to different grammatical structures. The well-known sentence for example ’Flying planes can be dangerous’ can be interpreted as ’Is flying planes dangerous?’ or ’are flying planes dangerous?’. Again context can be of crucial importance here. 8

Internet Security Threat Report instigated by Symantec. http://www.symantec.com/threatreport, [2012-05-15]

7

Report can be found on

Automated Detection of Offensive Language Behavior on Social Networking Sites

Not only ambiguity but the fact we are working in a social network environment, with a great deal of very young users, poses a challenge. This means the vast majority of the text will not be in standard Dutch, but will most likely contain informal language and different dialects. Numerous different spellings for one word make it harder to estimate the real value of that word (cf. chapter 2). Informal language’s lack of grammatical rules makes it more difficult to identify seperate sentences and analyze the context to resolve ambiguity. An utmost challenging problem is the detection of implicit messages. Implicit messages could for example contain irony, sarcasm, etc. Again, humans can use the context to reveal if the writer of a message is using irony or not. But even among human beings misinterpreted irony is a frequent source of conflicts. The same applies to humour, but this even includes an extra difficulty. Evaluating the appropriateness of humour implies determining if a joke crosses a certain non tolerable line. As exploring the boundaries of the right for freedom of speech is not one of our goals, we will simply ignore the fact whether or not one is using irony, humour, etc. To end the list of challenges, domain-independence is one of the biggest problems in machine learning and classification. Accuracy results for one certain corpus (e.g. blog corpus) can be very high, but perform very poorly when applied to another kind of corpus (e.g. biomedical corpus). Finding effective approaches to overcome this problem is a valued research field [58].

1.6

Related Work

Relatively few articles specifically discuss the detection of offensive language. However, many researchers are working on different kinds of opinion mining or sentiment detection. Examples are Pang et al. [45], Gordon et al. [18], Yu and Hatzivassiloglou [73], Riloff and Wiebe [52], Yi et al. [72] , Dave et al. [11] and Riloff et al. [53]. We mention these articles because you could state that in many cases the detection of sentiment is for some specific attributes similar to the detection of offensive language. Hence, the subjective language detection is a task for which flame detection could be considered an offspring.

Fk yea I swear: Cursing and gender in a corpus of MySpace pages Thelwall, in [65], studies offensive language on the online network site Myspace with a specific focus on ’swearing’. In this article they investigate whether (strong) swearing is more dominant based on gender. The methods used in this article do not rely on any text classification method. They do show that swearing declines when a user gets older. But they could not find an overall gender difference for the use of stronger swear words. 8

Automated Detection of Offensive Language Behavior on Social Networking Sites

Smokey: Automatic recognition of hostile messages. Ellen Spertus proposes in [60] a flame recognition system that not only looks for insulting words, but also for syntactic constructs that have a negative or condescending tendency. In the first step the smokey software converts each parsed sentence into Lisp s-expressions by using sed and awk scripts. Next, semantic rules are used to further process these sexpressions, resulting in a 47-element feature vector based on the syntax and semantics of each sentence. Finally one feature vector per message is created by summing up the vectors of each sentence. A message is then classified as flame or not by evaluating its feature vector with rules, generated by Quinlan’s C4.5 decision-tree generator. The feature based rules were generated using a training set of 720 messages , which were able to correctly categorize 64% of the flames and 98% of the non- flames in a separate test set of 460 messages. Detecting flames and insults in text [34] is a sentence level classification system that tries to distinguish flames and information by interpreting the basic meaning of a sentence. This is achieved by using a set of rules and analysing the general semantic structure of a sentence. However, the system is affected by the fact that insulting sentences can only be detected when containing related words or phrases originated from a lexicon. Filtering Offensive Language in Online Communities using Grammatical Relations In [71] the focus is more on filtering techniques than on the actual detection of offensive language. This articles makes use of simple word matching algorithm in combination with an offensive word lexicon to detect the actual offensive words Offensive Language Detection Using Multi-level Classification [48] tries to develop automatic and intelligent software for flame detection. Flames is an other word for offensive language including taunts, squalid phrases, etc. The article states that we are increasenly confronted with abusive language in for example emails or other texts. They intend to perform flame detection by extracting features at different conceptual levels and applying multi-level classification. A message can be assigned to either the class Okay or the class Flame. Results point out that a 3-level classification gives a general accuracy around 96% with a flame precision of 96.6%. They used a total number of 1525 messages of which 68% Okay and 32% Flame.

9

Chapter 2

The Dutch Netlog Corpus As we are trying to detect offensive language behavior on social networking sites a resembling corpus is needed. Different corpora are available such as Ken Lang’s Newsgroups data set [27], the Multi-Domain Sentiment corpus of Blitzer [3], movie reviews [45], the Wall Street Journal corpus, etc. However, none of these corpora (a) are real representations of the data on a social networking site, or (b) are well suited for flame detection, as they’ve been constructed for opinion mining purposes. We were able to make use of the Dutch Netlog Corpus (DNC), especially constructed for this study, containing a collection of blog messages and reactions on Netlog.

2.1

Netlog

Netlog is a European social networking site that was founded in the early 2000s by Lorenz Bogaerts and Toon Coppens. Netlog, nowadays, is being developed by Massive Media NV, located in Ghent. Over 96 million members1 spread over more than forty languages frequently visit Netlog. With Dutch accounted for the ninth highest number (over four million) of members. At the beginning Netlog focused on reaching a younger target audience, but this has changed over the years. Apart from Netlog the company also develops a fast growing online dating site, named Twoo. We explicitly thank Massive//Media for supplying this corpus.

2.2

Corpus Overview

In this section we give a general overview of the Dutch Netlog Corpus. We received approximately seven million blog messages from 800,201 different users. We have not 1

http://nl.netlog.com/go/about/statistics, [2012-05-09]

11

Automated Detection of Offensive Language Behavior on Social Networking Sites

received any information concerning a user’s age or other personal data. The corpus also includes more than eleven million reactions to the different blog messages. However, 71.9% of the blog messages does not contain any reaction. This means that the eleven million reactions are spread over about two million blog messages. In other words, a blog message belonging to the group containing reactions has on average six reactions. This is an important stat because in a later stadium reactions to a message are taken into account in the classification process (cf. chapter 4). Only a small subset of the messages will thus eventually be able to benefit from the extra information that reactions contain. Moreover, it is important to notice that a message containing offensive language should be detected as soon as possible, thus optimally nobody has been able to react to such a message.

2.3

Offensive language

Because offensive language or flames are very subjective concepts, we hereby attempt to clearly define these concepts. We created two general categories of offensive language. The first category is named racist, and contains the following types of abusive language: − Extremism: These phrases target some religion or ideologies. − Homophobia: Phrases are usually talking about homosexual sentiments. − Provocative language: expressions that may cause anger or violence. − Racism: These phrases intimidate race or ethnicity of individuals. − References to handicaps: These phrases attack the reader using his / her shortcomings. − Slurs: These phrases try to attack a culture or ethnicity in some way. The second category is named sexual: − Crude language: expressions that embarrass people, mostly because it refers to sexual matters or excrement. − Implicit/ambigious language: expressions that in an indirect way refer to sexual matters. − Indecent proposals: expressions that contain indecent proposals often related to sex. − Unrefined language: some expressions that lack polite manners and the speaker is harsh and rude. 12

Automated Detection of Offensive Language Behavior on Social Networking Sites

Each time this study mentions the concepts flame detection or offensive language, implicitly we refer to the above definitions. However, even with an elaborate definition, the labeling of a message as being offensive or not is a nontrivial task. Determining when a certain expression is rude, unrefined, indecent or provocative is a subjective decision.

2.4

Creating Training and Validation set

Since we intend to work with a supervised learning technique (cf. section 1.1), it is key to annotate and label a vast subset of blog messages. Supervised learning methods require an example (labeled) set to build experience, which they can then use to base decisions of new unlabeled messages on. Therefore, a training set should be as complete and descriptive as possible.

2.4.1

Labels

A message can be labeled in five different ways: ’sexual’, ’racist’, ’outrage’, ’irrelevant’ or ’unknown’. Sexual A message is labeled as ’sexual’ when it relies to the definition of offensive language. Moreover, it should contain either crude or unrefined language, implicit / ambigious language or an indecent proposal. Because we work on a social networking site, which has a vast number of young users, we have a rather conservative approach to what can be described as being sexual and what not. For example: ’goed geil wil camsex ’ Racist A message is labeled as ’racist’ when it relies to the definition of offensive language. The ’racist’ category contains more general abusive language than the name makes you believe. We do not consider the definition of a racist message to describe only discrimination of a race or ethnicity, but we consider it to also include matters such as homophobia, extremism, slurs, etc. For example: ’Bloed moet vloeien, weg met die vrijheid van die Joden republiek! ’ Outrage Outrage is a category, which is specifically designed for reactions, to indicate if the writer of the message intends to express contempt. A regular blog message should not be tagged as an ’outrage’ message. Although it is perfectly possible to annotate a set of messages expressing outrage, it is not our primary goal to create a training set for 13

Automated Detection of Offensive Language Behavior on Social Networking Sites

outrage. For example: ’Pfff,vuile seksist! Ik weiger hier aan mee te doen. [thumbs down] ’ Irrelevant All blog messages that do not contain any form of offensive language are considered to be ’irrelevant’. For example: ’ik kan niet slapen :( ’ Unknown Our last label is named ’unknown’ and is applicable when it is unclear if a message contains offensive language or not. Messages labeled as ’unknown’ are ignored.

2.4.2

Validation set

To be able to properly evaluate the performance of our classification system a validation set is required: A validation set (also known as a test set) is a part of a data set to assess the performance of classification models that have been trained on a separate part of the same data set 2 . We created a validation set containing a total of two thousand messages. To create a realistic image of the whole data set, we randomly selected these messages. Based on how the validation set is labeled, we can determine what the positive/negative ratio is in the whole set. The positive/negative ratio gives an indication of the percentage of messages containing offensive languages in comparison with irrelevant messages. Thus, we labeled two thousand messages that contain:

Label

Amount

Racist Sexual Irrelevant

2 15 1983

Percentage (%) 0.10 0.75 99.15

Table 2.1: Distribution of validation set.

We clearly see that the amount of relevant messages is very low compared to the amount of irrelevant messages. According to the annotation of the test set, only a tiny 0.85% of the whole data set contains offensive language. This means that our whole data set should contain around 59500 messages with flames of which 7000 racist messages and 52500 sexual messages. 2

http://www2.statistics.com/resources/glossary/v/validset.php, [2012-05-10]

14

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 2.1: Distribution of messages containing offensive language.

2.4.3

Training set

A training set can be defined as a specific part of the data set that is characteristic for the problem to be solved, and is used as input for learning algorithms3 . As mentioned before, a training set should contain as many labeled samples as possible. Finding relevant messages is not a trivial task in a set that contains just over seven million messages. Therefore, we decided to use an information retrieval (IR) system, combined with query expansion (cf. chapter 3), to be able to easily search for relevant documents. With the aid of such IR system we were able to build a training set containing following message distribution:

Label

Amount

Racist Sexual Irrelevant Total

173 362 4755 5290

Percentage (%) 3.27 6.84 89.89 100.00

Table 2.2: Distribution of different classes in training set.

3

http://www.answers.com/topic/training-set, [2012-05-10]

15

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 2.2: Distribution of different classes in training set.

2.5

Informal Language and Multiple Languages

To conclude this chapter we briefly discuss the nature of the data we received from Netlog. As we stated multiple times, the presence of informal language makes the detection of offensive language specific, or text classification in general, more difficult. To begin with, every person has his or her own writing style and practices. Moreover internet is known for its fast-changing pace and new trends, including the way people communicate with each other. Typical examples are the lack of spaces in words (eg. “ikhaatu”) and excessive use of abbreviations (eg. “BFF”, “LOL”). Other examples are adding or removing characters , writing in dialect (eg. “kzien a geire”), using emoticons (eg. “:-)”, “:D”), using words from a foreign language, etc. Endless variations and the complete lack of any structural rule, except from ones personal mindset, makes it very hard to process chat language. Since there is not one way to write something, which results in different variations that in essence all refer to one word, frequency distribution will not be as accurate as with standard language. The fact that a classifier is unable to estimate the real value of a word, hinders the final classification.

16

Chapter 3

Methodology Text categorization (TC) (also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. This task falls at the crossroads of information retrieval (IR) and machine learning (ML). This study is divided in two main parts based on two different techniques. In the first part we focus on collecting relevant messages from our data set. To do this we use, adjust and extend an information retrieval system. We extend this IR system to increase its overall efficiency, by applying a technique called query expansion. The second part then focuses on building a classification system to perform text classification. We first start with implementing a baseline classifier to be able to measure the general improvement and compare our results with a more advanced classifier. Our baseline classifier implementation will be the relatively simple but widely used Naive Bayes (NB) method. A second more advanced classifier will be an implementation of a support vector machine (SVM). The first two classification methods are supervised learning methods, which means they rely on a training set. In our last step we also implement a third classifier that does not have the ability to learn. This classifier will be based on prefab word lists and will, next to detecting offensive language, also have the ability to detect outrage in a message. The detection of ’outrage’ is however only used in the reactions to support a possible prior classification of a blog message.

3.1

Information Retrieval

To be able to search through a vast set of messages, the use of an information retrieval system is essential. However, the term information retrieval is widely used and very broad. We use the definition proposed by [36]: “Information Retrieval is finding material (usually documents) of an unstructured nature that satifies an information need from within large collections.” 17

Automated Detection of Offensive Language Behavior on Social Networking Sites

3.1.1

Features

The eventual goal of an IR system is to provide the user with a specific information need gathered from within a large collection of data. Although numerous different types of IR systems exist, depending on the specific context, they all comply to a general structure. In this section we will very briefly discuss the basic functionality of such a system. Since an IR system relies on a collection of “unstructured data”, grouped per document, its first task is to virtually organize the different documents in the collection. This is done by the creation of an index. How a specific index is created, heavily depends on the type of data we are working with. An index can be defined as :“A list of words and corresponding pointers” 1 . An index can thus be compared to a thesaurus where every word refers to a list of pointers. These pointers represent the set of documents containing that word. The construction of an index makes it possible to identify a document based on its content. After creating an index, the next objective is to be able to efficiently describe an information need. An information need is mostly described by a summarizing sentence that lists the most important keywords. This is called a query. A query is handled by an information retrieval system by processing and selecting the (key)words in the query. These words are then used to find the matching documents in the index.

3.1.2

Evaluation

The evaluation of an information retrieval system is based on the amount of retrieved relevant documents. In this study we mainly focus on three important metrics: Precision Precision is the fraction of retrieved documents that are relevant: Precision =

|{relevant documents} ∩ {retrieved documents}| |{retrieved documents}|

Recall Recall is the fraction of relevant documents that are retrieved: Recall =

|{relevant documents} ∩ {retrieved documents}| |{relevant documents}|

Average Precision (AP) Average precision takes into account precision and recall simultaneously. This metric is especially useful when retrieved documents are ranked. Better ranking systems will 1

http://www.answers.com/topic/index, [2012-05-11]

18

Automated Detection of Offensive Language Behavior on Social Networking Sites

yield a higher average precision as opposed to systems where relevant documents are more scattered [74]. Z Average Precision =

1

p(t)dr(t) = 0

n X

p(i)∆r(i)

i=1

with ∆r(i) being the change in recall from i − 1 to i and p(i) the precision from 0 . . . i.

3.1.3

Spelling Correction

The general purpose of spelling correction focuses on resolving typographical errors such as insertions, deletions, substitutions, and transpositions of letters that result in unknown words [8]. Unknown words are words that can not be found in a trusted lexicon. However, 25 to over 50% of observed spelling mistakes are mispelled words resulting in a valid, though unintended word [25]. For example, “I want you to be quite!” where “quite” should be spelled as “quiet”. The correction of a mispelled word strongly depends on the information available from the context. Multiple supervised and unsupervised learning methods exist, some based on a lexicon, other based on machine learning techniques [16, 17]. A very basic approach towards spelling correction is combining a trusted lexicon with the levenshtein distance. When we find an unknown word, we search a best match in the trusted lexicon according to the levenshtein distance. This levenshtein (or edit) distance is based on the number of insertions, deletions or substitutions that is required to transform one word to the other. Since the description of the information need (query) in an IR system is from utmost importance, spelling correction can be of significant meaning. Misspelled words can change the entire focus of the query and consequently result in very few or no relevant documents at all.

3.1.4

Stemming

Stemming is the process of reducing all words with the same root or stem to a common form [33]. A stemmer will for example match the words “fishing”, “fished”, “fish”, and “fisher” all to the same root word, “fish”. The most basic stemming algorithms only take into account morphological information to find a word’s reduced form. More advanced stemmers are language specific and try to use statistical and contextual information to find a word’s root form [69, 46]. Although we can generally increase the recall by using a stemmer, it can also significantly reduce precision by retrieving too many documents that have been incorrectly matched. When analyzing the results of applying stemming to a large number of queries, we notice that for every query that is helped by the technique, one is hurt [46]. 19

Automated Detection of Offensive Language Behavior on Social Networking Sites

The biggest problem that arises when using a stemmer is off course the fact that words with a totally different meaning, sometimes are matched to the same root form. For example stemming the Dutch words ’negeer’, ’negeren’ and ’neger ’ all result to the same root ’neger ’. This is a typical example, which will greatly reduce precision as ’neger ’ can be defined as offensive language, while ’negeer ’ is definitely not.

3.1.5

Part-of-Speech Tagger

Part-of-speech tagging (or POS tagging) is a technique that assigns an appropriate and context-related grammatical descriptor to words in a text. The early POS taggers distinguished words among eight different grammatical tags: noun, verb, particle, article, pronoun, preposition, adverb and conjunction [41]. Nowadays, the Penn Treebank tag set is often used as basis to build an english POS tagger. This tag set, which is extensively described in [38], contains thirty-six different tags to describe a word. Since very few grammatical rules can be applied to more than one languages, a general part-ofspeech tagger does not exist. However, different part-of-speech taggers all more or less operate according to the same procedure [41]: − Tokenization: divides a text in seperate processing units, and removes unwanted information. − Ambiguity look-up: implies using a lexicon to tag known words and a guesser for tokens not represented in the lexicon. − Ambigiuty resolution or disambiguation: based on two information sources, one out of possible multiple pos tags should be chosen. Part-of-speech tagging can greatly affect the performance of an IR system [32]. Assigning a part-of-speech tag to a term in a query improves the overall description of the information need, which is an essential part of information retrieval. In other words, a user is able to discriminate between different senses in which a term is used. On top the extra information that is provided by a pos tag can be used by a stemmer. This would, for example solve the issue with ’negeer ’ and ’neger ’, where it is clear that ’negeer ’ (verb) and ’neger ’ (noun) do not carry the same pos tag and thus should not be mapped to the same root form.

3.1.6

Query Expansion

Users tend to have difficulties formulating an appropriate query when searching for information on the web. According to a study the average length of a search query is 2.6 words long [1]. Which is in most cases too vague and small to retrieve a large set of relevant documents. Good queries often presuppose knowledge of relevant documents and on top users may not know how a query is used by a retrieval model. Query expansion tries to enhance the search to find more relevant documents by expanding the query and adding informative terms. Additional search terms will define a more specific query that 20

Automated Detection of Offensive Language Behavior on Social Networking Sites

is less ambiguous and will result in documents that reflect the underlying information needed. Methods existing to perform query expansion can be divided into two classes: − Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. − Local methods adjust a query relative to the documents that initially appear to match the query [36]. A popular technique categorized as being a local method is relevance feedback. This technique is based on the possible relevance of highly ranked documents. This information is then used to further specify the query to effectively increase the results. Relevance feedback can be exploited on a manual or automatic way. 3.1.6.1

Rocchio Classification

Rocchio Classification is a method stemmed from the SMART Information Retrieval System around the year 1970. It makes use of the Vector Space Model. The algorithm is based on the assumption that users have a general idea of which documents should be denoted as relevant or irrelevant [36]. An ideal query has maximal similarity to the relevant documents and minimum similarity to irrelevant documents. Suppose we have |Dr | relevant documents and |Dn | irrelevant documents. Then X X 1 1 Q= × di − × dj |Dr | |Dn | di ∈Dr

dj ∈Dn

is the ideal query. Clearly, the relevant documents are not known. Assuming the user defines which retrieved documents are relevant, a set of relevant documents Du (feedback ) and irrelevant documents Dv (remaining documents) are available. Then Qi+1 = α × Qi +

X X β γ × di − × dj |Du | |Dv | di ∈Du

dj ∈Dv

is a modified query that shifts towards Q. With α, β, γ being tuning parameters and Qi the original query [55]. Local methods such as the algorithm described above depend heavily on the relevance and similarity of the documents first retrieved. It also depends on the documents that are selected as being relevant, whether this happens by hand or automatical. When the initial query retrieves only few relevant documents, a query drift can occur. Which will lead the derived query much further away from the ideal query. A query drift occurs when the expanded form of the query changes the underlying “intent” [75]. In general Rocchio query expansion leads to rather large queries and will often increase recall at the cost of precision [36]. 21

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 3.1: Example of Rocchio classification. Obtained from [36].

3.1.6.2

Local Feedback

Local feedback is a local method that will consider the top m most highly ranked documents. The algorithm consists of following steps: − Add the n-most frequent (non-stop word) terms in the m most highly ranked documents to the query. − Disregard information from the other documents. Many other variants exist (based upon clustering terms) and recent years, many improvements have been obtained on the basis of local feedback. This includes re-ranking the retrieved documents using automatically constructed fuzzy Boolean filters, clustering the top-ranked documents and removing the singleton clusters, clustering the retrieved documents and using the terms that best match the original query for expansion [9]. TREC (Text REtrieval Conference) has showed in 1996 that local feedback approaches are effective [68]. This technique has however an obvious drawback as it supports on blind feedback, assuming the top ranked documents being relevant. If a large fraction of the top-ranked documents is actually irrelevant, then the words added to the query are likely to be unrelated to the topic. Thus the effects of pseudo-feedback strongly depend on the quality of the initial retrieval. Remark 1. Implicit relevance Feedback Document’s relevance is not based on direct user feedback but instead indirect sources of evidence are used. Rather than completely relying on the top-ranked documents. An example of an indirect source of evidence is an often clicked document. Implicit feedback is less reliable than explicit feedback, but is more useful than blind relevance feedback, because local feedback or blind relevance feedback contains no evidence of user judgements [36]. 22

Automated Detection of Offensive Language Behavior on Social Networking Sites

Remark 2. Relevance feedback does not always guarantee a successful query expansion. − Misspellings. If the user spells a term in a different way to the way it is spelled in any document in the collection, then relevance feedback is unlikely to be effective. This is very likely when working with chat language. This is one of the reasons that shows the importance of stemming and the use of a decent spelling corrector. − Cross-language information retrieval: Documents (partly) in another language are not nearby in a vector space based on term distribution. This is, anew the case with chat langauge as this is often influenced by foreign languages. On the contrary, documents from the same language tend to cluster.

3.1.6.3

Phrasefinder

Global methods consider all documents in the set for query expansion. The basic idea is that the global context of a concept can determine similarities between concepts. In other words, related terms will co-occur with each other. This is based on the association hypothesis stating that words related in a corpus tend to co-occur in the documents of that corpus [70]. Concept and context can be defined in numerous ways. The simplest definition is the one that states that all words are concepts (except for stop words) and the context for that word is the set of all words that co-occur in the set of documents [47]. More advanced definitions compare concepts with noun groups (containing multiple words) and define context as a fixed set of words surrounding the concept. Words surrounding the concept is called a window and has a typical size of one to three sentences. These definitions are a result from the TREC-3 conference in 1994 and are used as a basis for the phrasefinder technique. Every concept (noun group) will be associated with a pseudo-document. The content of this pseudo-document will be the words occurring in every window for that concept in the documents. At first these pseudo-documents will be filtered by removing stop words and words that occur too often or too rare. Then a database or thesaurus is automatically built with these pseudo-documents creating a concept database. To expand a query a ranked list of concepts will be generated out of this database. A number of concepts from this ranked list will be added to the query weighted approximately. In general global methods are very robust techniques that improve information retrieval. Some queries will however be degraded when adding irrelevant concepts. Another drawback in this case is that most general thesauri will be expensive in terms of disk space and computer time to analyse the data and build the database [68]. 23

Automated Detection of Offensive Language Behavior on Social Networking Sites

3.1.6.4

Latent Semantic Analysis

Latent semantic analysis (LSA) is a mathematical method for computer modeling and simulation of the meaning of words and passages by analysis of representative corpora of natural text [26]. Latent Semantic Indexing (LSI) decomposes a term into a vector in a low-dimensional space. This is achieved using a technique called singular value decomposition. It is hoped that related terms which are orthogonal in the high-dimensional space will have similar representations in the low-dimensional space, and as a result, retrieval based on the reduced representations will be more effective. The first step in the algorithm is to calculate the term-document frequency matrix. The entries in the term-document matrix are then transformed using a “ltc” weighting. This weighting takes the log of individual cell entries, multiplies each entry for a term by the inverse document frequency weight of the term, and then normalizes the document length. The transformed term-document matrix is taken as input for the single value decomposition algorithm. A best “reduced-dimension” approximation to this matrix is then calculated which results in a reduced-dimension vector for each term and each document. The cosine between term-term, document-document, or term-document vectors is then used as the measure of similarity between them and thus can be used to build a global thesaurus [12]. Despite its potential, retrieval results using latent semantic indexing so far have not shown to be conclusively better than those of standard vector space retrieval systems [70]. A serious problem with term clustering is that it cannot handle ambiguous terms. If a query term has several meanings, term clustering will add terms related to different meanings of the term and make the query even more ambiguous. 3.1.6.5

Local Context Analysis

Local context analysis attempts to combine local and global methods. Based on the hypothesis stating that top-ranked documents tend to form several clusters [70]. And the number of relevant documents that contain a query term is non zero for (almost) every query term [70]. As in phrasefinder noun groups are used as concepts and concepts are selected based on co-occurrence with the query terms. Instead of choosing the concepts out of the whole set of documents the top n ranked are considered. Only the best passages out of this ranked list are taken into account. A passage is a text window of fixed size. Using passages is more ideal as opposed to documents, which can be very long and a co-occurring of concept at the beginning and a term at the end can be meaningless. It is also more efficient to use smaller parts of a document because it reduces the cost of processing. The algorithm is divided in the following steps: − Use a standard IR system to retrieve the top m passages. 24

Automated Detection of Offensive Language Behavior on Social Networking Sites

− Concepts in these passages are ranked according to the formula Y log(af(c, ti )) × idfc idfi bel(Q, c) = δ+ log(n) ti ∈Q

where af(c, ti ) =

Pj=n j=1

ftij × fcj

idfi = max 1, idfc = max 1,

log10

5

log10

N Ni

N Nc

!

!

5

and c f tij f cj N Ni Nc δ

is is is is is is is

a concept. the number of occurrences of ti in pj . the number of occurrences of c in pj . the number of passages in the collection. the number of passages containing ti . the number of passages containing c. a low non zero value to avoid zero bel values.

This formula is a variation on the tf idf formula, which is used in most information retrieval systems. The af part rewards concepts frequently occurring with query terms, idfc penalizes concepts frequently occurring in the collection. idfi emphasizes infrequent concepts. Finally, multiplication is used to emphasize cooccurrence with all query terms [68]. − In the third step the top m concepts are chosen from the ranked list and added to the query. Concepts are weighted in proportion to their ranks so that a higherranked concept is weighted more heavily than a lower-ranked concept. Concepts are added to the query according to the following formulas: Qnew = #WSUM(1.0 1.0 Qold wtQ0 ) Q0 = #WSUM(1.0 1.0 wt1 c1 wt2 c2 . . . wtm cm ) wti = 1.0 − 0.90i m And ci being the ith ranked concept. The default value for wt is 2.0. #WSUM is an INQUERY operator to combine evidence from different parts of a query. Specifically, it computes a weighted average of its operands. Local context analysis has several advantages. Once the top ranked passages are retrieved query expansion performs fast. This technique also performs well with regard to proximity constraints. Phrasefinder can for example add concepts to the query that co-occur with all query terms, but still do not match proximity constraints. 25

Automated Detection of Offensive Language Behavior on Social Networking Sites

3.1.6.6

Lexical Semantic Algorithms

Lexical semantic algorithms make use of the semantics in a document to expand and enhance a query. The basic idea of these algorithms is to expand the query focusing on the original query terms and their synonyms, acronyms, etc. These methods do not make use of any statistical data, but rely on an external lexicon/thesaurus. As a consequence we are dealing with a language dependant method. The algorithm described in [67] makes use of the online database WordNet to act as a thesaurus. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations [40]. One of the biggest challenge in this algorithm is to properly choose which query terms to expand based on the thesaurus. 3.1.6.7

Wikipedia as Knowledge Base

Different articles state that using Wikipedia as extra knowledge base to perform query expansion can be useful [6, 30]. Wikipedia can be used to enhance our knowledge about certain concepts. In [30], Li et al. run every query both on the target corpus as on Wikipedia. Experiments pointed out that retrieval performance was superior with Wikipedia-based query expansion than without. In [20] an algorithm is developed to design an automatic concept graph by using Wikipedia. This approach is a statistical approach and is language independent. 3.1.6.8

Manual Constructed Thesaurus

A manual constructed thesaurus is a method that speaks for itself. To apply this method a thesaurus is manually constructed at first. To expand a query every query term is matched against a thesaurus and the corresponding terms are then added to the query.

3.1.7

Open Source Software

The most known and popular information retrieval system is off course google’s search engine. But beside google many open source IR systems exist. We list some freely available and Java-based IR software. Different systems off course offer different advantages and drawbacks. 3.1.7.1

Lucene

Apache Lucene(TM) is a java-based full-featured text search engine library, originally developed by Doug Cutting. Lucene is a very popular search framework, and is used by multiple large companies such as IBM, Twitter, Apple, etc.2 Key features of Apache Lucene are: 2

http://wiki.apache.org/lucene-java/PoweredBy, [2012-05-11]

26

Automated Detection of Offensive Language Behavior on Social Networking Sites

− Lucene’s core is based on the idea of a document containing fields of text. This allows Lucene’s API to be independent of the file format and enables text from PDFs, HTML, Microsoft Word, to OpenDocument documents to be indexed. − Lucene is highly supported by a large community, which facilitates finding a solution for a certain problem. − It enables scalability and high-performance indexing. − Powerful, accurate and efficient search algorithms. Lucene provides multiple ways to enhance a query: − Wildcard Searches: It is possible to use single or multiple character wildcards to broaden the search. − Fuzzy Searches: A fuzzy search is based on the Levenshtein distance algorithm. This makes it possible to search on slight variations of the original word. − Boosting a Term: A term or phrase can be boosted by adding a caret followed by a boost factor to the term. By boosting a term you can control the relevance of a document. In Lucene the relevance of a document is calculated taken into account following factors: tfq tfd idft numDocs docF reqt normq normdt boostt coordqd

: : : : : : : : :

Term frequency in query q Term frequency in document d Inverse document frequency Number of documents in index Number of documents containing t Norm of q Square root of number of tokens in document d in the same field as term t Boost factor for term t Number of terms in both query q as document d Number of terms in query q

(3.1) This gives the following equation to calculate the relevance of a certain document: ! X tfq × idft × tfd × boostt scored = × coordqd normq × normdt t 3.1.7.2

Terrier

Terrier, Terabyte Retriever, is a project that was initiated at the University of Glasgow in 2000. The goal of the project was to provide a flexible platform for the rapid development of large-scale Information Retrieval applications. On top it provides a stateof-the-art test-bed for research and experimentation in the wider field of IR. The Terrier 27

Automated Detection of Offensive Language Behavior on Social Networking Sites

project explored novel, efficient and effective search methods for large scale document collections, combining new and cutting-edge ideas from probabilistic theory, statistical analysis, and data compression techniques [43]. Automatic pseudo-relevance feedback (cf. section 3.1.6) is included in the open source engine. The method works by taking the top most informative terms from the top-ranked documents of the query, and adding these new related terms into the query. The new query is reweighted and rerun and as such providing a richer set of retrieved documents. Terrier provides several term weighting models from the DFR framework, which are useful for identifying informative terms from top-ranked documents. In addition Terrier also includes well-established pseudo-relevance feedback (cf. section 3.1.6) techniques, such as Rocchios method (cf. section 3.1.6.1). Automatic query expansion is highly effective for many IR tasks. 3.1.7.3

Xapian

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators. Xapian supports following important features: − Relevance feedback - given one or more documents, Xapian can suggest the most relevant index terms to expand a query, suggest related documents, categorise documents, etc. − Supports stemming of search terms (e.g. a search for ”football” would match documents which mention ”footballs” or ”footballer”). This helps to find relevant documents which might otherwise be missed. Stemmers are currently included for Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish. − Synonyms are supported, both explicitly (e.g. ” cash”) and as an automatic form of query expansion.

3.2

Machine Learning Techniques

Text classification can be performed in various ways. Nowadays, most popular and effective techniques are machine learninig techniques. Machine learning is a subdomain of Artificial Intelligence that deals with algorithms, which can learn based on previous experiences. In other words, these algorithms are based on previously labeled data, from which they can infer certain properties and patterns. These properties can then be applied to classify new data samples. Machine learning techniques can be supervised, as well as unsupervised or used by applying reinforcement learning.

28

Automated Detection of Offensive Language Behavior on Social Networking Sites

The major weakness these techniques involve is their domain-dependency, which means that new properties and unseen patterns in data will very likely lead to falsely classified data samples. As language can vary in infinite ways, rarely occuring patters are the rule rather than the exception. Machine learning techniques are often limited to generalize based on already known information.

3.2.1

Data Representation

Since we are dealing with textual data, it is vital this information is extracted and represented in an efficient and complete manner. The simplest approach is using a bagof-words model. This is an unordered set of words (features), disregarding grammar and even the exact position of the words [58]. Each distinct word corresponds to a feature, with the frequency of the word in the document as its value. Only words that do not occur in a stop list are considered. Although a bag-of-words representation seems to throw away significant information, it is a very simple, yet effective representation of a document [22]. An other way of representing a textual document is by using N-grams. N-grams provide the ability to identify n-word expressions. For example, ’I love you’ can be defined as a 3-gram and provides a whole lot more information than the seperate words ’I’, ’love’ and ’you’. However, it is not recommended to use n-grams with more than three words. Co-occurence of certain n-grams will be less likely to be found as the size of the n-grams increases. The extraction of words or concepts from a document can also be influenced by whether stemming (cf. section 3.1.4) is applied. On top extra contextual information can be added by applying a part-of-speech tagger (cf. section 3.1.5), detection of emoticons, excessive use of capital letters, etc.

3.2.2

Feature Selection

Feature selection (also known as subset selection or feature reduction) is the technique of selecting a subset of relevant features. In a textual environment, effective feature selection is essential to improve the efficiency of the learning task and increase general accuracy [14]. Text classification in general and offensive language detection more specific contain positive (relevant) classes and a negative (irrelevant) class. In most cases a relevant document is characterized by only a few specific features. Non-informative features are often described as noize. This noise can be defined as features that do not provide any extra information to further enhance the decision making proces. Eliminating noise and selecting only the most informative features is the key task of feature selection. The overall feature selection procedure works by giving each potential feature a score according to a particular metric. Then the best k features are selected and only those 29

Automated Detection of Offensive Language Behavior on Social Networking Sites

features are used in the further text classification process [14]. Multiple feature selection metrics exist and make use of different aspects of the data. In essence, they can be divided into wrappers, filters, and embedded methods [19]. − Wrappers rank features according to their predictive power; this score is obtained by utilizing a learning machine as a black box to determine most predictive subset of features. A wrapper method often uses sequential stepwise selection to gradually remove or add features at each step [62]. − Filters select subsets of features as a pre-processing step, not depending on the chosen predictor. Filters perform the feature selection faster than wrappers and create a more general subset of features. A well-known filter method is InformationGain based feature selection [14]. − Embedded methods include feature selection in the training process of the learning machine. Embedded methods are in general more efficient than wrappers since they do not require the continuous retraining of the learning machine and splitting of the data (in validation and training set) [19]. An example of a learning algorithm that embeds feature selection is CART [31]. A well-known and widely used feature selection (filter) method is Mutual Information (MI). This method attempts to compute the mutual information of a term t and a class C. This mutual information is a measure of how much information the presence or absence of a term contributes to making a correct decision whether or not a message belongs to c [37]. In short, the algorithm selects k terms per class with the highest mutual information.

3.2.3

Classification Techniques

As described above, machine learning techniques can be divided in three different categories: − Supervised learning (also known as inductive learning) is the process of learning a set of rules from labeled training samples [35]. − In unsupervised learning the algorithm is given a set of unlabeled samples and is supposed to group these into classes, without any prior knowledge of labeled output or the specific classes [35]. − Reinforcement learning is a technique where an algorithm depends on external feedback to base its decision on. The learning algorithm is not told which actions should be executed, but must determine the effectiveness of certain actions based on the given reward. Without this feedback, the learning algorithm will have no ground to decide which actions it should attempt [35]. We present two classification techniques, both supervised techniques. 30

Automated Detection of Offensive Language Behavior on Social Networking Sites

3.2.3.1

Naive Bayes

Naive Bayes (NB) is a simple, yet effective classifier that is based on the so-called Naive Bayes assumption. This assumption states that all attributes of the examples are independent of each other givenQthe context of the class [54]. In a mathematical way this is described as, P (X|C) = ni=1 P (Xi |C) where X = (X1 , . . . , Xn ) is a feature vector and C represents a class. Calculating the document probability comes down to multiplying the probabilities of all individual words in that document [54]. The class with the highest probability is then the final classification of the document. However, in [49] multiple improvements are proposed to eliminate several well-known issues that torment the regular naive bayes method. For example, the different features are given a score per class based on the normalized tf-idf metric (cf. chapter 4). Disregarded the often solid performance of Naive Bayes, it is often used to perform as a baseline opposed to other more advanced classification methods [54].

3.2.3.2

Support Vector Machine

A Support Vector Machine (SVM) is often regarded as the classifier with most success, and yielding the highest accuracy results for many classification problems [66]. Much simplified, an SVM represents samples as points in a space, mapped to a highdimensional space, that constructs a hyperplane with a maximal Euclidean distance between the different labeled samples. When classifying a new sample, it is mapped to the same space and it depends on which side of the hyperplane it is located to predict its class [63]. It is important to notice that this SVM hyperplane is determined by only a small subset of the training instances, which are called support vectors (cf. Figure 3.2) [63]. A classification problem mapped to a vector space can be linear seperable or nonlinear seperable. Two sets of points in an n-dimensional space are linearly seperable if they can be competely seperated by a hyperplane. This hyperplane is a decision surface of the form w · x + b = 0, where w is an adjustable weight vector, x is an input vector, and b is a bias term [15]. SVMs will learn a linear threshold function, designed to solve linear seperable data sets. 31

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 3.2: SVM hyperplane construction in the idealized case of linearly separable data. Obtained from [63].

However, by using an appropriate kernel function, an SVM can be generalized, allowing it to learn nonlinear functions (such as polynomial or radial basic function) [22]. Nonlinear separation surfaces can be approximated by linear surfaces by mapping them to a higher-dimensional space [64]. How this exactly happens lies outside the scope of this thesis, however in figure 3.3 a visualisation of this proces is shown, where ϕ represents the kernel function.

Figure 3.3: Feature vectors are mapped from a two-dimensional space to a three-dimensional embedding space. Obtained from [64].

32

Automated Detection of Offensive Language Behavior on Social Networking Sites

Often data sets are not perfectly seperable which leads to either wrongly classified training samples or overfitting. A required and important parameter when dealing with Support Vector Machines is the cost parameter, C. This parameter controls the trade-off between allowing training errors and forcing unadaptable margins. Increasing the value of C will increase the cost of misclassified samples and will force the SVM to create a more accurace model. However, this model may possibly not be able to generalize well and thus also won’t correctly estimate new samples (overfitting). It is therefore essential to find a good balanced value for the cost parameter C.

3.3

Lexicon Based Text Classification

Lexicon based text classification algorithms rely on manually constructed word lists. Every word in the list belongs to a certain category. In the early 1990s there were no standardized labeled data sets available [44]. As a result the most researchers only considered proposals for systems or designed prototypes. Most systems also had no learning capability and focused on simpler classification tasks [44]. Despite the fact that these methods are rather static and outdated, they can still be useful in some cases. In [61], fuzzy logic is combined with natural-language processing to analyze the affect content in free text.

3.4

Evaluation

The evaluation of classification methods is based on the number of correctly classified messages as opposed to falsely classified messages. We mention four different situations when a new message is classified: − True Positive (TP) : The classifier correctly indicates the message contains offensive language. In other words, the message is rightly classified as sexual or racist. − True Negative (TN): The classifier correctly indicates the message doesn’t contain offensive language. In other words, the message is rightly classified as irrelevant. − False Positive (FP): The classifier wrongly labels the message as a message containing abusive language. − False Negative (FN): The classifier wrongly indicates the message is irrelevant. The most simple manner to evaluate the performance of a classification system is to analyze its accuracy. Accuracy shows the general correctness of a classifier and is calculated as follows: TP + TN Accuracy = TP + TN + FP + FN 33

Automated Detection of Offensive Language Behavior on Social Networking Sites

However, a classification system that automatically labels samples as irrelevant (negative) would yield very high accuracy results when dealing with corpora that contain only a very small amount of positive samples. Accuracy is thus no useful metric when working in an environment where one category dominates the others. Therefore, it is important to again make use of the metrics precision and recall, which are interpreted the same as in information retrieval (cf. section 3.1.2). These metrics are defined as in [42]: Precision =

Recall =

TP (TP + FP)

TP (TP + FN)

By analyzing precision and recall we can constitute a better understanding of the performance of the detection of offensive messages. This study also uses the F1 Measure, which is an evenly weighted combination of both precision and recall: precision × recall F1 Measure = 2 × precision + recall

Apart from testing our data on a seperate test set, it also possible to extract a small part from our training set to use as validation set. This method will be applied by dividing the training set in k parts. The classifier is then subsequently trained on k − 1 parts and tested on the remaining part. This technique is called cross validation and provides us with an accurate view over the performance of a classifier.

34

Chapter 4

Design and Implementation This chapter discusses the system’s most important design decisions and solutions. It starts by describing an information retrieval system, extended with a query expansion technique. Subsequently, the implementation of our offensive language detection system is explained.

4.1

Information Retrieval

As already mentioned we intend to use an IR system to efficiently retrieve relevant documents to build a training set. Section 3.1.7 discusses several open source software packages that provide IR solutions. We decide to base our information retrieval system on Apache Lucene. Lucene provides an API to index and search large amounts of data on an efficient and performant manner. We choose Lucene above the other solutions, because it is an advanced platform that can easily be extended and integrated into an other system. Lastly, Lucene is well-known and elaborated support is provided by a large community.

4.1.1

Query Expansion

The aim of query expansion, together with several techniques describing how to apply it, are already extensively described (cf. section 3.1.6). However, we need to determine which query expansion method best suits our purposes. Avoiding query drift is the biggest challenge when applying query expansion. Another issue that is faced when implementing query expansion, this study in particular endures, is chat language. Chat language implies that multiple relevant features (eg. ’eigenlijk’ ) occur in many different forms (eg. ’eigelijk’, ’eiglijk’, ’eigelyk’, etc) in a text. Both global as automatic local methods will have difficulties effectively expanding a query, since it is much harder than usual to decide which features or documents are relevant. Also using lexical and semantic relations to perform more efficient query expansion seems not designated to use in this study. The Dutch language does not have a Wordnet alternative and again we predict that faulty and incomplete chat messages won’t contain much structure or grammar. 35

Automated Detection of Offensive Language Behavior on Social Networking Sites

Therefore, the manual and local query expansion technique Rocchio is chosen (cf. section 3.1.6.1). Rocchio classification performs query expansion by indicating which documents are relevant (and which are not). In this way, we have more control and can more easily avoid query drifts. Since the main goal of query expansion is to aid the construction of a training set, it is not an issue that user input is required. We implement Rocchio classification by making use of an existing extension on Lucene, i.e. LucQE. LucQE (Lucene Query Expansion) provides several modules to be able to perform query expansion [57]. One of these modules is Rocchio query expansion. This method is slightly adopted to fit into our general system. Figure 4.1 gives a graphical representation of the approach towards the first stage of this study. The Netlog corpus is indexed using Lucene, which is also used to search through the data. The workflow consists of following sequence:

− We insert a query into our Lucene system.

− Lucene provides all messages, ranked according to relevancy, matching the query.

− We select all messages that contain abusive language and specify whether they belong to the ’sexual’ or ’racist’ category.

− In the next step Rocchio classification is applied by expanding the base query with new words, originating from the relevant messages. Every message that was labeled is now also saved, so it can be reproduced to build our training set.

− In the following step the expanded query is inserted into the IR system and matching messages are displayed.

− These new terms can also be saved to construct a thesaurus. This thesaurus then contains informative terms grouped per category, which can be used to build a descriptive word list with. 36

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 4.1: Graphical representation of this study’s approach.

4.2

Offensive Language Detection

To be able to satisfy our final goal two supervised classifiers are created. At first, a Naive Bayes classifier is implemented and afterwards this study proceeds by designing a Support Vector Machine. Naive Bayes classification is a rather simple method, but has proven to perform solid in many cases. On the other hand we can also use the performance of our NB classifier as a baseline to assess the overall performance of our Support Vector Machine.

4.2.1

Naive Bayes

We describe the algorithm proposed by [49] and used in our system in more detail. Assume we have a total of N features and we have a fixed set of m classes. The following is calculated: − Normalized Frequency for a term (Tf) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that document. 37

Automated Detection of Offensive Language Behavior on Social Networking Sites

− Weight Normalized Tf for a given feature in a given class equals the sum of normalized frequencies of the feature across all the documents in the class. − Weight Normalized Tf-Idf for a given feature in a class is the Tf-idf calculated using standard idf multiplied by the ’Weight Normalized Tf’. − The sum of the weight normalized Tf-Idf (W-N-Tf-Idf) for all the features in a label is named Sigmak . For every term in a new message a weight can be calculated depending on its weight normalized Tf-Idf in that class: W-N-Tf-Idf + 1 W eight = Log Sigmak + N The probability of a message belonging to a certain class is then calculated by running over every term in the message and summing their weight.

4.2.2

Support Vector Machine

In order to implement our SVM algorithm an available, open source software package is used. Well-known packages are LibSVM, LibLinear and SVM-Light. We chose to work with LibLinear, which inherits many features of the popular SVM library libSVM [13]. Liblinear is built to optimally perform for linear high dimension classification problems, thus typically for text classification. To be able to work with LibLinear we need to transform our data in the data format used by the LibLinear package. LibLinear, just as LibSVM and SVM-Light, expects to receive an array of feature nodes per sample. This array of nodes represents a vector in the n-dimensional vector space. In order to convert a textual message to an array of feature nodes we extract all features out of this message and calculate for every feature its ’Tf-Idf’ score. This study also provides a feature selection method based on mutual inclusive information (cf. section 3.2.2). Mutual inclusive information is a filter method that is applied by removing or ignoring all non informative terms. Since feature selection is a technique to reduce the amount of noise, we analyze if our system benefits from using it.

4.2.3

Semantic Methods

Except from our two supervised classification methods, we also implement a classifier based on prefabricated word lists. The main reason to also design a semantic classification method is to be able to detect ’outrage’ in a reaction to support or improve the performance of our supervised classifiers. In this way we create two seperate methods that do not, by default, share the same flaws. 38

Automated Detection of Offensive Language Behavior on Social Networking Sites

Sexual

Racist

Outrage

kuste hoeren kont geile meiden

homo doden neger eigen land

racistisch vuil vies niet kunnen

Table 4.1: Random entries from our word lists.

Word lists

These word lists are constructed by using the query expansion module on the training set and are refined in a manual way. In other words, the top terms are extracted from our labeled ’racist’ messages to construct our ’racist’ word list. The same procedure is followed in order to create a ’sexual’ word list. Finally, a list to detect ’outrage’ is constructed by searching and applying query expansion on reactions that express contempt. These lists are then manually further adjusted and extended. An entity in our word list contains the following properties: The score of a word is a measure of the importance of that word regarding to a specific class. The score property is derived from the rocchio classification term ranking. The property vital is a boolean that indicates whether the word is important or not. These first two properties are only used in our initial algorithms, and are not taken into account in the final implementation. We eventually use the third attempt as our final algorithm (cf. chapter 5), which ignores the properties score and vital. The main properties present in the word list are intensity, centrality and pos. Intensity expresses the strength of a word, while centrality indicates the degree of relatedness to a certain class. For example, the word ’neger’ has a greater intensity level compared to the word ’zwarte’. Centrality is a measure that suggests how much a term is linked to a class. Many words relate in some context to a class, but in itself are no offensive words (eg. ’vuil’ ). Both centrality and intensity represent decimals from zero to one. Finally, every word also has a part-of-speech tag, which is not extensively used in this study, but allows future enhancements to create a more advanced algorithm. The values of the intensity, centrality and pos tag are manually crafted and adjusted, based on the authors experience and assessment skills. Lastly, the possibility to add bigrams to the word lists is added. Bigrams can greatly improve performance since they often combine two regular words into one offensive concept (eg. ’white power’ ). 39

Automated Detection of Offensive Language Behavior on Social Networking Sites

Algorithms We implemented three different algorithms starting from a straightforward, simple algorithm to a more advanced algorithm, which uses the intensity and centrality concepts. Simple Algorithm runs over the different features of a message and sums the score of all features that have a match in the word list. The class with the highest score is the eventual result of the classification. If no hits are found the message is logically classified as irrelevant. Vital Algorithm runs over every word of the message and checks if the word can be found in a certain word list. If the word is indeed found in a word list, its score is added to the total (per class) and we check whether it’s a vital word or not. Messages with too few or no vital words in it are automatically classified as irrelevant. We make use of the vital property to lower the amount of false positives. As we’ve already explained, many words in our lists contribute to abusive language, but on itself are not offensive. Since spelling correction is incorporated, words that at first are not found in a word list, can still be matched to one closely related (based on the levenshtein distance). However, this is a rather naive approach since very often words are not property corrected (cf. chapter 5). Fuzzy Algorithm runs per class over every word of the message and checks for possible hits. If a hit is found the intensity and centrality of the word are analyzed. The score of a message per class is solely based on the intensity of the matched words. This score is however reduced to zero whenever the unique hits in a message do not have a total centrality of at least 1 and one centrality score at least 0.7. The Fuzzy algorithm, in other words, fine-tunes the vitality concept used in the algorithm above. A certain word can belong more or less to a certain class based on its centrality.

4.2.4

Combination of Methods

Our final classification method both makes use of our Support Vector Machine classifier and our Fuzzy classifier, based on word lists. It is worthy to mention that at first it was not our intention to combine several classifiers into one whole. However, our initial Support Vector Machine did not yield the expected results (cf. chapter 5). Thus, we tried to improve our classification system by incorporating the information reactions to a message provide. We did this, as described above, by implementing a semantic classifier that can detect outrage in a reaction. On top this semantic classifier can also detect regular abusive language. This gives us the following sources of knowledge to design a final algorithm with: − Primary classifier: SVM trained on our default training set (possible outcome ’sexual’, ’racist’, ’irrelevant’) 40

Automated Detection of Offensive Language Behavior on Social Networking Sites

− Secondary classifier: semantic classifier based on a weighted word list (possible outcome ’sexual’, ’racist’, ’irrelevant’) − Reaction classifier: semantic classifier also based on word list but now classifying only the reactions of a certain message (possible outcome ’outrage’, ’sexual’, ’racist’, ’irrelevant’ ) The solution whether a message is relevant or not is predicted by one of our classifiers in more than 90% of the cases. Thus, it is vital to use the correct classifier in every situation. This algorithm takes into account the characteristics from our classifiers, derived from multiple tests. We distinguish among four different cases: − When both primary and secondary classifier give a message the same label, we logically assume this label is the correct one. − In the case that a message contains less features than a certain threshold, the secondary classifier has the advantage over the primary. − When primary and secondary dissent we analyze if any outrage or abusive language in reactions can be found to support the decision of one of our classifiers. − If on the contrary reactions do not contain any information we base our decision on the normalized scores both classifiers assess to their decisions. On the basis of the above cases we choose which classification result eventually will be selected.

41

Chapter 5

Results This chapter presents the results that are obtained from experiments applied on our different methods. These results are based on the proposed metrics (cf. section 3.1.2 and 3.4) and are gradually improved by a continuous development process. We start with stage one of our study, which is devoted to query expansion. We discuss the functionality of our chosen query expansion method and compare this to its overall performance. Further on, we extensively describe and discuss the performance of our text classification system. We start with an elaborated view on the functioning of our Naive Bayes classifier and end with the algorithm that combines the functionality of different classifiers.

5.1

Query Expansion

To point out the utility of query expansion we compare the (average) precision and recall (cf. section 3.1.2) of different queries, both their simple and expanded forms. It is important to note that queries in this experiments are focused on the retrieval of racist messages. Queries that are build to retrieve sexual messages show similar results. We focus on racist messages because they only occur rarely in the corpus, as opposed to sexual messages that can be found more easily (cf. section 2.4.2). To be able to measure the performance of our information retrieval system, we construct different (expandend) queries. By comparing the recall and precision of a simple query with its expanded form, we should notice significant differences. We presume that both precision and certainly recall will yield better results when the expanded query is applied. 43

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.1.1

Rocchio Relevance Feedback

We briefly recap the main formula that should shift a new (expanded) query Qi+1 closer to the ideal query Q (cf. section 3.1.6.1).

Qi+1 = α × Qi +

X X β γ di − dj × × |Du | |Dv | di ∈Du

5.1.1.1

dj ∈Dv

Initial Setup

The Rocchio algorithm depends on three main parameters (cf. section 3.1.6.1) α, β and γ. The importance of the original query is denoted by α, while β and γ respectively denote the importance of the selected relevant documents and the irrelevant documents. Although, nowadays the influence of irrelevant messages is questioned and thus γ is often reduced to zero. In our implementation, we assign the following values to these parameters:

α = 0.75 β = 0.25 γ=0

5.1.1.2

Query 1

We start with a query containing the following words ’neger ’ and ’zwart’. This query is expanded, which results in the query: “zwart neger marokkan wit werk virus turk clan turk the ras jood ”. We now compare both recall and precision of these two queries. 44

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 5.1: Recall Comparison

Figure 5.1 and 5.2 obviously show that both precision and recall increase when the expanded query is used. Recall of the expanded query outperforms the base query by far. The simple query retrieves around 33 relevant documents, while the expanded query retrieves 51. The same applies to precision, albeit less pronounced (cf. Figure 5.2). We can also compare precision in relation to the retrieved number of relevant documents. Again, we see a much improved result when using the expanded query (cf. Figure 5.3). Lastly, Figure 5.4(a) displays the difference in average precision between both queries. 45

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 5.2: Precision Comparison

Figure 5.3: Precision-Recall Comparison

46

Automated Detection of Offensive Language Behavior on Social Networking Sites

(a) Average Precision Comparison of Query 1

(b) Average Precision Comparison of Query 2

Figure 5.4: Average Precision Comparison.

5.1.1.3

Query 2

Our second base query starts from “sieg heil ” and is eventually expanded to “white race sieg parasiet heil vijand national fight pride trots 88 14 hitler nazi suprematie”. We only compare the average precision results, as our results are similar to the numbers above (cf. section 5.1.1.2). Query 2 also shows us the improved average precision of its expanded form (cf. Figure 5.4(b)). 5.1.1.4

Combined Query

We combine different queries into one large query to see the eventual results. All the relevant terms of the query can be found in Table 5.1. We emphasize relevant, since an expanded query usually contains multiple irrelevant terms, which we filtered out to avoid query drifting. jood neger zwart parasiet fight 14 neger virus

vuil moslim white heil pride hitler marokkaan turk

vraatzucht dik race vijand trots nazi wit clan

ras kk sieg national 88 supremati werk turk

Table 5.1: Expanded query

Figure 5.5 clearly shows that our combined query outperforms all other queries in terms of recall. 47

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 5.5: Number of relevant documents when retrieved 1000 documents

Figure 5.6: Average Precision Comparison

We notice that, in terms of average precision, the combined query does not perform better than the seperately expanded queries (cf. Figure 5.6). This is because the average 48

Automated Detection of Offensive Language Behavior on Social Networking Sites

precision metric depends on the ranking of our retrieved relevant documents. In other words, we depend on the ranking capability of Lucene. On top our combined query contains 36 different terms which will automatically increase the amount of retrieved documents. This, in turn, will make it more difficult for Lucene to properly rank these documents and will result in more scattered relevant documents.

5.1.2

Conclusion

We notice a great improvement in both recall and precision when expanding a query. The number of retrieved documents more than doubles when applying our combined query as opposed to its base form. However, precision does not significantly improve since the retrieval of documents also depends on the way results are ranked. Thus, it is not always better to use very large queries, although this will generally increase recall, precision will not. The main goal of this study was to use query expansion as a manner to efficiently build a descriptive training set. Albeit, query expansion greatly improved the finding of relevant messages, human interference is essential to guide the process.

5.2

Offensive Language Detection

This section presents the results of our flame detection system. We begin with an introduction to our Naive Bayes classifier, proceed to our support vector machine and semantic classifiers, to finally end with a combination of methods.

5.2.1

Naive Bayes

Our Naive Bayes is a nonregular implementation and takes into account a few improvements (cf. section 4.2.1). However, we do not use any form of feature selection. Our Naive Bayes classifier is tested by applying cross validation (cf. section 3.4) as well as testing it on our standard validation set (cf. section 2.4.2). 5.2.1.1

Evaluation

Table 5.2 displays the performances of our Naive Bayes classifier. For both tests the accuracy achieves solid results of respectively 95% and 87%. In the remainder of this chapter accuracy results are not analysed in-depth, because in all cases accuracy achieves values from 90% to 99%. This is due to the large impact of irrelevant messages, which account for more than 90% of the correctly classified messages. We notice a big difference in precision when applying cross validation as opposed to tests on our validation set. We attach more value to the results on our validation set, which better reflects a realistic situation. Naive Bayes has a tendency to classify documents as flames too rapidly, which in the second case results in a low precision. The reason why the NB classifier performs exceptionally well when applying cross validation is twofold. At first, our training set is larger and contains far more relevant messages 49

Automated Detection of Offensive Language Behavior on Social Networking Sites

than our validation set. Secondly, much more than our validation set, our training set is a coherent set of messages, which ensures that it will be much easier to classify messages from the same set. The simplicity of the Naive Bayes method does not allow it to properly unravel the structure of a message, which leads to the negative results on the validation set. Evaluation method

Precision (%)

Recall (%)

F1 Measure (%)

Cross Validation Validation set

71.28 5.24

92.86 70.00

80.65 9.75

Table 5.2: Performance of Naive Bayes.

5.2.1.2

Conclusion

On a realistic validation set our Naive Bayes performs very pooly. One of the reasons is off course the simplicity of the NB method, which does not permit to fully take into account the structure of a message. Another flaw is the fact our validation set contains too few relevant documents, which makes it hard to produce good results, as falsely classifying one positive document (or the other way round) has a huge impact on precision and recall numbers.

5.2.2

Support Vector Machine

The complexity of a Support Vector Machine is very high compared to a Naive Bayes classifier. This section gives a complete overview, starting from our initial setup and ending with our final version. We extensively examined multiple factors that possibly determine or influence the performance of our SVM classifier. This study starts with showing results that are not based on our final training set, but give a good indication where and how we improved. By showing the complete process this study tries to give a comprehensive picture of how we achieve our final results. 5.2.2.1

Initial Setup

As we have already discussed, our support vector machine is implemented based on the LIBLINEAR library. An L2-regularized logistic regression function is used, our cost parameter is set to 1.2, epsilon has a value of 0.1 and no bias is used. 5.2.2.2

Evaluation

We evaluate our classifier on the standard validation set. We notice our Support Vector Machine classifies 45 messages falsely as ’positive’. This results in a precision of only 6%. Both precision and recall numbers are extremely low, resulting in a F1 Measure of only 8.7%. 50

Automated Detection of Offensive Language Behavior on Social Networking Sites

TP

FP

TN

FN

3

45

1932

18

Precision (%)

Recall (%)

F1 Measure (%)

6.25

14.29

8.70

Table 5.3: Results of classifier after running on validation set. Precision drops to 6% due to the rate of true positive against false positive.

In the next sections we will try to determine crucial factors that are responsible for this bad performance.

5.2.2.3

Data Analysis

The first and most obvious action one should attempt to enhance the overall functionality of a supervised classifier is enhancing the training data. To try to limit the amount of false positives, we extend the irrelevant messages in our training set. However, since SVMs base their decision on only a small subset of samples (ie. support vectors), we do not randomly add irrelevant messages. Instead we classify 50, 000 random messages, originating from our data set and manually analyse all messages that are detected as offensive. Each false positive is then added to our training set as an irrelevant message. In this way, we change the decision making process and make sure the hyperplane that seperates the different categories is modified. This can be viewed as a form of reinforcement learning (cf. section 3.2.3). The messages that are correctly classified as offensive are also added to our training set. Table 5.4 displays the results of this classification. Thus, approximately 900 irrelevant messages are added to the training set together with more or less 150 offensive, predominantly ’sexual’ messages. Not every message is added to our training set since several messages are similar in terms of content and structure. Total Sexual Racist Irrelevant Precision True positive False positive

50000 774 438 48788 17.07% 205 996

Table 5.4: Results of classifier after running on randomly selected set.

To estimate the effect of our improved training set, we again test our classifier on our validation set. Table 5.5 shows the results. We clearly see an improvement, though the overall performance is still very poor. 51

Automated Detection of Offensive Language Behavior on Social Networking Sites

TP

FP

TN

FN

6

14

1963

15

Precision (%)

Recall (%)

F1 Measure (%)

30.00

28.57

29.27

Table 5.5: Results of classifier after improving the training set.

By repeating this process a number of times, we further enhanced our training set and eventually attained the set described in section 2.4.3. One other way of brightening our results is by increasing the amount of relevant messages in our validation set. Since our validation set contains only a very low amount of relevant messages, one wrongly classified message has a great impact on the eventual performance measures. Although, our validation set is then no longer a realistic representation of our data set, we do get a slightly better image of the functionality of our Support Vector Machine. We, thus, added a number of relevant, though unknown messages to our validation set, which increased the total of relevant messages to 29. Table 5.6 shows the final results of the enhancement in both training as validation set. TP

FP

TN

FN

15

22

1947

14

Precision (%)

Recall (%)

F1 Measure (%)

40.54

51.72

45.45

Table 5.6: Results of classifier after modifying validation set.

Figure 5.7 gives an overview of the evolution of the performance of our Support Vector Machine.

Figure 5.7: Evolution of SVM performance on validation set.

Notwithstanding, we managed, whether or not in an artificial way, to increase the performance of our SVM, the results are still far from satisfying. Therefore, we further 52

Automated Detection of Offensive Language Behavior on Social Networking Sites

analyse different aspects in the following sections.

5.2.2.4

Message Length Analysis

A thorough analysis of messages that have been falsely classified as positive, indicates that 71.3% contains less than ten features. To get a clear view on the performance on specifically small messages, our test set is split in two parts based on the size of the messages. Our test set is split into one part with only messages smaller than a certain threshold and one part containing only larger messages. SVM clearly performs on a different level depending on which set it is executed. Figure 5.8 shows the difference in precision between the two sets.

Figure 5.8: Difference in performance when splitting validation set in a set with small messages and one with the larger messages.

In Figure 5.9 we see the results regarding precision when splitting the validation set in two parts based on the number of features in a message. We can clearly see that a message containing more features has a greater chance of being correctly classified. 53

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 5.9: Snapshot of precision results that shows difference between small validation set (messages that contain less than 20 features) and large validation set (solely messages that at least contain 20 features).

Figure 5.10: Comparison of SVM performance when removing messages under a certain threshold in both training and validation set.

Based on the above data we can conclude to eliminate all features under a certain threshold. Instead of only eliminating these messages in our validation set we also remove them from our training set. In this way, we avoid these small messages to negatively affect the classifier when it attempts to build an accurate model. Figure 5.10 shows the results of our classifier when completely removing messages that 54

Automated Detection of Offensive Language Behavior on Social Networking Sites

do not contain a certain minimum amount of features. We notice a great performance improvement in both precision and recall. However, as soon as we remove messages that contain more than ten features the performance drops again. This is due to the fact that we are throwing away too much information, which results in a less accurate training model. We conclude that messages that do not contain at least five features are not suitable to be handled and are thus ignored completely by our Support Vector Machine.

5.2.2.5

Selection of Features Analysis

One last attempt to improve the performance of our SVM is to apply feature selection. As described in section 3.2.2, feature selection will try to eliminate noise by filtering out all useless features and preserve only those that contain the most information. We applied a method that is named Mutual Information (cf. section 4.2.2). However, the results were not satisfying as performance did not improve. We decided not to perform an in-depth analysis to what and why feature selection did not produce any improvements. As feature selection is an art in itself this would cost too much time and lead us too far away from the main subject.

5.2.2.6

Conclusion

We conclude our Support Vector Machine, when intensively analyzed and tuned, performs slightly above average. Figure 5.11 shows the overall progress of our SVM.

Figure 5.11: Precision and recall comparison of different stages of our Support Vector Machine

55

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.2.3

Semantic Classifier

In this section we present and compare the results of our semantic classifiers. The algorithms we designed (cf. Section 4.2.3) were originally meant to detect outrage in reactions. In this way it was possible to support or alter certain decisions taken by our primary classifier. However, by also constructing word lists for the other classes the semantic classifiers are regarded equivalent to our SVM or NB classifier.

5.2.3.1

Initial Setup

We created three classifiers, beginning with a rather simple implementation and ended with a more advanced one. Each implementation has its own characteristics (cf. section 4.2.3). The simplest classifier detects abusive language too quickly as it attaches too much belief to abusive words, which have multiple meanings depending on the context. We hypothesize that this classifier will yield high recall numbers, but a rather low precision. The second classifier, named vital classifier, on the contrary focuses much more on yielding high precision results, at the cost of recall. The last classifier, named fuzzy classifier, tries to combine the best parts of the two previous ones.

5.2.3.2

Evaluation

Figure 5.12 shows the results of our different semantic classifiers. As expected our simple classifier especially performs well regarding to recall, as opposed to our vital classifier that reaches a much higher precision than the first one. Fuzzy classifier reaches same precision level as vital classifier, but combines this with a recall of approximately 93%.

Figure 5.12: Comparison of performance of different semantic classifiers on validation set.

56

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.2.3.3

Reactions Classification

This section is devoted to determining the efficiency of classifying messages solely based on its reactions. In chapter 2, an extensive analysis of the corpus is provided. We concluded that merely 72% of all messages do not contain any reaction. However, a message that does contain reactions has on average six (cf. section 2.2). Logically, we can only take into account messages that do contain reactions for this experiment. Our validation set is then reduced to 584 messages. However, since semantic methods are not supervised methods, our training set can be used as a ’validation set’. Since our training set includes 1416 messages that contain reactions, we decide to temporarily execute our experiments using this set. We apply three different methods to classify a message based on its reactions. The first case focuses on detecting outrage in the reactions, while the second method searches for reactions that contain abusive language. Thus, we assume that messages containing abusive language will elicit reactions that also contain abusive language. Our third method combines both methods and attempts to find abusive reactions as well as reactions that express outrage. Figure 5.13 shows a comparison of these different methods, executed by using our fuzzy classifier.

Figure 5.13: Classification of messages solely based on the reactions to the message.

Our first as well as our second method achieves a rather high precision, but performs poorly concerning recall. The combination of both methods achieves reasonable results, but is not comparable to methods directly classifying the message itself. This seems normal as reactions to a message does not necessarily behave the same way as the message itself and are much less predictable. 57

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.2.3.4

Conclusion

We decide to select our fuzzy classifier to function as our final semantic classifier. On top we conclude that using reactions to classify a message can be useful, but rather in an assisting function than as a standalone solution.

5.2.4

Comparison of Single Methods

In the above sections we thoroughly analyzed the results of three single methods. For the sake of completeness Table 5.7 displays the results of Naive Bayes on the modified validation set. TP

FP

TN

FN

25

248

1721

4

Precision (%)

Recall (%)

F1 Measure (%)

9.16

86.21

16.56

Table 5.7: Results of Naive Bayes classifier on the modified validation set.

Figure 5.14 combines the best results for each classifier seperately. Naive Bayes achieves a very high recall, but combines it with the lowest precision. We focus on our Support Vector Machine classifier and our Semantic Classifier which produce solid results.

Figure 5.14: Comparison between different single classifiers.

5.2.4.1

Length of Message Analysis

Section 5.2.2.4 handles the fact that SVM has difficulties classifying small messages. In this section we compare the functionality of both SVM and Fuzzy when applied to 58

Automated Detection of Offensive Language Behavior on Social Networking Sites

small messages. We filter all messages that contain more than eleven features out our validation set; this eventually leaves 479 messages left in our validation set of which two relevant. Table 5.8 displays the results of the experiment. Figure 5.15 shows that fuzzy classifier outperforms SVM by far in terms of precision. These findings are used in the final algorithm that favors fuzzy classifier when messages contain only few features. Classifier SVM Fuzzy

Accuracy (%)

Precision (%)

Recall (%)

F1 Measure (%)

14.3 100.0

100.0 100.0

25.0 100.0

97.3 100.0

Table 5.8: Results of classifiers running on ’small’ validation set.

Figure 5.15: Comparison of precision of both classifiers running on ’small’ validation set.

5.2.4.2

Conclusion

Since in section 5.2.5, a combination of the SVM and fuzzy classifier is proposed, it is important to effecively determine the different weaknesses and strengths of both classifiers. We conclude that smaller messages should not be classified by SVM and also notice that fuzzy classifier yields a very high recall. Thus, when fuzzy classifier indicates the message does not contain abusive language, there is a high chance the message indeed is irrelevant. 59

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.2.5

Combined Algorithm

We can state that, albeit SVM and fuzzy classifier achieve ’reasonable’ results, the overall performance is not satisfactory. Therefore, a combination of seperate classifiers is designed to improve the functionality of the system. As described in section 4.2.4, we use both our SVM and our fuzzy classifier to create a more reliable system.

5.2.5.1

Initial Setup

First, we solely focused on the belief each classifier attached to its classification. Thus, we chose the classifier that got the highest score linked to a certain class. However, since it is hard to compare the score of two completely different classification systems and a lot of extra information can be derived from the context we designed a more complex system. This algorithm is described in section 4.2.4 and depends on four general cases to base its decision on.

5.2.5.2

Evaluation

Our advanced method achieves following results on our validation set.

TP

FP

TN

FN

23

0

1969

6

Precision (%)

Recall (%)

F1 Measure (%)

100.0

79.3

88.46

Table 5.9: Results of combination of different classifiers on validation set.

Table 5.9 displays the accuracy results that are achieved when classifying our validation set. Both precision and recall have improved significantly.

5.2.5.3

Comparison of Methods

The combination of different seperate classifiers clearly outperforms our single methods (cf. Figure 5.16). Especially precision benefits from combining our classifiers, and even reaches to perfection. Off course this is just a result obtained from our validation set. In a realistic environment precision will not be perfect, but will still yield very high results. Recall yields a less spectacular result, although outperforms almost all other attempts. 60

Automated Detection of Offensive Language Behavior on Social Networking Sites

Figure 5.16: Comparison of all implemented classification methods.

5.2.6

Corpus Classification

This sections presents the results concerning the classification of our whole data set. Classifying the whole data set allowes us to estimate the basic functionality of our classification system.

5.2.6.1

Detected Offensive Messages

Table 5.10 displays the distribution of messages among the different categories after the classification. Class

Amount

Percentage (%)

Irrelevant Sexual Racist

6,862,870 37,152 5,072

99.39 0.54 0.07

Table 5.10: Distribution of messages among the three classes: irrelevant, sexual and racist.

We notice that our systems detects a total of 42, 224 flame messages. This is equivalent with 0.60% of the whole corpus, while in chapter 2, we stated that 0.85% of the messages, derived from the amount of offensive messages in our validation set, contains abusive language. In total 30, 271 different users have posted at least one flame message. 61

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.2.6.2

Influence of Different Classifiers

Table 5.11 denotes the impact of the seperate methods on our final classification. Both the overall share as the impact solely on the positively classified messages is shown. SVM as well as fuzzy classifier both return the same results for the most part, which makes the decision straightforward in most cases. Reactions only alter of modify the decisions about 0.1% of the time, or 16% when solely focused on positive messages. Apart from the situation where both classifiers give equal results, SVM is prefered in 53% of the cases, while 47% is based on fuzzy classifier. Situation Same classification result Reactions modify decision Small message modifies decision Other

Global (%)

Positive (%)

96.72 0.10 0.96 1.32

51.45 16.27 4.01 28.27

Table 5.11: Impact of different situations

5.2.7

Conclusion

The performance of our Naive Bayes method, as expected, is greatly outperformed by both our SVM as our semantic method. This study notes that SVM, when optimally tuned, achieves a precision of 69% and a recall of 62%. The semantic method, which is based on word lists, especially achieves a very high recall of 93% and combines this with a precision of 46%. By combining the information we attain from the semantic method and the SVM classifier a considerable performance boost is achieved. Primarily, the main reason of building a method based on word lists was to be able to extract the information out of reactions. However, since the valued performance of this semantic method on our validation set, it is considered an adequate classifier. Our final algorithm achieves a precision of 100% and a recall of approximately 79%. All these tests are performed on our modified validation set, which is our initial validation set with an increased number of relevant documents.

62

Chapter 6

Discussion This next to last chapter briefly summarizes the results obtained in our experiments. It also tries to further describe and explain the different reasons why we obtain such results. Finally we suggest possible enhancements and future improvements.

6.1

Main Findings

This section will give an overview of the main accomplishments and findings in this study.

6.1.1

Corpus

The corpus, extracted from the social networking site Netlog, contained approximately seven million messages. By extracting 2000 random messages out of the corpus, we intend to (a) create a standard validation set to perform our experiments on, and (b) get a global idea of the number of relevant messages in the corpus. The validation set contains two ’racist’ messages and fifteen ’sexual’ messages. A set containing only a very low amount of relevant messages has the effect that it is hard to properly estimate the performance and functionality of a classifier. Therefore, the number of relevant messages is artificially increased in the validation set. Anew, this is done by randomly selecting approximately 2000 messages, although now solely extracting the relevant messages. Randomly chosen irrelevant messages in our standard validation set are then swapped with the retrieved relevant messages.

6.1.2

Development

In the first stage, an information retrieval platform is set up in order to effectively retrieve relevant documents. The system is expanded by implementing Rocchio query expansion, 63

Automated Detection of Offensive Language Behavior on Social Networking Sites

which enhances the IR process. By using query expansion the most informative words are extracted from the relevant documents and added to the query. This method should not only produce more relevant documents, but also lists the most important words that characterise the relevant documents. A possibility to save these words and construct a thesaurus is also implemented. The second stage contained of implementing a supervised classifier to be able to detect offensive language. Two supervised classifiers, Naive Bayes and Support Vector Machine, were created. Subsequently a ’semantic’ classifier based on word lists was constructed, increasing reliability at the cost of flexibility. Finally, a combination of our SVM and semantic classifier were designed. Analysis of the outcome of these different classfiers on the validation set points out that over 90% of the samples is correctly classified by one out of two or both classifiers. Based on the context and the strengths and weaknesses of both classifiers, a final classification method is developed.

6.1.3

Experimental Study

This section gives an overview of the main experimental results. Query Expansion We can state that rocchio query expansion, or query expansion in general, is very prone to query drifts. Even with manual annotation of relevant documents, filtering irrelevant terms out of the expanded query is required. However, we did prove that using an expanded query as opposed to its base query, increases recall as well as precision. On average, recall increased round 20% when expanding both query one and two. Especially when creating one large query composed from different single queries, we noticed an improvement in terms of recall. The average precision also increased when expanding queries, but depends on the way these new relevant messages are ranked. Due to this fact, precision increases when expanding a query that provides highly ranked relevant messages. When relevant messages are more scattered precision will likewise be lower. Offensive Language Detection Our flame detection mechanism starts with developing different independant classifiers. We start with a simple, but widely used classifier Naive Bayes. This classifier however performs rather poorly on our realistic validation set. Albeit, an impressive recall is achieved, only 9% of messages positively classified are in fact offensive messages. This off course completely nullifies the high recall numbers. The second classifier that is implemented is a Support Vector Machine, based on the LibLinear library. We, at first do not produce any more significant results, as opposed 64

Automated Detection of Offensive Language Behavior on Social Networking Sites

to our Naive Bayes classifier. However, after extending and enhancing our training set in combination with our modified validation set (cf. section 6.1.1), we manage to obain reasonable results. The SVM classifier higly depends on the number of features a message contains. This can be clearly inferred from an analysis on the falsely detected flame messages, of which 70% contains less than ten features. SVM precision is negatively influenced by messages that do not contain a minimum amount of features. Another method to enhance our SVM is by filtering features in such a way that only meaningful features are kept. However, the implemented filter, based on mutual inclusive information, does not produce the expected results. Therefore, we do not apply any feature selection method. Eventually our SVM reaches, in a best case scenario, a precision of almost 70% and a recall of 62%. This result is obtained when ignoring messages that do not contain more than five features. Since very small messages can contain abusive language as much as large messages, this is considered a flaw. Because of the limited performance of the SVM, we created another classifier, named semantic classifier, which is based on word lists. These methods are less dynamic and have much less learning capabilities than supervised methods, though are more trustworthy. We mainly designed such method to incorporate information provided by reactions in our classification system. However, we notice our semantic method performs solid on our validation set, with an especially high recall of 93%. Finally, different classifiers are combined to create one efficient classification system. This combined system achieves a precision of 100% on our validation set. Albeit, we may conclude that this is an excellent result, at the same time should be noticed that this validation set is not a perfect representation of the whole dataset.

Offensive Language Behavior The final goal of this study is not only to detect offensive language, but to globally be able to detect offensive language behavior. This means we intend to identify users that deliberately post offensive messages. Due to timing constraints we were unable to fully develop a detailed method to decide whether or not a user should get a label of offensive language behavior. However, a classification of the complete data set was performed in order to get an idea of the number of offensive messages and the distribution among the different users. We note that approximately 30% less messages have been classified as flames than initially expected when analyzing the number of relevant messages in the validation set.

6.2

Discussion

This sections handles multiple concerns related to the retrieved results supported with some examples. 65

Automated Detection of Offensive Language Behavior on Social Networking Sites

6.2.1

Domain-dependency

As described in section 1.5, domain-dependency is one of the biggest issues in the text classification domain. This issue has become even more relevant since our combined method partly supports on static word lists. Although, we attempted to add flexiblity to these word lists, by automatically proposing new words, this has not been extensively tested. Also, it is necessary to estimate a proper intensity and centrality value for each word. These elements combined make it hard to estimate the performance of our classification system on a completely different corpus.

6.2.2

Implicit Language

When messages contain implicit language, it is very difficult to properly decide if a message should be defined as offensive. One example message that is falsely classified as sexual is displayed underneath. Megan Fox: “Ik voel me net een hoer”. “Als actrice voel ik me een hoer.” De uitspraak komt van de verrukkelijke Megan Fox, onlangs nog verkozen tot meest sexy vrouw. Fox vindt eigenlijk alle acteurs en actrices een soort van prositituees. ”Het is vies“ Als je erover na denkt zijn we ... To detect such messages more advanced techniques should be used (eg. taking quotes under consideration, etc). Another example is shown below that brings up another issue concerning our system. Since the main focus is on detection of offensive chat language, informative and formal messages often are classified wrong when they appear to proclaim offensive language, but in fact are just communicating about latest news facts, etc. Poolse nachtclub gebruikt foto van Hitler om volk te lokken. Een Poolse nachtclub is op de vingers getikt omdat ze een foto van Adolf Hitler gebruikte om volk te lokken. Op de prent poseert de nazileider met zonnebril, terwijl onder hem een adelaar van het Derde Rijk afgebeeld staat.

6.2.3

Conclusion

In general can be stated that detecting offensive language is a very subjective and challenging domain. It is clear that such a system is a good helping tool, but is not suited as a standalone application.

6.3

Future Improvements

We intended to build a dynamic and flexible system that can easily incorporate new categories. In order to add a new class, two key additions need to be constructed. First, a 66

Automated Detection of Offensive Language Behavior on Social Networking Sites

representative document set that is descriptive for a specific category should be gathered and added to the general training set. Secondly, a word list needs to be created. It is possible to partly extract the most informative words from the labeled set by applying query expansion. However, our system does lack the ability to improve and evolve together with the changing behavior of offensive language, in particular with chat language. A potential improvement would be to automatically suggest new words to our word list, concomitant with automatically adjusting the values of already existing words. Notwithstanding we implemented the ability to extend our word lists with new words that often arise in flame messages, this module has not been extensively tested. To gradually be able to adjust the behavior of the classifier, reinforcement learning could be a helpful improvement. With feedback on certain messages, the training set or word lists could be altered accordingly and thus could be adjusted to comply to new trends and practices.

67

Chapter 7

Conclusion 7.1

Main Conclusions

The outset of this thesis states that we intend to detect offensive language behavior in an automated manner. This study outlines two main techniques being query expansion and text categorisation. The first technique is used in order to effectively retrieve relevant documents out of the corpus, while the second technique is utilized to seperate irrelevant messages from messages containing racist or sexual abusive language. This corpus contains just over seven million messages, but relevant messages only occur rarely. Analysis reveal that only 0.85% contains offensive language. Our results show that by using query expansion we are able to increase recall and, when not using too large queries, also precision. Based on the constructed training set we design two supervised learning methods. The first method, being Naive Bayes, performs rather poorly and achieves a precision of 9% and a high recall of 86%. This high recall is negated since precision is very low. The second supervised classifier is a Support Vector Machine and is more extensively studied as opposed to Naive Bayes. SVM achieves a reasonable 69% in terms of precision and 62% recall. However, messages that do not contain more than five features are ignored, since SVM does not produce good results when small messages are classified. To enhance overall results we intended to incorporate reactions to a message in our classification process. In this way, we can take into account the possible information reactions contain to support a possible decision on the message itself. Our study attempts to find reactions that express outrage or contain abusive language on their own. In order to achieve this a classifier, based on prefabricated word lists, was created. This ’semantic’ classifier achieves a very high recall of 93% and a reasonable precision of 46% on our standard validation set. Further analysis reveals that reactions only impact 0.1% of the total decisions.

69

Automated Detection of Offensive Language Behavior on Social Networking Sites

Lastly, to improve the overall results a combination of methods is designed. This combined method greatly improves performance of the classification system, since it is capable of using the strengths of both SVM as semantic classifier and avoiding their weaknesses. By using this method we managed to combine a perfect precision of 100% with a solid recall of 79%. Although, these results are decent, they are not representative for the overall corpus and certainly not for other corpora. Offensive language detection is a groundbreaking stage that can be of great aid to automate certain aspects of language monitoring. The fact remains that up untill today these machines are not able to truly understand the real meaning and emotions language expresses.

70

Bibliography [1] T. Saracevic B. Jansen, A. Spink. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2):207–227, 1996. [2] Janyce Wiebe Theresa Wilson Rebecca Bruce Matthew Bell and Melanie Martin. Learning subjective language. Computational Linguistics, 30:3:277–308, 2004. [3] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. 2007. [4] Danah M. Boyd and Nicole B. Ellison. Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication, 13(1):210–230, 2007. [5] John Broglio, James P. Callan, and W. Bruce Croft. Inquery system overview. In Proceedings of a workshop on held at Fredericksburg, Virginia: September 19-23, 1993, TIPSTER ’93, pages 47–67. Association for Computational Linguistics, 1993. [6] Kevyn Collins-Thompson and Jamie Callan. Query expansion using random walk models. In Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05, pages 704–711. ACM, 2005. [7] Steve Cronen-townsend, Yun Zhou, and W. Bruce Croft. A framework for selective query expansion. In In Proceedings of Thirteenth International Conference on Information and Knowledge Management, pages 236–237. Press, 2004. [8] Silviu Cucerzan and Eric Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of EMNLP, pages 293–300, 2004. [9] Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Ma. Probabilistic query expansion using query logs. In Proceedings of the 11th international conference on World Wide Web, WWW ’02, pages 325–332. ACM, 2002. [10] H. Turtle D. Metzler, T. Strohman and W. B. Croft. Indri at trec 2004: terabyte track. In Text Retrieval Conference, 2004. 71

Automated Detection of Offensive Language Behavior on Social Networking Sites

[11] Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web, WWW ’03, pages 519–528, New York, NY, USA, 2003. ACM. [12] S. T. Dumais. Latent semantic indexing: Trec-3. In Proceedings of the Text REtrieval Conference (TREC-3), page 219. ACM, 1995. [13] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871– 1874, 2008. [14] George Forman. An extensive empirical study of feature selection metrics for textclassification. J. Mach. Learn. Res., 3:1289–1305, March 2003. [15] D Garrett, DA Peterson, CW Anderson, and MH Thaut. Comparison of linear, nonlinear, and feature selection methods for eeg signal classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 11:141 – 144, 2003. [16] Andrew R. Golding and Dan Roth. A winnow-based approach to context-sensitive spelling correction. Machine Learning, 34:107–130, 1999. [17] Andrew R. Golding and Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL ’96, pages 71–78, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics. [18] Andrew Gordon, Abe Kazemzadeh, Anish Nair, and Milena Petrova. Recognizing expressions of commonsense psychology in english text. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 208–215, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. [19] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, March 2003. [20] Masoud Rahgozar Hadi Amiri, Abolfazl Ale Ahmad and Farhad Oroumchian. Query expansion using wikipedia concept graph. 2008. [21] Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. pages 76–84, 1996. [22] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Claire Ndellec and Cline Rouveirol, editors, Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science, pages 137–142. Springer Berlin / Heidelberg, 1998. 10.1007/BFb0026683. 72

Automated Detection of Offensive Language Behavior on Social Networking Sites

[23] Lee Rainie Keith N. Hampton, Lauren Sessions Goulet and KristenPurcell. Social networking sites and our lives. 2010. [24] S. Kotsiantis, I. Zaharakis, and P. Pintelas. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26:159–190, 2006. 10.1007/s10462-007-9052-3. [25] Karen Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., 24(4):377–439, December 1992. [26] T. K Landauer and S. Dumais. Latent semantic analysis. Scholarpedia, 3(8):4356, 2008. [27] Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339, 1995. [28] Victor Lavrenko and W. Bruce Croft. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 120–127. ACM, 2001. [29] Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal, 41(8):537–546, 1998. [30] Yinghao Li, Wing Pong Robert Luk, and Fu Lai Korris Ho, Kei Shiu Edward andChung. Improving weak ad-hoc queries using wikipedia asexternal corpus. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pages 797–798. ACM, 2007. [31] Wei-Yin Loh. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):14–23, 2011. [32] Robert M. Losee. How part-of-speech tags affect text retrieval and filtering performance. CoRR, cmp-lg/9602001, 1996. [33] Julie B Lovins. Development of a stemming algorithm. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE ELECTRONIC SYSTEMS LAB, 1968. [34] Kazi Zubair Ahmed M. K. Altaf Mahmud. Detecting flames and insults in text. In In Proceedings of 6th International Conference on Natural Language Processing, 2008. [35] I. Maglogiannis, K. Karpouzis, B.A. Wallace, and J. Soldatos, editors. Emerging Artificial Intelligence Applications in Computer Engineering - Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press, 2007. [36] Christopher D. Manning and Hinrich Schtze. An Introduction to Information Retrieval, pages 177–185. Cambridge University Press, 2009. 73

Automated Detection of Offensive Language Behavior on Social Networking Sites

[37] Christopher D. Manning and Hinrich Schtze. An Introduction to Information Retrieval, pages 272–273. Cambridge University Press, 2009. [38] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: the penn treebank. Comput. Linguist., 19(2):313– 330, June 1993. [39] Andrew McCallum and Kamal Nigam. A comparison of event models for naive bayes text classification. Dimension Contemporary German Arts And Letters, 752(1):4148, 1998. [40] George A. Miller. WordNet: A Lexical Database for English., volume 38. ACM, 1995. [41] Ruslan Mitkov, editor. The Oxford Handbook Of Computational Linguistics. Oxford University Press, 2005. [42] David L. Olson and Dursun Delen. Advanced Data Mining Techniques, page 138. Springer Publishing Company, Incorporated, 1st edition, 2008. [43] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Douglas Johnson. Terrier information retrieval platform. In David Losada and Juan Fernndez-Luna, editors, Advances in Information Retrieval, volume 3408 of Lecture Notes in Computer Science, pages 517–519. Springer Berlin / Heidelberg, 2005. [44] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135, January 2008. [45] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10, EMNLP ’02, pages 79–86, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. [46] Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pages 639–646, New York, NY, USA, 2007. ACM. [47] Yonggang Qiu and Hans-Peter Frei. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’93, pages 160–169. ACM, 1993. [48] Amir Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. Offensive language detection using multi-level classification. In Advances in Artificial Intelligence, volume 6085 of Lecture Notes in Computer Science, pages 16–27. Springer Berlin / Heidelberg, 2010. 74

Automated Detection of Offensive Language Behavior on Social Networking Sites

[49] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive bayes text classifiers. In In Proceedings of the Twentieth International Conference on Machine Learning, pages 616–623, 2003. [50] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. CoRR, abs/1105.5444, 2011. [51] Luke Richards. Social media in asia: Understanding the numbers., May 2012. [52] Ellen Riloff and Janyce Wiebe. Learning extraction patterns for subjective expressions. In Proceedings of the 2003 conference on Empirical methods in natural language processing, EMNLP ’03, pages 105–112, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. [53] Ellen Riloff, Janyce Wiebe, and Theresa Wilson. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 25–32, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. [54] I Rish. An empirical study of the naive bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3(22):4146, 2001. [55] J. J. Rocchio. The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Prentice Hall, 1971. [56] C. Romm-Livermore and K. Setzekorn. Social Networking Communities and EDating Services: Concepts and Implications. IGI Global, 2008. [57] Neil Rubens. The application of fuzzy logic to the construction of the ranking function of information retrieval systems. Computer Modelling and New Technologies, 10(1):20–27, 2006. [58] Sarah Schrauwen. Machine learning approaches to sentiment analysis using the dutch netlog corpus. Master’s thesis, University of Antwerp, 2010. [59] Frdrique Segond, Anne Schiller, Gregory Grefenstette, and Jean pierre Chanod. An experiment in semantic tagging using hidden markov model tagging. In ACL/EACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pages 78–81, 1997. [60] Ellen Spertus. Smokey: Automatic recognition of hostile messages. In In Proc. IAAI, pages 1058–1065, 1997. [61] Pero Subasic and Alison Huettner. Affect analysis of text using fuzzy semantic typing. In The Ninth IEEE International Conference on Fuzzy Systems. FUZZ IEEE 2000., volume 2, pages 647– 652, 2000. 75

Automated Detection of Offensive Language Behavior on Social Networking Sites

[62] Luis Talavera. An evaluation of filter and wrapper methods for feature selection in categorical clustering. In A. Famili, Joost Kok, Jos Pea, Arno Siebes, and Ad Feelders, editors, Advances in Intelligent Data Analysis VI, volume 3646 of Lecture Notes in Computer Science, pages 742–742. Springer Berlin / Heidelberg, 2005. [63] Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing, chapter 16, pages 883–885. William H. Press, 2007. [64] Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing, chapter 16, page 890. William H. Press, 2007. [65] Mike Thelwall. Fk yea i swear: cursing and gender in myspace. Corpora, 3(1):83– 107, 2008. [66] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66, March 2002. [67] Ellen M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’94, pages 61–69. Springer-Verlag New York, Inc., 1994. [68] Jinxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’96, pages 4–11. ACM, 1996. [69] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst., 16(1):61–81, January 1998. [70] Jinxi Xu and W. Bruce Croft. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst., 18:79–112, January 2000. [71] Zhi Xu and Sencun Zhu. Filtering offensive language in online communities using grammatical relations. In Seventh annual Collaboration, Electronic messaging, AntiAbuse and Spam Conference, 2010. [72] Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques. In In IEEE Intl. Conf. on Data Mining (ICDM, pages 427–434, 2003. [73] Hong Yu and Vasileios Hatzivassiloglou. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In 76

Automated Detection of Offensive Language Behavior on Social Networking Sites

Proceedings of the 2003 conference on Empirical methods in natural language processing, EMNLP ’03, pages 129–136, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. [74] M. Zhu. Recall, precision and average precision. Technical report, University of Waterloo, 2004. [75] Liron Zighelnic and Oren Kurland. Query-drift prevention for robust query expansion. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’08, pages 825–826, New York, NY, USA, 2008. ACM.

77

List of Figures 2.1 2.2

Distribution of messages containing offensive language. . . . . . . . . . . . 15 Distribution of different classes in training set. . . . . . . . . . . . . . . . 16

3.1 3.2

Example of Rocchio classification. Obtained from [36]. . . . . . . . . . . . 22 SVM hyperplane construction in the idealized case of linearly separable data. Obtained from [63]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Feature vectors are mapped from a two-dimensional space to a threedimensional embedding space. Obtained from [64]. . . . . . . . . . . . . . 32

3.3

4.1

Graphical representation of this study’s approach. . . . . . . . . . . . . . 37

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Recall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision-Recall Comparison . . . . . . . . . . . . . . . . . . . . . . . . Average Precision Comparison. . . . . . . . . . . . . . . . . . . . . . . . Number of relevant documents when retrieved 1000 documents . . . . . Average Precision Comparison . . . . . . . . . . . . . . . . . . . . . . . Evolution of SVM performance on validation set. . . . . . . . . . . . . . Difference in performance when splitting validation set in a set with small messages and one with the larger messages. . . . . . . . . . . . . . . . . Snapshot of precision results that shows difference between small validation set (messages that contain less than 20 features) and large validation set (solely messages that at least contain 20 features). . . . . . . . . . . Comparison of SVM performance when removing messages under a certain threshold in both training and validation set. . . . . . . . . . . . . . . . Precision and recall comparison of different stages of our Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of performance of different semantic classifiers on validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of messages solely based on the reactions to the message. . Comparison between different single classifiers. . . . . . . . . . . . . . . Comparison of precision of both classifiers running on ’small’ validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.9

5.10 5.11 5.12 5.13 5.14 5.15

79

. . . . . . .

45 46 46 47 48 48 52

. 53

. 54 . 54 . 55 . 56 . 57 . 58 . 59

Automated Detection of Offensive Language Behavior on Social Networking Sites

5.16 Comparison of all implemented classification methods. . . . . . . . . . . . 61

80

List of Tables 2.1 2.2

Distribution of validation set. . . . . . . . . . . . . . . . . . . . . . . . . . 14 Distribution of different classes in training set. . . . . . . . . . . . . . . . 15

4.1

Random entries from our word lists. . . . . . . . . . . . . . . . . . . . . . 39

5.1 5.2 5.3

. 47 . 50

Expanded query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . . Results of classifier after running on validation set. Precision drops to 6% due to the rate of true positive against false positive. . . . . . . . . . . . 5.4 Results of classifier after running on randomly selected set. . . . . . . . 5.5 Results of classifier after improving the training set. . . . . . . . . . . . 5.6 Results of classifier after modifying validation set. . . . . . . . . . . . . 5.7 Results of Naive Bayes classifier on the modified validation set. . . . . . 5.8 Results of classifiers running on ’small’ validation set. . . . . . . . . . . 5.9 Results of combination of different classifiers on validation set. . . . . . 5.10 Distribution of messages among the three classes: irrelevant, sexual and racist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Impact of different situations . . . . . . . . . . . . . . . . . . . . . . . .

81

. . . . . . .

51 51 52 52 58 59 60

. 61 . 62