Twitter Spamming: Techniques And Defence Approaches

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/i...
4 downloads 1 Views 427KB Size
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/ijaer.htm

Twitter Spamming: Techniques And Defence Approaches Arun Kumar R Electronics & Computer Engineering Department Indian Institute of Technology Roorkee Roorkee, India [email protected] Sandeep Kumar Electronics & Computer Engineering Department Indian Institute of Technology Roorkee Roorkee, India [email protected]

Abstract: Rapid growth of social networking sites have made a huge impact on today’s society and Web platform. Social networking sites are growing in both size and popularity with a very high rate in recent years. Among the Social Networking Sites, Twitter is the fastest growing one. Its popularity attracts many spammers to infiltrate legitimate user’s accounts with a large amount of spam messages. With the amount of data growing in Twitter in recent years, detection of spam in realtime has become a challenging task for researchers as well as for Twitter itself. Tremendous work is being carried out towards spam detection. Twitter spam detection consist both the varieties of detecting spammers and detecting spam links posted by the users. In this paper we give a hierarchy of spamming techniques, defence approaches and evasion tactics adopted by the spammers to evade detectors. Keywords- Spam, Twitter, Machine Learning, Social Networking

INTRODUCTION The web has changed the way we inform and get informed. Online Social Networking (OSNs) sites provide new form of communication in today’s world. With the growing size of OSNs, number of spam messages has also increased. Spam is an inseparable part of the web. Spam has taken various forms since its discovery in 1978 as e-mail spam to currently emerging social networking spam like facebook, LinkedIn and twitter spam through various other kinds. Twitter is a microblogging service, founded in 2006, where users can post 140 character messages called tweets. The goal of Twitter is to allow friends communicate and stay connected through the exchange of short messages. Unlike Facebook and MySpace, Twitter is directed, meaning that a user can follow another user, but the second user is not required to follow back. Most accounts are public and can be followed without requiring the owner’s approval. With the structure of following public accounts of Twitter’s with-out owner’s permission, spammers can easily follow legitimate users as well as other spammers. The term spam refers to “unsolicited bulk messages” and spammer is the one who spreads these messages. Spam is becoming an increasing problem on Twitter as on other online social networking sites. Unfortunately, spammers use Twitter

as a tool to post malicious links, send unsolicited messages to legitimate users, and hijack trending topics. Tweet spammers are driven by several goals, such as to spread advertise to generate sales, disseminate pornography, viruses, phishing, or simple just to compromise system reputation. The reason that spammers are so efficient in sending out spam is that they follow several twitter users and hope that those users turn around and follow them. This procedure is known to be proper etiquette in Twitter [1]. Spam not only pollutes real time search, but they can also consume extra resources from users and systems [8]. Most importantly, spam wastes human attention, most valuable resource in current era of World Wide Web. Given that spammers are increasingly arriving on Twitter, the success of real time search services and mining tools relies at the ability to distinguish valuable tweets from spam attacks.

RELATED WORK Spam Detection has an extensive scope of research exploring identification of spammers or spam, preventing spammers and counter balancing its effects on the media; society etc. Kwak et al. [2] have shown an exhaustive and qualitative study of Twitter user accounts’ behavior, like the variations in the number of followers and followings for normal user and spammer etc. Cha et al. [3] have design alternative metrics to measure Twitter accounts. In M. McCord, M. Chuah [5], have shown influence of user-based and content-based features, which are influenced by Twitter Policies, can be used to distinguish between spammers and legitimate users on Twitter. Usefulness of these features is evaluated in spammer detection using traditional classifiers like Random Forest, Naïve Bayesian, Support Vector Machine, K-Nearest Neighbor schemes using the Twitter dataset collected. Benevenuto et al. [16], has investigated different tradeoffs for classification approach of detecting spammers instead of tweets containing spam and the impact of different attribute sets. And it was also shown that change in the performance of classifier’s output based on different feature set selected. In Nikita Spirin [9], has shown importance if URL derived features set in detecting spammers.Yang et al. [4] focuses more on analysing evasion tactics utilized by current Twitter spammers and authors designed new machine learning

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/ijaer.htm features to more effectively detect Twitter spammers. In addition, authors also formalize the robustness of 24 detection features.

TWITTER TAXONOMY Twitter has its own taxonomy. This section defines Twitter taxonomy. a) Tweets: Short messages with maximum 140 characters in length are used as the communication tool. b) Followers: A user’s followers are the set of users who receive a tweet when posted. If a user posts a tweet on his home page, all of his followers receive the same tweet on their home pages too. c) Friends: Friends are the set of users an account subscribes to in order to obtain access to status updates. d) Hashtags: These are indicated by a # symbol and are combined with keywords to indicate a topic of interest. Hashtags become popular when many people use the hashtag. e) Trending Topics: These are the popular hashtags, appear on the main Twitter page and can significantly increase the number of tweets containing that topic.



Mention Spam: Spammers mention the username of a targeted user before tweeting. Targeted user’s attention can be grabbed by this method. B. Content-Based Spamming Techniques:  Trend Abuse: Twitter’s API also provides a list of the top trends per hour. Spammers use these trending topics in their tweets and it gets posted in the time line causing annoyance to all the users because public accounts can be seen by anyone on the twitter. Figure 2 shows the two instances of typical (a) trend abuse and (b) multitrend spam.

(a)Trend Abuse Scenario TWITTER SPAMMING TECHNIQUES Twitter Spamming techniques can be divided into two categories: A. Profile-Based Spamming Techniques  Follow Spam: Follow spam is the act of following mass number of people, not because a user actually interested in their tweets, but simply to gain attention, get views of a respective user’s profile (and possibly clicks on URLs therein), or (ideally) to get followed back. Automated programs are used to make this task easier, this way they can follow thousands of users with in a fraction of seconds. In extreme cases, these automated accounts have followed so many people and they are threat to the performance of the entire system. In less-extreme cases, they simply annoy thousands of legitimate users who get a notification about this new follower only to find out their interest may not be entirely sincere. These types of accounts can be examined by checking the tweets posted by the users and examining their behaviour. Figure 1 shows an instance of follow spam.

(b)Multi-Trend Abuse Scenario Figure 2: Instances of Trend abuse spamming techniques [11] 





Figure 1: Instance of Follow Spam Technique[7]



Trend Setting: Here spammers post a large number of tweets containing a specific word in it, making the word or hashtag a new trending topic. Fake Re-tweets: In this technique spammers take advantage of the Twitter’s Re-Tweet convention to make it appear that a Spammer’s tweet was originally published by another user. These can be identified by twitter’s search capabilities where re-tweets can be distinguished from original tweets. Embedding Popular Search Terms: In this technique spammers act very smart. They include popular search terms in their tweets and when a user search for the same terms, these tweets gets displayed in the result set, which is again an annoying experience for a legitimate user, who does not get the expected results. Direct Message: This is traditional spamming technique where spammers send personal message to

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/ijaer.htm the targeted user directly. New spamming techniques are emerging still today. Above explained techniques are the most popular spamming techniques used by spammers.

TWITTER SPAM DETECTION Twitter itself offers several options to the users to report spam messages or spammers. Some of them are: Report a user as spammer by clicking “Report @username as Spam” button under the Actions section of a profile’s sidebar, Report a tweet link as spam, Block suspicious user. Twitter also provide guidelines for analysing a spammer and provide rules of “DONT” for the user’s [10]. Detection Techniques of Twitter spam can be classified into two categories: A. Detecting Spammers(nodes) Detection Techniques of Twitter Spam is carried out by applying machine learning algorithms on the data extracted through various data mining techniques having features specified in feature sets to detect spammers. This approach for Twitter spam detection methods is done in three steps: a) Crawling twitter data and Building labelled collection: Data about the twitter users can be crawled using different approaches. Twitter provides different APIs like REST API, STREAMING API, and SEARCH API. Based on feature set selection the data is crawled accordingly. Then collected data is manually labelled as spam and non-spam labels by examining recent tweets and time line of the user. The links given in the each user tweet is examined and checked manually. As this is a time consuming process labelling is done for small set of data. Labelling is done on the basis of feature set selected. b) Construction of feature sets: One of the crucial and time consuming tasks in the web spam detection systems is the process of feature extraction, which is usually accomplished after crawling and during the indexing phase. If less number of features is used to detect the spam pages, then one might save some computational costs and therefore the performance of the system will be increased. The automated data mining feature selection technique provides an effective method for selecting the most predictable features from many presented features. After features are selected by feature selection methods, their effectiveness can be investigated by accuracy of classification algorithms applied to only these selected features vs. all features. The fewer features lead to reach the higher or the same level of performance. Classification of feature sets will be discussed in the next section. Types of features: 1) Graph-based Features[11]:  # friends  #followers  Reputation score[5] (#friends/#followers)  Users with certain distance in social graph 2) Content-based Features

 # Duplicate Tweets  # HTTP Links  # Replies and Mentions  # Trending Topics 3) URL based features  # tweets containing “spammy” URLs  fraction of tweets containing “spammy” URLs  # “spammy” URLs  # unique URLs Various types of other feature set e.g. user based features, tweet based features, timeline based features, neighbourhood based features etc. are used depending on the requirement of the detecting systems. c) Classify spammers and non-spammers using machine learning algorithms: After selecting features, we apply various classification algorithms to obtain performance of them on our dataset. The results of these algorithms are used to compare effectiveness of different feature selection methods. For the classification tasks, we use various algorithms such as neural network, Support Vector Machine (SVM), Naïve Bayesian classifier, Decision trees, and logical regression. The performance of the detection process is based on the right combination of selection of feature sets and machine learning algorithm. B. Detecting Spam (tweets, links) In this paper we focus on detecting spammers and detecting spam links is a falls under the class of web spam detection.

EVASION TACTICS In spite of many researchers and twitters effort to detect and avoid spam (as discussed in section V), Spammers follow evasion tactics to get rid of these detection methods. This section discusses about classification and methods adopted for evasions. A. Evasion Techniques The main evasion tactics, utilized by the spammers to evade existing detection approaches, can be categorized into the following two types: a) Profile-based Feature Evasion Tactics: A common intuition for discovering Twitter spam accounts can originate from accounts’ basic profile information such as number of followers and number of tweets, since these indicators usually reflect Twitter accounts’ reputation. To evade such profile-based detection features, spammers mainly utilize tactics including gaining more followers and posting more tweets.  Gaining More Followers: In general, popularity of a user can be measured through the number of followers of that account. A higher number of followers of an account commonly imply that more users trust this account and would like to receive the information from it. Thus, many profile-based detection features such as number of followers, fofo ratio (ratio of the number of an account’s following to its followers) and reputation score are built

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/ijaer.htm based on this number (number of followers). To evade these features or break-through Twitter’s 2,000 Following Limit Policy [8], spammers can mainly adopt the following strategies to gain more followers. The first approach is to purchase followers from websites. These websites charge a fee and then use a group of Twitter accounts to follow their customers. The specific methods of providing these accounts may differ from site to site. The second approach is to exchange followers with other users. This method is usually assisted by a third party website. These sites use existing customer’s accounts to follow new customer’s accounts. Since this method does only require Twitter accounts to follow several other accounts to gain more followers without paying any fee, Twitter spammers can get around the referral clause by creating more number of fake accounts. In addition, Twitter spammers can gain followers for their accounts by using their own created fake accounts. In this way, spammers can create a bunch of fake accounts, and then follow their spam accounts with these accounts. Figure 3 shows a existing online website from which users can directly buy followers.

number of tweets posted), unique URL ratio (ration of number of unique URLs posted to the total number of URLs posted), hashtag ratio (ration of tweets containing hashtags to the total number of tweets posted) [9]. By using this tactic, spammers are able to dilute their spam tweets and make it more difficult to be distinguished from legitimated accounts. Figure 4 shows an instance of such scenario.

Figure 4: Instance of mixing Normal Tweets. 

Posting Heterogeneous Tweets: Spammers can post heterogeneous tweets to evade content-based features such as tweet similarity [9] and duplicate tweet count [9]. Spammers can utilize public tools to convert a few different spam tweets into hundreds of variable tweets with the same semantic meaning using different words. Figure 5 shows a online tool which does this job.

Figure 3: Example of twitter followers’ online trading websites [6]  Posting More Tweets: Tweet based feature is also widely used in the existing Twitter spammers’ detection approaches. To evade this feature, spammers can post more Tweets at regular intervals of time to behave more like legitimate accounts, especially continuing to utilize some public tweeting tools or software. b) Content-based Feature Evasion Tactics: The percentage of Tweets containing URLs is an effective indicator of spam accounts, which is utilized in work such as [9]. Many existing approaches design content-based features such as tweet similarity (number of tweets posted having similar semantic meaning) [6] and duplicate tweet count (number of duplicate tweet posted) [6] to detect spam accounts. To evade such content-based detection features, spammers mainly utilize the tactics including mixing normal tweets and posting heterogeneous tweets.  Mixing Normal Tweets: Spammers can utilize this tactic to evade content-based features such as URL ratio (ratio of number of tweets posted that contain link to the

Figure 5: Scenario of posting heterogeneous tweets and Spin Bot [6].

CONCLUSIONS AND FUTURE WORK In this paper we have categorised and discussed various types of spamming techniques, general approach to detect spammers and category of evasion tactics to evade features used by

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012) © Research India Publications; http://www.ripublication.com/ijaer.htm detectors. With the techniques of spamming and detection methods explained in previous sections one could able to: 1. Identify instances of spam 2. Prevent spammers and 3. Counterbalance the effect of spamming. Spam detection and counter balancing its impact is a never ending story. It is just matter of how fast one can detect spam accurately. Removal spam completely is a myth. This work can be further extended by giving an expert system to detect spammers on twitter using minimal and robust feature sets which imposes cost to evade.

New Design for Fighting Evolving Twitter Spammers”, RAID 2011, pp. 318–337, SpringerVerlag Berlin Heidelberg 2011. [5] [6]

[7]

ACKNOWLEDGMENT

[8]

I would like to acknowledge the contribution of Late Dr. Anjali Sardana, Assistant Professor, Electronics and Computers Department, IIT Roorkee, whose guidance was indispensable throughout the course of this work.

[9]

REFERENCES [1]

[2]

[3]

[4]

K. Beck, “Analyzing Tweets to Identify Malicious Messages”, in IEEE International Conference on Electro/Information Technology (EIT), pp.1-5, May 15-17, 2011. H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a Social Network or a News Media?” In Int’l World Wide Web (WWW ’10), 2010. M. Cha, H. Haddadi, F. Benevenuto, and K. Gummadi. “Measuring User Influence and Social Media (ICWSM)”, 2010. Chao Yang, Robert Chandler Harkreader, Guofei Gu “Die Free or Live Hard? Empirical Evaluation and

[10]

[11]

[12]

M. McCord, M. Chuah, “Spam Detection on Twitter Using Traditional Classifiers”, in Proc. 8th International Conference on Autonomic and Trusted Computing, ATC 2011, pp. 175-186, Banff, Canada, September 2-4, 2011. Wang, Alex Hai, “DON’T FOLLOW ME Spam Detection in Twitter”, in Proc. International Conference on Security and Cryptography (SECRYPT), 2010, pp. 1-10, July 26-28, 2010. “Social Signals and SEO-Can Facebook and Twitter help my SEO?” ,http://thesemblog.com/tag/twitter/, March 2011. DITESCO “The 2000 Following Limit Policy On Twitter”, http://twittnotes.com/2009/03/2000following-limit-on-twitter.html, March 10, 2009. Nikita Spirin, “Mutually Reinforcing Spam Detection on Twitter and Web”, In VIII All-Russian scientific conference Microsoft Technologies in the theory and practice of programming, pp. 1-7, Saint-Petersburg, Russia, 2011. Twitter, “Twitter Rules”, https://support.twitter.com/articles/18311-the-twitterrules, 2012. F. Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virg´ılio Almeida, “Detecting Spammers on Twitter”, in CEAS 2010 Seventh annual Collaboration, Conference on Electronic messaging, AntiAbuse and Spam, July 13-14, 2010, Redmond, US.