Identifying Botnets Using Anomaly Detection Techniques Applied to DNS Traffic

Identifying Botnets Using Anomaly Detection Techniques Applied to DNS Traffic Ricardo Villamarín-Salomón and José Carlos Brustoloni Department of Comp...
Author: Angela Preston
15 downloads 1 Views 209KB Size
Identifying Botnets Using Anomaly Detection Techniques Applied to DNS Traffic Ricardo Villamarín-Salomón and José Carlos Brustoloni Department of Computer Science University of Pittsburgh Pittsburgh, PA, 15260, USA Email: (rvillsal, jcb)@cs.pitt.edu Abstract- Bots are compromised computers that communicate with a botnet command and control (C&C) server. Bots typically employ dynamic DNS (DDNS) to locate the respective C&C server. By injecting commands into such servers, botmasters can reuse bots for a variety of attacks. We evaluate two approaches for identifying botnet C&C servers based on anomalous DDNS traffic. The first approach consists in looking for domain names whose query rates are abnormally high or temporally concentrated. High DDNS query rates may be expected because botmasters frequently move C&C servers, and botnets with as many as 1.5 million bots have been discovered. The second approach consists in looking for abnormally recurring DDNS replies indicating that the query is for an inexistent name (NXDOMAIN). Such queries may correspond to bots trying to locate C&C servers that have been taken down. In our experiments, the second approach automatically identified several domain names that were independently reported by others as being suspicious, while the first approach was not as effective.

I.

INTRODUCTION

A bot is a malicious program that an attacker (also known as botmaster) can control remotely through a C&C infrastructure. A botnet is a network of hosts infected with such a program. A botnet is typically used for nefarious activities such as spamming and click fraud. The C&C of a botnet usually is centralized [8]. Bots contact their C&C server to receive instructions. A common approach to combat botnets consists in trying to identify and disrupt such C&C communication [7]. If a botnet uses a single C&C server at a fixed IP address, network administrators can easily disrupt the botnet by nullrouting that address [8]. To evade such disruption, botmasters often aggressively replicate and move C&C servers. In such botnets, bots typically use DDNS queries to locate an available C&C server instance. Dagon et al. [10] proposed identifying botnet C&C servers by detecting domain names with abnormally high or temporally concentrated DDNS query rates. High query rates may be expected because botnets with as many as 1.5 million bots have been discovered [18]. Dagon et al. use Chebyshev’s inequality [2] and a simplified version of the Mahalanobis distance [11] to quantify how anomalous the number of queries for each domain name is during a day or hour in that day, respectively. Considering that botnets often use Third Level Domains (3LDs) instead of subdirectories [10], Dagon et al. aggregate lookups for each Second Level Domain (SLD) with

those of the respective 3LDs. An alternative approach was proposed by Schonewille and van Helmond [28]. They observed that DDNS responses indicating name error (NXDOMAIN) often correspond to botnet C&C servers that have been taken down. Hosts that repeatedly issue such queries may be infected with a bot and still have the vulnerability that enabled such infection. This paper evaluates experimentally the effectiveness of these approaches for detecting botnets in enterprise and access provider networks. Although these approaches have been known for some time, they do not seem to have been replicated by other researchers. Consequently, they could depend on network or botnet characteristics that are not always present or have changed since their initial proposal. In our experiments, the approach based on abnormally high or temporally concentrated query rates was ineffective. We found that, currently, many legitimate and popular domains use DNS with short time-to-live (TTL), including gmail.com, yahoo.com, and mozilla.com. The evaluated algorithms misclassify these names as C&C servers. On the other hand, the approach based on abnormally recurring NXDOMAIN replies was very effective. It detected several domain names that were independently reported by others to be suspicious. The rest of this paper is organized as follows. Section II provides details of the anomaly detection techniques used. Section III discusses the proposed approaches for DDNS-based botnet detection. Section IV describes our evaluation methodology and Section V presents our experimental results. Section VI discusses the results and Section VII concludes. II.

ANOMALY DETECTION TECHNIQUES

In anomaly detection, the goal is to find objects that are different from most other objects [1]. Anomalous objects are also known as outliers because they lie far away from other data points in a scatter plot [1]. There are many algorithms for anomaly detection. In this section we briefly discuss Chebyshev’s Inequality [2], Mahalanobis Distance [1], and a simplified version of the latter [11]. A. Chebyshev’s Inequality An outlier detection method based upon Chebyshev’s inequality can be used [2] when (1) the distribution of the available data is unknown or an experimenter does not want to

make assumptions about its distribution, and (2) it is expected that the observations are independent from one another. The formula for Chebyshev’s inequality is [3]:

(

)

Pr X − E ( X ) ≥ kσ ≤ 1 k 2 ,

(1)

where X is a random variable, E(X) is its expected value and k > 0 is a parameter. This formula establishes an upper bound for the percentage of data points with value more than k standard deviations away from the population mean [2]. If k is set to 4.47, the upper bound is 0.05, a typical cut-off point for statistical significance [5]. Data points with value more than k standard deviations away from the mean are considered anomalous [2]. B. Distance Measures A distance d(x, y) between two data objects, x and y, quantifies how dissimilar they are. These objects can be viewed as points with n dimensions (features or attributes). When n is greater than one, the objects are represented as vectors. If the observations are multivariate Gaussian, we can classify points as anomalous if they have low probability with respect to the estimated distribution of the data [1]. One way to measure this is by calculating the distance of the object to the center of the distribution [1]. To do so, we can use the Mahalanobis distance, which takes the shape of the data distribution into account [1] and is defined as: mahalanobis ( x, y ) = ( x − y )Σ −1 ( x − y )T ,

(2)

-1

where Σ is the inverse of the covariance matrix of the data. There are a number of requirements to use the Mahalanobis distance for detecting anomalous objects. First and foremost, the distribution of the data has to be Gaussian. Second, the covariance matrix of the data must be invertible. Third, even if the matrix is invertible, computing its inverse is computationally expensive. To address the last two problems, a simplified version of the Mahalanobis distance was formulated in [11]: n  x − y  i i  , simplified _ mahalanobis ( x, y ) = ∑    σ + α i =1  yi 

(3)

where n is the number of features, and σ is the vector of standard deviations of every feature in the data vectors. To prevent the above distance to become infinite when a specific feature has a constant value across all observations, a smoothing factor α is added to every standard deviation σi. This simplified distance is based on the “naïve” assumption that the vector’s elements are statistically independent. In cases were that assumption is true, the covariance matrix Σ becomes diagonal and the elements along the diagonal are just the variance of each variable [11]. III.

DDNS-BASED BOTNET DETECTION

Bots typically initiate contact with C&C servers to poll for instructions. C&C servers often cannot initiate such contacts because bots’ addresses are unknown or dynamic or bots are

behind firewalls or network address translators (NATs) that do not permit externally initiated connections. Botmasters usually assign fully qualified domain names (FQDNs) to C&C servers. FQDNs enable bots to use DNS to locate the respective C&C servers. A botmaster can (possibly using a stolen credit card) create and register a FQDN or use a free 3LD from a DDNS provider [8]. In either case, the C&C server’s FQDN is typically hosted by a DDNS provider. If network administrators disrupt a C&C server at a certain IP address, the botmaster can easily set up another C&C server instance with the same name at a different IP address. Because DDNS providers set low TTL values for the domain names they host ([16], [17]), IP address changes in C&C servers propagate almost immediately to bots [8]. Dagon et al. observed that DDNS query rates can be much higher for botnets’ C&C servers than for other domain names [10]. In their method, a “Canonical DNS Request Rate” (CDRR) aggregates the query rate of a SLD with the query rates of the SLD’s children 3LDs, according to the formula: CSLD = RSLD + i

i

SLDi

∑ R3 LD j =1

j

(4)

Dagon et al. suggest that when the CDRR of a name is anomalous according to Chebyshev’s inequality (1), that name has an abnormally high query rate and is likely to belong to a botnet C&C server [10]. They also propose detecting botnet C&C servers as those names whose temporal CDRR distribution is abnormally concentrated [10]. Each SLD’s hourly CDRRs in a day are sorted in decreasing order, forming a feature vector. They suggest that names whose feature vector differs from that of a normal name by more than a threshold are likely to belong to a botnet C&C server. They use the simplified Mahalanobis distance to measure the distance between feature vectors. Schonewille and van Helmond’s alternative approach looks instead at NXDOMAIN reply rates [28]. Reply rates can be classified as anomalous by algorithms similar to those Dagon et al. used for query rates. IV.

METHODOLOGY

This section outlines the methods we used for experimentally evaluating the proposed DDNS-based botnet detection approaches. A. Data Collection We collected all DNS traffic at the University of Pittsburgh (Pitt)’s Computer Science (CS) department for a period of 192 hours (9 days) starting on 2/13/2007. We used the tcpdump network sniffer to collect this data (11 GB) and store it in the pcap format. This format can be read and filtered using the libpcap library. B. Data Selection We filtered the data into three subsets as specified in Table I. All subsets exclude truncated packets and contain only DNS responses with TTL values of at most 60, 300 or 600 seconds either in the answer resource record (for answers of type A) or

in the authoritative resource record (for referrals or NXDOMAIN response codes). Responses with longer TTL values are excluded because DDNS providers typically do not use them [16][17]. TABLE I. SUBSETS OF COLLECTED DATA (AA=Authoritative Answer, RR=resource record, NXDOMAIN= name error, ANS = answer RR, AUTH=Authority RR, TTL=Time to Live, NS=Name Server, SOA=Start of Authority)

Subset Name

CSAA

TTL values (secs.)

CS_NS

60, 300, 600

AA flag

1

Response code

NOERROR

Referral Allowed?

No

TTL in RR

ANS

RR types

A

DDNS_NS

60, 300, 600

60, 300, 600

0, 1 NXDOMAIN

0, 1 NOERROR, NXDOMAIN

-ANS

Yes ANS, AUTH

--

A, NS, SOA

Our first subset, called CSAA, includes only responses with NOERROR (0) return code and authoritative answer (AA) flag set. This subset includes only answer resource records (RR) of type A and excludes referrals and cached responses (e.g., from local DNS resolvers). Our second subset, named CS_NS, contains only responses with NXDOMAIN return code (authoritative or not) returned by the CS department’s DNS resolvers. When a host’s DNS resolver receives such a response, it often appends the host’s domain suffix to the queried name and retries the query. We filter out such query retries. Our third subset, called DDNS_NS, contains responses from known DDNS providers with NOERROR or NXDOMAIN return codes. It does not exclude referrals or cached responses. The known DDNS providers are those listed in [16] or [17] or otherwise discovered in our own data set. We formed three versions of each subset, each with a different maximum TTL value. For example, CS_NS_60 is the version of CS_NS with maximum TTL equal to 60 seconds. C. Detection of abnormally high rates As in [10], we performed the following steps for each subset version. First, for each SLD in the subset, we computed the SLD’s total CDRR for the entire 192-hour period. Second, we computed the average and standard deviation of the total CDRRs. Third, for each SLD in the subset, we verified whether the SLD is anomalous according to Chebyshev’s inequality with k = 4.47. Finally, we investigated whether anomalous SLDs are indeed suspicious. D.

Detection of abnormally temporally concentrated rates Attempting to follow [10], we also performed the following steps for each subset version. First, for each SLD and day in the subset, we computed the SLD’s feature vector containing the SLD’s hourly CDRRs for that day. Second, we normalized each feature vector by dividing each of its elements (hourly CDRR) by the elements’ sum (daily CDRR). We excluded any SLDs whose daily CDRR was less than 3, considering that such unpopular SLDs are unlikely to be C&C servers yet likely to have temporally concentrated rates. Third, we sorted each feature vector in decreasing element order. Fourth, for each

day in the subset, we computed the day’s centroid and covariance matrix. The centroid is the average feature vector of all included SLDs (after normalization and sorting). Fifth, for each SLD and day in the subset, we computed the SLD’s distance to the day’s centroid. Sixth, for each day in the subset, we sorted the included SLDs by decreasing centroid distance. The top SLDs with distances exceeding a threshold were considered anomalous. Finally, we investigated whether anomalous SLDs are indeed suspicious. We ran into difficulties while computing feature vector distances. Covariance matrices were not diagonal, invalidating the use of simplified Mahalanobis distances. To compute (full) Mahalanobis distances, covariance matrices need to be inverted, but many covariance matrices were noninvertible. We overcame the latter problem by modifying calculations as follows. First, we computed the daily centroid and covariance matrix considering only benchmark SLDs. We selected as benchmark SLDs those SLDs that have the least number of null features. These SLDs have well-distributed queries, as expected for popular and legitimate SLDs. On the other hand, SLDs with highly temporally concentrated rates (presumably C&C servers) will tend to be distant from benchmark centroids. This change reduced the number of noninvertible covariance matrices to only 2 out of 81 cases. Second, in cases where the covariance matrix was still noninvertible, we added a bit of Gaussian noise (µ=0, σ=10-8) to each feature of benchmark SLDs. This change made all covariance matrices invertible. V. EXPERIMENTAL RESULTS Table II summarizes our results for detection based on abnormally high rates. In subsets CSAA and DDNS_NS, all SLDs detected as anomalous belong to legitimate organizations, such as yahoo.com and weather.com. On the contrary, in subset CS_NS, almost all SLDs detected as anomalous have been independently reported by others as suspicious. These SLDs are listed in Table III. The first SLD in the table was detected in all versions of CS_NS, while the other SLDs were detected only in CS_NS_600. SLD yahoo.com was the only SLD incorrectly detected as anomalous in CS_NS. TABLE II. NUMBER OF SLDS CONSIDERED ANOMALOUS ACCORDING TO CHEBYSHEV’S INEQUALITY WITH K = 4.47 (ABNORMALLY HIGH RATES) Subset CSAA

CS_NS

DDNS_NS

TTL (secs.)

Total SLDs

# of Anomalous SLDs Suspicious

Non suspicious

60

2512

0

10

300

7417

0

16

600

12518

0

48

60

96

1

0

300

248

1

0

600

381

4

1

60

173

0

1

300

249

0

1

600

284

0

2

TABLE III. SLDS IN CS_NS WITH ABORMALLY HIGH RATES AND INDEPENDENTLY REPORTED AS SUSPICIOUS

SLD fscking.com 3lefties.com

Specific 3LDs found • a55.fscking.com: Reported as spoofing srtforums.com [21]. • www1.fscking.com: Reported as suspicious [22]. • mail1.3lefties.com: Reported as suspicious [23].

(Internet access)

busitec.jp

• 2krad.busitec.jp. Reported as suspicious [24].

(Online fax services)

shacknet.nu (DDNS provider)

• preschool.shacknet.nu. Reported as suspicious [25]. • macher.fake-ip.shacknet.nu. Used in spam in sender addresses [26].

Figures 1 and 2 show the average CDRRs of anomalous and normal SLDs in CS_NS_300 and CS_NS_600, respectively (the plot for CS_NS_60 is almost identical to the former). In Figure 2, yahoo.com was misclassified as anomalous. Figure 3 shows the effect of correcting this misclassification. As expected, CDRRs of anomalous SLDs are much higher than those of normal SLDs. Figure 4 shows a representative plot for a case where no anomalous SLD is actually suspicious. The discrepancy between CDRRs of anomalous and normal SLDs is much smaller than in Figures 1 and 3.

Figure 2. Average hourly CDRRs in CS_NS_600, with classification by Chebyshev inequality and yahoo.com misclassified as anomalous

Figure 3. Average hourly CDRRs in CS_NS_600, with classification by Chebyshev inequality and considering yahoo.com normal

Figure 1. Average hourly CDRRs in CS_NS_300, with classification by Chebyshev inequality

Detection based on abnormally temporally concentrated rates using full Mahalanobis distances did not yield any suspicious SLD. Interestingly, detection using simplified Mahalanobis distances did identify a suspicious 3LD, in CS_NS. This result is surprising because the covariance matrix was not diagonal and therefore the simplified Mahalanobis distance would theoretically be inapplicable. The detected 3LD, unknown.sagonet.net, reportedly belongs to a “guestbook harvester” botnet [13]. The parent SLD, sagonet.net, belongs to a network service provider that has been reported to host customers who do “all types of undesirable Internet activities” [14], including spamming.

Figure 4. Average hourly CDRRs in DDNS_NS_300, with classification by Chebyshev inequality.

Figures 5 and 6 show the sorted hourly CDRRs of normal SLDs and the SLD classified as anomalous using simplified Mahalanobis distances. The anomalous SLD actually has more evenly distributed hourly CDRRs than do the normal SLDs. This result is contrary to the expectation that anomalous SLDs would be distinguished by their more temporally concentrated CDRRs [10].

Figure 5. Sorted hourly CDRRs in CS_NS_300, with detection by simplified Mahalanobis distance

condition for DDNS responses [27]. Internet measurement studies suggest that low TTL values are becoming increasingly common for all domain names, independently of DDNS [27]. Many legitimate domains, such as google.com, yahoo.com, and weather.com use low TTL values, e.g,. for DNS-based load balancing. In subset DDNS_NS, we attempted instead to make the distinction by requiring the response to come from a known DDNS provider. However, some legitimate and popular domain names, such as mozilla.com, are also hosted by DDNS providers. Our failure to exclude non-DDNS names more thoroughly may have contributed to the significant number of false positives (legitimate names considered anomalous) we obtained in the CSAA and DDNS_NS subsets. This failure was inconsequential for the CS_NS subset, possibly because the subset includes only NXDOMAIN responses. Such responses are more likely for DDNS names than for other names. Detection based on abnormally high or temporally concentrated DDNS query rates might be more effective if the query rates considered were those observed at DDNS providers, instead of enterprise or access provider networks. DDNS providers can more easily exclude non-DDNS names. It would be interesting to repeat these experiments under such assumptions. Another factor that may have influenced our results is botnet size. Although the total number of bots has been growing fast, there is anecdotal evidence that the typical number of bots in each botnet has been decreasing. Smaller botnets can be expected to generate fewer queries for each C&C server, making the latter’s detection more difficult. Our distance computations may have been hampered by the high dimensionality [1] and sparseness of feature vectors (24 elements, many of which null). It would be interesting to investigate whether dimensionality reduction techniques could improve our results. VII. CONCLUSIONS

Figure 6. Sorted hourly CDRRs in CS_NS_600, with detection by simplified Mahalanobis distance

VI. DISCUSSION Detection based on abnormally high or temporally concentrated DDNS query rates should consider only DDNS queries. However, distinguishing DDNS queries from other DNS queries is difficult in enterprise and access provider networks. In subset CSAA, we attempted to make the distinction based solely on responses’ TTL values. Unfortunately, a low TTL value is a necessary but insufficient

We evaluated two DDNS-based approaches for identifying botnet C&C servers in enterprise and access provider networks. The first approach attempts to detect DDNS names whose query rates are abnormally high or temporally concentrated. The second approach attempts to detect DDNS names with abnormally recurring NXDOMAIN replies. In our experiments, the first approach generated many false positives (legitimate names classified as C&C servers). On the contrary, the second approach was effective. Most of the names it detected were independently reported as suspicious by others. The contrast in our results may be related to the difficulty of distinguishing DDNS from other names in enterprise and access provider networks. Increasingly, popular legitimate names such as gmail.com and mozilla.com are using low TTL values or DDNS hosting, blurring boundaries and confounding classifications. Detection based on NXDOMAIN replies may

be less prone to confounding because NXDOMAIN replies are more likely to refer to DDNS than to other names.

[11]

Acknowledgements This project was funded in part by The Technology Collaborative through a grant from the Commonwealth of Pennsylvania, Department of Community and Economic Development.

[12]

REFERENCES

[15]

[1] [2]

[3] [4] [5]

[6] [7] [8] [9] [10]

Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining (1st ed.) Amidan BG, TA Ferryman, and SK Cooley. 2005. "Data Outlier Detection using the Chebyshev Theorem." In 2005 IEEE Aerospace Conference, pp. 1-6. IEEE Conference Publications, Manhattan Beach, CA. DeGroot, M. H. (2001). Probability and Statistics (3rd ed.). AddisonWesley. J.A. Rice. Mathematical Statistics and Data Analysis. Wadsworth Publ. Co., 1995. C. Taylor and, J. Alves-Foss, “An empirical analysis of NATE: Network Analysis of Anomalous Traffic Events”, Proceedings of the 2002 Workshop on New Security Paradigms, ACM Press, New York, NY, 2002, pp. 18-26. C. Taylor and J. Alves-Foss. NATE - Network analysis of anomalous traffic events, a low-cost approach. In Proceedings of New Paradigms in Security Workshop, Cloudcroft, New Mexico, Sept. 2001. Evan Cooke and Farnam Jahanian. The zombie roundup: Understanding, detecting, and disrupting botnets. In Steps to Reducing Unwanted Traffic on the Internet Workshop (SRUTI ’05), 2005. N. Ianelli and A. Hackworth. Botnets as a Vehicle for Online Crime. CERT Coordination Center, 2005. Bobax trojan analysis. http://www.lurhq.com/bobax.html, March 2005. David Dagon, “Botnet Detection and Response, The Network is the

[13] [14]

[16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

Infection,” OARC Workshop, 2005. Ke Wang and Salvatore J. Stolfo. Anomalous payload-based network intrusion detection. In Recent Advance in Intrusion Detection (RAID), Sep 2004. David Dagon, Cliff Zou, and Wenke Lee. Modeling Botnet Propagation Using Time Zones. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS 06), 2006. Ralf D. Kloth, “List of Bad Bots, 2004”, http://www.kloth.net/internet/badbots-2004.php User “Brisguy52”, in a reply to the post “Help to Block IP range”, http://forum.statcounter.com/phpBB2/viewtopic.php?t=6308 CNN. “Expert: Botnets No. 1 emerging Internet threat.” http://www.cnn.com/2006/TECH/internet/01/31/furst/, Jan 31, 2006. Oth.net, “List of dynamic DNS providers,” http://www.oth.net/dyndns.html David E. Smith, “List of dynamic DNS providers,” http://www.technopagan.org/dynamic/ Joris Evers, “'Bot herders' may have controlled 1.5 million PCs,” http://news.zdnet.com/2100-1009_22-5906896.html Anirudh Ramachandran and Nick Feamster. “Understanding the networklevel behavior of spammers.” In Proceedings of ACM SIGCOMM, 2006. P. Mockapetris, “Domain Names - Implementation And Specification,” [Online] http://www.ietf.org/rfc/rfc1035.txt [Online] http://www.srtforums.com/forums/showthread.php?t=334294 [Online] http://google.com/search?q=fscking.com+site%3Aosdir.com [Online] http://google.com/search?q=3lefties.com+site%3Aosdir.com [Online] http://google.com/search?q=busitec.jp+site%3Aosdir.com [Online] http://google.com/search?q= shacknet.nu+site%3Aosdir.com [Online] http://google.com/search?q=macher.fake-ip.shacknet.nu J.Jung, E.Sit,H. Balakrishnan, and R.Morris. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking, 10(5), October 2003. A. Schonewille and D.-J. van Helmond. “The Domain Name Service as an IDS,” Master’s Project, University of Amsterdam, Netherlands, Feb. 2006, http://staff.science.uva.nl/~delaat/snb-2005-2006/p12/report.pdf

Suggest Documents