Classification of Applications by Network Behavior

Classification of Applications by Network Behavior Tomáš Dragoun [email protected] Faculty of Informatics Masaryk University Brno, Czech Republi...
Author: Geoffrey Mills
3 downloads 0 Views 868KB Size
Classification of Applications by Network Behavior Tomáš Dragoun [email protected] Faculty of Informatics Masaryk University Brno, Czech Republic

Abstract The network flow classification is an important task that is carried out by network administrators. The problem is to create mapping between unknown network flows and a set of possible classes. The aim of this paper is to use current knowledge of the problem and to evaluate, if similar methods can be used to classify applications as a whole. A class system which divides applications in 9 easy-to-understand categories was designed. Then approach was proposed, which can be used to classify applications. Data from real users was analyzed and a series of own experiments was performed in order to deduce some common behavioral characteristics for each application class. These elementary properties were expressed in a way of binary predicates, which describe the application behavior on different network levels. Predicates were verified experimentally and the usability was demonstrated with the aid of unsupervised machine learning. Behavior of any application is represented by the binary vector which can be used as the input for machine learning classifier. Keywords: application, classification, machine learning, network behavior.

1 Application classes Classification is the problem of creating mapping between certain objects (usually network flows) and a set of possible classes. From now on following application classes will be considered:  “DOWNLOAD” (download managers and proprietary update installers)  “TORRENT” (file-sharing and torrent clients)  “FTP” (File Transfer Protocol clients)  “GAME” (online games)  “IM” (instant messengers)  “MAIL” (e-mail clients)

  

“MEDIA PLAYER” (multimedia players) “VOIP” (internet telephony applications) “WEB” (web browsers)

Compared with the network flow classification, an application can belong to multiple classes. Classes can differ from version to version of applications or even within one version when plug-in modules are supported.

2 Data Data from beta users of endpoint firewall on level of socket API was collected and supplied by an external company, few million user sessions were captured. Each application was monitored within 3-hour time window and following attributes were measured: remote hosts count, maximum opened connections, overall connection count and a list of eight most often used connections. Each connection was described by its transport layer protocol, remote port number, connection side and multiplicity of occurrence. Primary goal was to explore current network flow classification methods and to analyze, if collected dataset could be used to classify applications as a whole.

3 Network flow classification Currently we encounter three main techniques used to classify network flows: port matching, deep packet inspection (DPI) and classification-in-the-dark. [5] Advantages and disadvantages of aforementioned methods are briefly summarized below: Classification method Port matching

Advantages

Disadvantages

Low computation requirements

Low accuracy

Easy implementation Deep packet inspection

High accuracy

High processing requirements Application protocol analysis Unusable for encrypted traffic

Classification-in-thedark

Applicable for encrypted traffic Can identify unknown apps

Lower accuracy than DPI

Study of current network flow classification methods led to the conclusion, that collected dataset can be used together with port matching technique. Unfortunately this method became inaccurate due to [5]: 

Applications in peer-to-peer networks use random ports.



Traffic is tunneled over various network protocols.



Applications hide behavior in order to bypass firewalls.

As a consequence of this, only applications which use ports in conservative fashion can be classified with the aid of port matching method. Other classification techniques must be employed to overcome shortcomings of this approach. For classification to perform well, attributes employed in classification process must capture differences in network behavior between distinct classes. Since supplied dataset contained mostly attributes from network layer of ISO/OSI model, own data was collected in order to find relevant attributes covering other layers.

4 Network behavior by classes Following section describes network behavior that is typical for each application class. Elementary properties from different network layers will be used later to discriminate applications. Conclusions are based on study of the available literature and of the exploratory analysis of both supplied and own collected data.

4.1

“DOWNLOAD”

Download clients are used for bulk data transfers. They usually aim to download as large amount of data as possible in a short time period. This causes the high average packet size, which almost matches MTU of network (see Figure 1). “DOWNLOAD” clients in contrast to “TORRENT” applications lack small incoming packets, which belong to signaling flows of underlying P2P network. Outgoing packets are of a small size when compared to “TORRENT” clients, because no data is usually sent to the network. Applications of class “DOWNLOAD” establish only a few connections during session, but it will not necessarily apply if accelerated download of more segments simultaneously is supported. Download clients usually communicate only with the limited number of remote hosts. Content is downloaded via HTTP or HTTPS connections on TCP ports 80 and 443 respectively.

4.2

“TORRENT”

File-sharing applications usually communicate with many remote hosts during the session which leads to the high count of established connections, and also significant amount of random high ports is used (see Figure 2). Applications of class “TORRENT” often use proprietary application protocols on stable ports for signaling and UDP protocol on random ports for the data transfer. We encounter both small packets (signaling flows) and large packets (data transfer). Torrent clients use only single HTTP or HTTPS connection in contrast to “WEB” browsers, which open more TCP connections to the single server for the request to be processed as quickly as possible.

4.3

“FTP”

FTP clients are mostly conservative and use ports reserved for File Transfer Protocol and its extensions. These protocols are FTP (on TCP port 21), SFTP (on TCP port 22) and FTPS (on TCP port 443/990). [10] FTP connections contain high amount of packets with ack and push TCP flag in both directions of data transfer. In passive FTP transfer mode clients establish connections on random high ports which causes a large number of connections, while remote host count remains low in contrast to “TORRENT” clients.

Figure 1: Download Accelerator (incoming packets), µTorrent (outgoing packets), World of Tanks (all packets).

4.4

“GAME”

Game connections consist mostly of small packets since only state information is transferred (see Figure 1). When game starts it usually connects to the update server via TCP protocol in order to download updates, therefore it might act as

“DOWNLOAD” manager. In most cases, the game traffic itself is transferred over UDP protocol. Interactive real-time communication between peers causes similar frequency of incoming and outgoing packets. The frequency is stable and doesn’t fluctuate very often, only in breaks between game matches.

4.5

“IM”

Instant messaging clients use often proprietary application protocols on dedicated ports (e.g. IRC, ICQ, Miranda, MSN …). The frequency of incoming and outgoing packets is similar and very low in contrast to other application classes. The amount of received traffic can be larger than the sent data, because many closed-source “IM” clients download and display various advertisements.

4.6

“MAIL”

E-mail clients are characterized by their conservative way of using ports, since majority of mail servers is configured according to IANA recommendation. Currently most often used protocols for the mail exchange are: SMTP (on TCP port 25), POP3 (on TCP port 110), IMAP (on TCP port 143), SMTP/S (on TCP port 465), STMP/TLS (on TCP port 587), IMAP/S (on TCP port 993) and POP3/S (on TCP port 995). [10]

4.7

“MEDIA PLAYER”

Online multimedia players usually communicate with a small amount of remote hosts during session. Frequency of incoming packets is constant and variability of inter-packet delay is rather low. Incoming connections contain high amount of packets with ack and push TCP flags, which is common for real-time applications.

4.8

“VOIP”

Signaling and voice/video transfer is usually separated due to latency requirements. Voice data are send over UDP protocol while signaling is done via TCP. TCP connections exhibit relatively negligible activity and UDP datagrams are addressed to high ports. Interactive communication causes stable frequency of UDP datagrams, which is comparable in both directions. Ratio of sent and received data can differ in contrast to the frequency, since some applications enable one to choose different quality levels for each direction of call. VoIP applications use open protocols (e.g. H323, SIP, RTP, RTCP) or closed proprietary protocols (e.g. Skype).

4.9

“WEB”

Web browsers communicate with many remote hosts during session, because web content is distributed through different web servers (see Figure 2). In contrast to “TORRENT”, web browsers only use a small number of ports, usually only TCP ports 80 and 443. “TORRENT” and “WEB” applications can be effectively distinguished by remote hosts to used ports ratio. When browsing web content, application loads large amount of advertisements from different sources, which causes many short (< 15sec) HTTP and HTTPS connections. Since HTTP v1.1 are TCP connections persistent by default, therefore single connection can handle multiple requests and remain active for a long time period. The traffic during web browsing session is irregular and response delay can reach a few seconds when web server is under high load. Almost one fourth of TCP connections are ended with reset flag. [2]

Figure 2: High ports, remote hosts and connections counts of chosen classes in 3-hour time window.

5 Heuristics In previous section the elementary behavior properties for different application classes were described. Now we can focus on design of attributes, which would express these properties with emphasis on low computational requirements of the evaluation algorithm. We decided to describe these properties with the aid of binary predicates – so called heuristics. Value true or false can be assigned to the each heuristic. Then heuristic is evaluated as true, if the condition defined by heuristic was met in the given time window. Thresholds constants in conditional statements are denoted by the symbol C and can differ with respect to the chosen time window. Only one example is given for each group of heuristics, more

examples can be found in Appendix. Pseudocode implementation of each heuristic evaluation algorithm together with implemented prototype can be found in [4].

5.1

DPI heuristics

First group of heuristics use method of lightweight DPI, in most cases only prefix of data segment is examined. Signatures (matching patterns) for DPI heuristics were taken from literature [1, 3, 6, 7]. 

HTTP protocol Condition – data segment contains prefix: „GET“, „POST“, „PUT“, „HTTP“, „HEAD“, „SEARCH“.

5.2

Port heuristics

Second group of heuristics is based on data provided by an external company and it describes usage of ports. 

MAIL ports Condition – connection established on one of TCP ports: 25, 110, 143, 465, 587, 993 or 995.

5.3

Packet heuristics

Third group of heuristics is characterized by packet statistics and is based on own collected data. 

Large incoming packets Condition – number of large incoming packets > (all incoming packets / C1), where large packets are of size at least C2 bytes.

5.4

Flow heuristics

Last group describes behavior on transport layer and is based on supplied and own collected data. 

Interactive UDP communication Condition – inter-packet delay has the same order of magnitude for both incoming and outgoing UDP datagrams.

6 Experiment

Figure 3: Experimental results for port heuristics. Applicability of heuristics was verified experimentally (see [4] for more), results of port heuristics are depicted on Figure 3. Precision and recall [5] were calculated for each pair of class and heuristic, and then candidate list of best heuristics for each class was created according to the results. Set of heuristics which identify each class best was chosen [4] and unsupervised machine learning was used to verify chosen heuristics (namely hierarchical clustering with C-link method and Manhattan metrics [9]). Hierarchical clustering dendrogram on Figure 4 shows clusters of applications separated with the aid of attributes selected for class “IM”. Dendrograms for other classes can be found in [4]. We assume that the supervised machine learning algorithms (e.g. SVM) will achieve good classification results on objects which form well separated clusters in vector space. Modular classification scheme with one-class classifiers similar to the one proposed in [8] can be used with a chosen set of heuristics as preprocessing filter.

Figure 4: Clustering dendrogram with attributes selected for class “IM”.

7 Conclusions and Future research Article describes the typical network behavior for each of 9 proposed application classes. Binary predicates which capture properties of network traffic were designed and the approach was verified experimentally with the aid of the implemented prototype. Behavior of any application is represented by the binary vector which can be used as the input for machine learning classifier. Following work should explore the performance of heuristics on a large scale dataset, and scenarios where single application belongs to multiple classes should be examined closely. Last but not least a set of categories might be augmented and new attributes discovered.

References [1]

Application Layer Packet Classifier for Linux: l7-filter. CLEARFOUNDATION. Available from WWW: .

[2]

CHARZINSKI, J. and FELDMANN, A.: HTTP/TCP connection and flow characteristics. Performance Evaluation. 2000, vol. 42, 2–3, pp. 367–399. DOI: 10.1002/047120644x.ch15.

[3]

DEWES, Ch., WICHMANN, A. and FELDMANN, A.: An analysis of Internet chat systems. Proceedings of the conference on Internet measurement conference - IMC ’03. 2003. DOI: 10.1145/948213.948214.

[4]

DRAGOUN, T.: Classification of applications by network behavior. Master thesis. 2015. Available from WWW: .

[5]

GOMES, J., INÁCIO, P., PEREIRA, M. et al.: Detection and Classification of Peer-to-Peer Traffic: A Survey. ACM Computing Surveys. 2013, vol. 45, issue 3, pp. 1–40. DOI: 10.1145/2480741.2480747.

[6]

KARAGIANNIS, T., BROIDO, A., FALOUTSOS, M. et al.: Transport Layer Identification of P2P Traffic. Proceedings of the 4th ACM SIGCOMM conference on Internet measurement - IMC ’04. 2004. DOI: 10.1145/1028788.1028804.

[7]

KARAGIANNIS, T., PAPAGIANNAKI, K. and FALOUTSOS, M.: BLINC: Multilevel Traffic Classification in the Dark. ACM SIGCOMM Computer Communication Review. 2005, vol. 35, issue 4. DOI: 10.1145/1090191.1080119.

[8]

RAHBARINIA, B., PERDISCI, R., LANZI, A. et al.: PeerRush: Mining for unwanted P2P traffic. Journal of Information Security and Applications 2014, vol. 19, issue 3, pp. 62–82. DOI: 10.1007/978-3-642-39235-1_4.

[9]

ROMESBURG, Ch.: Cluster Analysis for Researchers. USA, Belmont: Lifetime Learning Publications, 1984. ISBN 10.1787/888932708902.

[ 10 ] Service Name and Transport Protocol Port Number Registry. Internet Assigned Numbers Authority. Available from WWW: .

Appendix – Heuristics Examples for each group of heuristics are listed bellow, full list of heuristics together with pseudocode and implemented prototype of evaluation algorithm can be found in [4].

Deep packet inspection heuristics First group of heuristics use method of lightweight DPI, in most cases is examined only prefix of data segment. Signatures are plain ASCII strings, hexadecimal byte values are denoted by escape sequence “\x”. 

Instant-messaging protocols Condition – data segment contains prefix: „USERNAME“, „PING“, „PONG“, „JOIN“, „NICK“, „PRIVMSG“, „WHO“, „WATCH“, „USERHOST“, „:irc“, „USR“, „CVR“, „CHG“, „NLN IDL“, „NLN NLN“, „YMSG“ or „ 1023.



Same TCP and UDP port used Condition – same remote port used for TCP and UDP communication.



Same TCP client and server port Condition – same port used for both client and server TCP connection.

Packet heuristics Third group of heuristics characterizes packet statistics and is based on own collected data. 

Low packet frequency Condition – average packet frequency < C (packets * s-1).



Large average packet size Condition – average packet size > C bytes.



Small outgoing packets Condition – number of small outgoing packets > (all outgoing packets / C1), where large packets are of size at most C2 bytes.

Flow heuristics Last group describes behavior on transport layer and is based on supplied and own collected data. 

High remote hosts count (δ1) Condition – application contacted at least C different remote hosts.



Short HTTP and HTTPS connections (δ5) Condition – at least C1 TCP connections shorter than C2 seconds were established on ports 80 or 443 during C3-minutes time window.



Frequent change of data transfer direction (δ11) Condition – direction of data transmission changed at least C1 times during C2-sec time window.

Suggest Documents