Modeling of statistical data sources based on measured network traffic

Simulation Modeling of statistical data sources based on measured network traffic Simulation: Transactions of the Society for Modeling and Simulatio...
1 downloads 0 Views 691KB Size
Simulation

Modeling of statistical data sources based on measured network traffic

Simulation: Transactions of the Society for Modeling and Simulation International 88(10) 1216–1232 Ó 2012 The Society for Modeling and Simulation International DOI: 10.1177/0037549712452016 sim.sagepub.com

ˇ ucˇej3 ˇ arko C Matjazˇ Fras1, Jozˇe Mohorko2 and Z

Abstract In the process of network traffic modeling, for simulation purposes, there is often a need for statistical description of traffic data sources. Usually, the network traffic is measured by capturing packets at a physical level. Normally, the estimation of statistical description of traffic data sources cannot be derived directly from such captured packets traffic. For that reason, we have researched for simpler solutions, which are based on the estimation of statistical processes of traffic data sources from the measured packet network traffic. We have developed the estimation methods, which allow the estimation of suitable probability distribution functions and their parameters of stochastic processes of traffic data sources. Statistical distributions of network traffic processes, such as data lengths process and data inter-arrival time, are important since they can be used for modeling of network traffic in simulation tools. For that reason, the estimation method is firstly developed, which mimics the defragmentation process. This method allows an estimation of distributions of data source network traffic processes and their parameters for captured packet traffic. During further testing, this method shows some limitations, especially for the process of data lengths. For that reason, we have developed a new estimation method with the approach described in this paper in further detail. In the new estimation method, which is called estimation method based on histogram comparison (EMHC), we use the opposite concept where distribution of data lengths is transformed by a developed analytical model to a packet size’s histogram. The latter is further compared to a packet size histogram of captured packet traffic. The optimization method is used to find such distribution parameters of the data length process that cause minimal discrepancies between the histogram of captured packets and the estimated packet size histogram. To estimate the discrepancy between two histograms, a well-known χ2 test is used, which is modified by a weighting function that considers, beside packet frequencies, the packet lengths as well. The proposed algorithm and method are confirmed through validations and experiments in a simulation tool.

Keywords network traffic modeling, statistical modeling, simulation, traffic fragmentation

1. Introduction Tools to design and plan networks simplify the decisionmaking process and conceptualization of abstract ideas, which are values of simulations and modeling. They play a key role in evaluation of design options and assist in discovering design issues that are likely to impact the performance. With an introduction of new telecommunication services, parameters of the known models or even the whole models can easily change and test their influences. These changes can be, in real networks, discovered by network traffic measurements. The measurement results can be used to build more realistic traffic models that are needed in simulations, which are often defined by data source statistics. Because of the impact of the internet protocol (IP) packet fragmentation process, which is strongly

non-linear, it is not easy to derive data source statistics directly from the packet statistic.

1.1 Formulation of a problem The main task for successful network traffic statistical modeling is to minimize statistical discrepancies between 1

Margento R&D, Slovenia Tehnovitas R&D, Slovenia 3 Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia 2

Corresponding author: Matjazˇ Fras, Margento R&D, Gosposvetska cesta 84, 2000 Maribor, Slovenia. Email: [email protected], [email protected]

Fras et al.

1217

Figure 1. Simplified data source modeling for simulation purposes.

the measured and the simulation-generated traffic. This means that modeled traffic must be similar within the different criteria, such as bit and packet rate, bursts (Hurst’s parameter), variance, etc. Network traffic modeling, for simulation purposes, is usually based on statistical modeling of data sources on the application level. This means that distributions and their parameters are needed to describe network traffic’s stochastic processes, such as lengths of data source and data inter-arrival times. Such concept of network traffic modeling is also used in the OPNET Modeler1,2 simulation tool, which is used in our research. There are various possibilities for network traffic modeling, such as raw packet generator (RPG) station3,4 or more often-used traffic generators.3,5,6 The latter possibility is considered, where traffic generators, included in device models, represent data sources that usually consist of three parts shown, as shown in Figure 1: • •



a pseudo-random generator with a uniform probability of density function (pdf) and with a possibility to select different seeds to generate random numbers; two stochastic processes XD(x) and TD(t), which transform random numbers into data lengths (measured in bits or bytes) and data inter-arrival times (measured in seconds) according to selected statistical distributions; and an interface between the data source and a particular layer of protocol stack in the simulator.

Usually, the traffic generator is placed above the IP encapsulation layer, in the transmission control protocol (TCP)/

IP model, which takes care of packet formations and fragmentation. This is the process of segmentation where long data are split into the shorter packets, or vice versa, according to the RFC 7937 recommendation. Padding of the packet data payload also exists, with additional bits. This is performed when data is shorter than the predefined minimal payload. Because the traffic is modeled above the IP layer, an additional 20 bytes of IP header are added to the lengths of generated data. Eighteen bytes of information for media access control (MAC; 6 bytes destination address; 6 bytes source address; 2 bytes frame length or type) and frame check sum (FCS; 4 bytes) are also further added. The structure of the standard Ethernet frame, used in the IP station model in the OPNET simulation tool, is shown in Figure 2. Using such a traffic model, the application protocol does not have an impact on the generated traffic. This model is suitable for simulation cases by a single traffic source, when network traffic can be caused by many arbitrary communication applications (data sources) simultaneously. This is the opposite of the concept with application models, where each application protocol has an impact on the generated data. Changes in the offered data services or users’ habits are reflected in processes XD(x) and TD(t). Thus, the identification of data source statistics and estimation of their parameters, from the measured packet traffic, are rather important topics, because each of these two stochastic processes XD(x) and TD(t) are described by a probability distribution function pdf. In the case of modeling measured packet network traffic in a simulation tool, the suitable

1218

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

Figure 2: Ethernet frame. (SD: start delimiter; DA: destination address; SA: source address; FL: frame length; PAD: padding bits; FCS: frame check sum; IG: interframe gap. All lengths of fields are in octets.) Padding bits are located in the shaded field, if they are necessary.

parameters and distributions for data sources are estimated and based on the measured packet traffic.

• •

1.2 Definitions of network traffic The network packet traffic ZP[n] is a stochastic process, which can be interpreted as the traffic volume – measured in packets, bytes or bits per unit of time. ZP[n] can be described as a composite of two stochastic processes. These processes are the packet-size process XP[n], which is a series of packet sizes lPi measured in bits (b) or bytes (B) and packet inter-arrival time process YP[n], which is defined as a series of time intervals between packet arrivals tPi (time stamps):8 ZP ½n = XP ½n  YP ½n,

n∈R

ð1Þ

Network traffic modeling is usually performed at a higher (application) layer of the TCP/IP model.9 ZD[n] represents the network traffic of data sources on higher layers of the International Standards Organization Open Systems Interconnection (ISO/OSI) model, which can be also measured in packets, bytes or bits per time unit. It can be described by the processes of data source lengths XD[n] and data inter-arrival times YD[n]. Furthermore, as in the packet traffic case, ZD[n] can be described as a combination of data source lengths XD[n] and data inter-arrival times YD[n] processes:8 ZD ½n = XD ½n  YD ½n,

n∈R

ð2Þ

1.3 Methods of traffic statistic transformation Statistics of packet traffic can be described by their empirical histograms. The statistics of data sources can be described by their probability density functions (pdfs) or theoretical histograms calculated from their pdfs. Both statistics are related by algorithms determined by:

fragmentation of long data into shorter packets, or vice versa, by defragmentation of shorter packets into longer data, according to RFC 793;7 padding of a packet data field (payload), with padding bits, when data is shorter than a requested payload in the packet.

To provide statistical equality between the measured packet network traffic ZPM[n] and data sources network traffic ZD[n], a transformation between packet processes and data processes must be performed. This is a transformation between a process of packet sizes XPM[n] and the process of data lengths XD[n] and transformation between packet inter-arrival time YPM[n] and data inter-arrival time YD[n]:8 transformation  o XD ½ n  XPM ½n m  transformation  o YD ½n YPM ½n m 

ð3Þ

The simplest way to estimate data statistics is to directly measure data sources at the application layer. However, this is usually impossible or is very unpractical. Therefore, the solution is to measure the packet traffic by packets capturing tools (sniffers) and then estimate the data source statistics from the obtained packets. At first glance, it seems to be natural that data source statistic parameters can be determined directly from the measured packets (achieved after data defragmentation). This approach is briefly summarized in Section 2. However, extensive research shows that this approach requires an in-depth packet analysis (which needs specialized, very powerful and, consequently, expensive instruments). Furthermore, in the case of encrypted packets, depth packet analysis is virtually impossible. These are some reasons why a new defragmentation approach is proposed, based on simple achievable indicators. In the first proposed approach, a candidate pdf is chosen for data source statistics. This candidate is later, in the iterative optimization procedure, reshaped so that deviations

Fras et al. between statistics of packet traffics, simulated by modeled data source, and statistics of measured packet traffic are minimal. This innovative new approach is described in Section 2.

1.4 Related work In recent years, immense research effort has been devoted to measuring and analyzing network traffic for simulating packet-switching networks. Among these are measurement of the network traffic on the Internet,10,11 traffic in highspeed networks12 and also measurement of next generation networks.13 A considerable amount of attention has been also given to the analysis of the network traffic caused by different applications, such as peer to peer (P2P),14,15 network games16 and Voice over Internet Protocol (VoIP) application Skype.17 The network traffic has a stochastic nature; therefore, the main goal of traffic measurement analysis is to provide models that allow the statistical description of network traffic. Network traffic, in contemporary communication networks, is well described by a self-similar traffic model.9,18–20 In such a model, the Hurst parameter H is used as a measure of self-similarity. This model replaces the Poisson and Markov models21 used in traditional telephone networks. Paxon and Floyd21 show self-similarity for wide-area network (WAN) network traffic, Crovella and Bestavros20 for WWW network traffic and Garrett and Willinger22 for variable bit rate (VBR) video. Dainotti et al.23 developed a methodology and software architecture for packet-level statistical characterization of network traffic captured in WAN. Probability distributions, which describe the self-similar network traffic, are heavily tailed. Examples of such distributions are Pareto and Weibull distributions.24–26 A very important network traffic property, in regards to self-similarity, is short- or long-range dependence,24,25,27,28 which can be identified by an autocorrelation function.27,29 Some researchers perform network traffic analysis on higher TCP/IP layers, where data files and their transactions are attainable.8,30 It has been shown that file sizes, in the case of WWW network traffic, are best described by Pareto distribution with parameter a = 1.31 For file transfer protocol (FTP) traffic, it appears that the shape parameter of Pareto distribution is in the range of 0.9 < a < 1.1.21 It was shown25,30,32,33 that inter-arrival times of TCP connections are self-similar processes, which can be described by a Weibull heavily tailed distribution. In other cases, there are also some analyses and simulations of data transaction of IP and general packet radio service (GPRS) networks34 by a multiple input, multiple output (MIMO) simulator. Traffic modeling, within network simulation tools, is usually based on modeling of data sources at an application level. For simulation of the network traffic, different

1219 models are developed. The one concept that represents the Fractal Point processes, which are described by Ryu and Lowen35 and based on "on/off" models, is also supported in the OPNET simulation tool.4 There are also other models, such as Hidden Markov Models,36,37 which are used to describe traffic sources at a packet level. There are also other models based on Markov, such as Markov-modulated Poisson processes (MMPPs) and other special cases of MMPPs,38–40 which are used to characterize the packet arrival process and the distribution of packet sizes. There are also some research works, where authors implement different traffic models based on Markov models in the OPNET simulation tool.41 In our research work, the OPNET Modeler simulation tool version 14.01,2,42,43 is used. Only supported traffic generators35 available in models of standard communication nodes are used as well. The common communication nodes are RPG station3,4 and IP station. IP station was used in our research work.6,8 In this case, the network traffic is modeling on distribution and its parameters for two stochastic processes XD(x) and TD(t). In the context of our research project ‘‘Modeling of Command and Control Information Systems’’, financed by the Slovenian Ministry of Defense, we have developed the simulator for tactical networks on virtual terrain.2,42,43 A significant part of this project is also the modeling of network traffic data sources as a simulation tool, which is based on traffic measurements using sniffers, in real tactical networks.3 We have focused on estimation of data source network traffic distribution and parameters of the measured packet network traffic, not just in tactical networks but also in almost all IP networks. Estimated distributions and parameters can be used in the OPNET simulation tool for modeling and then simulating the measured network traffic by the standard IP station.8,35 The network traffic in IP station can be described by two independent probability distribution functions for the data source length XD(x) process and data inter-arrival time TD(t). The main focus and goal of our research are estimation methods, which allow estimation of the probability distribution function for the data source length XD(x) process and data inter-arrival time TD(t) from captured packet network traffic. Because there was no known method, we first developed methods, which mimic a defragmentation process.6,8,44,45 The main part in this method represents the algorithm of defragmentation, which is described in greater detail by Fras et al.8 This algorithm is used to transform captured packet traffic to the estimated data traffic, which is then used to estimate a suitable distribution for the data source length XD(x) process and data inter-arrival time TD(t)8,44 by standard fitting tools,46 such as EasyFit.47 During further testing, this method shows some limitations, especially the algorithm of defragmentation and also

1220

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

some limitations of fitting tools.6 This is the main reason why we have developed the estimation method with a new concept and approach. Our proposed estimation method, which is called estimation method based on histogram comparison (EMHC), is described in detail in this paper. This method offers a more direct and accurate alternative than previously known methods8,44 that estimate the data source length XD(x) process. In this method, an opposite concept has been used where data source traffic is transformed by an analytical model to estimate captured packets, which are then compared to the captured real packet network traffic. By using the optimization method, there is such a distribution of data length that discrepancy between the packet size process and estimated packet size process by developed analytical models is minimal. The paper is organized as follows. The Section 2 describes the estimation method based on histogram comparison. Section 3 describes the goodness-of-fit test, which is used to estimate discrepancies between histograms. Section 4 briefly describes the optimization method. Section 5 describes the experiments with the validation of the developed method and modeling of real network traffic in the simulation tool. The paper ends with a conclusion.

theoretical packet traffic histogram from an empirical one serves to reshape the candidate pdf of data source length process XD(x). By using the optimization method, we have achieved such distribution of data source length process XD(x) that discrepancies between the theoretical packet size histogram and the packet size histogram of captured packet traffic are minimal. The main idea of the developed estimation method EMHC is shown in Figure 3. The EMHC has three main parts: •

• •

a mapping algorithm that mimics fragmentation (MAFM – Mapping Algorithm with Fragmentation Mimics) to estimate a packet size histogram from data length size distribution, which considers the packet fragmentation process; a goodness-of-fit test for evaluating discrepancies between packet size histograms; and an optimization method to minimize the criterion function (discrepancies between histograms) based on histogram comparison, to estimate optimal data length distribution parameters from obtained packets.

2.1 Histogram of theoretical data traffic 2. Estimation method based on histogram comparison The new developed estimation method (estimation method based on histograms’ comparison-EMHC), uses the opposite concept and a new approach than previously developed method, which mimics the defragmentation process.8,44 EMHC is used to estimate and distribute the data source length process XD(x) from captured packet traffic. EMHC is based on a developed analytical model, which allows estimation of theoretical packet size histogram HT for distribution of data source length process XD(x). In the first step, the candidate for the pdf of data source length process XD(x) is firstly analytically transformed into the theoretical packet size histogram HT, which is compared to the packet size histogram of the measured packet traffic. Deviation of a

The theoretical histograms with equidistant intervals of data lengths and duration of data time intervals are used for modeling purposes of data sources. These intervals determine histograms with U = WN bins, organized in N windows with N bins. The first N bins, which constitute a window w = 1, are selected so that they overlap with bins of an empirical histogram. They are labeled as bwn, where: • •

w ∈ [1,W] is an index that indicates the window in which a histogram’s bin is; and n ∈ [1,N] is an index that indicates the serial number of a bin in a window w (Figure 4).

The probability that the value of the random variable x is inside a particular bin fD[wn], for example, bwn, is determined by

Figure 3. Concept of pdf fD(x) parameters estimation by the estimation method based on the histogram comparison (EMHC).

Fras et al.

1221

Figure 4. Organization of bins and windows of the theoretical histogram of the XD(x) process.

Figure 5. Error eWN · lDmax = lPmin + WNl.

fD ½wn =

ð

w ∈ ½1, W , n ∈ ½1,N 

fD (x)dx;

ð4Þ

x ∈ bwn

where fD(x) is the pdf of the considered process. Random variables, of the process XD(x), are data lengths measured in bits or bytes. The cumulative distribution function (cdf) FD(N) is by definition equal to 1: FD (∞) =

∞ ð

FWN (∞) =

ð W X N X w=1 n=1

fD (x)dx = 1

ð5Þ

0

When W!N, the same is valid for the theoretical histogram: Fwn (∞) =

a finite number, and at unlimited fD(x) the cdf Fwn(WN), calculated by Equation (6), is near but less than 1. On the other hand, all possible data of lengths lDmin ≤ lDi ≤ lDmax form the probability space S, where by axioms of probability space we have

wX →∞ X N w=1 n=1

ð

fD (x) dx = 1

ð6Þ

x ∈ bwn

In real telecommunication systems, the data lengths can be very large; however, they are always limited. Hence, W is

fD 0 ½wn = 1

ð7Þ

x ∈ bwn

where fD’[wn] is slightly higher than fD[wn], as is defined by (4). The difference between fD’[wn] and fD[wn] is proportional to the area under the part of pdf, which is not considered in the theoretical histogram (Figure 5), which is in most cases is negligible and small, since W and N are sufficiently large. That is why fD[wn] as fD’[wn] is considered in continuation: 1 εWN = WN

∞ ð

fD (x)dx NW l + lPmin

ð8Þ

1222

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

Data with lengths inside the first window are mapped in packets without fragmentation: Di jlD

i

≤ lPDUmax

 → P j lP

j

= lDi + lH

ð9Þ

where lDi is the length of data Di, lPDUmax is the length of the longest data in a packet, lPj is the length of packet Pj and lH is the length of a packet header. The amount of data in a packet is called the Protocol Data Unit (PDU). Data longer than lPDUmax is first fragmented into a sequence of packets. For example, data with lDi ∈ b2n is fragmented into two PDUs, data with lDi ∈ b3n is fragmented into three PDUs and so forth; data with lDi ∈ bwn is fragmented into w PDUs according to Di jlD

i

n   → Pj lMTU = lPDU + lH ∈ b1N , Pj + 1 lMTU = lPDU + lH ∈ b1N , . . . o   . . . , Pj + w1 lMTU = lPDU + l ∈ b1N , Pj + w lPDU ∈ b1n

∈ bwn

H

ð10Þ

Let us assume that pdf fD(x) is a continuous, monotonic function, which is, in a windows interval w ∈ [1, W], greater than zero. In this case, each bin contains data. The probability can also be defined by a relative frequency of their appearances in traffic, in such a bin: Nwn ND

p½n =

ð11Þ

where ND is a number of all data in the ‘‘experiment’’, that is, in their sending into the network, and Nwn is the number of that data with lDi ∈ bwn .

Nn , NP

n = ½1, . . . , N 

ð12Þ

where Nn represents the number of packets in bin n and Np the number of all packets. According to the mapping rule, each data with lDi ∈ b1n generates one packet, each data with lDi ∈ b2n generates two packets and so on to the longest data with lDi ∈ bWn , where each of them generates W packets (Figure 6). The probability of packets generated by data with lDi ∈ bwn is, according to fragmentation rules (Figure 6), ‘‘transferred’’ to bins bn as follows: •

2.2 Mapping fD [wn] ! p[n]

fD ½wn =

During the fragmentation process, the data that is longer than lPDUmax is first fragmented into PDUs, which are then packed into a sequence of packets. During a fragmentation process, usually one data causes more than one packet. This means that the number of data Nd is smaller than the number of fragmented packets Np. The probability of those packets can also be defined as a relative frequency of their appearances in traffic:



probabilities of data source length fD[wn] from bwn (n ≤ N; w = 1, 2,.,W) are transferred to probability p[n] of bins bn in ratio 1; probabilities of data source length fD[wn] from bwn for each window (1 ≤ n ≤ N; w > 1) are all transferred into bN in a ratio (w – 1).

The first portion of fragmentation rules considers the probability of data lengths fD[wn], which impacts the packets without the fragmentation or the last packets (remainder) in a fragmentation process. The second part of fragmentation rules considers the probability of data lengths fD[wn], which impacts just the MTU (Maximum Transmission

Figure 6. Fragmentation procedure: an example of data fragmentation with lDi ∈ fb0 , b11 , b1N , b21 , b31 , b11 g.

Fras et al.

1223

Unit) packets, which occur in a fragmentation process of large data. Consequently, the total number of packets generated from data is NP = 1 · ND

N X

f ½1n + 2 · ND

n=1

N X

f ½2n + . . .

n=1

. . . + W · ND

N X

f ½Wn + . . .

ð13Þ

n=1

= ND

W X

w

n=1

N X

f ½wn

ND = W P NP

w

1 N P

W X w=1

n=1

f ½wn

ð

fD (x)dx; n = 1, 2, . . . :, N  1

ð kD → P

fD (x)dx +

x ∈ bwN

kD → P · ðw  1Þ

ð

fD (x)dx; n = 1, 2, . . . , N

x ∈ bwn

ð16Þ

In cases when packet lengths are shorter than a minimal length of frame payload, functions in the physical layer enlarge packets with padding bits. Consequently, data that is shorter than lPmin (which fulfils the requirements of minimal length) is actually packed into packets with lengths that correspond to bin b1. The mapping algorithm is performed in two steps: • •

1 C fD (x)dxA

(w1)(lMTU lH )

0

BX p½n = kD → P @ W

w=1

(w1)lMTU wl ð H + lm + n

ð17Þ 1 C fD (x)dxA;

(w1)lMTU wlH + lm + (n1)

ð18Þ 1

11

The final algorithm for the MAFM, where Pareto distribution is used for description of data length process, is p½ 1 = k D → P

W X kα 1 k α A1 (w) α (k + ) w = 2

p½ n = k D → P

W X

! ð27Þ

! α

ð28Þ

k An (w)

w=1

2.5 Results for the Pareto pdf case In the second case, network traffic is generated in the simulation tool with the distribution of Pareto data lengths (a = 1.05, k = 26) for different random generator seeds. The packet size histograms HS of the generated network traffic (the captured one was almost 100,000 packets) for 10 different generator seeds are shown in Figure 9. It can be seen from Figure 9 that those differences between histograms of packet sizes, which can be best seen in the first b1 and the last bn bins, are negligible. These discrepancies are a consequence of the Pareto distribution property that, in certain conditions, gives a finite expected value (E(X)).24,25

ð25Þ

α, k > 0

p½ N  = k D → P k + kD → P



α

W X

W X

! AN (w)

w=1

(w  1)BN (w)

! ð29Þ

w=2

where A1 (w), An (w), AN (w) and BN (w) are A1 (w) = 1 1  ((w  1)(lMTU  lH ))α ((w  1)lMTU  wlH + lm + )α ð30Þ

1226

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

Figure 9. The packet size histograms HS for 10 different random generator seeds of generated traffic. In all cases, the Pareto data length distribution is used (a = 1.05, k = 26).

An (w) =

1 ((w  1)lMTU  wlH + lm + (n  1))α 1  ((w  1)lMTU  wlH + lm + n)α ð31Þ

AN (w) =

The reason for this lies in the previously mentioned Pareto distribution property about an expected value, when a ≤ 1 and a ! 1. It can be shown, quite generally, that our method gives unstable results in the cases of distribution for which the finite expected values (E(X)) do not exist.

1

((w  1)lMTU  wlH + lm + (n  1))α ð32Þ 1  (w(lMTU  lH ))α   1 1  BN (w) = ((w  1)(lMTU  lH ))α w(lMTU  lH )α ð33Þ

where a is the shape and k is the local parameter of the Pareto distribution, lMTU is the maximal size of packets, lH represents the packet headers and lm is the minimal size of packets. These estimated histograms HT are compared with the histograms of generated network traffic HS that are shown in Figure 9. For the case of seed 127, the validation result is very good. In this case, the discrepancy between histograms using the χ2 test is 14.56, where the P-value = 0.4089 ≈ 40%. Based on this result, the null hypotheses H0 can be rejected, which means that the observed data does not follow the specified distribution. However, there are some cases where χ2 can be around 1000, which means that the P-value < 0.001. For these results, the null hypotheses H0 must be accepted. The simulated traffic for the same distribution of Pareto data lengths is very diverse, which can also be seen from packet size histograms HS.

2.6 Conclusions about validation From the shown results of validation, it can be said that the proposed MAFM model is good for cases where distribution has a finite value for expected value E(X), such as in the case of exponential distribution. However, in those cases where distribution does not have a finite expected value E(X), the method becomes unstable, such as the case where Pareto distribution is used with the shape parameter a ≤ 1.

3. Goodness-of-fit test Deviations between histograms are measured by statistical tests.46,49,52,53 Among them, the following three are chosen: - Kolmogorov–Smirnov test; - Anderson–Darling test; - χ2 or Pearson’s test. Traffic quantity can be measured in bits, bytes or number of packets per time unit. The choice of the measurement units usually depends on the properties that we tend to emphasize. Since packets do not have a constant length,

Fras et al.

1227

the amount of data, transferred by packets, depends on the lengths of packets. The standard statistical test only considers a number of packets, so they do not depend on the amount of contained data. Experiences from numerous performed experiments show that, among the standard statistical test methods used, in our estimation method based on histogram comparison, the EMCH, the best convergence is achieved with standard χ2 test: χ2 =

N X  2 pn

pn

n=1

=

N X (ft, n  fe, n )2 n=1

fe, n

ð34Þ

where  represents the difference between the probability of bin pn in the packet size histogram and ft,n and fe,n represent the relative frequencies of packets in theoretic and empiric histograms, respectively. Even better results are achieved through modification of the χ2 test with weight function, which also takes into account influences of packet length on the test results.

3.1 Proposed weighted χw2 test The goal of modification of the χ2 test is to introduce the influence of the amount of data in the packets on the test results.54 The inner structure of statistical tests is better balanced by such modification. Consequently, it gives more accurate corrections of pdf parameters. Modification is done by multiplying χ2 with weight function U = {u1, u2, ., uN}, where u1, u2, . are weights of probabilities influences. They are defined by a ratio: χ2u =

N X n=1

un

N  2 pn X (ft, n  fe, n )2 = un pn fe, n n=1

ð35Þ

where lbn and lbN are mean values of packet lengths in histogram bins bn and bN, respectively. Because all bins have an equal width, they can be simply calculated as lb = n l + lP ; n min 2

n = 1, 2, . . . , N

ð36Þ

where ft,n and fe,n are relative frequencies of packets with lPDUj ∈ bn in theoretic and empiric histograms, respectively.

4. Optimization The optimization task, schematically shown in Figure 3, is to find such distribution parameters of data length XD[n], which will cause minimal discrepancies between the packet-size process of the measured network traffic ZPM[n] and the packet-size process of the modeled network traffic ZPS[n]. The estimation of measured traffic statistical parameters14–17,31,33 is, therefore, an optimization process, which minimizes (or maximizes) criterion function f(x):

min f (x) x∈R

ð37Þ

Criterion function f(x) is the measure of fitting discrepancies between the measured histogram of packet sizes HM and the estimated theoretical histogram of packet sizes HT. It derives from distribution of data length distribution by the MAFM algorithm. The criterion function is f(χ2) (Equation (34)) in the case of the Chi-square test or f(χw2) (Equation (35)) if the weighted Chi-square test is used. Such problems (Equation (37)) can be resolved by using optimization methods, where criterions function f(x) is minimized in domain A.55,56 In the case of linear problems, the optimization problems are solved with the use of linear programming;57,58 in our case, the criterion function f(x) is non-linear, where non-linear programming is used.56 From our experiments, it can be seen that criterion functions f(χ2) or f(χw2) do not differ. This is the reason why we use numerical methods55,56 to solve this problem. In our experiments, the simple elimination method with an unrestricted search and fixed step56 is chosen as the optimization method. This method gives satisfying results in the case of the unimodal criterion functions f(x). Figure 10 shows two-dimensional discrepancies between a histogram HS of simulated network traffic ZS[n] in the simulation tool and the estimated histogram of packet sizes HT, estimated for different parameters (k and a) of Pareto distribution of data lengths with the use of the Chi-square goodness-of-fit test.

5. Simulation results 5.1 Validation of the EMHC Validation of the EMCH method is conducted almost in the same way as validation of the MAFM algorithm, described in Section 2. The network traffic is generated with the use of the OPNET Modeler simulation tool, by known distributions for processes of data lengths and data inter-arrival. On the obtained packets, the packet size histogram HM is computed. For each simulated network traffic, the data length parameters are estimated using the EMCH method. Estimated parameters are then compared by parameters that are used in the traffic generation model in the simulation tool. Firstly, differently simulated network traffics are generated in the simulation tool with the use of exponential data length distribution for two different rate parameters l. For each rate parameter l we also generate the different network traffics for different generator seeds. For each block of captured packets of traffic generated by this method, the rate parameter le is estimated using the proposed EMHC. Table 2 shows the results of these tests. It can be noticed from results that there are some discrepancies between the chosen rate parameters l in the simulation tool, and estimated rate parameter le by the EMHC. The errors, in worst case scenarios, are about

1228

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

Figure 10. Criterion function f(χ2) as discrepancies (χ2) between histogram HS of synthetically generated network traffic in the simulation tool (Pareto: k = 24, α= 1.05, lMTU = 1502 B, lm = 64 B, lH = 46 B) and packet size histogram HT estimated by the Mapping Algorithm with Fragmentation Mimics algorithm for different values of a and k parameters of Pareto data length distribution.

Table 2. Generated network traffic in the OPNET simulation tool defined by exponential data length distribution (l1= 0,001, l2= 0.002) for different seeds and estimated rate parameter le using the developed estimation method based on histogram comparison (EMHC). Seed

l1

Estimated le1

l2

Estimated le2

121 122 123 124 125

0.001 0.001 0.001 0.001 0.001

0.001005 0.001012 0.001003 0.001002 0.001011

0.002 0.002 0.002 0.002 0.002

0.001999 0.001987 0.001998 0.002013 0.002022

1–2% (l = 0.002, seed 125; l = 0.001, seed 122 in Table 2). In these cases, estimation is carried out on more than 10,000 obtained packets. These results show the successfulness of the proposed EMHC for the case of exponential distribution. In the second scenario, the network traffic is generated using Pareto distribution for the process of data lengths. During the validation of the MAFM model for statistical transformation, we noted that in some cases, where Pareto is used, the transformation method becomes unstable. This happens in the cases with shape parameters a ≤ 1 and a ! 1. For this reason, a very long sample of network traffic (around a million packets) is generated for each test traffic.

Table 3. Generated network traffics using the OPNET Modeler simulation tool, defined by Pareto data length distribution for different seeds and estimated shape parameter ae by the estimation method based on histogram comparison (EMHC) based on the captured packets. Seed

a1

Estimated ae1

a2

Estimated ae2

a3

Estimated ae3

122 125 126 127 128

0.9 0.9 0.9 0.9 0.9

0.911 0.899 0.921 0.932 0.935

1 1 1 1 1

1.000 0.976 0.995 0.996 1.017

1.1 1.1 1.1 1.1 1.1

1.089 1.072 1.091 1.090 1.087

In the first case, shape parameter a = 0.9, in the second a = 1 and in the third parameter a = 1.1. Table 3 shows the results of these tests. Table 3 shows the estimated shape ae parameters for different network traffic, which is modeled by different Pareto distributions (ae1 = 0.9, k = 26; ae2 = 1, k = 26; ae3 = 1.1, k = 26) for different generator seeds. For the first case (ae1 = 0.9, k = 26), the largest discrepancies between set a and estimated ae shape parameters are 2.5% (seed 125). In the second case (ae2 = 1, k = 26), the largest discrepancies are 2.4%, whereas for the third case (ae3 = 1.1, k = 26) discrepancies are 3.89%.

Fras et al.

1229

Figure 11: Captured network traffic in a real network, which is used to estimate distribution parameters and then simulate them in the simulation tool.

5.2 Experiment on real traffic The developed estimation method is suitable for estimation data source network processes of different IP packets network traffics, which can be captured in small networks or large core networks. The estimated distributions and parameters of data source processes were used for modeling and simulations of the measured packet network traffic by standard IP station using an OPNET simulation tool. By using this method, we do not need any deep packet or protocol analysis, because detailed data content is not important for modeling purposes. The test traces of network traffic were captured in one of the laboratories of the faculty of electrical engineering and computer science in Maribor, which is part of the University of Maribor. Captured packets of test network traffics were caused by different applications and protocols, such as hypertext transfer protocol (HTTP), simple mail transfer protocol (SMTP), etc., but we considered all packets as one test traffic. The captured test traffic is shown in Figure 11, and it was captured by a Wireshark sniffer.59 During the capturing, the Wireshark sniffer gets the packet time stamps from the libpcap (WinPcap) library with a precision of microseconds, which in turn gets them from the operating system kernel.59 For all captured real packets of test traffic, we calculated the following statistical properties: average bit rate (107.8 kb/s), average packet rate (23.23 p/s) and Hurst parameter using the R/S method (0.732). We also find the maximal length of a packet (lMTU = 1502 B), the minimal length of a packet (lm = 64 B) and the length of packet header (lH = 46 B). Based on the captured packets of test trace, the distribution parameters of the data length process XD(x) are estimated using the proposed EMHC. The distribution parameter of data inter-arrival time is estimated using a

Table 4. Estimated distribution parameters for stochastic processes of data source network traffic. XD is estimated by the estimation method based on histogram comparison (EMHC)’ YD is estimated by a method that mimics the defragmentation process.8,45 XD

YD

Pareto a = 26 k = 0.868

Weibull a = 0.355 k = 0.0131

Table 5. Estimated distribution parameters for stochastic processes of data source network traffic. Seed

Packet rate (p/s)

Bit rate (kb/s)

Hurst par. (H)

125 126 127 128

19.66 23.91 19.33 21.09

55.66 116.68 60.61 81.98

0.645 0.636 0.689 0.681

method that mimics the defragmentation method.8,45 Table 4 shows estimated parameters for processes of data source traffic. Table 4 shows parameter estimated distributions for the measured real traffic, shown in Figure 11. Estimated distribution of parameter validation is performed through comparison of the measured real network traffic and simulated network traffic. Simulated traffic is modeled by estimated parameters derived using the EMHC for the process of data lengths and the defragmentation method8,45 for the data inter-arrival process. In the simulation tool, four simulation scenarios are prepared, where only the seed in the traffic generators is different. Table 5 shows the parameters of the simulated network traffic.

1230

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

Figure 12. Comparison between the bit rate of measured and simulated network traffic.

Table 5 shows the simulated network traffic parameters, which are modeled using estimated distribution parameters that derive from the measured network traffic. We expect that simulated network traffic represents the measured network traffic and that disparities between simulated traffic and measured traffic are minimal in regards to an average bit/packet rate and also Hurst’s parameter. From Table 5, it can be seen that, in case when the seed is 126, simulated network traffic is very close to the measured network traffic in the sense of measured parameters. Figure 12 shows measured and simulated (seed 126) network traffic in bits/s. This also confirms that our method for the distribution of parameter estimation is valid. However, from simulated network traffics (for seeds 125, 127 and 128), the discrepancies between the measured and the simulated network traffics are larger, especially in the sense of the bit rate. These discrepancies are a result of impact of the Pareto distribution property about an expected value. If the estimated shape parameter of Pareto distribution is lower than 1, then the expected value (E(x)) is not defined.24,25 This means that the Pareto distribution in such cases is very unpredictable, which is also the reason for such discrepancies in simulated and measured bit rates of network traffic. From Figure 12 it can be seen that simulated network traffic represents the measured network traffic very well. The bursts in measured and simulated network traffic are also the same, which are measured by the Hurst parameter estimation.

6. Conclusion This paper presents a new statistical network estimation method of traffic parameters, based on histogram comparison – EMHC. The presented estimation method allows

estimation of statistical distribution for the data length process from captured packets. Through the defragmentation method, described by Fras et al.,8,45 all distribution parameters of the captured network traffic can be estimated. These methods can be used in cases when network traffic cannot be measured directly on the higher ISO/OSI layers. We do not have access to some parts of the network components (i.e. application servers). Estimated distribution parameters of network traffic processes can be used for statistical analysis of network traffic; furthermore, the most important application, for us, is in traffic modeling for simulation. Traffic measurements can be done through the use of a relatively simple network packet capture tool (a sniffer). Using the described validation methodology, it can be confirmed that the developed estimation method is valid with some limitations, due to long tail distribution properties, such as in the Pareto pdf case where the shape of the parameter is a ≤ 1. The developed EMHC, together with the alsomentioned defragmentation method, is implemented in a publicly accessible software tool named TraffMod,60 which is used for all described tests. This tool can be used for network traffic analysis as well as for estimation of data source traffic processes from the measured network traffic by the developed methods. Funding This work was partly financed by the Slovenian Ministry of Defense as part of the target research program ‘Science for Peace and Security’: M2-0140 - Modeling of Command and Control information systems, and partly by the Slovenian Ministry for Higher Education and Science, research program P2-0065 ‘Telematics’.

Fras et al. References 1. Jiang M, Hardy S and Trajkovic L. Simulating CDPD networks using OPNET. In: OPNETWORK 2000, Washington DC, August 2000. ˇ ucˇej Zˇ. Modeling methods in 2. Mohorko J, Fras M and C OPNET simulations of tactical command and control information systems. In: IWSSIP conference, Maribor, Slovenia, 27–30 June 2007. ˇ ucej Zˇ. Estimating the parameters 3. Fras M, Mohorko J and C of measured self similar traffic for modeling in OPNET. In: IWSSIP conference, Maribor, Slovenia, 27–30 June 2007. 4. Leys P, Potemans J, Van den Broeck B, et al. Use of the raw packet generator in OPNET. In: proceedings OPNETWORK 2002, Washington DC, 26–27 April 2002. 5. Botta A, Dainotti A and Pescape A. Do you trust your software-based traffic generator? IEEE Commun Mag 2010; 48: 158–165. 6. Fras M. Methods for the statistical modeling of measured network traffic for simulation purposes. PhD Thesis, Maribor, Slovenia, 2009. 7. RFC 793 - Transmission Control Protocol, http://www.faqs. org/rfcs/rfc793.html (accessed 15 July 2011). 8. Fras M, Mohorko J and Cˇucˇej Zˇ. Modeling of captured network traffic by the mimic defragmentation process. Simulation 2011; 87: 437–448. 9. Park K, Kim G and Crovella ME. On the relationship between file sizes transport protocols, and self-similar network traffic. In: international conference on network protocols, October 1996, pp.171–180. 10. Abrahamsson H. Traffic measurement and analysis. Kista: Swedish Institute of Computer Science, 1999. 11. Williamson C. Internet traffic measurement. IEEE Internet Comput 2001; 5: 70–74. 12. Celeda P. High-speed network traffic acquisition for agent systems. In: proceedings IEEE/WIC/ACM international conference on high-speed network traffic acquisition for agent systems, intelligent agent technology, 2–5 November 2007, pp.477–480. 13. Pezaros D. Network traffic measurement for the next generation internet. Computing Department, Lancaster University, 2005. 14. Epema D, Pouwelse J, Garbacki P, et al. The bittorrent P2P filesharing system: measurements and analysis. In: peer-topeer systems IV, 2005. 15. Saroiu S, Gummadi PK and Gribble SD. A measurement study of peer-to-peer file sharing systems. In: proceedings of the multimedia computing and networking (MMCN), San Jose, CA, 2–5 January 2002. 16. Asensio E, Orduna JM and Morillo P. Analyzing the network traffic requirements of multiplayer online games. In: proceedings of the 2nd international conference on advanced engineering computing and applications in sciences: ADVCOMP’08, 2008, pp.229–234. 17. Yu Y, Liu D, Li J, et al. Traffic identification and overlay measurement of Skype. In: proceedings of the international conference on computational intelligence and security, vol. 2, 3–6 November 2006, pp.1043–1048. 18. Leland WE, Taqqu MS, Willinger W, et al. On the selfsimilar nature of Ethernet traffic (Extended version). IEEE/ ACM Trans Networking 1994; 2: 1–15.

1231 19. Willinger W and Paxson V. Where mathematics meets the Internet. Not Am Math Soc 1998; 45: 961–970. 20. Crovella ME and Bestavros A. Self-similarity in World Wide Web traffic evidence and possible causes. IEEE/ACM Trans Networking 1997; 6: 835–846. 21. Paxon V and Floyd S. Wide area traffic: the failure of Poisson modeling. IEEE/ACM Trans Networking 1995; 3: 226–244. 22. Garrett M and Willinger W. Analysis, modeling and generation of self-similar VBR video traffic. In: proceedings of ACM SIGCOM 94, 1994, pp.269–280. 23. Dainotti A, Pescape` A and Ventre G. A packet-level characterization of network traffic. In: 11th IEEE international workshop on computer-aided modeling, analysis and design of communication links and networks (CAMAD 2006), Trento, Italy, June 2006, pp.38–45. 24. Sheluhin O, Smolskiy S and Osin A. Self-similar processes in telecommunications. Chichester: John Wiley & Sons, 2007. 25. Park K and Willinger W. Self-similar network traffic and performance evaluation. John Wiley & Sons, 2000. 26. Yo˜lmaz H. IP over DVB: management of self-similarity. Master of Science, Bog˘azic xi University, 2002. 27. Karagiannis T, Molle M and Faloutos M. Understanding the limitations of estimation methods for long-range dependence. TechReport, UC Riverside, 2006. 28. Vujicic B, Cackov N, Vujicic S, et al. Modeling and characterization of traffic in public safety wireless networks. In: SPECTS 2005, Simon Fraser University, Vancouver, Canada. 29. Karagiannis T and Faloutos M. Selfis: a tool for selfsimilarity and long range dependence analysis. In: 1st workshop on fractals and self-similarity in data mining: issues and approaches, University of California, July 2002. 30. Nuzman C, Saniee I, Sweldens W, et al. A compound model for TCP connection arrivals for LAN and WAN applications. Comput Network 2002; 40: 319–337. 31. Crovella ME and Lipsky L. Long-lasting transient conditions in simulations with heavy-tailed workloads. In: proceedings of the 1997 winter simulation conference, Atlanta, GA, 7–10 December 1997. Edmonton, Canada. 32. Joo Y and Ribeiro V. TCP/IP traffic dynamics and network performance: A lesson in workload modeling, flow control, and trace-driven simulations. Comput Commun Rev 2001; 31: 25–37. 33. Feldmann A, Gilbert AC, Huang P, et al. Dynamics of IP traffic: a study of the role of variability and the impact of control. In: proceedings of applications, technologies, architectures, and protocols for computer communication, Cambridge, MA, 30 August 30–3 September 1999, pp.301–313. 34. Klampfer S, Kotnik B, Svecˇko J, et al. MIMO simulator of call server input lines occupancy. Simulation 2011; 87: 423–436. 35. Ryu B and Lowen S. Fractal traffic model for Internet simulation. In: proceedings of the 5th IEEE symposium on computers and communications (ISCC 2000), 2000. 36. Dainotti A, Pescape` A, Salvo Rossi P, et al. Internet traffic modeling by means of hidden Markov models. Comput Network 2008; 52: 2645–2662.

1232

Simulation: Transactions of the Society for Modeling and Simulation International 88(10)

37. Costamagna E, Favalli L and Tarantola F. Modeling and analysis of aggregate and single stream internet traffic. In: proceedings of IEEE GLOBECOM, December 2003, pp.3830–3834. 38. Salvador P, Pacheco A and Valadas R. Modeling IP traffic: joint characterization of packet arrivals and packet sizes using BMAPs. Comput Network 2004; 44: 335–352. 39. Muscariello L, Mellia M, Meo M, et al. Markov models of internet traffic and a new hierarchical MMPP model. Comput Commun J 2005; 28: 1835–1851. 40. Klemm A, Lindemann C and Lohmann M. Modeling IP traffic using the batch Markovian arrival process. Perform Eval J 2003; 54: 149–173. 41. Xinjie C, Tan T-Y and Subramanian KR. Source traffic modeling in OPNET. In: OPNETWORK 1999 - the third annual OPNET technology conference, Washington, DC, 1999. 42. Mohorko J and Fras M. Modeling of IRIS replication mechanism in a tactical communication network, using OPNET. Comput Network 2009; 53: 1125–1136. 43. Mohorko J, Fras M and Cucej Z. Modeling of IRIS replication mechanism in tactical communication network with OPNET. In: OPNETWORK 2007 – the eleventh annual OPNET technology conference, Washington, DC, 27–31 August 2007. ˇ ucˇej Zˇ. A network traffic sources 44. Fras M, Mohorko J and C modeling method based on measured data defragmentation. In: IWSSIP conference, Rio de Janeiro, Brazil, 17–19 June 2010, pp.328–331. 45. Fras M, Mohorko J and Cˇucˇej Zˇ. A new approach to the modeling of network traffic in simulations. Inf MIDEM 2009; 1: 41–45. 46. Law AM and McComas MG. How the Expertfit distribution fitting software can make simulation models more valid. In: proceedings of the 2001 winter simulation conference, New Orleans, LA, 7–10 December 2003. 47. Free (demo) fitting tool EasyFit software, www.mathwave. com/ (accessed 17 March 2011). 48. Schervish MJ. P values: what they are and what they are not. Am Stat 1996; 50: 203–206. 49. Plackett RL. Karl Pearson and the chi-squared test. Int Stat Rev 1983; 51: 59–72. 50. Greenwood PE and Nikulin MS. A guide to chi-squared testing. New York: Wiley, 1996. 51. Cramer D and Howitt D. The Sage dictionary of statistics. London: SAGE Publications Ltd, 2004. p.76. 52. Chakravarti M, Laha RG and Roy J. Handbook of methods of applied statistics. Volume I. New York: John Wiley and Sons, 1967, pp.392–394. 53. Eadie WT, Drijard D, James FE, et al. Statistical methods in experimental physics. Amsterdam: North-Holland, 1971, pp.269–271.

ˇ ucˇej Zˇ. A new goodness of fit test 54. Fras M, Mohorko J and C for histograms regarding network traffic packet size process. In: international conference on advanced technologies for communications, Hanoi, Vietnam, 6–9 October 2008. 55. Nocedal J. Numerical optimization. In: Wright SJ (ed) Linear programming: interior-point methods. New York: Springer, 1999. 56. Rao SS. Engineering optimization - theory and practice. 3rd ed. John Wiley & Sons, 1996. 57. Luenberger DG. Linear and nonlinear programming. 2nd ed. Norwell, MA: Kluwer Academic Publishers, 2003. 58. Vanderbei RJ. Linear programming- foundations and extensions. 3rd ed. Springer New York: Science + Business Media, 2008. 59. Wireshark. Free sniffer software, www.wireshark.org/ (accessed 28 February 2011). 60. OPNET, http://www.sparc.uni-mb.si/opnet/ (accessed 2 April 2011).

Author biographies Matjazˇ Fras was born in Maribor, Slovenia on 13 March 1980. He obtained his BS degree in 2005 and MS degree in 2007 in electrical engineering from the University of Maribor, faculty of electrical engineering and computer science, where he has been working as a researcher since September 2006 in the laboratory for signal processing and remote control. He has been engaged in network traffic analysis, self-similarity and network simulations. Since 2009 he has worked at the Margento R&D Company as a researcher and developer of mobile payment systems. He obtained his PhD degree in 2009. Jozˇe Mohorko received his PhD in electrical engineering from the University of Maribor in 2002. From 1990 to 2001, he has been working as an assistant and researcher of electrical and computer engineering at the University of Maribor. From 2001 to 2006 he worked as senior HW engineer at the Ultra d.o.o Company. Since 2006 he has been a researcher at the University of Maribor. His research interests include communications, signal and image processing, computer vision, measurements and telematics. He is a member of the Institute of Electrical and Electronics Engineers (IEEE). ˇ ucˇej received his BSc degree in electrical engineering ˇ arko C Z from the University of Ljubljana, Ljubljana, Slovenia, in 1976 and his MSc and PhD degrees in electrical engineering from the University of Maribor, Maribor, Slovenia, in 1984 and 1990, respectively. Since 1985, he has been a professor of electrical and computer engineering with the University of Maribor. His research interests include remote control, digital communication and signal processing.

Suggest Documents