On network traffic statistical analysis

The present article deals with statistical university network traffic, by applying the methods of self-similarity and chaos analysis. The object of measurement is Šiauliai University LitNet network node maintaining institutions of education of the northern Lithuania region. Time series of network traffic characteristics are formed by registering amount of information packets in a node at different regimes of network traffic and different values of discretion of registered information are present. Measurement results are processed by calculating Hurst index and estimating reliability of analysis results by applying the statistical method. Investigation of the network traffic allowed us drawing conclusions that time series bear features of self-similarity when aggregated time series bear features of slowly decreasing dependence.


Introduction
Empirical research of computer network packet traffic shows that it is attributed with self-similarity [1,2,6,8]. After estimating the latter feature, it is possible to adequately prognosticate the change of traffic and to apply the prognosis results in increase of network throughput and improvement of its QoS quality of service, while regulating packet latency, fluctuation restriction and packet loss transportation on data and physical OSI layers [3,10]. Quality of Service (QoS) refers to the capability of a network to provide better service to selected network traffic over various technologies. These technologies allow you to measure bandwidth, detect changing network conditions (such as congestion or availability of bandwidth), and prioritize or throttle traffic.
The self-similarity phenomenon is explained by network usage order attributed with burstiness. In fact, data is inherently "bursty" as it occurs in short bursts of communications followed by long periods of silence. Indeed, one can characterize data communication users who wish network resources to send their data as follows: users don't warn you exactly when they will demand access; one cannot predict how much they will demand, most of the time users do not need access to network; when users ask for it, they want immediate access [9]. Such situation is frequently faced in distance learning networks when students receive tasks and send theirs answers almost at the same time.
In contemporary university studies, computer networks are widely applied; they often undergo non-prognosticated overload. For effective network control, it is necessary to perform monitoring of network nodes in order to prognosticate network node load and overload. On the base of A. Erramilli, O. Narayan and W. Willinger, in 1989, by empirical research of Ethernet local area network of 10 Mbps which was carried out at Bellcore laboratory, it was estimated that Ethernet traffic characterisations bear fractal characteristics and are attributed with self-similarity with long-range dependence [1]. I. Kaj [5] in the monographs suggests the methods of statistical analysis of characteristics of modern communication traffic, by applying possibilities of contemporary mathematical modelling. J. Beran, analyse network traffic as a fractal process attributed with a second order statistical self-similarity which is characterised by a fractal measure [6]. Methods of non-linear (chaos) theory are applied for modelling and description of network processes, while estimating the heavy-tails which characterise large burstiness of network traffics.
The aim of this research is to analyse measurement results of Šiauliai University LitNet network node traffic and to estimate its self-similarity. It should be noted that analogous analysis of region's various educational network data has not been carried out yet in Lithuania in order to find out about the self-similarity. Programme and device tools for monitoring the network were used in analysis of network traffic; these tools registered data packets at the indicated interval. Data was registered while applying different levels of time discretisation, at different levels of network load present, while forming aggregated time series. Measurement results were processed by estimating the fractal measure and calculating Hurst coefficient and statistically estimating reliability of analysis results [6].

Composition of Empirical Data
For measurement of network traffics, Šiauliai University LitNet network node with the highest intensity of traffic load was chosen. In this node, received inter-city channel traffic of 1 Gbps is distributed to the university and educational institutions of Šiauliai region. Only data packets arriving at the node M, while disregarding sent packets, were analysed. Obtained information was collected in external data base Porstgree SQL (DB). Initial measurement was carried out with exactness of one microsecond. Record on the data base was formed right after receiving TCP or other protocol's data frame. Service information was not withdrawn while saving framework data: title, feature of framework beginning, addresses of a sender and receiver, etc. The biggest length of fixed transport frameworks was up to 1518 bytes [10].
ulogd software for Linux operational system distributed under GPL licence was used for measurement [12]. Data was being fixed in incoming data traffic drives of the router. Every pre-routed packet is registered by ulogd daemon in PostgreeSQL database (see Fig. 1). Data from January 4, 2008 13:30:35 to April 16, 2008, 12:00:00 was chosen from the data base for analysis. Within this period, more than three billion records were accumulated in the data base; they corresponded to 8936965 seconds or 103 days 10 hours 29 minutes and 25 seconds. Data for analysis was selected according to days of the week and part of the day, while estimating intensity of data traffic, i.e., those time series were selected when data traffic was the least (Sundays), medium (Saturdays) or the highest (weekdays). While investigating the change of load during the day, hours when data traffic is the highest, medium and the least were indicated for every day of the week. The method of k-means of cluster analysis was applied in analysis of intensity of data traffic in selected time intervals, while using Statistical Package for Social Sciences (SPSS). The method of non-hierarchic cluster analysis was applied, when the amount of cluster figures (k = 24) is known, and distances between clusters and objects are calculated by using Euclidean square range metrics. Series of one hour measurement consists of up to one-and-a-half million records. 309 hours of when data traffic is the highest, medium and the least were selected for further analysis.
In order to analyse such time series, it must be aggregated, i.e., to calculate data traffics in equal time intervals. For aggregation of data, two methods were chosen: 1 -the method of smoothing of moving surfaces was applied when an average traffic for a data series are calculated in a chosen time interval Δt : , here t k = kΔt + τ 1 . Obtained time series characterise average changes of data traffic in time moments Δt ; 2 -transferred data traffic amount within the time interval Δt : While forming the series for the research, the time intervals Δt ∈ [100 ms, 500 ms, 1 s] were chosen. Out of 309 selected measurement sequences, 6 queue groups were formed, totally 1854 series. Aggregated time series, while estimating network load, are marked as follows: For estimation of time series, the programme Fractan 4.4 was applied [10]. In the aggregated series, the programme calculates the following: Hurst coefficient and fractal measure, presents graphic image of numerical values and draws obtained attractors.

Network self-similarity estimated by using Hurst statistics
As time series formed of lengths of data frameworks transferred via the computer network do not satisfy the normal distribution, this section investigates their Hurst statistics. Hurst coefficient characterises whether the series analysed is random, whether it has a short-range or long-range, also called Markov, dependence. If Hurst coefficient H = 0.5, it means that sequence members are random and its every subsequent member does not depend on previous series members; in an opposite case, we can state that previous events recorded in time series have constant influence on further processes and this influence is the stronger the closer the event is to the past. Such series are invariant from the viewpoint of time. Influence of the current process on future events is calculated by estimating its correlation [6,3]: C = 2 2H −1 − 1, where C -correlation measure, o H -Hurst coefficient. While evaluating self-similarity of a time series, the value of Hurst coefficient, i.e., interval where it occurs, is very important.
If 0 H < 0.5, then the process characterised by the time series is anti-dispersive, i.e., we can state that if increase is observed in one period, in other period decrease will definitely follow, and the probability is the higher the closer H is to 0. In this case, correlation is negative and draws closer to 0.5. Such series usually bear a feature of high changeability and are formed of frequent increases and decreases.
If 0.5 < H < 1.0, then it is a persistent process with long-term memory, i.e., in the past, the process had a feature for increase, and it will retain this in future with the higher probability the closer H is to 1, and correlation will draw closer to 1. Usually, such series are called trend-resistant, while H draws closer to 0.5, the amount of trends (noises) increases in the series.
For formed and aggregated time series x t in the network node, Hurst coefficient is calculated according to the formula H = log(R/S)/ log(n/2), where H -Hurst coefficient, R/S -r/s statistics acquired according to the formula: here 1 τ n, where n -amount of sequence members, x t -average value of the series x t , and τ i=1 (x i − x t ) -the formed cumulative series describing sum of changes throughout time τ . According to Hurst [12], we can state that the expression is suitable for majority of natural phenomena: M R(n) S(n) ≈ cn H , n → ∞, where cconstant independent value [6]. Hurst coefficient is closely related with the fractal measure D which characterises local features of computer network data traffic, and Hurst coefficient describes characteristics of the whole process -memory of the process. In self-similar processes, local features are reflected in global ones and vice versa; because the time series measure N = 1, therefore the connection can be estimated by using the formula: D = 2 − H , where D -fractal measure, the so called attractors' dimension, H -Hurst coefficient. For estimation of attractors' dimension, we calculate Hausford measure which is obtained by analysing the strange Lorenz's attractor. For estimation of the system, we calculate Hausford measure D, the so called fractal measure [2]: D = lim ε→0 ln N(ε) ln(1/ε , here N is minimum amount of n-time blocks with facet length which cover points of a set, when facet length draws close to zero. We analyse the system, when 1 < D < 2, then the formula is as follows: D = ln N ln(1/(2 * r) , here Namount of elements used fro measuring fractal's undulation, N → 2 when a fractal is in plane, r -radius of a circle used in 2-dimensional space. In computer networks, fractal measure characterises dynamics of formed data traffic time series changes, when one variable is used [13]. Hurst coefficient distribution with percentage estimation is displayed in Table 1. It suggests that more than 80% of time series 0.5 < H < 1.0 and they preserve this feature when Δt ∈ [100ms, 500ms, 1000ms] and different network loads x min t , x δ t , x max t . The calculated numerical expressions are estimated analytically as well, i.e., for every series group, according to the time interval, the average, median, standard deviation and dependent interval with reliability of 95% were calculated. Estimation results are presented in Table 2. Hurst coefficient changes from 0.61 to 0.79, thus, the process of data transferred via computer network which is described by aggregated series is a persistent process with long-term memory. Fractal measure D changes from 1.23 to 1.82, and dependent intervals are very narrow; this proves that calculated values are reliable, and fractional expression of the fractal measure suggests that the series bear Table 2. Estimations of Herst coefficient and fractal measure reliability features of fractals. After generalising research data of this section, we can state that the investigated time series characterise persistent processes with long-term memory.

Conclusions
1. Estimations of Hurst coefficient reliability proved that the aggregated series describe a persistent process with long-term memory. It was proved by analysis of Hurst coefficient charts. 2. The data traffic of Šiauliai University LitNet network node is attributed with selfsimilarity with long-term memory.