😖 👩🏼‍🤝‍👩🏻 👨‍👧 Entropy and network traffic anomaly detection 🤲🏾 🍎 👩🏿‍🤝‍👩🏽

In this article, Daniil Volkov, a leading expert in the Data Science business area at Neoflex, talks about methods for detecting network anomalies based on the use of entropy - the main characteristic of systems from the point of view of information theory. He also notes some ways of detecting time series anomalies.

Over the past decade, a large amount of research has focused on entropy as a measure of "chaos" inherent in various characteristics of network traffic. In particular, entropy time series has proven to be a relatively good method for detecting anomalies in network traffic. The pluses of this approach include:

. (, Netflow), .
. , , , packets rate (rps) (.., packets rate).
. , zero-day.

So, the proposed systems for detecting anomalies based on the concept of "entropy" analyze network flows, and not individual network packets. Let's define network streams (hereinafter simply streams) as unidirectional meta information about network packets that have the same source and destination IP addresses and ports, as well as the type of IP protocol. It is important to note that all network activity in OSI layer 3 and higher is reduced to streams, i.e. these are not only TCP connections, but also stateless protocols such as UDP and ICMP. Benefits of using the concept of streams:

They are very easy to use and store information, which makes analysis easier;
Cause fewer privacy and personal data problems;
It is easy to establish access to the necessary information on the network, for example, in the form of Cisco NetFlow, sflow, or even IPFIX (to your liking).

Entropy

The concepts of entropy in statistical physics and information theory are quite close to each other. In addition, for us it will be a fact of not the last importance when using entropy as a measure of the distribution of traffic characteristics - that entropy also serves as a measure of the proximity of the state of the system to equilibrium (in the theory of nonequilibrium processes, which can also include the exchange of network traffic). As many remember, the classical Shannon entropy is defined as:

H = - \sum_{i = 1}^{n} p_{i} \log p_{i},

$H = - \sum^n_{i=1} p_i \log p_i,$

Where

p_{i}

$p_i$ - probability

i

$i$ -th state of the system,

n

$n$ - the number of all possible states of the system. To facilitate the interpretation of the result and to exclude the influence of seasonal factors that change n, we will use the normalized Shannon entropy in what follows.

H_{0} = \frac{H}{\log n}, ⟶ H_{0} \in [0, 1] .

$H_0 = \frac{H}{\log n}, \longrightarrow H_0 \in [0, 1].$

Network traffic attributes and entropy time series

Now it remains to explain exactly how we will calculate

H_{0} = \frac{H}{\log n}, ⟶ H_{0} \in [0, 1]

$H_0 = \frac{H}{\log n}, \longrightarrow H_0 \in [0, 1]$ for various characteristics of network traffic, and most importantly - for which ones. Various authors suggest using a wide variety of characteristics, however, in almost all works, the following basic set remains:

Source IP
Destination IP
Source port
Destination port.

It is sometimes suggested to be extended with other features such as Flow records or IP for backbone flows.

We will use a time series of magnitude

H_{0}

$H_0$ calculated for these traffic characteristics, within time windows of finite length. Typical window lengths (overlapping, sliding can be used) are about 5-10 minutes, considering that the typical duration of an attack on a network infrastructure is tens of minutes. And also this is the need for a fairly large amount of accumulated statistics. So, if we are interested in the entropy for the source IP, then

n

$n$ is equal to the number of unique IP addresses in the window, and as for calculating probabilities, the overwhelming majority of authors use the number of packets with this characteristic as a measure of the probability of such a packet in the network (which, in general, is logical, but you can use both the number of bytes and the number of streams ). For example, if we had 100 packages for 1.1.1.1, 100 packages for 1.1.1.2, and 300 packages for 2.2.2.2, then:

p_{1.1.1.1} = p_{1.1.1.2} = \frac{100}{100 + 100 + 300} = 0.2

$p_{1.1.1.1} = p_{1.1.1.2} = \frac{100}{100 + 100 + 300} = 0.2$

p_{2.2.2.2} = \frac{300}{500} = 0.6

$p_{2.2.2.2} = \frac{300}{500} = 0.6$

H = - (0.2 \log 0.2 + 0.2 \log 0.2 + 0.6 \log 0.6) \approx 1.37

$H = -(0.2 \log 0.2 + 0.2 \log 0.2 + 0.6 \log 0.6) \approx 1.37$

H_{0} = \frac{1.37}{l o g 3} \approx 0.86.

$H_0 = \frac{1.37}{log 3} \approx 0.86.$

Next, we will discuss attributes and when it makes sense to consider them for the attack types of interest.

In most of the studies related to entropy and network anomalies, researchers have focused on the source IP address and, accordingly, its distribution and entropy. It's a pretty good choice.

In a study [Sharma et al. (2015)] the authors experimented with different types of attacks and analyzed the usefulness of different attributes for detecting these attacks using an entropy approach. Namely, the NUST dataset was used, the types of attacks: TCP-SYN flood, UDP flood and Smurf. For the analysis, about 100 thousand packets of normal traffic and 10 thousand packets of attacked traffic were taken. The attributes were Source IP, Destination IP, Source Port and Destination Port (standard), but also Flags (distribution by TCP flags), Length (distribution by packet lengths), Protocol (by protocols) were considered.

As a result, it turned out that it is useful to use in anomaly detection systems based on the entropy approach also the aforementioned additional to the standard attributes, such as the packet length (very good results in the case of TCP-SYN attacks). It is relatively useful to use entropy by protocols, since it brought significant results only in specific cases, like UDP floods, but such types of attacks can easily be detected by completely traditional methods of traffic monitoring.

An algorithm for detecting network anomalies based on entropy time series

This method is a generalization of the proposed by the authors [Winter (2011)]. Generalizations mainly concern what attributes to use to build a time series system, as well as methods for identifying anomalies in specific time series.

Algorithm

, - . [Winter (2011)] : Src-Dst IP/Port, flow records (1 flow records = Src IP + Dst IP + Src Port + Dst Port + IP Protocol);
! , . . $T$ ;
( ). , . $\sigma_i$ – $i$ -th time series from $T$ ;
The main idea behind detecting sudden changes is to continually make short-term forecasts and determine the difference between the forecast and the actual value. Winter (2011) suggests a simple exponential smoothing method, but you can use something more complex (and accurate), from ARIMA to LSTM networks.

For each

T_{i}

$T_i$ we determine the forecast error for the moment of interest to us

t

$t$ :

δ_{i} = | Pred (T_{i} (t)) - T_{i} (t) | .

$\delta_i = | \texttt{Pred}(T_i(t)) - T_i(t) |.$

5. However, certain prediction errors are not equal in significance, because the base time series have different variances. For this reason, we normalize the prediction errors with respect to the variance of the corresponding time series by multiplying by the weighting factor:

w_{i} = \frac{1}{σ_{i}} \cdot max (σ_{1}, \dots, σ_{n}) .

$w_i = \frac{1}{\sigma_i} \cdot \max (\sigma_1, \ldots, \sigma_n).$

6. To simplify the process of detecting anomalies, we introduce the aggregate characteristic "anomaly score":

A S = \sum_{i = 1}^{n} δ_{i} \cdot w_{i} .

$AS = \sum_{i=1}^n \delta_i \cdot w_i.$

7. If

A S > A S_{t h r}

$AS > AS_{thr}$ we say that a certain anomaly of network traffic has been recorded. Threshold

value

A S_{t h r}

$AS_{thr}$ is determined empirically depending on the number of base time series

n

$n$ , as well as requirements for the sensitivity of the detector.

About setting parameters

For our algorithm, we need to set several parameters. Each combination of parameters is associated with a trade-off: a high detection rate due to many false positives, or vice versa.

. , . , , , – . , . 5 10 ;
. . , 80%;
The size of the sliding window for calculating variances . We decided to set the size of the sliding window to calculate the standard deviation up to 24 hours. Most often, such a sliding window covers the entire seasonal cycle. In other cases, we recommend choosing windows that are multiples of 24 hours. An alternative natural choice would be 7 days.

Further modifications

Bhuyan et al. (2014) investigated a set of information metrics for suitability for detecting anomalies using similar methods. In addition to the aforementioned and well-known Shannon entropy, Hartley entropy, Renyi's entropy and Renyi's generalized entropy were tested. These are generalizations of Shannon's entropy, defined by the following formula:

H_{α} (x) = \frac{1}{1 - α} \log \sum_{i = 1}^{n} p_{i}^{α} .

$H_{\alpha} (x) = \frac{1}{1 - \alpha} \log \sum_{i=1}^n p_i^{\alpha}.$

Here

α \geq 0, α \neq 1, p_{i} \geq 0

$\alpha \geq 0, \alpha \neq 1, p_i \geq 0$ ... As an exercise, the reader is given the opportunity to make sure that when

α \to 1, H_{α}

$\alpha \rightarrow 1, H_{\alpha}$ tends to Shannon's entropy. Among other things, the authors used the widespread assumption that malicious traffic obeys a Poisson distribution and legitimate traffic obeys a Gaussian distribution.

The authors come to the following conclusions:

it is important to use the minimum number of traffic attributes. Most often these will be IP addresses or packet lengths;
information entropy gives the best result in detecting low-speed attacks for large values $\alpha$ ...

results

Since Winter (2011) had no real data to validate, they modeled and injected synthetic anomalies into the original dataset. For this, a modified version of the FLAME tool by Brauckhoff et al. (2008), which facilitates the introduction of artificial anomalies into streaming data. The authors have implemented two such "flow generators":

HTTP-flood: HTTP-. 11 500 «». -. , , . - HTTP-. 220,000 , ();
: , . IP- /16-, 65 534 IP-. TCP- 21, FTP. , . 67000 . , (D).

The figure below shows the entropy analysis for the entire day of October 28th. The top chart is the original dataset, while the bottom chart contains our added anomalies (C and D). Again, both charts contain time series for all five attributes. Black time series for

A S

$AS$ is the most interesting. In both charts, we have highlighted particularly strong anomalies and labeled them from A to D. A and B represent “natural” anomalies that were already present in the original dataset.

To illustrate how this can all go unnoticed by traditional monitoring tools, the figure below shows three popular traffic statistics: bytes, packets, and streams per minute for the same traffic as in the previous example. Anomalies (labeled A through D) disappear in noise.

A similar experiment was carried out at Neoflex, only the anomalies are "natural", i.e. were already in the customer data.

For comparison: standard network traffic dashboards for routers over the same period:

We can see that rather serious deviations in the distribution of network traffic characteristics remained invisible through the standard metrics. How, on the basis of the information received, to identify the causes of these anomalies should be discussed in a separate conversation. Let us mention that the search for a group of values of characteristics that significantly changed the entropy is reduced to the study of the dynamics of the distribution of these very characteristics, and in simple cases it can be carried out by ordinary sorting by probabilities

p_{i}

$p_i$ ...

Summing up

So, we have implemented and tested an algorithm for detecting a wide class of anomalies in the network through entropy time series. Attacks such as floods, worms, and scans often lead to these kinds of anomalies. The main idea of the algorithm is to constantly make short-term forecasts and determine the difference between the forecasts and the actually observed value of entropy. Entropy serves as an indicator of the balance of the process. Thus, abrupt changes in entropy indicate a qualitative change in the structure (through changes in the distribution of characteristics) of the system. It is important to note that attacks must still reach a certain scale in order to be detected. Very small-scale attacks can be invisible on a high-speed network connection. We believe,that algorithms of this kind are a valuable tool for network operators and information security departments of various kinds of enterprises. It can be easily configured, quickly deployed, and requires no training data.

Information sources

Philipp Winter, Harald Lampesberger, Markus Zeilinger, and Eckehard Hermann 2011 . “On Detecting Abrupt Changes in Network Entropy Time Series”

Sidharth Sharma, Santosh Kumar Sahu, Sanjay Kumar Jena 2015 . “On Selection of Attributes for Entropy Based Detection of DDoS”

Monowar H. Bhuyan, DK Bhattacharyya, JK Kalita . “Information Metrics for Low-rate DDoS Attack Detection: A Comparative Evaluation”

Brauckhoff, D., Wagner, A., May, M .: “FLAME: A Flow-Level Anomaly Modeling Engine”. In: Proc. of the conference on Cyber security experimentation and test. pp. 1 - 6. USENIX Association, Berkeley, CA, USA (2008)

Entropy and network traffic anomaly detection