Zeek: New IRC Feature Extractor Package

This blogpost was authored by Ondrej Prenek (@ondraprenek) on Februrary 13th, 2020

zkg-logo.png

Introduction

Zeek [1] (formerly named Bro) is an open-source network security platform that supports a wide range of traffic analysis tasks even outside of the security domain. Zeek Package Manager [2] allows users to extend Zeek functionality by installing third party scripts and plugins.

To facilitate the analysis of IRC connections, we created an IRC Feature Extractor Zeek Package to automatically recognize IRC connections in a packet capture (pcap) file and extract features from it.

The goal for the feature extraction is to describe individual IRC connections that occur in the pcap file as accurately as possible. The package was created during our research in the Aposemat project [3], a joint project between Avast Software and the Czech Technical University (CVUT), where we proposed a technique for detecting malicious IRC connections in the network. The package is used to separate individual IRC connections and generate a set of features that were used as a data source for the malware detection system.

In the following sections we will describe what an IRC connection is, how the data are preprocessed, which features are extracted, and how the features are computed. Then we show an example of how to install the package and how to use it.

Data Preprocessing

Once the data was obtained from network traffic capture, there was a process to extract the features. We separated the whole pcap into IRC connections for each individual user. In our research, we consider the IRC connection as a flow between the source IP, destination IP, and destination port (hereinafter IRC connection). The source port is neglected in separation to include multiple TCP connections in a single IRC connection - when a new TCP connection is established between two IP addresses, the source port is randomly chosen from the unregistered port range, and that is why the source port differs in multiple TCP connections. This is shown in Figure 1, where there are two connections from the source IP address (192.168.0.1) to the same destination IP address (192.168.0.2) using different source port.

Figure 1. Example of IRC connection that is defined by source IP address 192.168.0.1, destination IP address 192.168.0.2, and destination port 440. Source port is neglected, and therefore one IRC connection can have multiple source ports. The IP add…

Figure 1. Example of IRC connection that is defined by source IP address 192.168.0.1, destination IP address 192.168.0.2, and destination port 440. Source port is neglected, and therefore one IRC connection can have multiple source ports. The IP addresses and ports were randomly chosen for demonstration purposes

Extracted Features

Here, we will describe the complete list of features that are extracted by the package for each IRC Connection that we obtained from a pcap file. The features were manually chosen to provide us a meaningful representation of the IRC connection biased towards the malware detection we were trying to solve.

Total Packet Size

Size of total amount of all packets in bytes that were sent in IRC connection. It reflects how many messages were sent and how long they were.

Session Duration

Time duration of IRC connection in milliseconds - i.e., the difference between the time of the last message and the first message in IRC connection.

Number of Messages

A total number of messages in a given IRC connection.

Number of Source Ports

As we have mentioned before, the source port is neglected in unifying communication into IRC connections because it is randomly chosen when a TCP connection is established. We suppose that artificial users could use a higher number of source ports than the real users since the number of connections of the artificial users was higher than the number of connections of the real users.

Message Periodicity

We suppose that artificial users (e.g., bots that are controlled by botnet master) use IRC for sending commands periodically, so we wanted to obtain that value. To do that, we created a method that would return a number between 0 and 1 - i.e. one if the message sequence is perfectly periodical, zero if the message sequence is not periodical at all.

If you want to know more details about computation of this feature, continue reading, otherwise you can skip the rest of the description and continue to the next feature.

There are four stages to extract the message periodicity feature, those are: computing time differences between messages, applying Fast Fourier Transform, splitting messages into boxes by most significant period and computing Normalised Mean Squared Error. We will go through all the stages in detail:

1. Compute time differences between messages

For each message, we store a set of attributes - one of them is the time when the message was sent. We use that attribute to compute time differences between chronologically-sorted messages.

2. Apply Fast Fourier Transform

On the computed sequence of time differences between messages, we apply a fast Fourier transform (FFT). Fast Fourier transform is an effective algorithm for computing the discrete Fourier transform, which is the function we are using to express time sequence as a sum of periodic components and to recover signal from those components. The output of FFT is a sequence of numbers. The higher the number on the given position of the output, the bigger the amplitude on the given position. Thus it has a more significant influence on the periodicity of the data.

3. Split messages into boxes by most significant period.

The position of the largest element in the FFT's output represents the length of the period, which is the most significant from all other periods. To compute the quality of the most significant period, we split the data by length of that period.

4. Compute Normalised Mean Squared Error (NMSE)

From split messages into boxes, we compute the normalized mean squared error (NMSE) that returns the resulting number in the interval between 0 and 1, where 1 represents the perfectly periodic messages, and 0 represents not periodic messages at all.

The described process of extracting message periodicity feature is illustrated in Figure 2.

Figure 2. Illustration of how message periodicity is computed. The time differences between messages and FFT output numbers are chosen randomly for demonstration purposes.

Message Word Entropy

To consider whether the user sends the same message multiple times in a row, or whether the message contains a limited number of words, we compute a word entropy across all of the messages in the IRC connection. By the term word entropy we mean a measure of words uncertainty in the message.

If you want to know more details about the computation of the the word entropy, we use the formula below:

formula_entropy.gif

where n represents the number of words, and pi represents the probability that the word i will be used among all other words.

Username Special Characters Mean

We want to obtain whether the username of the user in the IRC communication is random generated or not. Therefore, in this feature, we compute the average usage of non-alphabetic characters in the username. We match non-alphabetic characters by the following regex.

 
irc-rgx.png
 

Then we count matched characters and divide them by the total number of username characters.

Message Special Characters Mean

If the artificial user sends many commands, the message will most likely contain a lot of different characters than the message of an ordinary user would send. With this feature, we obtain the average usage of non-alphabetic characters across all messages in the IRC connection. We apply the same procedure of matching special characters for each message as in the previous case - we match non-alphabetic characters by regex, and then we divide the number of matched characters by the total number of message characters. Finally, we compute an average of all the obtained values for each message.

We explained all the eight features that this new package automatically extracts. Next, we will show an example of how to put the package into practice.

Example

To demonstrate how the package works, we will run the package on capture from our recently released IoT23-Dataset’s scenario [4] - CTU-IoT-Malware-Capture-34-1 [5]. First, we will go through the package installation process, then we evaluate the package on the selected pcap file, and finally describe the output log.

First, download 2018-12-21-15-50-14-192.168.1.195.irc.pcap.

Installation

To install the package, run the following command in a terminal:

$ zkg install IRC-Zeek-package

Run

To extract the IRC features on the selected pcap file that contains IRC, run the following command:

$ zeek IRC-Zeek-Package -r 2018-12-21-15-50-14-192.168.1.195.irc.pcap

The output will be stored in irc_features.log file in zeek log format. The log will look like this:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   irc_features
#open   2020-01-31-12-29-30
#fields src src_ip  src_ports_count dst dst_ip  dst_port    start_time  end_time    duration    msg_count   size_total  periodicity spec_chars_username_mean    spec_chars_msg_mean msg_word_entropy
#types  string  addr    count   string  addr    port    time    time    double  count   int double  double  double  double
AmpAttacks!AmpAttacks@Summit.gov.GoV    192.168.1.195   3   ##Summit    185.244.25.235  6667    1545404139.541643   1545478074.769545   73935.227902    153 251637  0.006601    0.111111    0.299652    5.254209
#close  2020-01-31-12-29-30

Every line consists of a line descriptor followed by a content described by the descriptor. Lines 1-5 describes predefined values that determine the structure of the log. Line 6 indicates the time when the package starts evaluation and Line 10 when the package ends the evaluation. Line 7 contains extracted feature names, line 8 contains data types of each feature, and line 9 contains feature values.

Conclusion

IRC Feature Extractor Zeek Package extends the functionality of the Zeek network analysis framework to the analysis of IRC connections.This package automatically recognizes IRC connections in a pcap file and extract features from it.

In this blogpost, we introduced our newly created Zeek Package - we described how the data were preprocessed, which features were extracted, and how they were computed. Then, we demonstrated the functionality of the package on the example, and last but not least, we described the structure of the output log.

Download

Link to the package: https://github.com/stratosphereips/IRC-Zeek-package/

References

[1] The Zeek Network Security Monitor: https://www.zeek.org

[2] Zeek Package Manager: https://packages.zeek.org/

[3] Aposemat Project: https://www.stratosphereips.org/aposemat

[4] IoT23 Dataset: https://www.stratosphereips.org/blog/2020/1/22/aposemat-iot-23-a-labeled-dataset-with-malicious-and-benign-iot-network-traffic

[5] IoT23 Dataset - CTU Malware Capture 34-1: https://mcfp.felk.cvut.cz/publicDatasets/IoT-23-Dataset/IndividualScenarios/CTU-IoT-Malware-Capture-34-1

Contact

If you have any further questions, don’t hesitate to contact us! aposemat@aic.fel.cvut.cz