The World of Malicious IPs: Creating Blocklists from Honeypot Traffic.

A honeypot network is a security mechanism to detect and deflect potential cyber-attacks. It works by creating a decoy system that appears to be a valuable target for attackers. The honeypot is designed to lure attackers into interacting with it so that security researchers can monitor their activities and learn more about their tactics. By nature, the honeypots are hidden and do not form part of any production system. As they do not receive legitimate connections, all the interactions with the honeypots can be considered attacks.

In our laboratory, we are studying these attacks, and one of the objectives of these studies is the generation of threat intelligence feeds (TI feeds), particularly IP Blocklists. The IP Blocklists are lists of IPs known to be malicious that the organizations can use to protect themselves, filtering any traffic to or from those IPs. These lists can be downloaded from the Internet or purchased online.

There are several ways to create a Blocklist. For example, you can randomly choose a certain number of IPs from the address space. The problem with this list is that it probably will not be very effective in defending you from attackers, or worse, it will contain the IP of a legit service or user of your organization. A better way to create those Blocklists is to filter the IPs of real attackers and here is when the honeypots play an important role.

Our honeypot network comprises 27 IoT devices exposed to the public Internet. It can be considered a network of high-interaction honeypots. The devices receive any sort of attacks; they are pinged and scanned constantly by attackers searching for new victims. Some devices expose admin interfaces whose users and passwords are brute-forced. Eventually, the attackers compromise a device and install all sorts of malware, from crypto miners to command and control tools. Some compromised devices are even used to attack other victims. Luckily, all the network traffic to and from the honeypots is closely monitored and studied by security experts.

In Stratosphere Laboratory, we created the Attacker IP Prioritization (AIP) tool. It is a python framework that processes the attacks, gets the attackers' IPs and statistics of the network traffic, and based on a series of algorithms, generates IP Blocklists. The framework also allows testing the efficiency of the IP Blocklists, allowing us to choose the best models. The resulting blocklists are updated daily and available for download in the following link.

While freely distributing IP Blocklists is okay, distributing the information about the attacks can be misused by attackers in several ways. The ability of the honeypots to lure attackers depends greatly on secrecy, which can be compromised if detailed information about attacks is published and updated frequently. However, carefully sharing honeypot attack information can benefit the community in several ways.

Thus, we published the CTU-AIP-Attacks-2022 dataset, available for download in []. This dataset contains the list of unique IPs (exclusively IPv4) connecting to our honeypot network and the statistics gathered by our AIP tool, per day for each day of the year 2022. The statistics published include the number of flows, packets and bytes sent to the network by the attacker during a given day, plus the total duration in seconds of all the flows (time between the first and last packet of the flow).

This is an example of the content of one of the files in the dataset.

~ $ zcat attacks.2022-04-04.csv.gz | head -n20
# This file is part of the CTU-AIP-Attacks-2022 dataset
# Version: 1.0
# Publication Date: 2023-03
# Authors: Joaquin Bogado, Veronica Valeros, Sebastian Garcia
# Institution: Stratosphere Laboratory, AIC, FEL, Czech Technical University in Prague
# DOI: 10.5281/zenodo.7684550
# Zenodo: https://zenodo.org/record/7684550/
# Source: https://mcfp.felk.cvut.cz/publicDatasets/CTU-AIP-Attacks-2022/
date,orig,flows,duration,packets,bytes
2022-04-04,1.0.234.65,1,5e-06,2,104
2022-04-04,1.10.172.211,1,0.0,1,52
2022-04-04,1.116.138.182,1,4.7e-05,2,80
2022-04-04,1.116.243.210,1,3e-06,2,80
2022-04-04,1.116.37.121,1,2e-06,2,80
2022-04-04,1.116.67.192,24,0.000173,48,1920
2022-04-04,1.116.73.236,22,0.000173,43,1720
2022-04-04,1.116.97.146,1,1e-06,2,120
2022-04-04,1.117.107.145,1,3e-06,2,126
2022-04-04,1.117.199.237,1,5e-06,2,80
2022-04-04,1.12.255.18,2,2e-05,4,160

There is one of these files per day, from 2022-01-01 to 2021-12-31, in compressed CSV format (.csv.gz) with a header of 8 lines beginning with "#". The columns are

'date': the date of the attacks in string format (yyyy-mm-dd).

'orig': the origin IP of the attacks in string format.

'flows': the number of flows the attacker sends during that date in integer format.

'duration': the total duration of all the flows, that is the time between the first and last packet of the flow, in seconds in float format.

'packets': the number of packets sent by the attacker in integer format.

'bytes': The number of bytes sent by the attacker during that date in integer format.

These files are easily parsed and loaded into data analysis tools. For example, in Python, using the Pandas library, a file can be read into a DataFrame using the read_csv() function like this:

import pandas as pd

df = pd.read_csv('attacks.2022-04-04.csv.gz', comment='#')

A quick analysis reveals that there are 1,226,521 different IP addresses in the dataset. The top 1% of those IPs (12,265) are responsible for more than 90% of the total amount of bytes sent to the honeypots during the whole year.


However, the first time each of the IPs in the top 1% is seen attacking our honeypots is distributed across the year. Every day, the honeypot network receives an attack from between 2000 and 10000 IPs never seen before, of which between 1 and 100 are in the set of the top 1%. This fact severely limits the performance of the models that are based on the attacks' history because those models can't predict IPs they have never seen before in the past.

A model able to predict even a small percentage of the new IPs can be a game changer and help us to develop new and more efficient blocklists. But studying the attacks' history may not be sufficient to develop such a model. Augmenting the dataset with geolocation, ASN, or other data may shed more light on a very interesting and thriving problem in the network security field. If you have any new ideas, feel free to test them using the AIP tool. If you are a researcher using this dataset, please cite us in the following way:

Citation: Joaquin Bogado, Veronica Valeros, & Sebastian García. (2023). CTU-AIP-Attacks-2022 (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7684550

Check this link https://zenodo.org/record/7684550 for other options for citing, downloading, and for a detailed technical description of the dataset.