Attacker IP Prioritizer Program

This blogpost was authored by Thomas O’Hara(@bambino_thomas) on November 14th, 2019

Intro

The Attacker IP Prioritization (AIP) algorithm was created in order to sort the huge amount of attacker’s IP addresses to help block them using a blacklist.  In the Aposemat project we started a research line called “Polonium in my IoTea” were we are trying to investigate the relationship between organized cyber crime groups (e.g.  FancyBear) and their use of IoT devices. In this investigation we needed to analyze thousands of IP addresses attacking our honeypots and quickly decide which ones were the most dangerous.The idea of the AIP algorithm was born from the need to sort the IP addresses of the attackers from a statistical point of view, and then compare those rated IPs to our more technical and in-depth attack research to find out if an IP is really part of an important attacker group or not. We use investigation tools such as VirusTotal, PassiveTotal, Shodan, Machinae, and Reputation Authority. This prioritization need gradually grew into a much more ambitious project that aimed at designing an algorithm that could prioritize IP addresses in a continuous way by being constantly fed new data.

The source code for it can be found in the GitHub Repository.

The Principles

The aim of the AIP algorithm is to generate a list of the most dangerous IPs from the attacks to our IoT devices. We assume that the most dangerous  IPs, from a statistics point of view, should have a couple of recognizable features:

  • First, they should be attacking more often than other IPs. In terms of our collected data, we increase the priority of the IPs that attack more.

  • Second, IPs should attack consistently. Namely, IPs should have a higher daily average of attacks and its standard deviation should be low.

  • Third, the average duration of the attacks should be long. This is simply because larger and more advanced botnets are more organized and thorough, thus meaning they need to try more things once they get into our honeypots, thus increasing the length of their events.

  • Fourth, IPs should be currently active. An IP that was last seen a few months ago would have its priority decreased in our list.

  • Fifth, the number of bytes transferred and the number of packets sent and received will be greater.

All five of these traits needs to be included in the sorting process of AIP and each of them need to be weighted since they are not of equal importance. Therefore, there is a need to build a prioritization algorithm that receives data flows and outputs information built on top of these six characteristics.

The Data Source

The first step in the design of the AIP algorithm was to find a source from which the data could be efficiently imported from. We decided to use our Splunk instance since it was already consuming all the flows going to our honeypots.

Using this existing setup, we wrote a custom Splunk script that would sort the flows by IP with the most number of attacks per week to the least number of attacks.

Figure 1: The output of the Splunk sorting function

As it can be seen in Figure 1, each IP in the list of flows has nine different features attached to it that are generated from the data gathered during the last 24 hours.

  1. Amount of events - Meaning the total connections to our honeypots originating from the given IP

  2. Total Duration - How long did this IP connect for the total of its events

  3. Average duration - The average length in seconds of all the connections per IP

  4. Amount of Bytes - Total bytes sent and received

  5. Average number of bytes - For bytes transferred in each connection per IP

  6. Total packets - Of all the connections per IP

  7. Average packets - Average packets sent per connection

  8. Last event time - UNIX time of the last time the IP tried to connect to something  in the last 24 hours

  9. First event time - UNIX time of the first time the IP tried to connect in the last 24 hours

The program shown in Figure I outputs a CSV file to be used  by the sorting program.

The AIP Algorithm

The AIP algorithm  takes each of the flows from the input and uses their data to calculate eight values for each IP. The first seven values from the input data remain unchanged, number of events, total duration, average duration, number of bytes, average number of bytes, total packets and average packets. However, the first event time and the number of events are used to calculate the average number of events per day the IP has had since it was first seen by the program, giving us a total of eight features as input for our algorithm.

For each IP, each of the eight values is updated using the data from the current day and then saved to a file, called the absolute file.  The absolute data file contains the values for all the IPs seen since the program was started. 

The next step is to feed the absolute data file, which has been updated with the last 24 hours of events, into the rating program. The rating program assigns each of the eight values a specific weight. These weights control the effect each value will have on the final score. The sum of all weights is one. The current used weights are:

  • For the Amount Events (feature 1) = 0.10 (Weight 1)

  • For the Average Events (feature 2) = 0.15 (Weight 2)

  • For the Total Duration (feature 3) = 0.10 (Weight 3)

  • For the Average Duration (feature 4) = 0.15 (Weight 4)

  • For the Total Bytes (feature 5) = 0.10 (Weight 5)

  • For the Average Bytes (feature 6) = 0.15 (Weight 6)

  • For the Total Packets (feature 7) = 0.10 (Weight 7)

  • For the Average Packets (feature 8) = 0.15 (Weight 8)

The rating function has the form:

Figure 2: Score Function

Figure 2: Score Function

Basically, each feature is multiplied by its weight and then summed with the rest, as in a basic linear combination. Then the sum is multiplied by a time modifier, as will be explained below, and then we take the square root of that total.

The time modifier function receives  the time of the last event of each IP. If the last event is less than one day old, the function returns 1, thus leaving the total unchanged. However, if the last event for a given IP is over a day old, the time modifier function returns a decreasing number. The pseudo code is :

time_modifier_function(days):
    if days > 1:
        return 1 - days/(days + 30)
    else:
        return 1

In this way if an IP is inactive it’s score will decrease over time like this:

the x-axis is the percentage of the original score, y-axis is days

the x-axis is the percentage of the original score, y-axis is days

And if the IP is active, its score will increase like this:

x-axis is score, y-axis is days

x-axis is score, y-axis is days

Output and Usage

The AIP algorithm is scheduled to run every 24 hours at 12:00hs. It receives the data from the last 24 hours and updates the scores for each IP and then writes the final list of IPs in descending score order to a CSV file. That file is copied to the Stratosphere labs public data sets. It is also copied to a historical directory in that same folder thus saving the original ratings for every day that the program was run. This would allow a researcher to go back in time to see how the blacklist looked like in a specific day.

This blacklist would be very useful for anyone who wants access to a list of currently active and attacking IP addresses. Many blacklists today have a very simple mechanism for determining whether an IP is to be blacklisted or not. There is usually no order for the IPs, no priority, and no priority given to IPs that are more consistent in their activity. Our AIP algorithm generates a blacklist that solves this problem. Because it updates every 24 hours, you will always be given the currently active IPs first, over the less active. Also, since this blacklist has a time modifier, an IP that is no longer being used will be eliminated from the blacklist as time goes on. This is a rather useful and rare feature that most blacklists do not have. Whether you are doing research or simply blocking IPs in a firewall, we hope this blacklist will be useful to you.

Planned Improvements

Currently, the blacklist includes all the IPs that are listed in the absolute data file. The absolute data file is the file where all the compounded data over time is stored. At this point it includes even the IPs that have a very low score. We plan on adding a deletion function that will remove these IPs from the output rating file, even though they will be saved forever in the absolute data file.

  1. We plan to alter the time modifier function so that instead of using the value 30, it will have a variable that will depend on the previous score of the IP, so that highly rated IPs will disappear slower than barely active ones.

If anyone has any ideas on how to improve the program, please leave a comment!

The blacklist can be found here Stratosphere AIP Blacklist!!

The source code for it can be found in the GitHub Repository!!