smart-home-4658636_1920.jpg

AIP Tool

For those that wish to implement the AIP algorithm for themselves on their own networks with their own data, we have published the source code for the algorithm in the form of the AIP tool on GitHub. It is very simply to use and deploy, and as long as you are familiar with crontab or some other Linux automation system it can be set up to automatically update and regenerate itself.

This tool is meant to be used in conjunction with at least one honeypot from which you are gathering your data from. The data input cannot be simply data collected from a normal network, since the tool does not differentiate between attacks and regular connections. It treats every IP in the input data file as an attacker. It is designed to be run once a day at a time of your choosing after copying the .csv file that contains the data from the last 24 hours to the designated directory. If it is not run once a day, the rating system will be thrown off, but it will still work.

Input Data

What the program accepts is a directory that contains data files from each day. You assign a directory for the program to look in every time it runs, and it checks if there are any new files to process. If there are, it processes the new files and remembers the names of the new files so that it does not process it the next time it runs.

In terms of file format, it accepts a .csv file that has one IP per line, with each of the following data inputs for each IP on that line, separated by commas:

Amount of events - Meaning the total connections to our honeypots originating from the given IP

Total Duration - How long did this IP connect for the total of its events

Average duration - The average length in seconds of all the connections per IP

Amount of Bytes - Total bytes sent and received

Average number of bytes - For bytes transferred in each connection per IP

Total packets - Of all the connections per IP

Average packets - Average packets sent per connection

Last event time - UNIX time of the last time the IP tried to connect to something in the last 24 hours

First event time - UNIX time of the first time the IP tried to connect in the last 24 hours

For example, a single line in the file could look like this:

"IPv4 Address",”26049”,"7415310","284.6","41808957","1605.0",”284577”,"10.92","157899154","1578968762.519"

Rating Process

The AIP algorithm takes each of the flows from the input and uses its data to calculate eight values for each IP. For each IP, each of the eight values is updated using the data from the current day and then saved to a file, called the absolute file.  The absolute data file contains the values for all the IPs seen since the program was started. 

The absolute data file is then checked against a whitelist data file, which a simple csv file that contains a list of IPs that should not be blocked. This is meant to be a safeguard against mistakes that can be made by whatever filtration system is used for generating the input data. The whitelist file comes with a list of google IPs and some other common web crawlers, and can be easily changed by the user.

The next step is to feed the absolute data file, which has been updated with the last 24 hours of events, into the rating program. Each of the eight data features are normalized against each feature for each IP across the entire database. For example, in the database IP-A has an average number of events per day of 50. Instead of using 50 to calculate the score, the AIP algorithm now normalizes the value to a value between 0 and 1 by comparing it to the IPs in the database that have the highest and lowest average event values. In this way each data type will play an equal role in the final score.

The rating program assigns each of the eight values a specific weight, the sum of which is one. These weights control the effect each value will have on the final score. The normalized values multiplied by their weights are then combined in a linear combination for a final score.

The final score is then multiplied by the time modifier function, which a value based on the time of the last event of each IP. If the last event is less than one day old, the function returns 1, thus leaving the total unchanged. However, if the last event for a given IP is over a day old, the time modifier function returns a decreasing number. The pseudo code is :

time_modifier_function(days):
    if days > 1:
        return days/(days + 5)
    else:
        return 1

Thus the longer it has been since an IP has attacked, the more its score will decrease. This will continue until the IPs score diminishes below the threshold score. The threshold score is a value that an IP needs to score higher than in order to be blacklisted. All IPs which have a value above that threshold are blacklisted, and those that do not, although they are still tracked, are not blacklisted.