Introducing Collectress: Consistent Threat Intelligence Feed Collection and Storage

Collectress is a free software tool developed by Stratosphere: https://github.com/stratosphereips/collectress

This blog was authored by Veronica Valeros (@verovaleros) on July 30, 2023

This blog introduces Collectress, a new tool developed at the Stratosphere Laboratory. Collectress was born out of the need to have a certain feed for 30 days or 300 days to evaluate the feeds over time and make a reasonable comparison among feeds. 

Collectress

Collectress is a Python tool designed to download external web threat intelligence feeds periodically and consistently. The emphasis is on consistency here. To be able to evaluate the performance of a blocklist or feed over time, we need to have a historical record of the feed. This is the main purpose of the tool. 

Collectress reads feeds to process from a YML file, which has three keys: feed name, feed organisation and feed URL. This information is then combined into a homogenous file name. 

Feeds are stored in a date-based nested directory structure: YYYY/MM/DD/<feeds>. This allows researchers to find all the feeds for a given day in one directory.

Collectress has a comprehensive log for each run, which allows not only to see how many feeds were downloaded successfully and how many failed, but also it measures the runtime and bytes transferred and calculates the success rate of each execution. The log in JSON format can be easily parsed to obtain more insights into the tool's operation throughout the days and months.

Additionally, Collectress implements an eTag cache to reduce unnecessary downloads. The eTag of the web files is cached, with a memory of 1 day, in a JSON file called etag_cache.json. Before downloading the full content, Collectress checks if the eTag is the same. If the eTag changes, the file is downloaded in full. If not, the file is copied from yesterday to today.

To make the deployment of the tool easier and also increase stability, we created a Docker image that contains all the dependencies, and that can be downloaded from DockerHub or GitHub packages. 

How to Use

Collectress needs three inputs: the data output folder, the data feeds YML file, and the eTag cache file. Once these files are created, you can run the tool using docker or Python, as shown below:

Figure 1 - Collectress downloading 43 threat intelligence feeds, featuring a simple progress bar.

We have included a sample data_feeds.yml file in our repository that you can use to test the tool. Collectress is still in early development, and we welcome any new issues with bugs or feature requests or contribute to the project following the contribution guidelines.

Learn more

Learn more about Collectress at GitHub: https://github.com/stratosphereips/collectress 

Before you go…

We have recently published a new dataset, CTU-SME-11: a labeled dataset with real benign and malicious network traffic mimicking a small medium-size enterprise environment. Check it out at Zenodo: https://zenodo.org/record/7958259