Three Years of Publishing Malware Traffic Datasets

The Stratosphere IPS is a behavioral-based intrusion detection and prevention system. It uses machine learning algorithms to detect malicious behaviors. In order to do that, we create models based on real malware behaviours to ensure a good accuracy and performance of our IPS. For this reason, in 2015 we started our sister project called 'Malware Capture Facility Project'. 

Nomad Project

The goal of the NoMaD project is to collect, label, organize and make available a large, verified and labeled dataset of normal and malicious HTTPS connections. This dataset is designed to support the research team at Cisco Prague as well as to support the research activities and publications of the CVUT University. The project will give Cisco Systems an evolving dataset to generate better and faster analysis; and will give the CTU University the opportunity to research about the HTTPSbehaviors in the network as part of its Stratosphere Project.

New dataset, CTU-13-Extended, now includes pcap files of normal traffic

After considering several request we decided to extend the previous CTU-13 dataset to include truncated versions of the original pcap files. The pcap files include now all the traffic: Normal, Botnet and Background. The pcap files where however truncated to protect the privacy of the users, but in such a way that it is still possible to read the complete TCP, UDP and ICMP headers.

Differences on The Behavioral Patterns of Malware and Normal DNS Connections

This blog post is a comparison and analysis of the differences in the behavioral patterns found in the DNS traffic of malware and normal connections. We captured malware and normal traffic in the MCFP project and we extracted the DNS behavior with the stf tool. The captures correspond to DNStraffic of a SPAM malware, DGA-based malware and a normal computer. The idea is to analyze the differences in the behaviors as they are shown by the stf program. For an explanation of how the stfprogram is generating this data see this explanation.

The Importance of Good Labels in Security Datasets

Working as security researchers is common to create a new machine learning algorithm that we want to evaluate. It may be that we are trying to detect malware, identify attacks or analyze IDS logs, but at some point we figure it out that we need a good dataset to complete our task. But not any dataset; in fact we need a labeled dataset. The dataset will be used not only to learn the features of, for example, malware traffic, but also to verify how good our algorithm is. Since getting a dataset is difficult and time consuming, the most common solution is to get a third-party dataset; although some researchers with time and resources may create their own. Either way, most usually we obtain a dataset of malware traffic (continuing with the malware traffic detection example) and we assign the label Malware to all of its instances. This looks good, so we make our training and testing, we obtain results and we publish. However, there are important problems in this approach that can jeopardize the results of our algorithm and the verification process. Let’s analyze each problem in turn.