Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.
Aimed at driving security improvements across the industry, the database provides metadata, labels, and features for the files within, and enables interested parties to download the available malware samples for further research.
Containing a curated and labeled set of samples and relevant metadata, the publicly-accessible dataset is expected to help accelerate machine learning research for malware detection.
Although machine learning models are built on data, the field of security lacks a standard, large-scale dataset that all types of users (ranging from independent researchers to laboratories and corporations) can easily access, which has so far slowed down advancement, Sophos argues.
“Obtaining a large number of curated, labeled samples is both expensive and challenging, and sharing data sets is often difficult due to issues around intellectual property and the risk of providing malicious software to unknown third parties. As a consequence, most published papers on malware detection work on private, internal datasets, with results that cannot be directly compared to each other,” the company says.
A production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, the SoReL-20M dataset aims to address the problem.
For each sample, the dataset includes features that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries for the included malware samples.
Additionally, PyTorch and LightGBM models that have already been trained on this data as baselines are provided, along with scripts needed to load and iterate over the data, as well as to load, train, and test the models.
Given that the malware being released has been disarmed, it would require “knowledge, skill, and time to reconstitute” and run, Sophos says.
The company does admit that there is a chance that skilled attackers would learn from these samples or use them to create attack tools, but argues that there “are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use.”
Thus, the company believes that the disarmed samples are more valuable to security researchers looking to advance their independent defenses.
The disabled malware samples, which have been in the wild for a while, are expected to call back to dismantled infrastructure. Moreover, they should be detected by most anti-virus vendors. With metadata released alongside the samples, detection is expected to improve.
“As an industry, we know malware isn’t limited to Windows or even executable files, which is why researchers and security teams will always need more data,” ReversingLabs, which claims to be providing a reputation database of more than 12 billion goodware and malware files, said.
Related: Microsoft, MITRE Release Adversarial Machine Learning Threat Matrix