Security Experts:

Connect with us

Hi, what are you looking for?


Endpoint Security

Sophos, ReversingLabs Release 20 Million Sample Dataset for Malware Research

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Aimed at driving security improvements across the industry, the database provides metadata, labels, and features for the files within, and enables interested parties to download the available malware samples for further research.

Containing a curated and labeled set of samples and relevant metadata, the publicly-accessible dataset is expected to help accelerate machine learning research for malware detection.

Although machine learning models are built on data, the field of security lacks a standard, large-scale dataset that all types of users (ranging from independent researchers to laboratories and corporations) can easily access, which has so far slowed down advancement, Sophos argues.

“Obtaining a large number of curated, labeled samples is both expensive and challenging, and sharing data sets is often difficult due to issues around intellectual property and the risk of providing malicious software to unknown third parties. As a consequence, most published papers on malware detection work on private, internal datasets, with results that cannot be directly compared to each other,” the company says.

A production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, the SoReL-20M dataset aims to address the problem.

For each sample, the dataset includes features that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries for the included malware samples.

Additionally, PyTorch and LightGBM models that have already been trained on this data as baselines are provided, along with scripts needed to load and iterate over the data, as well as to load, train, and test the models.

Given that the malware being released has been disarmed, it would require “knowledge, skill, and time to reconstitute” and run, Sophos says.

The company does admit that there is a chance that skilled attackers would learn from these samples or use them to create attack tools, but argues that there “are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use.”

Thus, the company believes that the disarmed samples are more valuable to security researchers looking to advance their independent defenses.

The disabled malware samples, which have been in the wild for a while, are expected to call back to dismantled infrastructure. Moreover, they should be detected by most anti-virus vendors. With metadata released alongside the samples, detection is expected to improve.

“As an industry, we know malware isn’t limited to Windows or even executable files, which is why researchers and security teams will always need more data,” ReversingLabs, which claims to be providing a reputation database of more than 12 billion goodware and malware files, said.

Related: Microsoft, MITRE Release Adversarial Machine Learning Threat Matrix

Related: Tackling the SDLC With Machine Learning

Written By

Ionut Arghire is an international correspondent for SecurityWeek.

Click to comment

Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Management & Strategy

Industry professionals comment on the recent disruption of the Hive ransomware operation and its hacking by law enforcement.

Malware & Threats

Microsoft plans to improve the protection of Office users by blocking XLL add-ins from the internet.

Management & Strategy

Tens of cybersecurity companies have announced cutting staff over the past year, in some cases significant portions of their global workforce.


CISA, NSA, and MS-ISAC issued an alert on the malicious use of RMM software to steal money from bank accounts.


Chinese threat actor DragonSpark has been using the SparkRAT open source backdoor in attacks targeting East Asian organizations.

Management & Strategy

Microsoft making a multiyear, multibillion dollar investment in the artificial intelligence startup OpenAI, maker of ChatGPT and other tools.


Twenty-one cybersecurity-related M&A deals were announced in December 2022.