Security Experts:

Connect with us

Hi, what are you looking for?


Endpoint Security

Sophos, ReversingLabs Release 20 Million Sample Dataset for Malware Research

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Aimed at driving security improvements across the industry, the database provides metadata, labels, and features for the files within, and enables interested parties to download the available malware samples for further research.

Containing a curated and labeled set of samples and relevant metadata, the publicly-accessible dataset is expected to help accelerate machine learning research for malware detection.

Although machine learning models are built on data, the field of security lacks a standard, large-scale dataset that all types of users (ranging from independent researchers to laboratories and corporations) can easily access, which has so far slowed down advancement, Sophos argues.

“Obtaining a large number of curated, labeled samples is both expensive and challenging, and sharing data sets is often difficult due to issues around intellectual property and the risk of providing malicious software to unknown third parties. As a consequence, most published papers on malware detection work on private, internal datasets, with results that cannot be directly compared to each other,” the company says.

A production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, the SoReL-20M dataset aims to address the problem.

For each sample, the dataset includes features that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries for the included malware samples.

Additionally, PyTorch and LightGBM models that have already been trained on this data as baselines are provided, along with scripts needed to load and iterate over the data, as well as to load, train, and test the models.

Given that the malware being released has been disarmed, it would require “knowledge, skill, and time to reconstitute” and run, Sophos says.

The company does admit that there is a chance that skilled attackers would learn from these samples or use them to create attack tools, but argues that there “are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use.”

Thus, the company believes that the disarmed samples are more valuable to security researchers looking to advance their independent defenses.

The disabled malware samples, which have been in the wild for a while, are expected to call back to dismantled infrastructure. Moreover, they should be detected by most anti-virus vendors. With metadata released alongside the samples, detection is expected to improve.

“As an industry, we know malware isn’t limited to Windows or even executable files, which is why researchers and security teams will always need more data,” ReversingLabs, which claims to be providing a reputation database of more than 12 billion goodware and malware files, said.

Related: Microsoft, MITRE Release Adversarial Machine Learning Threat Matrix

Related: Tackling the SDLC With Machine Learning

Written By

Ionut Arghire is an international correspondent for SecurityWeek.

Click to comment

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Join this webinar to learn best practices that organizations can use to improve both their resilience to new threats and their response times to incidents.


Join this live webinar as we explore the potential security threats that can arise when third parties are granted access to a sensitive data or systems.


Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Management & Strategy

SecurityWeek examines how a layoff-induced influx of experienced professionals into the job seeker market is affecting or might affect, the skills gap and recruitment...


The changing nature of what we still generally call ransomware will continue through 2023, driven by three primary conditions.


No one combatting cybercrime knows everything, but everyone in the battle has some intelligence to contribute to the larger knowledge base.


A recently disclosed vBulletin vulnerability, which had a zero-day status for roughly two days last week, was exploited in a hacker attack targeting the...


Twenty-one cybersecurity-related M&A deals were announced in December 2022.

Management & Strategy

Industry professionals comment on the recent disruption of the Hive ransomware operation and its hacking by law enforcement.

Malware & Threats

Threat actors are increasingly abusing Microsoft OneNote documents to deliver malware in both targeted and spray-and-pray campaigns.