Now on Demand Ransomware Resilience & Recovery Summit - All Sessions Available
Connect with us

Hi, what are you looking for?

SecurityWeekSecurityWeek

Endpoint Security

Sophos, ReversingLabs Release 20 Million Sample Dataset for Malware Research

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.

Aimed at driving security improvements across the industry, the database provides metadata, labels, and features for the files within, and enables interested parties to download the available malware samples for further research.

Containing a curated and labeled set of samples and relevant metadata, the publicly-accessible dataset is expected to help accelerate machine learning research for malware detection.

Although machine learning models are built on data, the field of security lacks a standard, large-scale dataset that all types of users (ranging from independent researchers to laboratories and corporations) can easily access, which has so far slowed down advancement, Sophos argues.

“Obtaining a large number of curated, labeled samples is both expensive and challenging, and sharing data sets is often difficult due to issues around intellectual property and the risk of providing malicious software to unknown third parties. As a consequence, most published papers on malware detection work on private, internal datasets, with results that cannot be directly compared to each other,” the company says.

A production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, the SoReL-20M dataset aims to address the problem.

For each sample, the dataset includes features that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries for the included malware samples.

Additionally, PyTorch and LightGBM models that have already been trained on this data as baselines are provided, along with scripts needed to load and iterate over the data, as well as to load, train, and test the models.

Advertisement. Scroll to continue reading.

Given that the malware being released has been disarmed, it would require “knowledge, skill, and time to reconstitute” and run, Sophos says.

The company does admit that there is a chance that skilled attackers would learn from these samples or use them to create attack tools, but argues that there “are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use.”

Thus, the company believes that the disarmed samples are more valuable to security researchers looking to advance their independent defenses.

The disabled malware samples, which have been in the wild for a while, are expected to call back to dismantled infrastructure. Moreover, they should be detected by most anti-virus vendors. With metadata released alongside the samples, detection is expected to improve.

“As an industry, we know malware isn’t limited to Windows or even executable files, which is why researchers and security teams will always need more data,” ReversingLabs, which claims to be providing a reputation database of more than 12 billion goodware and malware files, said.

Related: Microsoft, MITRE Release Adversarial Machine Learning Threat Matrix

Related: Tackling the SDLC With Machine Learning

Written By

Ionut Arghire is an international correspondent for SecurityWeek.

Click to comment

Trending

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Join the session as we discuss the challenges and best practices for cybersecurity leaders managing cloud identities.

Register

SecurityWeek’s Ransomware Resilience and Recovery Summit helps businesses to plan, prepare, and recover from a ransomware incident.

Register

People on the Move

Bill Dunnion has joined telecommunications giant Mitel as Chief Information Security Officer.

MSSP Dataprise has appointed Nima Khamooshi as Vice President of Cybersecurity.

Backup and recovery firm Keepit has hired Kim Larsen as CISO.

More People On The Move

Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Cybercrime

A recently disclosed vBulletin vulnerability, which had a zero-day status for roughly two days last week, was exploited in a hacker attack targeting the...

Cybercrime

The changing nature of what we still generally call ransomware will continue through 2023, driven by three primary conditions.

CISO Strategy

SecurityWeek spoke with more than 300 cybersecurity experts to see what is bubbling beneath the surface, and examine how those evolving threats will present...

CISO Conversations

Joanna Burkey, CISO at HP, and Kevin Cross, CISO at Dell, discuss how the role of a CISO is different for a multinational corporation...

CISO Conversations

In this issue of CISO Conversations we talk to two CISOs about solving the CISO/CIO conflict by combining the roles under one person.

CISO Strategy

Security professionals understand the need for resilience in their company’s security posture, but often fail to build their own psychological resilience to stress.