With RSA Conference on the horizon, odds are that if you make it to the exhibit floor, you will hear a lot about data science and machine learning.
Security vendors old and new are touting the powers of data science to solve security problems. And while these technologies have real value, the terms are rapidly becoming empty marketing buzzwords.
To keep our collective heads above water, it is important to understand the realities behind these technologies so we can separate the truth from the hype and make well-informed security decisions.
A quick intro to data science and why it matters
The world of data science can be hard to navigate, not simply because it involves lots of hard math, but also because it spans an enormously broad set of disciplines. Data science is concerned with the many ways that knowledge can be extracted from data including mathematics, statistics, machine learning, and a variety of analytics just to name a few.
A subset of data science, machine learning enables software to iteratively learn from data and adapt without being programmed. For example, machine learning can reveal low-level traits that command-and-control messages have in common, or signal an impending data theft when unusual employee behavior occurs. These characteristics might be unknown beforehand, but machine-learning models can recognize these signs from the data.
These examples illustrate critical concepts that make data science and machine learning important to security professionals.
First, the intelligence we extract from very large data sets tends to be fairly long-lived. Instead of chasing every URL a command-and-control server uses, we can learn its core underlying behavior and recognize it wherever it goes. This allows our security detections to stay well ahead of attackers.
Second, machine learning extends intelligence to the local environment. An intelligence feed will never be able to tell you when one of your employees starts behaving abnormally. It’s the sort of thing that must be learned locally, and is often the essential context needed to find a live threat.
All data is not equal
Data science models inherently depend on the quality of data they consume. The better the data, the more you will be able to learn. An entire industry has been spawned by analyzing logs and events generated by other systems. While this approach may help connect the dots between observed events, it rarely finds hidden threats that go undetected in the first place.
By nature, logs are a secondary source of data that briefly summarize an event. Information that is not contained in the log is lost and unavailable for further analysis. Equally important, logs are only as good as the systems that generated them. If an upstream firewall or security device fails to detect a threat, there will be no log.
This is a fundamental issue. It is the job of a cyber security solution to detect threats that slip by standard layers of defense. Data science and machine learning can be applied to any data source, not just log data. Direct analysis of traffic, files or devices allows us to detect what was previously invisible.
Focus on answers, not data
Having looked at the inputs to a data science detection model, we can now turn our attention to what they actually deliver. And this is where things can get a little dicey if you’re not careful.
While the promises may sound enticing, the vast majority of security and analytics solutions require a significant amount of human effort and attention in order to deliver value. Needless to say, most security organizations don’t have the luxury of extra time or staff.
As a precaution, be sure to evaluate whether a prospective solution makes life easier or harder on your staff. Many products generate mountains of anomalies that require a human analyst to investigate, and this bottleneck will severely limit your real-world value.
It is critically important for security products to actually deliver high-confidence detections and answers. Of course, analysts will always need solid evidence to validate that a threat is real. But this should be an effort of verification and not require the analyst to do the heavy lifting of intensive analysis and diagnosis.
These are of course not the only factors to consider when evaluating data science and machine learning solutions. However, it can provide some context to cut through the hype and find the data science solutions that are most likely to deliver real value.