Security Experts:

Security Operations: What is Your Signal-to-Noise Ratio?

More Signal, Less Noise! With a Large Volume of Even the Highest Priority Security Alerts, Analysts Cannot Successfully Review Each Alert

When I chat with security leaders and practitioners, they often ask me for recommendations on how they can improve their security posture. I generally make several recommendations, which depend heavily on the specific organization and its maturity. One recommendation I almost always make is for the organization to take its security operations workflow to the next level by improving its efficiency. This is a topic I am passionate about, and it is one that I would like to discuss with a wider audience in this piece. Human analyst and incident responder resources are always in short supply, and an efficient security operations workflow is the single biggest force multiplier I have found to date for those resources.

There are many ways that an organization could look to improve the efficiency of its security operations workflow, but there is one way in particular that presents itself most prominently to me.  Most organizations have a variety of log data streaming into a centralized log collection and aggregation system (be it a SIEM, data warehouse, or otherwise).  Most organizations use that data to drive their alerting. Often, the alerts that are produced populate a ticketing or incident management system, and it is often from there that the events that make up the work queue are drawn.

Singal to Noise Ratio Whatever the specific technologies and processes involved in this workflow, there is one particular point that jumps out at me emphatically.  Our efficiency as an organization correlates most strongly with the quality of our alerts.  In other words, our work queue defines what our scarce human resources work on in a given day. Given that, doesn’t it make sense to supply that work queue with the highest quality, highest fidelity alerts possible to ensure that human resources spend their precious cycles on the highest value work?  In other words, more signal, less noise.

At the same time, recent media reports discussing various high profile breaches have indicated that, often, numerous alerts fired as a result of the intrusion activity. In many cases, the alerts were not properly handled, causing the breaches to remain undetected for months. I'm sure there are many angles in which these media reports can be dissected. Rather than play the blame game, I would like to discuss a subject that remains a challenge for our profession as a whole and that I eluded to above: the signal-to-noise ratio.

Wikipedia defines the signal-to-noise ratio as "a measure used in science and engineering that compares the level of a desired signal to the level of background noise."  In other words, the more you have of what you want, and the less you have of what you don't want, the easier it is to measure something.  

Let's illustrate this concept by imagining a conversation between two people in a noisy cafe. If I record that conversation from the next table, upon playback, it will be very difficult for me to truly understand what was discussed. Conversely, if I record that conversation in a quiet room, it will be much easier to understand what was discussed upon playback. The signal-to-noise ratio in the second scenario is much higher than in the first scenario.

The same concept applies to security operations and incident response. In security operations, true positives are the signal, and false positives are the noise. Consider the case of two different Security Operations Centers (SOCs), SOC A and SOC B.  In SOC A, the daily work queue contains approximately 100 reliable, high fidelity, actionable alerts. In SOC A, an analyst is able to review each alert. If incident response is necessary for a given alert, it is performed. In SOC B, the daily work queue contains approximately 100,000 alerts, almost all of which are false positives. Analysts attempt to review the alerts of the highest priority.

Because of the large volume of even the highest priority alerts, analysts are not able to successfully review all of the highest priority alerts.  Additionally, because of the large number of false positives, SOC B's analysts become desensitized to alerts and do not take them particularly seriously.

One day, 10 additional alerts relating to payment card stealing malware fire within a few minutes of each other.

In SOC A, where every alert is reviewed by an analyst, where the signal-to-noise ratio is high, and where 10 additional alerts seems like a lot, analysts successfully identify the breach less than 24 hours after it occurs.  SOC A's team is able to perform analysis, containment, and remediation within the first 24 hours of the breach.  The team is able to stop the bleeding before any payment card data is exfiltrated.  Although there has been some damage, it can be controlled. The organization can assess the damage, respond appropriately, and return to normal business operations.

In SOC B, where an extremely small percentage of the alerts are reviewed by an analyst, where the signal-to-noise ratio is low, and where 10 additional alerts doesn't even raise an eyebrow, the breach remains undetected.  Months later, SOC B will learn of the breach from a third party.  The damage will be extensive, and it will take the organization months or years to fully recover.

Unfortunately, in my experience, there are many more SOC B's out there than there are SOC A's. It is relatively straightforward to turn a SOC B into a SOC A, but it does require experienced professionals, organizational will, and focus.  How do I know? I've turned SOC B's into SOC A's several times during my career.

We are fortunate to have some great technology choices these days that we can leverage to improve our security operations and incident response functions. These technology choices can enable us to learn of and respond to breaches soon after they occur.

Before purchasing any technology intended to produce alerts destined for the work queue, we should ensure that it allows us to hone in on the activity we want to identify (the true positives/the signal), while minimizing the activity we do not want to identify (the false positives/the noise).  As always, these technologies are tools that need to be properly leveraged as part of the larger people, process, and technology picture.

What is your signal-to-noise ratio? Is it high enough to detect the next breach, or could it stand to be strengthened? I would posit that the ratio of true positives to false positives (the signal-to-noise ratio) is an important metric that all organizations should review. Not doing so could have dire consequences.

view counter
Joshua Goldfarb (Twitter: @ananalytical) is CTO – Emerging Technologies at FireEye and has over a decade of experience building, operating, and running Security Operations Centers (SOCs). Before joining nPulse Technologies, which was acquired by FireEye, as its Chief Security Officer (CSO), he worked as an independent consultant where consulted and advised numerous clients in both the public and private sectors at strategic and tactical levels. Earlier in his career Goldfarb served as the Chief of Analysis for US-CERT where he built from the ground up and subsequently ran the network, physical media and malware analysis/forensics capabilities. Goldfarb holds both a B.A. in Physics and a M.Eng. in Operations Research and Information Engineering from Cornell University.