Security Experts:

Root Cause Analysis: Stop Playing Whack-a-Mole

Security Incident Root Cause Analysis

What Can we do Once we Identify the Root Cause? We Can Work to Address it. 

Recently, a piece of Point-of-Sale (POS) malware, Backoff POS, has become big news.  I read several different write-ups on the malware, including the US-CERT alert (TA14-212A) that was released in late July.  In reviewing the different write-ups, I found a good deal of information regarding post-infection Indicators of Compromise (IOCs) to help organizations assess whether or not they have been compromised by Backoff POS.  The information I saw was great, and it is a good thing that organizations were able to receive such detailed IOC information.  But, I must admit that I was quite surprised by what I didn’t see in any of the write-ups I reviewed. Allow me to explain.

As a practitioner, customers often ask me how they can best mitigate or reduce the risk presented by a variety of threats. Point-of-Sale malware is one of those threats, for obvious reasons. The damage to an organization, monetary, public relations, or otherwise from a breach involving the theft of payment card data can be enormous. I get many questions when I meet with customers, but questions on mitigating or reducing risk are by far the most difficult. These questions require an intimate knowledge of specific threat vectors.  In other words, for a given risk or threat, I need to know how that threat can get into my organization in order to try and keep it out.

Given this, I was startled by how little information was available on the initial delivery mechanism or initial infection vector into the organization for Backoff POS.  All of the information I had access to was either about the malware itself, or its behavior on the network following infection. That’s all great information and should absolutely be fully leveraged, but it is all reactive information. Put another way, that information does not help the proactive defender to mitigate or reduce the risk presented by Backoff.

My intent here is not to pick on those who researched and analyzed Backoff or to belittle their work, which I thought was excellent. Rather, I want to raise awareness regarding something that we as a community do not do enough of – root cause analysis. In other words, the question of “how or why are we getting infected?” is an important one in my opinion, but one that is often paid too little attention. Instead, as a community, we seem to accept as our fate the need to play whack-a-mole. Allow me to elaborate.

On any given day, an organization will detect or receive notification regarding any number of infected systems on the network. The organization will then perform incident response accordingly, as we might expect. For those of us that have worked in the field of incident response for a while, we recognize this as a routine part of our day – just like drinking our morning coffee. As part of our incident response, we will improve and tighten our controls to prevent what happened today from happening tomorrow. Seems like a good approach, right?  Yes, absolutely – except for the fact that tomorrow, the attackers will be onto something else that we probably don't have controls in place for.

If we take a step back, we see that from this perspective, incident response can begin to feel a bit like the arcade game whack-a-mole. Kill 12 infected systems today and their associated infection vectors, and tomorrow, 15 more will pop up. I'm not suggesting that we abandon this – incident response absolutely needs to be performed for systems we know are infected. Rather, I'm suggesting that we think about treating the cause of the infections, rather than the symptoms.  If we can treat the cause of the infections, we will have far fewer symptoms to treat.

Getting to the root cause involves a level of understanding beyond that of simply identifying that a system in infected. We need to understand what specifically enabled or facilitated the infection. It's important to remember that root cause and infection vector are two different things. Identifying the infection vector allows us to know how the malicious payload was delivered. Identifying the root cause allows us to understand why the malicious payload succeeded in infecting the system. There is a subtle difference there.

Consider the all-too-common example of a drive-by re-direct attack delivering an exploit to a vulnerable version of Java. The infection vector tells us that an unsuspecting user (the innocent bystander) was re-directed to a malicious site that delivered an exploit. If we block the malicious site, there will be another one (or another 1,000) tomorrow. The root cause, on the other hand, tells us that the version of Java on the infected system was vulnerable, and it is upon this that the attackers preyed.

So how can we identify the root cause of infection? In order to identify root cause, we need to re-construct exactly what transpired during the infection to fully understand the sequence of events. In order to fully understand the sequence of events, we need to precisely extract only the relevant network traffic data and endpoint data. In order to precisely extract only the relevant data, we need to issue precise, targeted, and incisive queries across that data. In other words, we need to perform forensics to re-construct and fully understand what occurred.

What can we do once we identify the root cause? We can work to address it. For example, if vulnerable versions of Java are the root cause of 80% of our malicious code infections, we can work with IT to understand why we are running a vulnerable version of Java and correct that. Think of the ramifications here: By performing forensics to identify root cause and subsequently addressing the root cause, we could potentially achieve a five-fold decrease in malicious code infections. How do I know this?  I've seen it happen with my own eyes inside an enterprise.

As an added benefit, when there are less commodity malicious code infections to respond to, we can focus on other questions that are often overlooked because of lack of time.  For example, we might want to analyze our data looking for more sophisticated threats, or perhaps understand if we have particularly unusual traffic on our network that requires additional investigation. There is no shortage of good ways to invest newly liberated human resources.

Root cause analysis is a great thing, unless you like playing whack-a-mole that is.

view counter
Joshua Goldfarb (Twitter: @ananalytical) is CTO – Emerging Technologies at FireEye and has over a decade of experience building, operating, and running Security Operations Centers (SOCs). Before joining nPulse Technologies, which was acquired by FireEye, as its Chief Security Officer (CSO), he worked as an independent consultant where consulted and advised numerous clients in both the public and private sectors at strategic and tactical levels. Earlier in his career Goldfarb served as the Chief of Analysis for US-CERT where he built from the ground up and subsequently ran the network, physical media and malware analysis/forensics capabilities. Goldfarb holds both a B.A. in Physics and a M.Eng. in Operations Research and Information Engineering from Cornell University.