There are many important and useful tools related to the metrics landscape; let’s take a look at some of them and how they fit together. For the sake of this discussion, I’ll stick with the definition of “metrics” that I offered previously:
A metric is some data and an algorithm for reducing and presenting it to tell a story.
Different metrics tools such as statistics, benchmarks and heuristics each have their place in our intellectual landscape, but the underlying principle behind all of them is rooted in The Scientific Method. The method of science is one of humanity’s great inventions for controlling and understanding the world around us. The way science works, you attempt to gain understanding of something by hypothesizing some kind of cause/effect relationship in that thing, then hypothesizing how a change in that cause’s inputs will affect the outputs. After you’ve done that, you alter some of the inputs in an experimental set, while leaving another set – the control set – unaltered, and see if the results of the change support or destroy your hypothesis.
Let’s imagine we’re all sitting in a conference room and someone asks “does anyone know how to dim the lights in here?” When I jump up from my seat and start flipping light switches on the wall, I’m engaging in science even if it’s not a particularly grandiose experiment: I look at one of the switches and am implicitly assuming that “light switches control lights” based on past experiments. I observe that the lights are on now, which is my control situation, then flip a switch performing an experiment. If the light goes off at the moment I flip the switch I can reasonably infer a cause/effect relationship between the switch and the light. I could do an annoying repeat experiment by flipping the switch a couple of times so I could be more sure of the cause/effect relationship between that switch and the light going on/off.
Most of us don’t think of the many things we do as embodying The Scientific Method – we’re too busy living – but really every time we’re hypothesizing something about cause/effect that’s what we’re doing. It’s a necessary part of the process of acquiring knowledge: without experience with light switches you cannot know how to turn on and off a light, and after years of using light switches and having them work in accordance with your subliminal theory of light switches, eventually you become comfortable establishing a general rule that light switches mostly work that way. If you found yourself in a conference room with no apparent light switches, you would start the process of turning the light off by hypothesizing that there was a switch (because that’s how your experience tells you lights work) but you just haven’t figured out what it looks like.
If you’ve ever worked with statistics, it’s a sure thing that sooner or later someone will say “correlation does not imply causation.” Just because you can find a correlation between two things, you haven’t learned which thing causes the other, or whether one even causes the other at all. Let’s consider global warming: planetary temperatures have increased steadily as the number of pirates has dropped. That’s a fact. But observing that fact does not mean that lack of pirates is responsible for global warming – in fact there are an infinite number of wrong hypothetical cause/effect relationships you can come up with for any given situation, and maybe one or a small number of correct ones. The reason we use statistics is to explore our data sets looking for possible correlations that we might be interested in confirming separately. One of the worst ways you can let your science go off the rails is by jumping to conclusions without confirming a cause/effect relationship; when you overlook something, you can come up with a completely wrong answer that looks perfectly reasonable. That’s why the “correlation does not imply causality” mantra is so repeatable: it’s a reminder to perform a sanity check on your hypothesis and cause/effect analysis.
In the sense that a statistic is a tool for exploring cause/effect relationships in our data, it’s a “heuristic” – a system for learning and discovering about experiences. In the context of metrics, heuristics are used as a way of encapsulating a “rule of thumb” or a strategy used in decision-making. It’s how we communicate about our shared experience, and we see it fairly often in security. If I had a dollar for every time I’ve told someone I work in the field of computer security and they’ve replied, “Oh, firewalls and antivirus? That kind of stuff?” I could retire to a palace somewhere. What they’re implicitly saying when they ask that is that there’s a relationship between firewalls and anti-virus with security, to them, or perhaps that’s what security is.
It’s important to understand the difference between learned behaviors – things that have worked for us in the past and become heuristics – and our understanding of the problem we’re dealing with. Here’s an example of a heuristic in security: you should reduce your vulnerabilities in order of most severe to least severe. Lots of people do that, and it is in fact a pretty good rule of thumb. But if you deconstruct it and look at it with a critical eye you can see where the assumptions creep in: that vulnerabilities’ severity scores correctly map to what you are experiencing at a particular moment, and that your experience of severity matches everyone else’s. One of the fascinating things about metrics is the degree to which we are concerned with shared experience versus our own. I’ll get into that topic in a lot more detail later in this series when we go into normalization and data-sharing.
For the time being, it’s good to think of metrics as being of two different kinds: ones that capture your internal experience, and ones that deal with shared experience. An example of a metric that deals with internal experience might be your vulnerability count over time, whereas the severity score that your vulnerability management system applies to individual vulnerabilities will be based on the shared experience of the security community and is probably the result of a heuristic! Whoever it was who decided that a particular vulnerability was “critical” did so using a scoring system that factors in a bunch of knowledge like: is it remotely exploitable, is there an exploit in the wild, etc. The reason heuristics like that are valuable is because they offer consistency for our shared experience: we can trust that when the vulnerability management system says something is critical, that that rating was not produced by having a blindfolded engineer throw a dart at a dartboard. So, you can think of heuristics as a valuable way of standardizing how we capture knowledge – it’s crucial to establishing our definition of whatever it is we’re going to count as part of our metric.
0-110 in a very short distance – launching a car off an aircraft carrier, 1988 Citroen Ad
There is one last concept I want to introduce: benchmarks. A benchmark is another way of reaching for a common experience – we establish a test that we believe captures some kind of knowledge about our problem; and then we test against it. Now, we can produce a metric of how various systems perform against the benchmark, and automatically encapsulate all the underlying assumptions that went into the benchmark, without having to refer to it every time. Let’s look at a popular benchmark: 0-60 times in automobiles. The underlying assumption in that benchmark is that cars that can go 0-60 in less time are faster (more precisely: faster at getting from 0-60!) than other cars. Given 0-60 scores for a group of cars, we can talk about how they accelerate without having to constantly redefine the underlying assumption that the 0-60 time was measured in a straight line, with the car’s traction control system enabled, a professional driver at the wheel, and so forth. The benchmark allows us to normalize our experience so that we can focus on the critical point of comparison. That’s why it’s so egregious when someone cheats on a benchmark – it hacks that underlying assumption and throws into doubt our ability to reason about common experience.
A good rule of thumb, or heuristic, to use for your own metrics program is that if you’re going to share your results outside your organization, you need to start thinking about normalization, heuristics and benchmarks. If not – if your metrics process is entirely internally-focused – then you can just worry about whether a critical number goes up and down over time. If you want to be serious about metrics, though, you’ll eventually want to compare your experience with others’ and you’ll probably get better results if you bear sharing and normalization in mind while you’re designing your process.
One of the reasons security is in a rather shabby state is because we frequently jump to conclusions and enshrine ideas as “how things should be done” without really understanding the cause/effect relationships that underlie our decisions. That’s why we can’t answer simple questions like: “How much more danger would we be in if we took out the firewall?” First, off, we don’t really know how much the firewall is reducing danger, we just got one because we were told to, so we can’t establish any kind of theory about what would happen if it were gone.
To further complicate the problem we can’t take advantage of one of the other great things we gain through The Scientific Method, namely shared experience. If all of our firewalls were configured the same, then we might be able to actually take a population of users/sites with firewalls and without and compare their outcomes directly, hypothesizing that the firewall had some effect.
This is why statistics appear so attractive and powerful to us: they seem to offer a magical way to learn from someone else’s experience without having to have that experience ourselves. It lets me put my seatbelt on when I drive my big-wheeled over-tall 4×4, because statistics indicate that doing so reduces the likelihood I’ll get thrown from the vehicle if I am so unfortunate as to roll it. Those other guys for the last 10 years who died in roll-overs? They were the control set. The ones who wore their seatbelts and didn’t get killed were the experimental set. Over time and a lot of experiments we get confidence that seatbelts save lives. In computer security, the field is fairly new, and changes so fast, that we don’t get enough experimental data-points to let us conclusively learn much, unfortunately. So we assume the firewall helps because there are people we trust who have used firewalls and think they helped. And, it’s probably true, but what we don’t know is how true it is.
Some of the areas of humanity’s common experience are socially important enough that we study cause/effect and outcomes very closely, and have enough data that we’re able to establish theories that are as close to conclusively proven as they get: you are a fool if you smoke cigarettes, or play slot machines in Las Vegas. But what if you want to do without your firewall? We don’t have enough common experience or large enough data to be able to say, really. One of the conclusions we cannot avoid from this is that we need to do a much, much, much better job of collecting information about our security experiences and outcomes, and sharing that information more effectively. That’s why things like VERIS (link) and BSIMM (link) are so important – they are part of the scientific experiment of computer security: measuring what we do and the outcomes that result. Keeping metrics is the only way we can improve the state of security.
One last point: metrics can be manipulative. The famous Disraeli quip, “there are lies, damn lies, and statistics,” pretty accurately sums up one of the most common uses of metrics – to convince someone of something that you’re not really sure of. That’s one of the reasons why I use the definition I presented above, for metrics. It’s important to understand that a metric really isn’t any good unless you present it along with the underlying assumptions and (at least offer to provide) the data. In computer security, we’ve seen a fair number of factoids that amount to manipulative metrics: “80% of attacks come from the inside” is a bogus metric intended to manipulate the listener into paying more attention to insider threats than they do. In some circumstances that might be justifiable, and in others it might be absolutely inaccurate – if we’re going to be serious about metrics, we need to do much, much better than that.
I’m afraid that the reason bogus metrics like the “80% of attacks” carry weight in security is because generally we’re doing such a bad job of producing metrics that people aren’t even willing to challenge such an assertion. Here is how, by the way: “How do you define ‘attack’ in that context? How did you count them? From what population did you collect your data?” Those are the three questions you should be able to defend against if you’re presenting a metric:
• Define your notion of what you are measuring
• Understand the reliability of your counts: are you relying on automatic collection or subjective self-reporting
• Understand the population of your study and control group