When you start your metrics program, you will find that a great deal of information can be gleaned from existing data that gets stored in various places....
When I was a junior systems administrator just out of college, I worked at a major hospital as a systems analyst in the clinical systems group. This was around 1987 or so, back in the days when a 400 Megabyte hard drive cost $30,000 and weighed 300 pounds, and Sun Workstations ran 4.2BSD and the UNIX operating system was a pretty stripped-down thing we'd consider a “micro kernel” by today's standards. We had a couple of Pyramid 9020s that ran Ingres databases, which were accessed via ASCII terminals and the Sun Workstations. Back in those days, the Pyramids were considered as having very fast I/O, perfect for database machines, because Pyramid had put a lot of work into developing smart I/O controllers that rearranged disk writes and cached, and a bunch of clever stuff. But there was a problem: ours often crashed. And when they crashed, the Ingres databases would get corrupted and life was not good for system administrators who had to come in at 3am to restart the system and get the databases back online. We were literally on the cutting edge of client/server computing in UNIX-land, and we bled a lot.
Eventually, the crashes became worse and worse and word of the problem percolated higher and higher up the management chain. And so it came to pass that a very senior person from the hospital called a very senior person at Pyramid and told them that the systems' unreliability was putting one of their flagship accounts in jeopardy. And Pyramid's response was to send out their best trouble-shooter, a guy named
Sylvan was a UNIX guru who smoked like a chimney and drank coffee like a supermassive black hole gobbling down interstellar dust. I was tasked with getting Sylvan whatever he needed to fix the crashes. Sylvan ran a bunch of commands on the system and printed out some files, then went to smoke and drink coffee and think. I was expecting him to spend a lot of time standing in front of the console, demonstrating command-line wizardry and maybe muttering incantations, but – no – I was to be disappointed. I was an underling so I didn't rate to be involved in the meeting that happened later that week, but Sylvan dropped by my cube and we went out for lunch someplace quiet and he told me what he had done to fix the systems.
He had produced a chart mapping all of the system crashes that had happened that year (based on data that the system recorded during a kernel panic) and tabulated the causes that the system reported for the crashes.
The first metric he presented was, apparently, that there were five things causing crashes, and of those five things one was responsible for about 90% of the problems. He had also tabulated the downtime caused by the various things that caused the crashes, so we could either see that one problem had caused 90% of the crashes, or that it had caused something like 99% of the downtime. Once Sylvan had established those facts to management's comfort, he produced another chart that mapped the causes of the crashes to system patches (Pyramid called them 'PTF's – Program Temporary Fix'es) that had not been installed because the system administrators hadn't been being granted enough scheduled downtime to apply the fixes, and weren't confident that the fixes were going to actually work.
Sylvan's concluding chart summarized the amount of downtime that was caused unnecessarily by not applying patches, compared to the amount of downtime for unknown reasons (nearly zero). He observed that we had shot ourselves repeatedly in the same foot. Sylvan had been kind enough to further cover certain junior systems administrators' asses by documenting how many times we had requested downtime to install PTFs and been told that the system couldn't be taken down. His management recommendation was that we should ask the guys who ran the mainframes how to set up a patch/regression-test and change control process.
I never saw Sylvan again, but I decided right then and there that I was going to be a super bad-ass UNIX guru, just like him, someday.
In this story, there are a lot of things related to business process analysis, not simply metrics. First and foremost, what Sylvan did was a “root cause analysis” - a fancy way of saying “figuring out why it happened.” But if he had simply walked in and said, “Your system is unreliable because you didn't install crucial patches” it wouldn't have carried much weight. It's fine to do a root cause analysis and understand the problem, but communicating that understanding effectively is crucial to bringing about a desired change.
That's where metrics came in. The basic metric wasn't very complicated: “90% of your problems were caused by not installing 5 critical patches.”
Later in this series I will get to the issue of how to present a contentious metric, but in this case, there wasn't much anyone could argue with – was it 5 patches or 6? Who cares, really, Sylvan made his point clearly enough.
We could have argued (maybe) about the causal relationship between crashes and patches, but that would be difficult since each of the patches contained a manifest that said “this patch includes fixes for the following problems: (list)” in the case of the Pyramid system, it was especially straightforward since the system fail-check would record a check code (other UNIX systems would just say: “Panic: some gobbledeygook”) which was referenced in the patch manifests.
All Sylvan's script did was pull the check codes from the system crash dumps and directly match them against the manifest files, sorting them by patch file and leaving aside the handful that were not associated with a known patch. Sylvan was careful to explain to me how he had done it, so there was no question of anyone challenging his root cause analysis when he presented it.
This is an example of how you can develop a critical metric from data that you didn't realize was scattered all over your system, waiting to be collected and analyzed. When you start your metrics program, you'll find that a great deal of information can be gleaned from existing data that gets stored in various places – most likely in your system logs.
• If you understand your metric and the underlying data, you can prepare a couple different ways of looking at it, in case you don't get your point across the first time.
• The order in which you present your metrics should support and reinforce them; get buy-in on the little pieces then present your conclusion, which will then be unarguable.
• Metrics can build atop each other; the “hard part” of what Sylvan did was figuring out how to collect the data that the system left here and there in order to construct the higher-level pieces of data that he needed. Sylvan calculated total down-time with a script that scrubbed through the database logs looking for a “database restarted” message, then measured the time between that and the last successful transaction, then only counted occurrences where there was a significant gap. This marked the beginning of my career-long love affair with system logs.
• If you're not relying on common heuristics, briefly explain your methodology – it cements your metric in reality and makes it very hard to argue against.
Next up: The Evolution of an Important Metric: The Poverty Line, and some of the do's and don't's we can learn from how the Poverty Line was established and calculated.