Disaster Recovery

True White-Knuckled Stories of Metrics in Action: Sylvan

System Crash Report

When you start your metrics program, you will find that a great deal of information can be gleaned from existing data that gets stored in various places….

Marcus Ranum

April 2, 2014

System Crash Report

When you start your metrics program, you will find that a great deal of information can be gleaned from existing data that gets stored in various places….

When I was a junior systems administrator just out of college, I worked at a major hospital as a systems analyst in the clinical systems group. This was around 1987 or so, back in the days when a 400 Megabyte hard drive cost $30,000 and weighed 300 pounds, and Sun Workstations ran 4.2BSD and the UNIX operating system was a pretty stripped-down thing we’d consider a “micro kernel” by today’s standards. We had a couple of Pyramid 9020s that ran Ingres databases, which were accessed via ASCII terminals and the Sun Workstations. Back in those days, the Pyramids were considered as having very fast I/O, perfect for database machines, because Pyramid had put a lot of work into developing smart I/O controllers that rearranged disk writes and cached, and a bunch of clever stuff. But there was a problem: ours often crashed. And when they crashed, the Ingres databases would get corrupted and life was not good for system administrators who had to come in at 3am to restart the system and get the databases back online. We were literally on the cutting edge of client/server computing in UNIX-land, and we bled a lot.

Eventually, the crashes became worse and worse and word of the problem percolated higher and higher up the management chain. And so it came to pass that a very senior person from the hospital called a very senior person at Pyramid and told them that the systems’ unreliability was putting one of their flagship accounts in jeopardy. And Pyramid’s response was to send out their best trouble-shooter, a guy named

Sylvan was a UNIX guru who smoked like a chimney and drank coffee like a supermassive black hole gobbling down interstellar dust. I was tasked with getting Sylvan whatever he needed to fix the crashes. Sylvan ran a bunch of commands on the system and printed out some files, then went to smoke and drink coffee and think. I was expecting him to spend a lot of time standing in front of the console, demonstrating command-line wizardry and maybe muttering incantations, but – no – I was to be disappointed. I was an underling so I didn’t rate to be involved in the meeting that happened later that week, but Sylvan dropped by my cube and we went out for lunch someplace quiet and he told me what he had done to fix the systems.

He had produced a chart mapping all of the system crashes that had happened that year (based on data that the system recorded during a kernel panic) and tabulated the causes that the system reported for the crashes.

The first metric he presented was, apparently, that there were five things causing crashes, and of those five things one was responsible for about 90% of the problems. He had also tabulated the downtime caused by the various things that caused the crashes, so we could either see that one problem had caused 90% of the crashes, or that it had caused something like 99% of the downtime. Once Sylvan had established those facts to management’s comfort, he produced another chart that mapped the causes of the crashes to system patches (Pyramid called them ‘PTF’s – Program Temporary Fix’es) that had not been installed because the system administrators hadn’t been being granted enough scheduled downtime to apply the fixes, and weren’t confident that the fixes were going to actually work.

Sylvan’s concluding chart summarized the amount of downtime that was caused unnecessarily by not applying patches, compared to the amount of downtime for unknown reasons (nearly zero). He observed that we had shot ourselves repeatedly in the same foot. Sylvan had been kind enough to further cover certain junior systems administrators’ asses by documenting how many times we had requested downtime to install PTFs and been told that the system couldn’t be taken down. His management recommendation was that we should ask the guys who ran the mainframes how to set up a patch/regression-test and change control process.

I never saw Sylvan again, but I decided right then and there that I was going to be a super bad-ass UNIX guru, just like him, someday.

Advertisement. Scroll to continue reading.

In this story, there are a lot of things related to business process analysis, not simply metrics. First and foremost, what Sylvan did was a “root cause analysis” – a fancy way of saying “figuring out why it happened.” But if he had simply walked in and said, “Your system is unreliable because you didn’t install crucial patches” it wouldn’t have carried much weight. It’s fine to do a root cause analysis and understand the problem, but communicating that understanding effectively is crucial to bringing about a desired change.

That’s where metrics came in. The basic metric wasn’t very complicated: “90% of your problems were caused by not installing 5 critical patches.”

Later in this series I will get to the issue of how to present a contentious metric, but in this case, there wasn’t much anyone could argue with – was it 5 patches or 6? Who cares, really, Sylvan made his point clearly enough.

We could have argued (maybe) about the causal relationship between crashes and patches, but that would be difficult since each of the patches contained a manifest that said “this patch includes fixes for the following problems: (list)” in the case of the Pyramid system, it was especially straightforward since the system fail-check would record a check code (other UNIX systems would just say: “Panic: some gobbledeygook”) which was referenced in the patch manifests.

All Sylvan’s script did was pull the check codes from the system crash dumps and directly match them against the manifest files, sorting them by patch file and leaving aside the handful that were not associated with a known patch. Sylvan was careful to explain to me how he had done it, so there was no question of anyone challenging his root cause analysis when he presented it.

This is an example of how you can develop a critical metric from data that you didn’t realize was scattered all over your system, waiting to be collected and analyzed. When you start your metrics program, you’ll find that a great deal of information can be gleaned from existing data that gets stored in various places – most likely in your system logs.

Lessons:

• If you understand your metric and the underlying data, you can prepare a couple different ways of looking at it, in case you don’t get your point across the first time.

• The order in which you present your metrics should support and reinforce them; get buy-in on the little pieces then present your conclusion, which will then be unarguable.

• Metrics can build atop each other; the “hard part” of what Sylvan did was figuring out how to collect the data that the system left here and there in order to construct the higher-level pieces of data that he needed. Sylvan calculated total down-time with a script that scrubbed through the database logs looking for a “database restarted” message, then measured the time between that and the last successful transaction, then only counted occurrences where there was a significant gap. This marked the beginning of my career-long love affair with system logs.

• If you’re not relying on common heuristics, briefly explain your methodology – it cements your metric in reality and makes it very hard to argue against.

Next up: The Evolution of an Important Metric: The Poverty Line, and some of the do’s and don’t’s we can learn from how the Poverty Line was established and calculated.

Written By Marcus Ranum

Latest News

Click to comment

CIEM Chat: How to Reduce Cloud Identity Risk

March 26, 2024

Join the session as we discuss the challenges and best practices for cybersecurity leaders managing cloud identities.

Virtual Event: Ransomware Resilience & Recovery Summit

April 17, 2024

SecurityWeek’s Ransomware Resilience and Recovery Summit helps businesses to plan, prepare, and recover from a ransomware incident.

Beyond the Buzz: Rethinking Alcohol as a Cybersecurity Bonding Ritual

Jennifer Leggio makes the case for more alcohol-free networking events at conferences, and community-building opportunities for sober individuals working in cybersecurity. (Jennifer Leggio)

Navigating Vendor Speak: A Security Practitioner’s Guide to Seeing Through the Jargon

As a security industry, we need to focus our energies on those professionals among us who know how to walk the walk. (Joshua Goldfarb)

SD-WAN: Don’t Build a Dead End, Prepare for Future-Proof Secure Networking

SD-WAN must be scalable, stable, secure, and fully operational to serve as a strong base for seamless modernization and progression to SASE. (Etay Maor)

You Against the World: The Offenders Dilemma

Foreign attackers have many more toolsets at their disposal, so we need to make sure we’re selective about our modeling, preparation and how we assess and fortify ourselves. (Tom Eston)

Why Intelligence Sharing Is Vital to Building a Robust Collective Cyber Defense Program

With automated, detailed, contextualized threat intelligence, organizations can better anticipate malicious activity and utilize intelligence to speed detection around proven attacks. (Marc Solomon)

Application Security

Source Code Security Firm Cycode Launches With $4.6 Million in Funding

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Eduard KovacsSeptember 24, 2019

Cybercrime

Comodo Forums Hacked via Recently Disclosed vBulletin Vulnerability

A recently disclosed vBulletin vulnerability, which had a zero-day status for roughly two days last week, was exploited in a hacker attack targeting the...

Eduard KovacsOctober 1, 2019

Zero Trust and Identity and Access Management

Identity & Access

Cyber Insights 2023 | Zero Trust and Identity and Access Management

Zero trust is not a replacement for identity and access management (IAM), but is the extension of IAM principles from people to everyone and...

Kevin TownsendFebruary 6, 2023

Topics for 2023 Cybersecurity Insights Series

CISO Strategy

SecurityWeek Cyber Insights 2023 Series

SecurityWeek spoke with more than 300 cybersecurity experts to see what is bubbling beneath the surface, and examine how those evolving threats will present...

Kevin TownsendFebruary 13, 2023

Incident Response

Amazon’s Shuttering of Alexa Ranking Service Hits Cybersecurity Industry

Amazon has shut down Alexa.com.

Eduard KovacsMay 6, 2022

CISO Conversations

CISO Conversations: HP and Dell CISOs Discuss the Role of the Multi-National Security Chief

Joanna Burkey, CISO at HP, and Kevin Cross, CISO at Dell, discuss how the role of a CISO is different for a multinational corporation...

Kevin TownsendMay 10, 2023

CISO Conversations

CISO Conversations: Code42, BreachQuest Leaders Discuss Combining CISO and CIO Roles

In this issue of CISO Conversations we talk to two CISOs about solving the CISO/CIO conflict by combining the roles under one person.

Kevin TownsendMarch 1, 2023

CISO Strategy

Burnout in Cybersecurity – Can It Be Prevented?

Security professionals understand the need for resilience in their company’s security posture, but often fail to build their own psychological resilience to stress.

Kevin TownsendMarch 22, 2023

SECURITYWEEK NETWORK:

ICS:

SecurityWeek

Disaster Recovery

True White-Knuckled Stories of Metrics in Action: Sylvan

More from Marcus Ranum