There’s a Ton of Unstructured Data in the World - We Should All Tap Into Our Inner Analyst to Devise and Embrace New Categorization Models
Ever wonder how often people get burned when their waterskis catch fire? How about injuries suffered due to macaw attacks?
Well, the International Classification of Diseases is being modified and the 10th Revision (ICD-10) will be out soon. ICD-10 increases the number medical services codes, or how hospitals classify patient issues, from the current 18,000 to around 140,000. The World Health Organization has been classifying ailments for years in an effort to create a more exact and up-to-date accounting of diagnoses and hospital inpatient procedures, which could improve payment strategies and care guidelines.
So now you can find out if a patient's injury occurred in a chicken coop, opera house, art gallery, squash court, and nine (9) locations in and around a mobile home. There are codes for injuries incurred while ironing, which isn’t surprising, but also playing a brass instrument and crocheting. The classification system also includes patient characteristics observed by the caregiver, including “bizarre personal appearance”, which may overlap with “very low level of personal hygiene.”
Some critics believe 140,000 codes are too many and provide little benefit beyond the current set. I’m inclined to disagree: any classification system with a well designed taxonomy can provide invaluable analysis potential, and some data sets are made for twisting and pivoting. For example, it may be interesting to perform a query into whether the majority of slip-and-falls occur in the home, workplace, or while engaged in sporting activities. Digging deeper, categorizing them by lacerations, sprains, bone breaks, or concussions may yield unexpected conclusions.
If you’re a psychologist, you might want to find out how trainable people are. ICD-10 has both “walked into lamppost, initial encounter” and “walked into lamppost, subsequent encounter”. How often until the subject learns to sidestep immovable objects?
Maybe you’re not particularly interested in the frequency of animal-related injuries during lightning storms and are starting to wonder what all this has to do with security.
There are already taxonomies for classifying vulnerabilities in the form of the Common Vulnerability Enumeration (CVE) database and Open Source Vulnerability Database (OSVDB), but not so with events. Every vendor creates their own log formats and many vendors have many formats, perhaps from acquiring multiple software applications or simply not having a development standard. In many cases the software developers just make up their own events, following neither a prescribed format for the fields nor the text within the fields. This makes parsing and categorizing events from a wide range of vendors difficult, and yet it’s a critical undertaking: normalization is the foundation of cross-system data mining and correlation.
There are a couple of main strategies for dealing with the lack of event standardization:
• Store it, perhaps making a best effort to parse the data into common, or normalized, fields, and wrap a flexible search engine around it;
• Invest significant effort into parsing and normalizing the data
The first is the simpler of the two but is largely relegated to post-event analysis; the latter requires more effort but lends itself to real-time correlation and early threat detection. The difference is log management vs. SIEM.
My background is in traditional intelligence—espionage, surveillance, and analysis—and I love use cases that converge electronic data and feeds from the real world. Health care information is a great example. I’ve mentioned before that health care organizations want to detect fraud, such as patients who don’t have a real medical condition but are collecting disability, and inappropriate use of patient data, such as caregivers accessing their neighbor’s record out of curiosity. ICD-10 facilitates closing in on being able to implement these types of uses cases with more accuracy, or as we’re used to thinking of it, with less white-noise and false-positives.
While many in the information security field aren’t used to thinking in terms of physical security threats, ICD-10 and cross-facility correlation could detect imminent outbreaks of an illness and track the contagion vectors, perhaps even introducing airline, train, and bus itineraries; you might think of this as an element of business continuity planning, part of the ‘A’ in CIA. Or ICD-10 could be used as an early warning system to alert on possible biological weapons attacks, such as poisoned drinking water, and help control the damage.
There are, of course, privacy concerns. As we start to amass more data, the potential for abuse and the restriction of freedom is a real possibility. There’s a great video on the Internet where a man tries to order a pizza for delivery and the order taker has access to the hungry guy’s medical and financial history. Ultimately, he can only order a sprouts sub and pay in cash. Clearly, data needs to be classified as well as categorized, and access granted to agencies, roles, and users based on need, and use and distribution controlled both contractually and by monitoring it.
Even with proper controls, data can be dangerous in the wrong hands. Analysis based upon activities and behavior is a reasonable predictor of malicious intent. Examining an individual’s recent purchases and comparing them to historical transactional trends can inform law enforcement agencies around the world to pay special attention to the guy who just bought a bunch of fertilizer, particularly when there’s no associated buying activity to suggest he has a garden. But an untrained analyst might take irrelevant factors into consideration, such as eye color, shoe style, or guys with mustaches. While statistically (and purely hypothetically), more green-eyed women may commit crimes, common sense tells us this isn’t the major factor. Taking action on the data without further interpretation can lead to unjustified interrogation, detention, and even incarceration; at a minimum, it would make green-eyed women feel harassed.
There’s a ton of unstructured data in the world, emanating from, and stored in, unlikely places – data that can be used to enrich our ability to analyze the world around us and in our information systems. We should all tap into our inner analyst to devise and embrace new categorization models. Like Einstein, who sought a unified field theory, we can work toward a unified security model.
Related: Security Intelligence: A Spy Story