Big Data Means Exposing More Data to Internal Misuse or Accidental Exposure, and Exposing More Data to a Successful Attacker
As these stories often go, a friend who recently started a new job asked me if I had any thoughts about the security of “Big Data.” My first thought was, to some extent, that big data is more a buzzword than anything else. Big data is just more data, so it faces the same types of issues as any data, right?
But this is an over simplification of the issues. To some extent, managing big data is kind of like having kids. Two kids are not twice as much work as one kid – it is more like an exponential relationship. Two kids is work2. As your big data store grows, do the potential control issues grow at least as fast?
Obviously “sizing” is the big issue. But, beyond just “more,” you have to appreciate that not only means more data, but it means more complicated data, more sensitive data, and a related chance for exposure due to errors, or vulnerabilities. It also means exposing more data to internal misuse or accidental exposure, and exposing more data to an attacker who succeeds in penetrating your perimeter. At a very basic level, if a potential attacker knows that you have large volumes of high quality data, it may very well elevate your attack profile, since you are more likely to be viewed as an attractive target. You are accounting for that when you do your organizational risk analysis, right?
But, when we talk about big data we are not really just talking about volume or quantity. Most people who have the big data discussion will talk about Velocity, which is a function of the speed at which data enters your environment. Personally, I find the concept of Variety more interesting. What kind of data is it? It is much easier to manage large volumes of all PHI data than it is PHI data, PCI data, medical telemetry data, and demographic data all mixed together. The more data is contextually alike can have a huge impact on how that data is managed. How similar/dis-similar is the data? It is much easier to manage large volumes of database files than it is a complex combination of database files, flat text files, system logs, application specific data, customized format data, graphics, etc. The more diverse the specific pieces of information, the more complex the infrastructure required to support it.
IT implementations supporting big data have a whole host of throughput, availability, and data access controls that are more related to the operations supporting the data than to security of that data. Scalable infrastructures, parallel processing, data replication, and massive in-memory processing are just a few of the discussions to be had about big data operations. But, for us security geeks, what does a classical view of big data tell us about security?
1. Understand the data. Obviously, this is the best place for me to do my standard “Have you done a Business Impact Assessment (BIA)” rant. The question is pretty simple – you have lots of data, but do you know “what” you have? Do you have PCI data, or PHI data, or private corporate data, or private customer/consumer data? Before you worry about anything else, you have to understand what is actually in the big data. For one thing, understanding your data helps you add context to the data early in the process of managing it, and, probably just as importantly, lets you identify aberrant data as you see it. Cleansing big data of irrelevant, erroneous, or toxic data is not a task that you should take lightly.
2. Understand the size constraints. Yes, availability really is a security issue, so make sure you size your infrastructure appropriately. Is your network fast enough to support the throughput demands of the data you are managing, both Velocity and consumption? Do you have enough CPU capacity that you can support the movement and management of the data through any required applications, databases, and storage devices? Do you have enough disk space that you can easily store the data? Do you have a robust enough drive management process that you are not single threading a drive and trying to write too much active data to the same storage device(s)? These are all the standard IT issues on the functions required to just manage high volumes of data. Appreciate that higher volumes of data are, in many ways, just harder to protect, and that your solution has to scale with the data and demands on the data. Can you encrypt petabytes or exabytes of data in a real-time enough manner to make the data consumable while meeting operational constraints, including meeting timing requirements?
3. Understand the timing constraints. Again, this is highly reliant on the fact that you have done your job in step #1. But, timing is tres important. Does your data have a lifespan? In more straightforward English, clinical medical information obviously has a more sensitive duty lifespan than typical manufacturing metrics. Said even more plainly, some data is not as valuable if it cannot be managed and analyzed in a timely manner. Do you think the Phalanx anti-missile system would have any value if it took five minutes to evaluate a threat and respond? (the correct answer is, of course, “no”) This obviously drives IT capacity and throughput requirements, and supports availability security objectives. Sometimes it just does not matter, but in many instances, old data may not be relevant, so lifespan of the data is often more important than we often appreciate.
4. Understand the appropriate level of data context. This is a direct expansion of the above three issues, and really what helps us make sense of the big data. When the data has context we can manage it as information as opposed to bits and bytes. Is it PHI data, or PCI, or personal information or something else? Contextual data can be mined for details, correlation information, and actively managed with meaning, instead of just “data.” Treating the data in an intelligent manner also lets us treat data with similar context in a similar manner – we build contextual relationships in the data.
And there’s the rub. While big data may be “data”, we really don’t want it to be “data” as much as we want it to be “information” (which is data with context). The big data is more valuable as a source of analytics about the data, than it just is, “the data.” That is why context and correlation are so very important when you talk about big data – we need to make the data intelligent by using the available context to help ensure that we can consume relevant information. So, you are not just talking about “medical data,” you are talking about unique patient identifiers, pre-existing conditions, allergies, current prescriptions, contra-indications, and a whole host of demographic information about the patient as well as the provider. You are not just talking about “manufacturing data,” you are talking about specific inventory items, in-stock, re-order/manufacture points, required supplies, vendors, price of goods, selling price, buyer (and all their information, like industry, geographic location, volumes, discounts, and specific/customized delivery contracts). You are not just talking about security event data, you are talking about the detail that your IDS and internal systems are reporting attacks against a system named Mordor, which is a Windows Server 2008, R2 SP1, running Oracle 11g Enterprise, sitting in the Princeton, N.J., data center in row 3, rack A12, and it holds all of your clinical patient records, so, does indeed fall under HIPAA and HITECH.
In some ways, this all just exacerbates the problem. We are taking large quantities of potentially valuable, dynamic and complex data, and attaching contextual analytics to that data – assigning even greater value to the big data because of the context, and the information we can glean from the data. The fact is the analytics themselves, and even the process used to create those analytics, are also highly valuable. This highlights the need to protect the analytics modeling and results, along with access to them as well.
After all, without the relevant intelligence we can get from good analytics of the big data, it really is just a bunch of data.
Related Reading: Examining The Security Implications of Big Data