Security Experts:

Hadoop Data Encryption: "P.S. Find Robert Langdon"

“P.S. Find Robert Langdon” and an out-of-order Fibonacci sequence are part of the cryptic message in the opening scenes of “The Da Vinci Code.” Following that, we see cryptographer Sophie Neveu with Professor Langdon embark on a hunt to recover and unlock the secrets of the cryptex keystone. There are many other great movies I have enjoyed in which cryptography plays a role – from “A Beautiful Mind” to the “Imitation Game.” These films remind us that cryptography comes in many forms and has been used since early times to protect secrets, way before the invention of computers. Even so, encryption was not built into Apache Hadoop from the start – it was added over time and implemented across components. And today it has become a common method used to protect big data at financial institutions, healthcare organizations, telecommunication companies and government agencies.

HDFS Encryption

HDFS natively supports encryption of data via a mechanism called Encryption Zones. How it works is that an Encryption Zone is basically an HDFS directory that has been associated with an encryption key. Once the directory has been associated with the encryption key, all files in the directory and subdirectory will be encrypted automatically.

When using HDFS encryption, not all data in HDFS needs to be encrypted; you can have some directories with public or non-sensitive data in cleartext while sensitive data gets encrypted. Hadoop users can have their own Encryption Zones to protect their data from other users, and I will go more into that in the next section.

A common misconception about native HDFS encryption is the belief that the data is encrypted when written to disk on the data nodes like most disk encryption solutions. In fact, the data actually gets encrypted before it is sent to the data node. That architecture has two nice side effects: one is that the data is also protected in transit and the other is that it also prevents the keys from being exposed on the data nodes where the data is stored.

The cryptographic algorithm used to encrypt the HDFS data is industry proven AES. The default is AES-128 but for organizations that have standardized on the stronger AES-256 it is configurable. Each individual file in the Encryption Zone is assigned its own randomly generated key. This is good because in the event that a key were to be compromised, it would therefore only be usable on a single file.

There are some caveats to HDFS Encryption, however, that you should be aware of:

1) You cannot create an Encryption Zone on a non-empty directory. In other words, if you were expecting to take an existing directory with data and set it up as an encryption zone to have it all encrypted… No can do. That’s just not how it works. You need an empty directory to create an encryption zone. From there you can use the distcp command to copy all the data from its current directory into the new, empty and now encrypted directory (and then delete the old files).

2) You cannot make your entire “/” root directory an encryption zone.

3) Nested encryption zones are not currently supported but they are on the roadmap. You may want to set up the entire /user directory as an encryption zone and allow users to then create their own nested encryption zones, but this is not yet currently supported. We will hopefully support this soon – perhaps by the time this article is published.

4) Test your applications. There are many scenarios where applications can break due to user permissions to the encryption keys,  and restrictions on how files may be moved and copied in/out of encryption zones.

Hadoop Key Management Server (KMS)

The key to encryption (pun intended) is the actual management of the encryption keys. Each Encryption Zone will have its own unique key with associated user permissions to use that key. Each individual file will also have its own unique and randomly generated key. So where do those keys get stored, how are they managed, and how are they protected? This is where Hadoop KMS comes into play. Apache Hadoop KMS is a three-fold pluggable key management service that: 

● generates and stores keys for Encryption Zones

● generates and encrypts/decrypts keys for files

● protects and manages the permissions to these keys

KMS is an independent service that runs separately from the Hadoop cluster and typically is managed by the Information Security team and not the Hadoop administrators. It is important to create a separation of duties when it comes to managing permissions to encryption keys. Hadoop administrators may have access to the hdfs user and all the data but not to the keys.

Permissions to keys are defined in KMS as Access Controls Lists (ACLs). These ACL permissions define which users/groups have access to encryption keys, also known as the whitelist. ACLs can also be created to define which users or groups are blocked from accessing keys, aka the blacklist. Using ACLs, users can block privileged and administrative users such as hdfs users from accessing user data.

KMS by default uses an implementation with the Java KeyStore (JKS). The JKS stores its secrets in file, which is typically only protected by filesystem permissions. The JKS implementation, however, has downsides and is not recommended for production systems. One of those downsides is security. You can put a password on a JKS file, but then where do you store that password – in a text file on the server? And there are also scalability and availability challenges; there are no built-in replication, redundancy and backup mechanisms for this JKS file.

In this article I only touched on HDFS encryption of data at rest and key management. But not all data resides in HDFS. There is sensitive data that may be outside of HDFS, there are temporary files and, of course, there is also data that is in transit. In my next column I will detail those other areas of encryption.

I leave you with the now-infamous anagram from “The Da Vinci Code” for your deciphering pleasure:

“O, Draconian devil!

Oh, lame saint!

So dark the con of Man”


view counter
Eddie Garcia is an information security architect at Cloudera, a provider of enterprise analytic data management, where he helps enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. He was formerly the VP of InfoSec and Engineering for Gazzang prior to its acquisition by Cloudera. He was the chief architect of the Gazzang zNcrypt product and is author of four issued and provisional patents for data security. Prior to Gazzang, he was responsible for Enterprise Architecture projects that helped AMD’s distribution and OEM partners securely collaborate over secure networks with single sign-on. He holds an engineering degree in computer science from the Instituto Tecnologico y de Estudios Superiores de Monterrey.