In my previous contribution I reviewed Apache Hadoop’s native support for HDFS Encryption, which allows sensitive data to be stored and encrypted on HDFS with keys protected by KMS access control lists.
This notwithstanding, in any real world big data environment there are many other data sources, data-staging areas, temporary files and log files where sensitive data may reside outside of HDFS. And of course sensitive data should also be protected “in transit” when going from end points into the cluster, or when a job runs moving data from one node in the cluster to another. The good news is that there are solutions for encrypting data at rest and in transit that will allow organizations not only to meet regulatory compliance requirements, but more importantly to protect and secure their information assets.
Apache Hadoop Encryption in Transit
The first step in securing an Apache Hadoop cluster is to enable encryption in transit, which is the foundation for every other security layer that will be added subsequently. Authentication and Kerberos rely on secure communications, so before you even go down the road of enabling authentication and Kerberos you must enable encryption of data-in-transit.
As you know, Apache Hadoop is not a monolithic application; quite the contrary, it’s a zoo of services from Pig and Hive to Impala and Kudu. These services may communicate with different protocols including RPC, TCP/IP or HTTP and each protocol has a different method to encrypt data. Everyone who does transactions on the web is (hopefully) familiar with an https:// in their browser and a little image of a lock next to it to identify a secure communication. Similarly, to achieve secure communications in Hadoop we need to enable the secure version of protocols used.
For the components that communicate with RPC we need to enable SASL to protect RPC data in transit. SASL is enabled by setting the hadoop.rpc.protection property in the core-site.xml file, and there are three choices when enabling this option:
• authentication, which provides authentication between the two parties
• integrity, which provides authentication between the two parties and message integrity
• privacy, which, in addition to authentication and integrity, also provides confidentiality
Of course, you will want to go with the most secure option, so let’s see what that looks like in the core-site.xml file.
Once this property is changed, all daemons in the cluster need to be restarted for all parties to communicate with the same encrypted protocol.
Because Hadoop Data Protocol is used to send and receive data from the cluster data nodes, this protocol uses direct TCP/IP sockets and has support for encrypted communication with keys that get exchanged over RPC (which reinforces the need to first enable SASL). To enable this TCP/IP encrypted data stream, we set the dfs.encrypt.data.transfer property to “true” in the hdfs-site.xml configuration file. This configuration change must be made on both the NameNode and DataNodes.
<value></value> <!-- leave empty for AES --> </property>
<value>256</value> <!-- can also be set to 128 or 192 -->
As you may have noticed, we can also optionally configure the encryption cipher; in this case we have configured the more secure AES-256 algorithm.
Once this property is changed, the NameNode and DataNode daemons in the cluster need to be restarted.
This is where things get a bit more complicated. Different Apache Hadoop components have been developed in different coding languages – for example, MapReduce in Java, or Hue with Python – and for this reason there are a variety of ways and locations where SSL/TLS is configured. As an example, we will review enabling encryption for the WebUI of MapReduce v2 as well as enabling encrypted shuffles for MapReduce v2.
To enable the encrypted WebUI for MapReduce v2, you need to edit the core-site.xml file setting the hadoop.ssl.enabled property to “true”.
hadoop.ssl.enabled true true
And to enable encrypted shuffle for MapReduce v2, you will edit the mapred-site.xml file and also set the mapreduce.shuffle.ssl.enabled property to “true”.
This is just scratching the surface. There are also certificates, trust stores and other SSL/TLS configurations that we need to worry about. But because this article is only meant to give you a high-level overview, we will not go into details and only point out that you also need to modify the ssl-server.xml and ssl-client.xml files with the appropriate values after obtaining SSL/TLS certificates from your AD admin or internal Certificate Authority (CA). In many cases, requesting the certificates from your AD team or a third party can be a lengthy process.
Encryption of Data at Rest outside HDFS
Once you have enabled in-transit and at-rest encryption in all the Apache Hadoop components, the last thing you need to configure is encryption of data at rest outside of HDFS.
While some organizations may consider encrypted hard drives, this method is not commonly used and also requires specialized and more expensive hardware. Instead, we can take advantage of the Linux native data-at-rest encryption features, namely dm-crypt.
Using LUKS with dm-crypt we can create an encrypted block device that sits above your standard storage device and encrypt/decrypt data as it writes and reads files. While setting up a dm-crypt block device is pretty simple thanks to tools like cryptsetup, the storage and protection of the encryption password is not trivial and requires careful planning and testing. There are plenty of resources on the web to set up dm-crypt that will vary based on your Linux distribution.
This article should give you a sense of what it takes to encrypt data in a Hadoop cluster; the tools are here and, if you are brave enough, you could attempt this at home. But the reality is that many things can go wrong if this is your first time, and you could end up permanently encrypting your data and throwing away the key (ultimate protection, even from you).
In my next column, I could go into much greater detail on how everything gets configured, but as everyone knows “threequels” are never as good as the first two, so I will save you the grief. Instead, I’ll offer you the option to look into how tools have automated and simplified the steps that are necessary to enable data encryption on Apache Hadoop.