Connect with us

Hi, what are you looking for?

SecurityWeekSecurityWeek

Data Protection

Hadoop Data Encryption at Rest and in Transit

In my previous contribution I reviewed Apache Hadoop’s native support for HDFS Encryption, which allows sensitive data to be stored and encrypted on HDFS with keys protected by KMS access control lists.

In my previous contribution I reviewed Apache Hadoop’s native support for HDFS Encryption, which allows sensitive data to be stored and encrypted on HDFS with keys protected by KMS access control lists.

This notwithstanding, in any real world big data environment there are many other data sources, data-staging areas, temporary files and log files where sensitive data may reside outside of HDFS. And of course sensitive data should also be protected “in transit” when going from end points into the cluster, or when a job runs moving data from one node in the cluster to another. The good news is that there are solutions for encrypting data at rest and in transit that will allow organizations not only to meet regulatory compliance requirements, but more importantly to protect and secure their information assets.

Apache Hadoop Encryption in Transit

The first step in securing an Apache Hadoop cluster is to enable encryption in transit, which is the foundation for every other security layer that will be added subsequently. Authentication and Kerberos rely on secure communications, so before you even go down the road of enabling authentication and Kerberos you must enable encryption of data-in-transit.

As you know, Apache Hadoop is not a monolithic application; quite the contrary, it’s a zoo of services from Pig and Hive to Impala and Kudu. These services may communicate with different protocols including RPC, TCP/IP or HTTP and each protocol has a different method to encrypt data. Everyone who does transactions on the web is (hopefully) familiar with an https:// in their browser and a little image of a lock next to it to identify a secure communication. Similarly, to achieve secure communications in Hadoop we need to enable the secure version of protocols used.

RPC/SASL

For the components that communicate with RPC we need to enable SASL to protect RPC data in transit. SASL is enabled by setting the hadoop.rpc.protection property in the core-site.xml file, and there are three choices when enabling this option:

• authentication, which provides authentication between the two parties

Advertisement. Scroll to continue reading.

• integrity, which provides authentication between the two parties and message integrity

• privacy, which, in addition to authentication and integrity, also provides confidentiality

Of course, you will want to go with the most secure option, so let’s see what that looks like in the core-site.xml file.

<property>

   <name>hadoop.rpc.protection</name> 

   <value>privacy</value> 

</property> 

Once this property is changed, all daemons in the cluster need to be restarted for all parties to communicate with the same encrypted protocol.

TCP/IP

Because Hadoop Data Protocol is used to send and receive data from the cluster data nodes, this protocol uses direct TCP/IP sockets and has support for encrypted communication with keys that get exchanged over RPC (which reinforces the need to first enable SASL). To enable this TCP/IP encrypted data stream, we set the dfs.encrypt.data.transfer property to “true” in the hdfs-site.xml configuration file. This configuration change must be made on both the NameNode and DataNodes.

<property>

  <name>dfs.encrypt.data.transfer</name>

  <value>true</value> 

</property>

<property>

  <name>dfs.encrypt.data.transfer.algorithm</name> 

  <value></value> <!– leave empty for AES –> </property>

<property> 

  <name>dfs.encrypt.data.transfer.cipher.suites</name> 

  <value>AES/CTR/NoPadding</value> </property>

<property> 

  <name> dfs.encrypt.data.transfer.cipher.key.bitlength</name> 

  <value>256</value> <!– can also be set to 128 or 192 –> 

</property> 

< span>As you may have noticed, we can also optionally configure the encryption cipher; in this case we have configured the more secure AES-256 algorithm.

Once this property is changed, the NameNode and DataNode daemons in the cluster need to be restarted.

TLS/HTTPS

This is where things get a bit more complicated. Different Apache Hadoop components have been developed in different coding languages – for example, MapReduce in Java, or Hue with Python – and for this reason there are a variety of ways and locations where SSL/TLS is configured. As an example, we will review enabling encryption for the WebUI of MapReduce v2 as well as enabling encrypted shuffles for MapReduce v2.

To enable the encrypted WebUI for MapReduce v2, you need to edit the core-site.xml file setting the hadoop.ssl.enabled property to “true”.

hadoop.ssl.enabled true true

And to enable encrypted shuffle for MapReduce v2, you will edit the mapred-site.xml file and also set the mapreduce.shuffle.ssl.enabled property to “true”.

<property>

  <name>hadoop.ssl.enabled</name>

  <value>true</value>

  <final>true</final> 

</property> 

This is just scratching the surface. There are also certificates, trust stores and other SSL/TLS configurations that we need to worry about. But because this article is only meant to give you a high-level overview, we will not go into details and only point out that you also need to modify the ssl-server.xml and ssl-client.xml files with the appropriate values after obtaining SSL/TLS certificates from your AD admin or internal Certificate Authority (CA). In many cases, requesting the certificates from your AD team or a third party can be a lengthy process.

Encryption of Data at Rest outside HDFS

Once you have enabled in-transit and at-rest encryption in all the Apache Hadoop components, the last thing you need to configure is encryption of data at rest outside of HDFS.

While some organizations may consider encrypted hard drives, this method is not commonly used and also requires specialized and more expensive hardware. Instead, we can take advantage of the Linux native data-at-rest encryption features, namely dm-crypt.

Using LUKS with dm-crypt we can create an encrypted block device that sits above your standard storage device and encrypt/decrypt data as it writes and reads files. While setting up a dm-crypt block device is pretty simple thanks to tools like cryptsetup, the storage and protection of the encryption password is not trivial and requires careful planning and testing. There are plenty of resources on the web to set up dm-crypt that will vary based on your Linux distribution.

Summary

This article should give you a sense of what it takes to encrypt data in a Hadoop cluster; the tools are here and, if you are brave enough, you could attempt this at home. But the reality is that many things can go wrong if this is your first time, and you could end up permanently encrypting your data and throwing away the key (ultimate protection, even from you).

In my next column, I could go into much greater detail on how everything gets configured, but as everyone knows “threequels” are never as good as the first two, so I will save you the grief. Instead, I’ll offer you the option to look into how tools have automated and simplified the steps that are necessary to enable data encryption on Apache Hadoop.

Written By

Click to comment

Trending

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Understand how to go beyond effectively communicating new security strategies and recommendations.

Register

Join us for an in depth exploration of the critical nature of software and vendor supply chain security issues with a focus on understanding how attacks against identity infrastructure come with major cascading effects.

Register

Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Data Protection

The cryptopocalypse is the point at which quantum computing becomes powerful enough to use Shor’s algorithm to crack PKI encryption.

Artificial Intelligence

The CRYSTALS-Kyber public-key encryption and key encapsulation mechanism recommended by NIST for post-quantum cryptography has been broken using AI combined with side channel attacks.

Compliance

The three primary drivers for cyber regulations are voter privacy, the economy, and national security – with the complication that the first is often...

Application Security

Virtualization technology giant VMware on Tuesday shipped urgent updates to fix a trio of security problems in multiple software products, including a virtual machine...

Data Protection

While quantum-based attacks are still in the future, organizations must think about how to defend data in transit when encryption no longer works.

Application Security

Fortinet on Monday issued an emergency patch to cover a severe vulnerability in its FortiOS SSL-VPN product, warning that hackers have already exploited the...

Cybersecurity Funding

Los Gatos, Calif-based data protection and privacy firm Titaniam has raised $6 million seed funding from Refinery Ventures, with participation from Fusion Fund, Shasta...