Data Protection

Hadoop Data Encryption at Rest and in Transit

In my previous contribution I reviewed Apache Hadoop’s native support for HDFS Encryption, which allows sensitive data to be stored and encrypted on HDFS with keys protected by KMS access control lists.

Eddie Garcia

| August 24, 2016 (2:35 PM ET)

This notwithstanding, in any real world big data environment there are many other data sources, data-staging areas, temporary files and log files where sensitive data may reside outside of HDFS. And of course sensitive data should also be protected “in transit” when going from end points into the cluster, or when a job runs moving data from one node in the cluster to another. The good news is that there are solutions for encrypting data at rest and in transit that will allow organizations not only to meet regulatory compliance requirements, but more importantly to protect and secure their information assets.

Apache Hadoop Encryption in Transit

The first step in securing an Apache Hadoop cluster is to enable encryption in transit, which is the foundation for every other security layer that will be added subsequently. Authentication and Kerberos rely on secure communications, so before you even go down the road of enabling authentication and Kerberos you must enable encryption of data-in-transit.

As you know, Apache Hadoop is not a monolithic application; quite the contrary, it’s a zoo of services from Pig and Hive to Impala and Kudu. These services may communicate with different protocols including RPC, TCP/IP or HTTP and each protocol has a different method to encrypt data. Everyone who does transactions on the web is (hopefully) familiar with an https:// in their browser and a little image of a lock next to it to identify a secure communication. Similarly, to achieve secure communications in Hadoop we need to enable the secure version of protocols used.

RPC/SASL

Advertisement. Scroll to continue reading.

For the components that communicate with RPC we need to enable SASL to protect RPC data in transit. SASL is enabled by setting the hadoop.rpc.protection property in the core-site.xml file, and there are three choices when enabling this option:

• authentication, which provides authentication between the two parties

• integrity, which provides authentication between the two parties and message integrity

• privacy, which, in addition to authentication and integrity, also provides confidentiality

Of course, you will want to go with the most secure option, so let’s see what that looks like in the core-site.xml file.

<name>hadoop.rpc.protection</name>

<value>privacy</value>

</property>

Once this property is changed, all daemons in the cluster need to be restarted for all parties to communicate with the same encrypted protocol.

TCP/IP

Because Hadoop Data Protocol is used to send and receive data from the cluster data nodes, this protocol uses direct TCP/IP sockets and has support for encrypted communication with keys that get exchanged over RPC (which reinforces the need to first enable SASL). To enable this TCP/IP encrypted data stream, we set the dfs.encrypt.data.transfer property to “true” in the hdfs-site.xml configuration file. This configuration change must be made on both the NameNode and DataNodes.

<name>dfs.encrypt.data.transfer</name>

</property>

<name>dfs.encrypt.data.transfer.algorithm</name>

<name>dfs.encrypt.data.transfer.cipher.suites</name>

<value>AES/CTR/NoPadding</value> </property>

<name> dfs.encrypt.data.transfer.cipher.key.bitlength</name>

</property>

< span>As you may have noticed, we can also optionally configure the encryption cipher; in this case we have configured the more secure AES-256 algorithm.

Once this property is changed, the NameNode and DataNode daemons in the cluster need to be restarted.

TLS/HTTPS

This is where things get a bit more complicated. Different Apache Hadoop components have been developed in different coding languages – for example, MapReduce in Java, or Hue with Python – and for this reason there are a variety of ways and locations where SSL/TLS is configured. As an example, we will review enabling encryption for the WebUI of MapReduce v2 as well as enabling encrypted shuffles for MapReduce v2.

To enable the encrypted WebUI for MapReduce v2, you need to edit the core-site.xml file setting the hadoop.ssl.enabled property to “true”.

hadoop.ssl.enabled true true

And to enable encrypted shuffle for MapReduce v2, you will edit the mapred-site.xml file and also set the mapreduce.shuffle.ssl.enabled property to “true”.

<name>hadoop.ssl.enabled</name>

</property>

This is just scratching the surface. There are also certificates, trust stores and other SSL/TLS configurations that we need to worry about. But because this article is only meant to give you a high-level overview, we will not go into details and only point out that you also need to modify the ssl-server.xml and ssl-client.xml files with the appropriate values after obtaining SSL/TLS certificates from your AD admin or internal Certificate Authority (CA). In many cases, requesting the certificates from your AD team or a third party can be a lengthy process.

Encryption of Data at Rest outside HDFS

Once you have enabled in-transit and at-rest encryption in all the Apache Hadoop components, the last thing you need to configure is encryption of data at rest outside of HDFS.

While some organizations may consider encrypted hard drives, this method is not commonly used and also requires specialized and more expensive hardware. Instead, we can take advantage of the Linux native data-at-rest encryption features, namely dm-crypt.

Using LUKS with dm-crypt we can create an encrypted block device that sits above your standard storage device and encrypt/decrypt data as it writes and reads files. While setting up a dm-crypt block device is pretty simple thanks to tools like cryptsetup, the storage and protection of the encryption password is not trivial and requires careful planning and testing. There are plenty of resources on the web to set up dm-crypt that will vary based on your Linux distribution.

Summary

This article should give you a sense of what it takes to encrypt data in a Hadoop cluster; the tools are here and, if you are brave enough, you could attempt this at home. But the reality is that many things can go wrong if this is your first time, and you could end up permanently encrypting your data and throwing away the key (ultimate protection, even from you).

In my next column, I could go into much greater detail on how everything gets configured, but as everyone knows “threequels” are never as good as the first two, so I will save you the grief. Instead, I’ll offer you the option to look into how tools have automated and simplified the steps that are necessary to enable data encryption on Apache Hadoop.

Written By Eddie Garcia

Latest News

Click to comment

Virtual Event: Threat Detection and Incident Response Summit

May 20, 2026

Delve into big-picture strategies to reduce attack surfaces, improve patch management, conduct post-incident forensics, and tools and tricks needed in a modern organization.

Webinar: Third-Party Risk in Practice

June 4, 2026

Organizations are investing heavily in third-party risk management, but breaches, delays, and blind spots continue to persist. Join this live webinar as we examine the gap between how organizations think their third-party risk programs are performing and what’s actually happening in practice.

SECURITYWEEK NETWORK:

ICS:

SecurityWeek

Data Protection

Hadoop Data Encryption at Rest and in Transit

More from Eddie Garcia

Latest News

Trending

Virtual Event: Threat Detection and Incident Response Summit

Webinar: Third-Party Risk in Practice

People on the Move

Expert Insights

Caught Off Guard: Securing AI After It Hits Production

Cyber Resilience is the New Business Continuity Plan

Enhancing Data Center Security Without Sacrificing Performance

Is the SOC Obsolete, and We Just Haven’t Admitted It Yet?

The Mythos Moment: Enterprises Must Fight Agents with Agents

SECURITYWEEK NETWORK:

ICS:

Daily Briefing Newsletter

More from Eddie Garcia

Latest News

Trending

Daily Briefing Newsletter

Virtual Event: Threat Detection and Incident Response Summit

Webinar: Third-Party Risk in Practice

People on the Move

Expert Insights

Caught Off Guard: Securing AI After It Hits Production

Cyber Resilience is the New Business Continuity Plan

Enhancing Data Center Security Without Sacrificing Performance

Is the SOC Obsolete, and We Just Haven’t Admitted It Yet?

The Mythos Moment: Enterprises Must Fight Agents with Agents

Daily Briefing Newsletter