Data Protection

Hadoop Data Encryption at Rest and in Transit

In my previous contribution I reviewed Apache Hadoop’s native support for HDFS Encryption, which allows sensitive data to be stored and encrypted on HDFS with keys protected by KMS access control lists.

Eddie Garcia

August 24, 2016

This notwithstanding, in any real world big data environment there are many other data sources, data-staging areas, temporary files and log files where sensitive data may reside outside of HDFS. And of course sensitive data should also be protected “in transit” when going from end points into the cluster, or when a job runs moving data from one node in the cluster to another. The good news is that there are solutions for encrypting data at rest and in transit that will allow organizations not only to meet regulatory compliance requirements, but more importantly to protect and secure their information assets.

Apache Hadoop Encryption in Transit

The first step in securing an Apache Hadoop cluster is to enable encryption in transit, which is the foundation for every other security layer that will be added subsequently. Authentication and Kerberos rely on secure communications, so before you even go down the road of enabling authentication and Kerberos you must enable encryption of data-in-transit.

As you know, Apache Hadoop is not a monolithic application; quite the contrary, it’s a zoo of services from Pig and Hive to Impala and Kudu. These services may communicate with different protocols including RPC, TCP/IP or HTTP and each protocol has a different method to encrypt data. Everyone who does transactions on the web is (hopefully) familiar with an https:// in their browser and a little image of a lock next to it to identify a secure communication. Similarly, to achieve secure communications in Hadoop we need to enable the secure version of protocols used.

RPC/SASL

For the components that communicate with RPC we need to enable SASL to protect RPC data in transit. SASL is enabled by setting the hadoop.rpc.protection property in the core-site.xml file, and there are three choices when enabling this option:

• authentication, which provides authentication between the two parties

Advertisement. Scroll to continue reading.

• integrity, which provides authentication between the two parties and message integrity

• privacy, which, in addition to authentication and integrity, also provides confidentiality

Of course, you will want to go with the most secure option, so let’s see what that looks like in the core-site.xml file.

<name>hadoop.rpc.protection</name>

<value>privacy</value>

</property>

Once this property is changed, all daemons in the cluster need to be restarted for all parties to communicate with the same encrypted protocol.

TCP/IP

Because Hadoop Data Protocol is used to send and receive data from the cluster data nodes, this protocol uses direct TCP/IP sockets and has support for encrypted communication with keys that get exchanged over RPC (which reinforces the need to first enable SASL). To enable this TCP/IP encrypted data stream, we set the dfs.encrypt.data.transfer property to “true” in the hdfs-site.xml configuration file. This configuration change must be made on both the NameNode and DataNodes.

<name>dfs.encrypt.data.transfer</name>

</property>

<name>dfs.encrypt.data.transfer.algorithm</name>

<name>dfs.encrypt.data.transfer.cipher.suites</name>

<value>AES/CTR/NoPadding</value> </property>

<name> dfs.encrypt.data.transfer.cipher.key.bitlength</name>

</property>

< span>As you may have noticed, we can also optionally configure the encryption cipher; in this case we have configured the more secure AES-256 algorithm.

Once this property is changed, the NameNode and DataNode daemons in the cluster need to be restarted.

TLS/HTTPS

This is where things get a bit more complicated. Different Apache Hadoop components have been developed in different coding languages – for example, MapReduce in Java, or Hue with Python – and for this reason there are a variety of ways and locations where SSL/TLS is configured. As an example, we will review enabling encryption for the WebUI of MapReduce v2 as well as enabling encrypted shuffles for MapReduce v2.

To enable the encrypted WebUI for MapReduce v2, you need to edit the core-site.xml file setting the hadoop.ssl.enabled property to “true”.

hadoop.ssl.enabled true true

And to enable encrypted shuffle for MapReduce v2, you will edit the mapred-site.xml file and also set the mapreduce.shuffle.ssl.enabled property to “true”.

<name>hadoop.ssl.enabled</name>

</property>

This is just scratching the surface. There are also certificates, trust stores and other SSL/TLS configurations that we need to worry about. But because this article is only meant to give you a high-level overview, we will not go into details and only point out that you also need to modify the ssl-server.xml and ssl-client.xml files with the appropriate values after obtaining SSL/TLS certificates from your AD admin or internal Certificate Authority (CA). In many cases, requesting the certificates from your AD team or a third party can be a lengthy process.

Encryption of Data at Rest outside HDFS

Once you have enabled in-transit and at-rest encryption in all the Apache Hadoop components, the last thing you need to configure is encryption of data at rest outside of HDFS.

While some organizations may consider encrypted hard drives, this method is not commonly used and also requires specialized and more expensive hardware. Instead, we can take advantage of the Linux native data-at-rest encryption features, namely dm-crypt.

Using LUKS with dm-crypt we can create an encrypted block device that sits above your standard storage device and encrypt/decrypt data as it writes and reads files. While setting up a dm-crypt block device is pretty simple thanks to tools like cryptsetup, the storage and protection of the encryption password is not trivial and requires careful planning and testing. There are plenty of resources on the web to set up dm-crypt that will vary based on your Linux distribution.

Summary

This article should give you a sense of what it takes to encrypt data in a Hadoop cluster; the tools are here and, if you are brave enough, you could attempt this at home. But the reality is that many things can go wrong if this is your first time, and you could end up permanently encrypting your data and throwing away the key (ultimate protection, even from you).

In my next column, I could go into much greater detail on how everything gets configured, but as everyone knows “threequels” are never as good as the first two, so I will save you the grief. Instead, I’ll offer you the option to look into how tools have automated and simplified the steps that are necessary to enable data encryption on Apache Hadoop.

Written By Eddie Garcia

Latest News

Click to comment

CIEM Chat: How to Reduce Cloud Identity Risk

March 26, 2024

Join the session as we discuss the challenges and best practices for cybersecurity leaders managing cloud identities.

Virtual Event: Ransomware Resilience & Recovery Summit

April 17, 2024

SecurityWeek’s Ransomware Resilience and Recovery Summit helps businesses to plan, prepare, and recover from a ransomware incident.

Beyond the Buzz: Rethinking Alcohol as a Cybersecurity Bonding Ritual

Jennifer Leggio makes the case for more alcohol-free networking events at conferences, and community-building opportunities for sober individuals working in cybersecurity. (Jennifer Leggio)

Navigating Vendor Speak: A Security Practitioner’s Guide to Seeing Through the Jargon

As a security industry, we need to focus our energies on those professionals among us who know how to walk the walk. (Joshua Goldfarb)

SD-WAN: Don’t Build a Dead End, Prepare for Future-Proof Secure Networking

SD-WAN must be scalable, stable, secure, and fully operational to serve as a strong base for seamless modernization and progression to SASE. (Etay Maor)

You Against the World: The Offenders Dilemma

Foreign attackers have many more toolsets at their disposal, so we need to make sure we’re selective about our modeling, preparation and how we assess and fortify ourselves. (Tom Eston)

Why Intelligence Sharing Is Vital to Building a Robust Collective Cyber Defense Program

With automated, detailed, contextualized threat intelligence, organizations can better anticipate malicious activity and utilize intelligence to speed detection around proven attacks. (Marc Solomon)

Application Security

Source Code Security Firm Cycode Launches With $4.6 Million in Funding

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Eduard KovacsSeptember 24, 2019

Quantum computing and the cryptopocalypse

Data Protection

Cyber Insights 2023 | Quantum Computing and the Coming Cryptopocalypse

The cryptopocalypse is the point at which quantum computing becomes powerful enough to use Shor’s algorithm to crack PKI encryption.

Kevin TownsendFebruary 2, 2023

Artificial Intelligence

AI Helps Crack NIST-Recommended Post-Quantum Encryption Algorithm

The CRYSTALS-Kyber public-key encryption and key encapsulation mechanism recommended by NIST for post-quantum cryptography has been broken using AI combined with side channel attacks.

Kevin TownsendFebruary 21, 2023

Compliance

Cyber Insights 2023 | Regulations

The three primary drivers for cyber regulations are voter privacy, the economy, and national security – with the complication that the first is often...

Kevin TownsendFebruary 2, 2023

Data Protection

How Quantum Computing Will Impact Cybersecurity

While quantum-based attacks are still in the future, organizations must think about how to defend data in transit when encryption no longer works.

Marie HattarAugust 30, 2023

Application Security

VMware Patches VM Escape Flaw Exploited at Geekpwn Event

Virtualization technology giant VMware on Tuesday shipped urgent updates to fix a trio of security problems in multiple software products, including a virtual machine...

Ryan NaraineDecember 13, 2022

Application Security

Fortinet Ships Emergency Patch for Already-Exploited VPN Flaw

Fortinet on Monday issued an emergency patch to cover a severe vulnerability in its FortiOS SSL-VPN product, warning that hackers have already exploited the...

Ryan NaraineDecember 12, 2022

Cybersecurity Funding

Data Protection and Privacy Firm Titaniam Raises $6 Million in Seed Funding

Los Gatos, Calif-based data protection and privacy firm Titaniam has raised $6 million seed funding from Refinery Ventures, with participation from Fusion Fund, Shasta...

Kevin TownsendFebruary 10, 2022

SECURITYWEEK NETWORK:

ICS:

SecurityWeek

Data Protection

Hadoop Data Encryption at Rest and in Transit

More from Eddie Garcia

Latest News

Trending

CIEM Chat: How to Reduce Cloud Identity Risk

Virtual Event: Ransomware Resilience & Recovery Summit

People on the Move

Expert Insights

Beyond the Buzz: Rethinking Alcohol as a Cybersecurity Bonding Ritual

Navigating Vendor Speak: A Security Practitioner’s Guide to Seeing Through the Jargon

SD-WAN: Don’t Build a Dead End, Prepare for Future-Proof Secure Networking

You Against the World: The Offenders Dilemma

Why Intelligence Sharing Is Vital to Building a Robust Collective Cyber Defense Program

Related Content

Application Security

Source Code Security Firm Cycode Launches With $4.6 Million in Funding

Data Protection

Cyber Insights 2023 | Quantum Computing and the Coming Cryptopocalypse

Artificial Intelligence

AI Helps Crack NIST-Recommended Post-Quantum Encryption Algorithm

Compliance

Cyber Insights 2023 | Regulations

Data Protection

How Quantum Computing Will Impact Cybersecurity

Application Security

VMware Patches VM Escape Flaw Exploited at Geekpwn Event

Application Security

Fortinet Ships Emergency Patch for Already-Exploited VPN Flaw

Cybersecurity Funding

Data Protection and Privacy Firm Titaniam Raises $6 Million in Seed Funding

SECURITYWEEK NETWORK:

ICS:

More from Eddie Garcia

Latest News

Trending

Daily Briefing Newsletter

CIEM Chat: How to Reduce Cloud Identity Risk

Virtual Event: Ransomware Resilience & Recovery Summit

People on the Move

Expert Insights

Beyond the Buzz: Rethinking Alcohol as a Cybersecurity Bonding Ritual

Navigating Vendor Speak: A Security Practitioner’s Guide to Seeing Through the Jargon

SD-WAN: Don’t Build a Dead End, Prepare for Future-Proof Secure Networking

You Against the World: The Offenders Dilemma

Why Intelligence Sharing Is Vital to Building a Robust Collective Cyber Defense Program

Related Content

Application Security

Source Code Security Firm Cycode Launches With $4.6 Million in Funding

Data Protection

Cyber Insights 2023 | Quantum Computing and the Coming Cryptopocalypse

Artificial Intelligence

AI Helps Crack NIST-Recommended Post-Quantum Encryption Algorithm

Compliance

Cyber Insights 2023 | Regulations

Data Protection

How Quantum Computing Will Impact Cybersecurity

Application Security

VMware Patches VM Escape Flaw Exploited at Geekpwn Event

Application Security

Fortinet Ships Emergency Patch for Already-Exploited VPN Flaw

Cybersecurity Funding

Data Protection and Privacy Firm Titaniam Raises $6 Million in Seed Funding