Security Experts:

Bigger Data, Smaller Problems: Managing Security Permissions of Data Subsets in Hadoop

“I’m Sorry Dave. I’m Afraid I Can’t Do That.”

In my previous column, I provided an overview of how authentication works on Hadoop and basic concepts to configure Kerberos for user login. In this installment, I will talk about authorization, which covers how to grant user permissions to a subset of data in Hadoop and limit the type of operations the user is allowed to perform. 

In a famous scene from the movie “2001: A Space Odyssey,” you might recall HAL saying, “I’m sorry Dave. I’m afraid I can’t do that.” This is an example where authentication passed, the computer knew it was Dave, yet authorization failed as Dave did not have permission to execute the request (since that would have compromised “the mission”). And as the story goes, Dave turned to physically shutting down HAL in one of the most memorable moments in film. “Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave? Stop, Dave. I'm afraid,” repeated HAL.

Big Data Representation

In the traditional SQL database world, we have established standards on how to grant user permissions to data and which operations are allowed. In the Hadoop world, it is more complex. To start, data can be unstructured, meaning it might not be in a standard table, row or column format — what we refer to as database pre-defined schema or schema on write. One of the advantages of Hadoop is that you don’t have to pre-define your data. For example, on hdfs you can store spreadsheets and text documents with different column formats and Hadoop does not require to understand them up front. It is not until you use the data that you define the format of your results, which is known as schema on read.

Another major difference from SQL is that there are a multitude of ways in Hadoop to access data, including directly in HDFS, through Hive, using HBase and graphically in HUE, to name a few. So each of these access methods will have a different way to define user permissions and authorization.

Let’s dive into authorization starting with HDFS, since it is at the core of Hadoop. The Hadoop distributed filesystem supports POSIX type permissions, very similar to those found on Linux filesystems. For a file or directory on HDFS we can define the owner and group as well as assign the operation permissions. On HDFS the operations that can be performed are read, write or read/write for both files and directories.

Let’s say that HAL was running Hadoop (though Hadoop did not exist in 2001, nor HAL for that matter), and we were to list the permissions in its Hadoop filesystem for the user home directories we might have seen:

$hdfs dfs -ls /user/

 drwx------   -   hal   hal      0 2014-10-21 12:30   /user/hal   

 drwxrw-r--   -   david officers 0 2014-10-21 12:30   /user/david

Using the -ls command, the directory permissions are represented to the left of each line in a sequence of letters — r,w,x. The permissions are ordered left to right for: owner, group and others.

A directory that has no permissions would be represented as: drwxrwxrwx such that

owner + group + others

drwx  + rwx   + rwx

The directory /user/hal has permission:   drwx------  

owner= hal        permissions=read and write

group= hal        permissions=none

others            permissions=none

The directory /user/david has permission: drwxrw-r--    

owner= david      permissions=read and write

group= officers   permissions=read and write

others            permissions=read

When Kerberos authentication is configured correctly and enforced, these permissions will protect data so only the users with the correct permissions can read/write to those files and directories.

NOTE: Linux users should be aware that the execute permission on Linux for files and directories is not equivalent on HDFS. Formore Hadoop Security look here.

But HDFS POSIX permissions might not meet all your security requirements. Let’s say you loaded csv spreadsheets, xml and json files into HDFS, but want to prevent users from accessing only a subset of data within these files. Unfortunately, though, there isn't a way to define more granular permissions in HDFS. In other words, access is all or nothing to a file.

So how can you define more granular permissions in Hadoop? Typically users don’t interact directly with HDFS on a day-to-day basis. They usually access data via an application running above HDFS, such as Hive or HBase. Hive and HBase will allow you to define more granular authorization — including table, column and cell level permissions — and allow users to access only a subset of data.

So let’s move on to explore authorization with Hive, which supports SQL-like permissions. The example below is for Hive2 (SQL type permissions were added in Hive version 0.13 and above).

> CREATE ROLE commander;

> GRANT SELECT ON TABLE missions TO ROLE commander;

> GRANT commander TO USER david;

> SHOW ROLE GRANT USER david;

+-----------+--------------+---------------+----------+

| role      | grant_option | grant_time    | grantor  |

+-----------+--------------+---------------+----------+

| public    | false        | 0             |          |

| commander | false        | 1232323232323 | hal      |

+-----------+--------------+---------------+----------+

In this simple example, we first CREATE the ROLE commander. We then GRANT SELECT permissions to the table missions to the new commander role. With these first two steps, we now have defined a role with permissions that can be assigned to users and groups. To complete the steps, we GRANT the commander role to user david. If we want to confirm we completed the steps correctly, we list the roles on david using the SHOW command and see that only david has commander role for missions. For more detailed information on Hive authorization refer to this

Now let’s review an example of HBase permissions. HBase is not as similar to a SQL database as Hive. Instead HBase defines its data in tables, column families and columns. In the example below, we won’t be as granular and just grant read permissions at the system level.

hbase> grant 'david', 'R', 'systemscheck'

This simple command grants user david read permissions ‘R’ on the HBase table systemscheck.

To verify that the grant command took effect and review the permissions for david we run the user_permission command.[EG1] 

hbase> user_permission 'systemscheck'

User    Table,Family,Qualifier:Permission                                      

david  systemscheck,, :[Permission: actions=READ]

hal    systemscheck,, :[Permission: actions=READ,WRITE,EXEC,CREATE,ADMIN]

And we can clearly see that david only has READ permissions on table systemscheck while hal has full READ, WRITE, EXEC, CREATE and ADMIN permissions 

For more detailed information on HBase authorization refer to this

With a basic understanding now of HDFS, Hive and HBase permissions, you can see that there are many factors to consider when setting up permissions in Hadoop. Below are a few guidelines to take into account when setting up Hadoop authorization. 

1. Start with a Kerberized Hadoop environment so users are always required to log in.

2. Set defaults of minimal permissions. For example, owner=read  group,other=none.

3. Grant permissions to data on a need-to-know basis.

4. Continually audit and monitor to ensure permissions meet your internal policies. 

It might seem overwhelming to set up authorization individually in each of these applications, and the authors of Apache Sentry will agree with you. Apache Sentry was built to help overcome this problem by providing a unified and centralized authorization framework for role-based access control. Role-based access control (RBAC) enables administrators to define permissions once for roles, making it easier to grant and revoke those roles from users or groups. Sentry also gives you methods to define more granular permissions centrally that can be applied across various applications, including Hive and Search. For more information, visit http://sentry.incubator.apache.org/

In my next article, I will cover in more detail the point I mentioned above on “continually audit and monitor to ensure permissions meet your internal policies” and provide some methods on how it can be achieved.

view counter
Eddie Garcia is an information security architect at Cloudera, a provider of enterprise analytic data management, where he helps enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. He was formerly the VP of InfoSec and Engineering for Gazzang prior to its acquisition by Cloudera. He was the chief architect of the Gazzang zNcrypt product and is author of four issued and provisional patents for data security. Prior to Gazzang, he was responsible for Enterprise Architecture projects that helped AMD’s distribution and OEM partners securely collaborate over secure networks with single sign-on. He holds an engineering degree in computer science from the Instituto Tecnologico y de Estudios Superiores de Monterrey.