Connect with us

Hi, what are you looking for?


Incident Response

Hadoop Audit and Logging “Back in Time”

In my previous contributions I covered authentication and authorization in Hadoop. This time I will be covering Audit, the third of the three AAAs of Information Security.  Audit and monitoring are critical to data security.

In my previous contributions I covered authentication and authorization in Hadoop. This time I will be covering Audit, the third of the three AAAs of Information Security.  Audit and monitoring are critical to data security. Through audit, we can ensure that the security controls that are in place are working correctly and identify attempts to circumvent them.

Logs are a common method to record the actions of an application and allow administrators and auditors to go “Back in Time” to review a user’s actions. Much like your credit card or bank statement, these logs provide evidence of transactions performed. In absence of a time machine, these logs may be the only means to provide a historical view of what took place in a Hadoop cluster at a given moment in time.

As you all know by now, Hadoop has many different components and it just so happens that they have different types of audit logs. I will cover the auditing capabilities of several components in this article.

HDFS Audit Logs

HDFS is at the core of Hadoop, providing the distributed file system that makes Hadoop so successful. HDFS has two different audit logs, hdfs-audit.log for user activity and SecurityAuth-hdfs.audit for service activity. Both of these logs are implemented with Apache Log4j, a common and well known mechanism for logging in Java. The log4j properties can be configured in the file with:


Below is an example log for user Marty McFly after a listing of files/directories and an attempted copy to directory /user/doc which was denied.

Advertisement. Scroll to continue reading.

2015-07-01 12:15:10,123 INFO FSNamesystem.audit: allowed=true  [email protected]

 (auth:KERBEROS) ip=/ cmd=getfileinfo src=/user/martymcfly dst=null perm=null

2015-07-01 12:15:10,125 INFO FSNamesystem.audit: allowed=true  [email protected]

 (auth:KERBEROS) ip=/ cmd=listStatus src=/user/martymcfly dst=null perm=null

2015-07-01 12:15:46,167 INFO FSNamesystem.audit: allowed=false [email protected]

 (auth:KERBEROS) ip=/ cmd=rename src=/user/martymcfly/delorean dst=/user/doc perm=null

MapReduce Audit Logs

Like HDFS, MapReduce also has two logs mapred-audit.log for user activity and SecurityAuth- mapred.audit for service activity. The log4j configuration can be found in the file with:


YARN Audit Logs

For YARN the user audit log events are not in a separate file but rather mixed into the daemon log files. To enable the service logging in YARN as with HDFS and MapReduce you enable the log4j property with:


Hive Audit Logs

 Hive is a bit different and uses the Hive Metastore for service logging. To identify the Hive audit events amongst the other logged events you can filter lines containing org.apache.hadoop.hive.metastore.HiveMetaStore.audit. Hive log events will also contain information to identify which database or table is being operated on.

HBase Audit Logs

HBase has a separate file for audit logs, though playing back the activity for a user is a bit trickier as the events can be spread amongst the HBase nodes. The events will contain information about the column family, column, table and action performed. The log4j configuration can be found in the file with:


Sentry Audit Logs

While logging user operations are important, logging admin operations and changes to user permissions is extremely important. Apache Sentry also uses log4j and has a dedicated file that is configured with:


Cloudera Impala Audit Logs

Each Cloudera Impala daemon will have its own audit log file. The format is a bit different and uses JSON for easier parsing of events. Like Hive, Impala will log information about the database, table and even SQL statement performed.

Monitoring and Log Analysis for the added benefit of Event Analysis and Alerts

Once you have set up all the Hadoop logging, an equally important step is to monitor the cluster proactively for security events, breaches and suspicious activity. And what better place to do this but Hadoop itself!

Among the many other great use cases for big data, one is to use Hadoop for log ingestion and security analytics. In the past, important information contained in log files was discarded during log rotations, but now with Hadoop, smart organizations are storing all log data for active archiving. Organizations then take advantage of the large ecosystem of tools that are available for advanced persistence threat (APT) analytics, security forensics, cyber intelligence and user behavior machine learning built on Hadoop.

Stay tuned for upcoming articles on new methods and approaches to capture network, packet and DNS data on Apache Hadoop to detect potential threats using machine learning.

It is always a good idea to make sure you have enabled logging correctly even on existing clusters or after performing upgrades. And if you are not currently storing logs in Hadoop you should definitely start now. 

Written By

Click to comment


Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Join the session as we discuss the challenges and best practices for cybersecurity leaders managing cloud identities.


The AI Risk Summit brings together security and risk management executives, AI researchers, policy makers, software developers and influential business and government stakeholders.


People on the Move

Satellite cybersecurity company SpiderOak has named Kip Gering as its new Chief Revenue Officer.

Merlin Ventures has appointed cybersecurity executive Andrew Smeaton as the firm’s CISO-in-Residence.

Retired U.S. Army General and former NSA Director Paul M. Nakasone has joined the Board of Directors at OpenAI.

More People On The Move

Expert Insights