hadoop security

the admin user I made for eqrs kerberos is cloudera-scm/admin@CLOUDERA.COM / admin

set java home first in order to use keytool

if use yum install java, it will install openjdk, not recommand and the defaul java home is put on /usr/lib/jvm

find / -type d -name jdk* to find java home

export JAVA_HOME=/usr/java/jdk1.8.0_121-cloudera export PATH=$JAVA_HOME/bin:$PATH

vi /etc/profile

source profile

cloudera Https level 0

admin–> setting–> security 

truststore must be assign and restart cloudera server and agent to take effect. otherwise management service won’t start

generate self signed certificate with keytool

1) generate keystore file with domain specified

#key point here the hostname(CN) must be private domain name of ec2 for cloudera #management service to recognize the host

# when enable level 1 TLS, CN must be internal IP of ec2 to work when generate the jks

there is a problem with enable level 1 TLS, management service looking for CN as hostname but level 1 looking for CN as IP

keytool -genkeypair -alias cmhost -keyalg RSA -keystore /opt/cloudera/security/jks/cmhost-keystore.jks -keysize 2048 -dname “CN=ip-10-0-0-169.us-gov-west-1.compute.internal,OU=Eqrs,O=Mantech,L=OwingsMills,ST=Maryland,C=US” -storepass eqrs#cloudera -keypass eqrs#cloudera

2) separate x509 format certificate and private key from keystore file for hue

keytool -importkeystore -srckeystore /opt/cloudera/security/jks/cmhost-keystore.jks -srcstorepass eqrs#cloudera -srckeypass eqrs#cloudera -destkeystore cloudera.p12 -deststoretype PKCS12 -srcalias cmhost -deststorepass eqrs#cloudera -destkeypass eqrs#cloudera

sudo openssl pkcs12 -in cloudera.p12 -passin pass:eqrs#cloudera  -nokeys -out /opt/cloudera/security/x509/cloudera.pem

sudo openssl pkcs12 -in cloudera.p12 -passin pass:eqrs#cloudera -nocerts -out /opt/cloudera/security/x509/cloudera.key -passout pass:eqrs#cloudera

3) export self signed certificate from keystore file

#export the certificate from keystore file, this is the certificate acts as it’s issued by public CA like varisign

keytool -exportcert -keystore /opt/cloudera/security/jks/cmhost-keystore.jks -alias cmhost -storepass eqrs#cloudera -file /opt/cloudera/security/x509/cmhost-keystore.cer

4) create custom custom truststore from certificate for cloudera management service

keytool -importcert -keystore /opt/cloudera/security/jks/cmhost.truststore -alias cmhost -storepass eqrs#cloudera -file /opt/cloudera/security/x509/cmhost-keystore.cer -noprompt

config https for cloudera manager


config https for hue


config https for navigator


home page


AWS S3 enryption



AWS Key Management Service





access management


policy simulator


how role is different from resouce-based policy


policy basics


aws read only policy example: https://aws.amazon.com/code/AWS-Policy-Examples/6851158459579252

aws read/write policy example:

EC2 actions


user, group and policy when to create user and when to create role


Don’t create an IAM user and pass the user’s credentials to the application or embed the credentials in the application. Instead, create an IAM role that you attach to the EC2 instance to give applications running on the instance temporary security credentials.

Admin group exam


within one account, there groups and under that where are users

policy is like permission, a policy can be attached to a group alone as well

role is a collection of policies and a role can be attached to multiple entities like user, EC2, app and so on. it’s a decouple layer between policy and entities:


best practice: add user to group and apply policy to group, create roles for apps running on EC2


CDH5.3配置Kerberos+LDAP(Lightweight Directory Access Protocol) +Sentry记录




implementing security in hadoop cluster


aws EBS encryption


When youcreate an encrypted EBS volume and attach it to a supported instance type, the following types of data are encrypted:

  • Data at rest inside the volume
  • All data moving between the volume and the instance
  • All snapshots created from the volume
  • As illustrated, external data streams can be authenticated by mechanisms in place for Flume and Kafka. Any data from legacy databases is ingested using Sqoop. Users such as data scientists and analysts can interact directly with the cluster using interfaces such as Hue or Cloudera Manager. Alternatively, they could be using a service like Impala for creating and submitting jobs for data analysis. All of these interactions can be protected by an Active Directory Kerberos deployment.
  • Encryption can be applied to data at-rest using transparent HDFS encryption with an enterprise-grade Key Trustee Server. Cloudera also recommends using Navigator Encrypt to protect data on a cluster associated with the Cloudera Manager, Cloudera Navigator, Hive and HBase metastores, and any log files or spills.
  • Authorization policies can be enforced using Sentry (for services such as Hive, Impala and Search) as well as HDFS Access Control Lists.
  • Auditing capabilities can be provided by using Cloudera Navigator.

config sentry on cloudera general instruction



data encryption

in transition: TLS using https

HDFS at rest: HDFS encryption

S3 at rest: S3 server-side data-at-rest encryption

Cloudera Manager also supports TLS authentication. Without certificate authentication, a malicious user can add a host to Cloudera Manager by installing the Cloudera Manager Agent software and configuring it to communicate with Cloudera Manager Server. To prevent this, you must install certificates on each agent host and configure Cloudera Manager Server to trust those certificates.

https with SSL or TLS

HTTPS is a protocol, it uses either SSL or TSL algorithm to encrypt the data , the point of HTTPS is to make sure

1) data flow over internet is encrypted (symmetric encryption with session key)

2) data ONLY flow between server and client who request it (cause only server and client have session key which can used to decrypt data)

3) data is safe even if it’s intercepted cause it’s encrypted

a certificate contains public key

a keystore file contains certificate and  secrete key

secrete key only stays on server and used once during hand shake, client doesn’t have secrete key

client only decided whether to trust the certificate send from the server, which means during the hand shake, only client verify the server but the server doesn’t verify the client



data redaction (数据掩饰)




data encrytption

Understanding Java Keystores and Truststores

The standard Oracle Java JDK distribution includes a default truststore (cacerts) that contains root certificates for many well-known CAs, including Symantec. Rather than using this default truststore, Cloudera recommends using the alternative truststore (jssecacerts), which is created by simply copying cacerts to a file of that name. This file is loaded by Hadoop daemons at startup. All clients in a Cloudera Manager cluster configured for TLS/SSL need access to the truststore, to ascertain the validity of any certificates presented during TLS/SSL session negotiation, for example. The certificates assure the client or server process as to the validity of the host’s public key. The private keys are maintained in the keystore

cloudera navigator encrypt architecture

还是在data at rest范畴,补足HDFS at rest encryption不cover的部分比如log和meta data


Navigator Encryptis part of Cloudera’s overall encryption-at-rest solution, along with HDFS encryption—which operates at the HDFS folder level, enabling encryption to be applied only to HDFS folders where needed—and Navigator Key Trustee, which is a virtual safe-deposit box for managing encryption keys, certificates, and passwords.

HDFS transparent end-to-end encryption

  1. Transparent means that end-users are unaware of the encryption/decryption processes, and end-to-end means that data is encrypted at-rest and in-transit.
  1. To get started with deploying the KMS and a keystore, see Enabling HDFS Encryption Using the Wizard on page 272. For information on configuring and securing the KMS, see Configuring the Key Management Server (KMS) on page 280
  2. and Securing the Key Management Server (KMS) on page 285.
  3. Note: An encryption zone cannot be created on top of an existing directory.
  4. Accessing Files Within an Encryption Zone
  5. To encrypt a new file, the HDFS client requests a new EDEK from the NameNode. The NameNode then asks the KMS to decrypt it with the encryption zone’s EZ key. This decryption results in a DEK, which is used to encrypt the file.

Java Key Store (JKS)

Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption.

The Java Development Kit maintains a CA keystore in folder jre/lib/security/cacerts. JDKs provide a tool named keytool[1] to manipulate the keystore.



install and config








  • Secret如何表示。
  • A如何向B提供Secret。
  • B如何识别Secret。

Long-term Key/Master Key:在Security的领域中,有的Key可能长期内保持不变.我们一般管这样的Hash Code叫做Master Key。由于Hash Algorithm是不可逆的,同时保证密码和Master Key是一一对应的,这样既保证了你密码的保密性,有同时保证你的Master Key和密码本身在证明你身份的时候具有相同的效力。

Short-term Key/Session Key:由于被Long-term Key加密的数据包不能用于网络传送,所以我们使用另一种Short-term Key来加密需要进行网络传输的数据

Kerberos Distribution Center-KDC。KDC在整个Kerberos Authentication中作为Client和Server共同信任的第三方起着重要的作用,而Kerberos的认证过程就是通过这3方协作完成

通过上面的介绍,我们发现Kerberos实际上一个基于Ticket的认证方式。Client想要获取Server端的资源,先得通过Server的认证;而认证的先决条件是Client向Server提供从KDC获得的一个有Server的Master Key进行加密的Session Ticket(Session Key + Client Info)可以这么说,Session Ticket是Client进入Server领域的一张门票。而这张门票必须从一个合法的Ticket颁发机构获得,这个颁发机构就是Client和Server双方信任的KDC, 同时这张Ticket具有超强的防伪标识:它是被Server的Master Key加密的。对Client来说, 获得Session Ticket是整个认证过程中最为关键的部分。

如果我们把Client提供给Server进行认证的Ticket比作股票的话,那么Client在从KDC那边获得Ticket之前,需要先获得这个Ticket的认购权证,这个认购权证在Kerberos中被称为TGT:Ticket Granting Ticket,TGT的分发方仍然是KDC。

通过以上的介绍,我们基本上了解了整个Kerberos authentication的整个流程:整个流程大体上包含以下3个子过程

  1. Client向KDC申请TGT(Ticket Granting Ticket)。
  2. Client通过获得TGT向DKC申请用于访问Server的Ticket。
  3. Client最终向为了Server对自己的认证向其提交Ticket。

KDC give TGT to client –> client use TGT to get Session Ticket(session key + client info) from KDC–> client use session ticket to verify with kerberos –> get permission and then access resource (hadoop resources)

install kerberos server on centos



kerberos instruction


Cloudera Navigator 2.0 Overview


Data Management

  • Audit data access and verify access privileges – The goal of auditing is to capture a complete and immutable record of all activity within a system. Cloudera Navigator auditing adds secure, real-time audit components to key data and access frameworks. Compliance groups can use Cloudera Navigator to configure, collect, and view audit events that show who accessed data, and how.
  • Search metadata and visualize lineage – Cloudera Navigator metadata management allows DBAs, data stewards, business analysts, and data scientists to define, search for, amend the properties of, and tag data entities and view relationships between datasets.
  • Policies – Data stewards can use Cloudera Navigator policies to define automated actions, based on data access or on a schedule, to add metadata, create alerts, and move or purge data.
  • Analytics – Hadoop administrators can use Cloudera Navigator analytics to examine data usage patterns and create policies based on those patterns

Data Encryption

  • Cloudera Navigator Encrypt transparently encrypts and secures data at rest without requiring changes to your applications and ensures there is minimal performance lag in the encryption or decryption process. Also it has ACL. Encrypt best to be used to encrypt data outside HDFS cause HDFS itself has better end-to-end encryption mechanism.    The ACL uses rules to control process access to files. The rules specify whether a Linux process has access permissions to read from or write to a specific Navigator Encrypt path.
  • Cloudera Navigator Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys and other security artifacts.
  • Cloudera Navigator Key HSM allows Cloudera Navigator Key Trustee Server to seamlessly integrate with a hardware security module (HSM).

navigator key trustee high availibilty set up


hadoop security config order

install MySQL

config SSL/TSL for cloudera manager, services and navigator

install and run kerberos server with KDC

install and run key trustee server


The most common Key Trustee Server clients are Navigator Encrypt and Key Trustee KMS.

install cloudera navigator encryt

keystore and truststore in java SSL

Both keystore and truststore is used to store SSL certificates in Java but there is subtle difference between them. truststore is used to store public certificates while keystore is used to store private certificates of client or server.


hdfs encryption



An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. The files in this directory will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a key which is specified when the zone is created. Each file within an encryption zone also has its own encryption/decryption key, called the Data Encryption Key (DEK). These DEKs are never stored persistently unless they are encrypted with the encryption zone’s key. This encrypted DEK is known as the EDEK. The EDEK is then stored persistently as part of the file’s metadata on the NameNode.

setup navigator key trustee cluster for hdfs data encryption


hdfs permission guide


By default, non-admin users cannot access any encrypted data. You must create appropriate ACLs before users can access encrypted data. See the Cloudera documentationfor more information on managing KMS ACLs.


Hadoop use hdfs as its filesystem and there is no command “cd” in hdfs. See the below link…


hdfs data encryption

general cloudera instruction


commands to create keys and zones

Login or su to these users on one of the hosts in your cluster. These directions will help to verify KMS is setup to encrypt files.

Create a key and directory.

su <KEY_ADMIN_USER> hadoop key create mykey1 hadoop fs -mkdir /tmp/zone1

Create a zone and link to the key.

su hdfs hdfs crypto -createZone -keyName mykey1 -path /tmp/zone1

Create a file, put it in your zone and ensure the file can be decrypted.

su <KEY_ADMIN_USER> echo “Hello World” > /tmp/helloWorld.txt hadoop fs -put /tmp/helloWorld.txt /tmp/zone1 hadoop fs -cat /tmp/zone1/helloWorld.txt rm /tmp/helloWorld.txt

Ensure the file is stored as encrypted.

su hdfs hadoop fs -cat /.reserved/raw/tmp/zone1/helloWorld.txt hadoop fs -rm -R /tmp/zone1

By default, non-admin users cannot access any encrypted data. You must create appropriate ACLs before users can access encrypted data. See the Cloudera documentationfor more information on managing KMS ACLs.

config ACL


configure CDH service for data encryption


after recreate encryption zone eg. /hbase, The right thing to do is to use hdfs user to create the fold and copy the files over and then change owner to hbase otherwise hbase can’t start

using command

sudo su – hdfs

hadoop fs -chown -R hbase /hbase

 hadoop fs -cp /user/hive-old/* /user/hive

intellij 15 active server


sentry roles and group are managed by each service

1) in hive


matching existing sentry policy file


sentry policy file authorization


sentry authorization using policy files can’t be enabled the same time entry services is enabled

by enabling sentry user to group mapping class to hadoopgropresourceauthorizationprovider

we can map hadoop users to groups created by command lines and then using command line in each service to create role and attached to group to achieve authorization like {1) in hive}

sentry is mainly to provide role based grandunr control for hive and impala at table and column level.

it doesn’t control what user can access what services, this part is handled by kerberos. But kerberos doesn’t authenticate what user can or can’t login to cloudera manager, in another word, kerberos only secure services of the cluster

Hi Guys:

Shahid purpose a way to access cloudera services without exposing ips in url. And the following is how to do it.

For services in public subnet do the following.

1) find the port of service and public IP of the server

2) to access to director run this: ssh -i clouderaKey.pem -L 7189:localhost:7189 ec2-user@

3) to your browser to your local machine and type localhost:7189, you have to keep the terminal open

to make this work


For services in private subnet do the following.

1)      Find the port of service and private IP of the server and a public IP of a node in public subnet (here we use director server:

2)      Eg: to access spark UI in private subnet find port is 18088 and private IP is and then do the following commend in your termial

ssh -v -i clouderaKey.pem -L 18088: ec2-user@

3)      go to brower and type in localhost:18088 will lead you to spark ui

4)      in order to make it work, your termial must keep open

#  Create SOCKS Proxy with Bastion:

ssh -v -i sli-key.pem -CND 8157 sli@ -p 48520


Open isolated browser by command line:


“/Applications/Google Chrome.app/Contents/MacOS/Google Chrome” –user-data-dir=”$HOME/chrome-with-proxy” –proxy-server=”socks5://localhost:8157″

windows: open cmd terminal

“C:\Users\hl5315\AppData\Local\Google\Chrome\Application\chrome.exe”  –user-data-dir=”$HOME/chrome-with-proxy” –proxy-server=”socks5://localhost:8157″

ssh tannelling

put the config file to .ssh folder and change the private IP and key path accordingly. 

and then in terminal run: ssh <the alias name of target server>



Here’s a simplified instruction for connecting to databases on AWS. Please make sure you set up ssh-config following Shahid’s instruction.


  1. In SSH command line, enter following commands for SSH tunneling.

ssh -v -L ${localport}:${db_host}:${db_port} eqrspublicbastion



                Localport can be any port number you want to use for the DB connection.

                Db_host and db_port can be found in attached DB spreadsheet.

For example:

ssh -v -L 1772:eqrsdv02.cskfanqlck6a.us-gov-west-1.rds.amazonaws.com:1525 eqrspublicbastion


  1. Connect DB with following info

Hostname:                          localhost

Port:                                      ${localport}

Service name:                   ${service_id}                      Can be found in attached spreadsheet


  1. All team members should have their accounts created in specified databases. Default password is ‘Eqrs#12#4567890’.

Let me know if you are not in the list.

  • HLI
  • SLI

hadoop cluster system level fine grind user access control

using sentry

install kerberos and use Enterprise level AD KDC, which itself should be a high available cluster ,as authentication component of sentry and define different orgnization unit for each environment within AD

and fine grind rule such as what user group can access what resources are defined within sentry as authorization component


Best Regards

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s