Securing Your Hadoop Cluster With Apache Ranger, Atlas and Knox Attila Kanto & Zsombor Gegesy
June 13rd 2017 – Budapest Data Forum
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately never be developed.
Product capabilities are based on information that is publicly available within the Apache Software Foundation websites (“Apache”). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document may contain an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
2
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security concepts overview
Apache Knox
Apache Ranger
Apache Atlas
Q&A
3
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Five pillars of enterprise security
4
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDP Security: Comprehensive, Complete, Extensible Perimeter Level Security • Network Security (i.e. Firewalls) • Apache Knox (i.e. Gateways) Authentication • LDAP / AD • Kerberos Authorization • Consistent authorization control across all HDP components with Apache Ranger Data protection • Encrypt data in motion and data at rest, Apache Ranger KMS OS Security • Process isolation • Namespaces 5
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Knox
6
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is Apache Knox?
REST API and Application Gateway for the Apache Hadoop Ecosystem
7
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Why Apache Knox?
Extensible reverse proxy framework
Simplifies access Kerberos encapsulation Single access point for all REST and HTTP interactions Multi-cluster support
Enhanced Security
Eliminate SSH edge node (securely exposes REST APIs and HTTP based services at the perimeter) Protects the details of the cluster deployment Provides SSL for non-SSL services Central auditing
Enterprise Integration LDAP/AD integration SSO integration
8
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
What the Apache Knox isn’t
Not an alternative to firewalls
Not an alternative to Kerberos
Not a channel for high volume data ingest or export
9
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Knox Overview Proxying Services Proxying Services Ambari
UIs
Hadoop UIs
Oozie WebHDFS
YARN
Zeppelin
Web Sockets
WebSSO
OAuth
WebHCat YARN RM
REST APIs
SQL/DB
KnoxSSO/Token
Hive
Gremlin
Phoenix
Client DSL/SDK Services WebHDFS
REST API Classes
Groovy based DSL Token Sessions
10
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Authentication And Federation providers
Header Based
HBase
WebHCat
SPNEGO LDAP/AD
SAML
HTTP Proxying Services
Ranger
Authentication Services
Hive
YARN RM
HBase
KnoxShell SDK
Primary goals of the Apache Knox project is to provide access to Apache Hadoop via proxying of HTTP resources.
Authentication Services Authentication for REST API access as well as WebSSO flow for UIs. LDAP/AD, Header based PreAuth, Kerberos, SAML, OAuth are all available options. Client DSL/SDK Services Client development can be done with scripting through DSL or using the Knox Shell classes directly as SDK.
Cluster access through Edge Node
CLI hard to install on desktops
Limited auditing
CLIs must be aware of cluster topology DMZ
SSH/SCP
User
11
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hadoop Services
Edge Node Hadoop CLIs
Cluster access through Gateway
All activity audited consistently
Cluster topology is not exposed to the client
User connects trough a REST API DMZ
Hadoop Services
REST API
Gateway REST API
User
12
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
REST API
Authentication and Identity Propagation 3. Authenticate as user:secret
Client is not aware the cluster is secured with Kerberos
0. Configure Knox as trusted proxy
Hadoop Services
1. REST API Request
User
13
2. Authentication challenge user:secret
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Gateway
4. Authenticate as Knox via SPNEGO (i.e. Kerberos) 5.REST API Request doAs user
Scalability and Fault Tolerance
Gateway
Load Balancer REST API
User
14
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hadoop Services
REST API
REST API
Multi cluster support / multi tenant support
Hadoop Services
Gateway Hadoop Services
User
15
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Extensibility: Providers and Services
Providers Features of the the gateway that can be used by Services
Services Actual Hadoop services like WebHDFS, Hive, RM, etc. Definitions of endpoints to the gateway to expose a specific service Includes providing configuration (e.g. rewrite rules)
Topologies Assembly of providers and services
16
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
17
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Why Apache Ranger? Centralized Platform To define and administer security policies consistently
Define security policy once, and apply across the component stack
Deep visibility – detailed audit trail
18
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Fine-Grained Security On Database
Table
Column
Queue ( be it Kafka, or YARN)
Any resource
Apache Ranger
All the Hadoop components have some kind of user management
Not integrated, not too sophisticated – HDFS – Posix like User/Group access policy – Hive/Hbase - db/table level restrictions for Users/Groups – Etc …
It would be nice if restrictions could be applied: – – – – –
19
Driven from LDAP/Active Directory Client IP address Time of access Data masking - ‘support’ only could see the last 4 number of credit card number Data filtering – ‘sales’ could only see the data from the same region
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Solution for authorization with flexible policies – with a plugin architecture
Supports: – – – – – – – – –
HDFS Hive HBase Kafka Knox Storm YARN Nifi Atlas
Contributed community plugins: – Apache Hawq, Druid, Gaian DB
20
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
One policy to grant: – Hdfs://home/sales/{USERNAME} – to all the users in the ‘sales’ group
Hive database, table, column level access – row level filter – ‘location = ”HU”’ – row level masking – hashing / hiding / etc
HBase – Table, Column-family and column level filtering
YARN : – Limit processing queue access
KNOX – Limit to topologies / services
Etc …
21
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Audit log about every access and decision – Stored in HDFS and/or Solr – So it is easy to search/filter for audit events – Could be sent to Kafka, for integrating with other services
Ranger KMS – Secure key management for HDFS (“data-at-rest”) – Access control policies – Audit
22
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Can we have more flexibility?
If table/column/row specific restrictions are not enough – To configure hundreds of columns independently, manually is error prone
Tag based access decisions: – Every column taged as ‘Personal Information’ should be hidden from ‘X’ – Every table tagged with ’visibleBefore=2017-10-01’ should be hidden
23
But how to get the tags?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
- Metadata Truth in Hadoop Project 5
Project 1
Project 3 Project 6
Metadata
STRUCTURED
TRADITIONAL RDBMS
24
MPP APPLIANCES
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Project 4
UNSTRUCTURED
METADATA
DATA LAKE
Data Management along the entire data lifecycle with integrated provenance and lineage capability
Modeling with Metadata Cross- component dataset lineage. Centralized location for all metadata inside Hadoop Interoperable Solutions Single Interface point for Metadata Exchange with platforms outside of Hadoop
Apache Atlas
Graph about the Metadata – ability to collect and link various information automatically
As a graph, it is highly extensible – Define new nodes and edges between them, and even new node types
Dynamic query language
REST API – for external systems
External connectors with messaging frameworks
25
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas
For Hive – Which column contains what kind of data – Who created / who consumes the data – lineage – Lineage if created by
• Sqoop, Storm, Kafka, Falcon – or if it’s created by Hive SQL
Tags – for marking something as personal info …
26
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas + Ranger
More fine grained access decisions …
27
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas and Ranger Integration
Basic Tag policy – Access and entitlements can be based on attributes. – Personally Identifiable Information (PII) is a tag that can be leveraged to protect sensitive personal data.
Geo-based policy – Access policy based on location. – A user might be able to access data in North America, but may be restricted from access in EMEA due to privacy compliance.
Time-based policy – Access policy based on time windows. – A user might be able to access data only between 8AM – 5PM (common in SOX regulations.)
Prohibitions – Restrictions on combining two data sets which might be in compliance originally, but not when combined together. – Names and health care records
28
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Q&A
29
© Hortonworks Inc. 2011 – 2017. All Rights Reserved