Securing Your Hadoop Cluster With Apache Ranger, Atlas and Knox Attila Kanto & Zsombor Gegesy

Securing Your Hadoop Cluster With Apache Ranger, Atlas and Knox Attila Kanto & Zsombor Gegesy June 13rd 2017 – Budapest Data Forum Disclaimer  Th...
Author: Erin Shepherd
5 downloads 0 Views 1MB Size
Securing Your Hadoop Cluster With Apache Ranger, Atlas and Knox Attila Kanto & Zsombor Gegesy

June 13rd 2017 – Budapest Data Forum

Disclaimer 

This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately never be developed.



Product capabilities are based on information that is publicly available within the Apache Software Foundation websites (“Apache”). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.



This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.



Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.



Since this document may contain an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

2

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Agenda



Security concepts overview



Apache Knox



Apache Ranger



Apache Atlas



Q&A

3

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Five pillars of enterprise security

4

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

HDP Security: Comprehensive, Complete, Extensible Perimeter Level Security • Network Security (i.e. Firewalls) • Apache Knox (i.e. Gateways) Authentication • LDAP / AD • Kerberos Authorization • Consistent authorization control across all HDP components with Apache Ranger Data protection • Encrypt data in motion and data at rest, Apache Ranger KMS OS Security • Process isolation • Namespaces 5

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Knox

6

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

What is Apache Knox?

REST API and Application Gateway for the Apache Hadoop Ecosystem

7

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Why Apache Knox? 

Extensible reverse proxy framework



Simplifies access  Kerberos encapsulation  Single access point for all REST and HTTP interactions  Multi-cluster support



Enhanced Security    



Eliminate SSH edge node (securely exposes REST APIs and HTTP based services at the perimeter) Protects the details of the cluster deployment Provides SSL for non-SSL services Central auditing

Enterprise Integration  LDAP/AD integration  SSO integration

8

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

What the Apache Knox isn’t



Not an alternative to firewalls



Not an alternative to Kerberos



Not a channel for high volume data ingest or export

9

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Knox Overview Proxying Services Proxying Services Ambari

UIs

Hadoop UIs

Oozie WebHDFS

YARN

Zeppelin

Web Sockets

WebSSO

OAuth

WebHCat YARN RM

REST APIs

SQL/DB

KnoxSSO/Token

Hive

Gremlin

Phoenix

Client DSL/SDK Services WebHDFS

REST API Classes

Groovy based DSL Token Sessions

10

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Authentication And Federation providers

Header Based

HBase

WebHCat

SPNEGO LDAP/AD

SAML

HTTP Proxying Services

Ranger

Authentication Services

Hive

YARN RM

HBase

KnoxShell SDK

Primary goals of the Apache Knox project is to provide access to Apache Hadoop via proxying of HTTP resources.

Authentication Services Authentication for REST API access as well as WebSSO flow for UIs. LDAP/AD, Header based PreAuth, Kerberos, SAML, OAuth are all available options. Client DSL/SDK Services Client development can be done with scripting through DSL or using the Knox Shell classes directly as SDK.

Cluster access through Edge Node 

CLI hard to install on desktops



Limited auditing



CLIs must be aware of cluster topology DMZ

SSH/SCP

User

11

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hadoop Services

Edge Node Hadoop CLIs

Cluster access through Gateway 

All activity audited consistently



Cluster topology is not exposed to the client



User connects trough a REST API DMZ

Hadoop Services

REST API

Gateway REST API

User

12

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

REST API

Authentication and Identity Propagation 3. Authenticate as user:secret

Client is not aware the cluster is secured with Kerberos

0. Configure Knox as trusted proxy

Hadoop Services

1. REST API Request

User

13

2. Authentication challenge user:secret

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Gateway

4. Authenticate as Knox via SPNEGO (i.e. Kerberos) 5.REST API Request doAs user

Scalability and Fault Tolerance

Gateway

Load Balancer REST API

User

14

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hadoop Services

REST API

REST API

Multi cluster support / multi tenant support

Hadoop Services

Gateway Hadoop Services

User

15

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Extensibility: Providers and Services 

Providers  Features of the the gateway that can be used by Services



Services  Actual Hadoop services like WebHDFS, Hive, RM, etc.  Definitions of endpoints to the gateway to expose a specific service  Includes providing configuration (e.g. rewrite rules)



Topologies  Assembly of providers and services

16

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Ranger

17

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Why Apache Ranger? Centralized Platform  To define and administer security policies consistently 

Define security policy once, and apply across the component stack



Deep visibility – detailed audit trail

18

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Fine-Grained Security On  Database 

Table



Column



Queue ( be it Kafka, or YARN)



Any resource

Apache Ranger 

All the Hadoop components have some kind of user management



Not integrated, not too sophisticated – HDFS – Posix like User/Group access policy – Hive/Hbase - db/table level restrictions for Users/Groups – Etc …



It would be nice if restrictions could be applied: – – – – –

19

Driven from LDAP/Active Directory Client IP address Time of access Data masking - ‘support’ only could see the last 4 number of credit card number Data filtering – ‘sales’ could only see the data from the same region

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Ranger 

Solution for authorization with flexible policies – with a plugin architecture



Supports: – – – – – – – – –



HDFS Hive HBase Kafka Knox Storm YARN Nifi Atlas

Contributed community plugins: – Apache Hawq, Druid, Gaian DB

20

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Ranger 

One policy to grant: – Hdfs://home/sales/{USERNAME} – to all the users in the ‘sales’ group



Hive database, table, column level access – row level filter – ‘location = ”HU”’ – row level masking – hashing / hiding / etc



HBase – Table, Column-family and column level filtering



YARN : – Limit processing queue access



KNOX – Limit to topologies / services



Etc …

21

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Ranger 

Audit log about every access and decision – Stored in HDFS and/or Solr – So it is easy to search/filter for audit events – Could be sent to Kafka, for integrating with other services



Ranger KMS – Secure key management for HDFS (“data-at-rest”) – Access control policies – Audit

22

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Can we have more flexibility? 

If table/column/row specific restrictions are not enough – To configure hundreds of columns independently, manually is error prone



Tag based access decisions: – Every column taged as ‘Personal Information’ should be hidden from ‘X’ – Every table tagged with ’visibleBefore=2017-10-01’ should be hidden



23

But how to get the tags?

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

- Metadata Truth in Hadoop Project 5

Project 1

Project 3 Project 6

Metadata

STRUCTURED

TRADITIONAL RDBMS

24

MPP APPLIANCES

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Project 4

UNSTRUCTURED

METADATA

DATA LAKE

Data Management along the entire data lifecycle with integrated provenance and lineage capability

Modeling with Metadata Cross- component dataset lineage. Centralized location for all metadata inside Hadoop Interoperable Solutions Single Interface point for Metadata Exchange with platforms outside of Hadoop

Apache Atlas 

Graph about the Metadata – ability to collect and link various information automatically



As a graph, it is highly extensible – Define new nodes and edges between them, and even new node types



Dynamic query language



REST API – for external systems



External connectors with messaging frameworks

25

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Atlas 

For Hive – Which column contains what kind of data – Who created / who consumes the data – lineage – Lineage if created by

• Sqoop, Storm, Kafka, Falcon – or if it’s created by Hive SQL 

Tags – for marking something as personal info …

26

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Atlas + Ranger 

More fine grained access decisions …

27

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Atlas and Ranger Integration 

Basic Tag policy – Access and entitlements can be based on attributes. – Personally Identifiable Information (PII) is a tag that can be leveraged to protect sensitive personal data.



Geo-based policy – Access policy based on location. – A user might be able to access data in North America, but may be restricted from access in EMEA due to privacy compliance.



Time-based policy – Access policy based on time windows. – A user might be able to access data only between 8AM – 5PM (common in SOX regulations.)



Prohibitions – Restrictions on combining two data sets which might be in compliance originally, but not when combined together. – Names and health care records

28

© Hortonworks Inc. 2011 – 2017. All Rights Reserved

Q&A

29

© Hortonworks Inc. 2011 – 2017. All Rights Reserved