State of Security: Apache Spark & Apache Zeppelin

25
Director, Product Management June 30, 2016 Twitter: @neomythos Vinay Shukla

Transcript of State of Security: Apache Spark & Apache Zeppelin

Director, Product Management June 30, 2016Twitter: @neomythos

Vinay Shukla

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Who Are I?

Product Management Spark for 2.5 + years, Hadoop for 3+ years Recovering Programmer Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Minor contributor to Apache Zeppelin

Vinay Shukla

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security: Rings of Defense

Perimeter Level Security•Network Security (i.e. Firewalls)

Data Protection•Wire ecnryption•HDFS TDE/Dare•Others

Authentication•Kerberos•Knox (Other Gateways)

OS Security

Authorization•Apache Ranger/Sentry•HDFS Permissions•HDFS ACLs•HBase ACLs

Page 4

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key to Spark Security

Spark processes data in-memory, does not store it.

Page 5

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Context: Spark Deployment Modes

• Spark on YARN–Spark driver (SparkContext) in YARN AM(yarn-cluster)–Spark driver (SparkContext) in local (yarn-client):

• Spark Shell & Spark Thrift Server runs in yarn-client only

Client

Executor

App Master

Spark Driver

Client

Executor

App Master

Spark Driver

YARN-ClientYARN-Cluster

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark on YARN

Spark Submit

John Doe

Spark AM

1

Hadoop Cluster

HDFS

Executor

YARN RM

4

2 3

Node Manager

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Security – Four Pillars

Authentication Authorization Audit Encryption

Spark leverages Kerberos on YARNEnsure network is secure

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Authentication: Kerberos Primer

Client

KDC

NN

DN

1. kinit - Login and get Ticket Granting Ticket (TGT)

3. Get NameNode Service Ticket (NN-ST)

2. Client Stores TGT in Ticket Cache

4. Client Stores NN-ST in Ticket Cache

5. Read/write file given NN-ST and file name; returns block locations, block IDs and Block Access Tokensif access permitted

6. Read/write block givenBlock Access Token and block ID

Client’sKerberos

Ticket Cache

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberos authentication within Spark

KDC

Use Spark ST, submit Spark Job

Spark gets Namenode (NN) service ticket

YARN launches Spark Executors using John Doe’s identity

Get service ticket for Spark,

John Doe

Spark AMNN

Executor reads from HDFS using John Doe’s delegation token

kinit

1

2

3

4

5

6

7

Hadoop Cluster

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark + X (Source of Data)

KDC

Use Spark ST, submit Spark Job

Spark gets X ST

YARN launches Spark Executors using John Doe’s identity

Get Service Ticket (ST) for Spark

Spark AMX

Executor reads from X using John Doe’s delegation token

kinit

1

2

3

4

5

6

7

Hadoop Cluster

John Doe

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Kerberos - Example

kinit -kt /etc/security/keytabs/johndoe.keytab [email protected]

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS

Spark – Authorization

YARN Cluster

A B C

KDC

Use Spark ST, submit Spark Job

Get Namenode (NN) service ticket

Executors read from HDFS

Client gets service ticket for Spark

RangerCan John launch this job?Can John read this file

John Doe

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Encryption: Spark – Communication Channels

Spark Submit

RM

Shuffle Service

AMDriver

NM

Ex 1 Ex N

Shuffle Data

Control/RPC

ShuffleBlockTransfer

DataSource

Read/Write Data

FS – Broadcast,File Download

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Communication Encryption Settings

Shuffle Data

Control/RPC

ShuffleBlockTransfer

Read/Write Data

FS – Broadcast,File Download

spark.authenticate.enableSaslEncryption= true

spark.authenticate = true. Leverage YARN to distribute keys

Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS

NM > Ex leverages YARN based SSL

spark.ssl.enabled = true

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Gotchas with Spark Security Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop

– Lowers security, forces STS to run as Hive user to read all data– Use SparkSQL via shell or programmatic API– https://issues.apache.org/jira/browse/SPARK-5159

Spark + HBase with Kerberos– Issue fixed in Spark 1.4 (Spark-6918)

Spark Stream + Kafka + Kerberos– Issues fixed in HDP 2.4.x– No SSL support yet

Spark jobs > 72 Hours– Delegation token not renewed before Spark 1.4

Spark Shuffle > Only SASL, no SSL support

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How can I get Row/Column/Masking with SparkSQL?

Hopefully you went to “Fine Grained Security for Hive & Spark” yesterday

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Features: Spark Column Security with LLAP

Fine-Grained Column Level Access Control for SparkSQL.

Fully dynamic policies per user. Doesn’t require views.

Use Standard Ranger policies and tools to control access and masking policies.

Flow:1.SparkSQL gets data locations known as “splits” from HiveServer and plans query.2.HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied.3.Spark gets a modified query plan based on dynamic security policy.4.Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server.

HiveServer2

Authorization

Hive MetastoreData Locations

View Definitions

LLAPData Read

Filter Pushdown

Ranger Server

Dynamic Policies

Spark Client

12

4

3

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: Per-User Row Filtering by Region in SparkSQL

Spark User 2(East Region)

Spark User 1(West Region)

Original Query:SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based onDynamic Ranger Policies

LLAP Data AccessUser ID Region Total Spend1 East 5,1312 East 27,8283 West 55,4934 West 7,1935 East 18,193

Dynamic Rewrite:SELECT * from CUSTOMERS

WHERE total_spend > 10000AND region = “east”

Dynamic Rewrite:SELECT * from CUSTOMERS

WHERE total_spend > 10000AND region = “west”

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Spark

Ex

Spark on YARN

Zeppelin

Spark-Shell

Ex

Spark Thrift Server

Driver

REST ServerDriver

Driver

Driver

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin Security

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authentication + SSL

Spark on YARNEx Ex

LDAP

John Doe

1

2

3

SSL

Firewall

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin + Livy E2E Security

Zeppelin

Spark

Yarn

Livy

Ispark GroupInterpreter

SPNego: Kerberos Kerberos/RPC

Livy APIs

LDAP

John Doe

Job runs as Jon Doe

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authorization

Notebook level authorization Grant Permissions (Owner, Reader, Writer) to users/groups on Notebooks LDAP Group integration just got merged (ZEPPELIN-946)

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Vinay Shukla @neomythos