Running Apache Spark & Apache Zeppelin in Production
-
Upload
hadoop-summit -
Category
Technology
-
view
574 -
download
5
Transcript of Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
Director, Product Management August 31, 2016Twitter: @neomythos
Vinay Shukla
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am i?
Product Management Spark for 2.5 + years, Hadoop for 3+ years Recovering Programmer Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin
Vinay Shukla
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security: Rings of Defense
Perimeter Level Security•Network Security (i.e. Firewalls)
Data Protection•Wire encryption•HDFS TDE/Dare•Others
Authentication•Kerberos•Knox (Other Gateways)
OS Security
Authorization•Apache Ranger/Sentry•HDFS Permissions•HDFS ACLs•YARN ACL
Page 4
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
Ex
Spark on YARN
Zeppelin
Spark-Shell
Ex
Spark Thrift Server
Driver
REST ServerDriver
Driver
Driver
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN–Spark driver (SparkContext) in YARN AM(yarn-cluster)–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App Master
Spark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
John Doe
Spark AM
1
Hadoop Cluster
HDFS
Executor
YARN RM
4
2 3
Node Manager
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
Authentication Authorization Audit Encryption
Spark leverages Kerberos on YARNEnsure network is secure
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit Spark Job
Spark gets Namenode (NN) service ticket
YARN launches Spark Executors using John Doe’s identity
Get service ticket for Spark,
John Doe
Spark AMNN
Executor reads from HDFS using John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab [email protected]
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST, submit Spark Job
Get Namenode (NN) service ticket
Executors read from HDFS
Client gets service ticket for Spark
RangerCan John launch this job?Can John read this file
John Doe
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Encryption: Spark – Communication Channels
Spark Submit
RM
Shuffle Service
AMDriver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
ShuffleBlockTransfer
DataSource
Read/Write Data
FS – Broadcast,File Download
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Communication Encryption Settings
Shuffle Data
Control/RPC
ShuffleBlockTransfer
Read/Write Data
FS – Broadcast,File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sharp Edges with Spark Security SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop
– Lowers security, forces STS to run as Hive user to read all data– Use SparkSQL via shell or programmatic API– https://issues.apache.org/jira/browse/SPARK-5159
Spark Stream + Kafka + Kerberos– Issues fixed in HDP 2.4.x– No SSL support yet
Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data
Fine grained Security to SparkSQL
http://bit.ly/2bLghGzhttp://bit.ly/2bTX7Pm
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authentication + SSL
Spark on YARNEx Ex
LDAP
John Doe
1
2
3
SSL
Firewall
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin + Livy E2E Security
Zeppelin
Spark
Yarn
Livy
Ispark GroupInterpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
LDAP
John Doe
Job runs as John Doe
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authorization
Note level authorization Grant Permissions (Owner, Reader, Writer)
to users/groups on Notes LDAP Group integration
Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini
For Spark with Zeppelin > Livy > Spark– Identity Propagation Jobs run as End-User
For Hive with Zeppelin > JDBC interpreter Shell Interpreter
– Runs as end-user
Authorization in Zeppelin Authorization at Data Level
[urls]/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Credentials
LDAP/AD account Zeppelin leverages Hadoop Credential API
Interpreter Credentials Not solved yet
Credentials
Credentials in Zeppelin
This is still an open issue
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Performance
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn \ --deploy-mode client \ --num-executors ? \ --executor-cores ? \ --executor-memory ? \ --class MySimpleApp \ mySimpleApp.jar \ arg1 arg2
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory24 cores
96GB Memory24 cores
96GB Memory
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores Per E#
Memory Per E (GB)
E Per Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the performance is relatively close
18 cores 108GB memory per node# The lower the better
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
Avoid too small or too big executor.– Small executor will decrease CPU and memory efficiency.– Big executor will introduce heavy GC overhead.
Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
Executor memory != Container memory Container memory = executor memory + overhead memory (10% of executory memory) Leave some resources to other service and system
Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Multi-tenancy for Spark
Leverage YARN queues– Set user quotas– Set default yarn-queue in spark-defaults– User can override for each job
Leverage Dynamic Resource Allocation– Specify range of executors a job uses– This needs shuffle service to be used
Cluster resource Utilization
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla @neomythos