Apache Spark Operations
-
Upload
cloudera-inc -
Category
Software
-
view
981 -
download
1
Embed Size (px)
Transcript of Apache Spark Operations

1© Cloudera, Inc. All rights reserved.
Spark OperationsKostas Sakellis

2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager

3© Cloudera, Inc. All rights reserved.
Building a proof of concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

4© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

5© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

6© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

7© Cloudera, Inc. All rights reserved.
Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4

8© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4

9© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4

10© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4

11© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect

12© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage

13© Cloudera, Inc. All rights reserved.
Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
•A pipelined set of transformation on a single thread

14© Cloudera, Inc. All rights reserved.
Spark Architecture

15© Cloudera, Inc. All rights reserved.
Spark System Architecture

16© Cloudera, Inc. All rights reserved.
Deployments
• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support

17© Cloudera, Inc. All rights reserved.
Standalone
Master
WorkerClient
Worker
Process
AppMaster
Process

18© Cloudera, Inc. All rights reserved.
Standalone
•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>
• Submit jobspark-submit --master <master-spark-URL>
…

19© Cloudera, Inc. All rights reserved.
Container
YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process

20© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process

21© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process

22© Cloudera, Inc. All rights reserved.
Spark on YARN
• Submit jobspark-submit --master yarn-client …
•Cluster modespark-submit --master yarn-cluster …
• Spark shell only works in client mode!

23© Cloudera, Inc. All rights reserved.
Customers often have shared infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

24© Cloudera, Inc. All rights reserved.
Multi-tenancy
•Cluster utilization is top metric•Target: 70-80% utilization
•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager

25© Cloudera, Inc. All rights reserved.
Underutilized Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

26© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed
• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell

27© Cloudera, Inc. All rights reserved.
Dynamic Allocation Limitations
• Still required to specify cores•--num-cores
•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself
•Our customers still get it wrong!

28© Cloudera, Inc. All rights reserved.
The Future of Dynamic Allocation
•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory
• Leads to better cluster utilization

29© Cloudera, Inc. All rights reserved.
Security, now it’s getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

30© Cloudera, Inc. All rights reserved.
Authentication
•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens

31© Cloudera, Inc. All rights reserved.
Encryption
•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

32© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity
•Apache Sentry•HDFS-Sentry synchronization plugin
•Record Service•Column level security for Spark!

33© Cloudera, Inc. All rights reserved.
Thank youWe’re Hiring!