Apache Spark Operations

33
1 © Cloudera, Inc. All rights reserved. Spark Operations Kostas Sakellis

Transcript of Apache Spark Operations

Page 1: Apache Spark Operations

1© Cloudera, Inc. All rights reserved.

Spark OperationsKostas Sakellis

Page 2: Apache Spark Operations

2© Cloudera, Inc. All rights reserved.

Me

• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager

Page 3: Apache Spark Operations

3© Cloudera, Inc. All rights reserved.

Building a proof of concept!

Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

Page 4: Apache Spark Operations

4© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 5: Apache Spark Operations

5© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 6: Apache Spark Operations

6© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 7: Apache Spark Operations

7© Cloudera, Inc. All rights reserved.

Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 8: Apache Spark Operations

8© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 9: Apache Spark Operations

9© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

Page 10: Apache Spark Operations

10© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Page 11: Apache Spark Operations

11© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Page 12: Apache Spark Operations

12© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDD Lineage

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Lineage

Page 13: Apache Spark Operations

13© Cloudera, Inc. All rights reserved.

Task

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

•A pipelined set of transformation on a single thread

Page 14: Apache Spark Operations

14© Cloudera, Inc. All rights reserved.

Spark Architecture

Page 15: Apache Spark Operations

15© Cloudera, Inc. All rights reserved.

Spark System Architecture

Page 16: Apache Spark Operations

16© Cloudera, Inc. All rights reserved.

Deployments

• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos

• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support

Page 17: Apache Spark Operations

17© Cloudera, Inc. All rights reserved.

Standalone

Master

WorkerClient

Worker

Process

AppMaster

Process

Page 18: Apache Spark Operations

18© Cloudera, Inc. All rights reserved.

Standalone

•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>

• Submit jobspark-submit --master <master-spark-URL>

Page 19: Apache Spark Operations

19© Cloudera, Inc. All rights reserved.

Container

YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 20: Apache Spark Operations

20© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 21: Apache Spark Operations

21© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 22: Apache Spark Operations

22© Cloudera, Inc. All rights reserved.

Spark on YARN

• Submit jobspark-submit --master yarn-client …

•Cluster modespark-submit --master yarn-cluster …

• Spark shell only works in client mode!

Page 23: Apache Spark Operations

23© Cloudera, Inc. All rights reserved.

Customers often have shared infrastructure

Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

Page 24: Apache Spark Operations

24© Cloudera, Inc. All rights reserved.

Multi-tenancy

•Cluster utilization is top metric•Target: 70-80% utilization

•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager

Page 25: Apache Spark Operations

25© Cloudera, Inc. All rights reserved.

Underutilized Clusters

Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

Page 26: Apache Spark Operations

26© Cloudera, Inc. All rights reserved.

Dynamic Allocation

• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed

• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell

Page 27: Apache Spark Operations

27© Cloudera, Inc. All rights reserved.

Dynamic Allocation Limitations

• Still required to specify cores•--num-cores

•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself

•Our customers still get it wrong!

Page 28: Apache Spark Operations

28© Cloudera, Inc. All rights reserved.

The Future of Dynamic Allocation

•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory

• Leads to better cluster utilization

Page 29: Apache Spark Operations

29© Cloudera, Inc. All rights reserved.

Security, now it’s getting serious.

Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

Page 30: Apache Spark Operations

30© Cloudera, Inc. All rights reserved.

Authentication

•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.

• Spark utilizes delegation tokens

Page 31: Apache Spark Operations

31© Cloudera, Inc. All rights reserved.

Encryption

•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)

SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

Page 32: Apache Spark Operations

32© Cloudera, Inc. All rights reserved.

Authorization

• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity

•Apache Sentry•HDFS-Sentry synchronization plugin

•Record Service•Column level security for Spark!

Page 33: Apache Spark Operations

33© Cloudera, Inc. All rights reserved.

Thank youWe’re Hiring!