Chicago spark meetup-april2017-public

download Chicago spark meetup-april2017-public

If you can't read please download the document

Embed Size (px)

Transcript of Chicago spark meetup-april2017-public

PowerPoint Presentation

Building Efficient Pipelines in Apache Spark

Guru Medasani

# Cloudera, Inc. All rights reserved.

1

AgendaIntroductionMyselfCloudera Spark Pipeline EssentialsUsing Spark UIResource AllocationTuningData FormatsStreamingQuestions

# Cloudera, Inc. All rights reserved.

2

Introduction: Myself

Current: Senior Solutions Architect at Cloudera (Chicago, IL)Past: BigData Engineer at Monsanto Research & Development (St. Louis, MO)

# Cloudera, Inc. All rights reserved.

3

Introduction: ClouderaThe modern platform for data management, machine learning and advanced analyticsFounded2008, by former employees ofProductFirst commercial distribution of Hadoop CDH Shipped 2009 World Class Support24x7 Global Staff & Operations in 27 CountriesProactive & Predictive Support Programs using our EDHMission CriticalProduction deployments in run-the-business applications worldwide Financial Services, Retail, Telecom, Media, Health Care, Energy, Government The Largest Ecosystem2,500+ PartnersCloudera UniversityOver 45,000 TrainedOpen Source LeadersCloudera employees are leading developers & contributors to the complete Apache Hadoop ecosystem of projects

# Cloudera, Inc. All rights reserved.

4

Spark Pipeline Essentials: Using Spark UI

# Cloudera, Inc. All rights reserved.

5

UI: Event Timeline

# Cloudera, Inc. All rights reserved.

6

UI: Job Details - DAG

# Cloudera, Inc. All rights reserved.

7

UI: Stage Details

# Cloudera, Inc. All rights reserved.

8

UI: Stage Metrics

# Cloudera, Inc. All rights reserved.

9

UI: Skewed Data Metrics - Example

# Cloudera, Inc. All rights reserved.

10

UI: Job Labels and Storage

# Cloudera, Inc. All rights reserved.

11

UI: Job Labels and RDD Names

# Cloudera, Inc. All rights reserved.

12

UI: DataFrame and Dataset Names

https://issues.apache.org/jira/browse/SPARK-8480

# Cloudera, Inc. All rights reserved.

13

UI: Skipped Stages

http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui

# Cloudera, Inc. All rights reserved.

14

UI: Using Shuffle Metrics

# Cloudera, Inc. All rights reserved.

15

Lots more in the UISQL QueriesEnvironment VariablesExecutor Aggregates

# Cloudera, Inc. All rights reserved.

16

Spark Pipeline Essentials: Resource Allocation

# Cloudera, Inc. All rights reserved.

17

Resources: BasicsIf running Spark on YARNFirst Step: Setup proper YARN resource queues and dynamic resource pools

# Cloudera, Inc. All rights reserved.

18

Resources: Dynamic AllocationDynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload.Originally just Spark-On-Yarn, now all cluster managers

# Cloudera, Inc. All rights reserved.Static Allocation vs Dynamic AllocationStatic Allocation--num-executors NUMDynamic Allocation Enabled by default in CDHGood starting pointNot the final solution

# Cloudera, Inc. All rights reserved.

20

Dynamic Allocation in Spark StreamingEnabled by default in CDHCloudera recommends to disable dynamic allocation for Spark Streaming

Why?Dynamic Allocation behavior - executors are removed when idle.Data comes in every batch, and executors run whenever data is available.Executor idle timeout is less than the batch duration, executors are constantly being added and removedIf the executor idle timeout is greater than the batch duration, executors are never removed

# Cloudera, Inc. All rights reserved.

21

Resources: # Executors, cores, memory !?!

6 Nodes16 cores each64 GB of RAM each

# Cloudera, Inc. All rights reserved.

22

Decisions, decisions, decisions

Number of executors (--num-executors)Cores for each executor (--executor-cores)Memory for each executor (--executor-memory)

6 nodes16 cores each64 GB of RAM

# Cloudera, Inc. All rights reserved.

23

Spark Architecture recap

# Cloudera, Inc. All rights reserved.Answer #1 Most granularHave smallest sized executorspossible1 core each64GB/node / 16 executors/node= 4 GB/executorTotal of 16 cores x 6 nodes = 96 cores => 96 executors

Worker nodeExecutor 6Executor 5Executor 4Executor 3Executor 2Executor 1

# Cloudera, Inc. All rights reserved.

25

Answer #1 Most granularHave smallest sized executorspossible1 core each64GB/node / 16 executors/node= 4 GB/executorTotal of 16 cores x 6 nodes = 96 cores => 96 executors

Worker nodeExecutor 6Executor 5Executor 4Executor 3Executor 2Executor 1

# Cloudera, Inc. All rights reserved.

26

Why?Not using benefits of running multiple tasks in same executor.Missing benefits of shared broadcast variables. Need more copies of the data

# Cloudera, Inc. All rights reserved.Answer #2 Least granular6 executors in total=>1 executor per node64 GB memory each16 cores each

Worker node

Executor 1

# Cloudera, Inc. All rights reserved.Answer #2 Least granular6 executors in total=>1 executor per node64 GB memory each16 cores each

Worker node

Executor 1

# Cloudera, Inc. All rights reserved.

29

Why?Need to leave some memory overhead for OS/Hadoop daemons

# Cloudera, Inc. All rights reserved.

30

Answer #3 with overhead6 executors 1 executor/node63 GB memory each15 cores each

Worker node

Executor 1

Overhead(1G,1 core)

# Cloudera, Inc. All rights reserved.Answer #3 with overhead6 executors 1 executor/node63 GB memory each15 cores each

Worker node

Executor 1

Overhead(1G,1 core)

# Cloudera, Inc. All rights reserved.

32

Lets assumeYou are running Spark on YARN, from here on4 other things to keep in mind

# Cloudera, Inc. All rights reserved.#1 Memory overhead

--executor-memory controls the heap sizeNeed some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memoryDefault is max(384MB, . 0.10 * spark.executor.memory)

# Cloudera, Inc. All rights reserved.

34

#2 - YARN AM needs a core: Client mode

# Cloudera, Inc. All rights reserved.

35

#2 YARN AM needs a core: Cluster mode

# Cloudera, Inc. All rights reserved.

36

#3 HDFS Throughput15 cores per executor can lead to bad HDFS I/O throughput.Best is to keep under 5 cores per executor

# Cloudera, Inc. All rights reserved.#4 Garbage CollectionToo much executor memory could cause excessive garbage collection delays.64GB is a rough guess as a good upper limit for a single executor.When you reach this level, you should start looking at GC tuning

# Cloudera, Inc. All rights reserved.Calculations5 cores per executorFor max HDFS throughputCluster has 6 * 15 = 90 cores in totalafter taking out Hadoop/Yarn daemon cores)90 cores / 5 cores/executor= 18 executorsEach node has 3 executors63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB1 executor for AM => 17 executors

OverheadWorker node

Executor 3

Executor 2

Executor 1

# Cloudera, Inc. All rights reserved.

39

Correct answer17 executors in total19 GB memory/executor5 cores/executor

* Not etched in stone

OverheadWorker node

Executor 3

Executor 2

Executor 1

# Cloudera, Inc. All rights reserved.

40

Dynamic allocation helps with this though, right?

Number of executors (--num-executors)Cores for each executor (--executor-cores)Memory for each executor (--executor-memory)

6 nodes16 cores each64 GB of RAM

# Cloudera, Inc. All rights reserved.

41

Spark Pipeline Essentials: Tuning

# Cloudera, Inc. All rights reserved.Memory: Unified Memory Management

https://issues.apache.org/jira/browse/SPARK-10000

# Cloudera, Inc. All rights reserved.

43

Memory: ExampleLets say you have 64GB Executor. Default spark.memory.fraction : 0.6 = 0.6 * 64 = 38.4 GBDefault spark.memory.storage.fraction: 0.5 = 0.5 * 38.4 = 19.2 GB

So based on how much data is being spilled, GC pauses and OOME, you can take following actionsIncrease number of executors (increasing parallelism)Tweak the spark.yarn.executor.memory.overhead (avoid OOME)Tweak Spark.memory.fraction (reduces memory pressure and spilling)Tweak Spark.memory.storage.fraction (what you think is right, not excessive)

# Cloudera, Inc. All rights reserved.

44

Memory: Hidden Caches(GraphX)

org.apache.spark.graphx.lib.PageRank

# Cloudera, Inc. All rights reserved.

45

Memory: Hidden Caches(MLlib)

# Cloudera, Inc. All rights reserved.

46

ParallelismNumber of tasks depends on number of partitionsToo many partitions is usually better than too few partitionsVery important parameter in determining performanceDatasets read from HDFS rely on number of HDFS blocksTypically each HDFS block becomes a partition in RDDUser can specify the number of partitions during input or transformations

What should the X be?The most straightforward answer is experimentationLook at the number of partitions in the parent RDD and then keep multiplying that by 1.5 until performance stops improvingval rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)

# Cloudera, Inc. All rights reserved.

47

How about the cluster?The two main resources that Spark