Spark tuning
-
Upload
gmo-zcom-vietnam-lab-center -
Category
Technology
-
view
589 -
download
0
Transcript of Spark tuning
Agenda
1. Tuning Spark parametera. Control Spark’s resource usageb. Advanced Parameterc. Dynamic Allocation
2. Tips for tuning your Spark program3. Example use case of tuning Spark
algorithm
2
3
Tuning Spark Parameter
3
The easy way
If you Spark application is slow, just let it have more system resources.
Is there anything simpler?
4
Spark Architecture Simplified
5
Control Spark’s resource usage
• spark-submit command’s parameter (some only available when using in YARN)
6
Parameter Description Default value
num-executor Number of executors to launch 2
executor-cores Number of cores per executor 1
executor-memory Memory per executor 1G
driver-cores Number of cores used by the driver, only in YARN cluster mode
1
driver-memory Memory for driver 1G
Calculate the right values
• For example: 4 servers for Spark, each server has 64gb ram, 16 cores. How should we set those spark-submit’s parameters?– --num-executors 4 --executor-memory 63g --
executor-cores 15– --num-executors 7 --executor-memory 29GB --
executor-cores 7– --num-executors 11 --executor-memory 19GB --
executor-cores 5
7
Spark Executor’s Memory Model
• Memory request from YARN for each container = spark.executor.memory + spark.yarn.executor.memoryOverhead
• spark.yarn.executor.memoryOverhead = max(spark.executor.memory * 0.1, 384mb)
8
Move advanced parameters
9
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and cogroups during shuffles
0.2
spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously from each reduce task
48m
spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created during a shuffle
false
spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output stream
32k
spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6
spark.akka.frameSize Number of actor threads to use for communication 4
spark.akka.threads Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB
10
Advanced Spark memory
10
Demo Spark UI
11
Using Dynamic Allocation
• Dynamically scale the set of cluster resources allocated to your application up and down based on the workload
• Only available when using YARN as cluster management tool
• Must use an external shuffle service, so must config a shuffle service with YARN
12
Dynamic Allocation parameters (1)
13
spark.shuffle.service.enabled Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed
false
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation
false
spark.dynamicAllocation.executorIdleTimeout
If an executor has been idle for more than this duration, the executor will be removed
60s
spark.dynamicAllocation.cachedExecutorIdleTimeout
If an executor which has cached data blocks has been idle for more than this duration, the executor will be removed
Infinity
spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor
Dynamic Allocation parameters (2)
14
spark.dynamicAllocation.maxExecutors Upper bound for the number of executors
Infinity
spark.dynamicAllocation.minExecutors Lower bound for the number of executors
0
spark.dynamicAllocation.schedulerBacklogTimeout
If there have been pending tasks backlogged for more than this duration, new executors will be requested
1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
Same as spark.dynamicAllocation.schedulerBacklogTimeout, but used only for subsequent executor requests
schedulerBacklogTimeout
Dynamic Allocation in Action
15
Dynamic Allocation - The verdict
• Dynamic Allocation help using your cluster resource more efficiently
• But only effective when Spark Application is a long running one with different long stages with different number of tasks (Spark Streaming?)
• In addition, when an executor is removed, all cached data will no longer be accessible
16
17
Tips for Tuning Your Spark Program
17
Tuning Memory Usage
• Prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap).
• Avoid nested structures with a lot of small objects and pointers when possible.
• Using numeric IDs or enumeration objects instead of strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight.
18
Other Tuning Tips (1)
● Using KryoSerializer instead of default JavaSerilizer● Know when to persist RDD and determine the right
level of storage level○ MEMORY_ONLY○ MEMORY_AND_DISK○ MEMORY_ONLY_SER○ …
● Choose the right level of parallelism○ spark.default.parallelism○ repartition○ 2nd arguments for methods in spark.
PairRDDFunctions19
Other tuning tips (2)
• Broadcast large variables• Do not collect on large RDDs (should filter first)• Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey…)• Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if possible.
20
groupByKey vs reduceByKey (1)
21
groupByKey vs reduceByKey (2)
22
23
Example use case of tuning Spark algorithm
23
Tuning CF algorithm in RW project
• 1st algorithm, no parameter tuning: 27mins• 1st algorithm, parameters tuned: 18mins• 2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s• 3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s
24
25
Q&A
25
26
Thank You!