Sparklint @ Spark Meetup Chicago
Embed Size (px)
Transcript of Sparklint @ Spark Meetup Chicago
Sparklinta Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster
Simon WhitearPrincipal Engineer
Why Sparklint?A successful Spark cluster grows rapidlyCapacity and capability mismatches ariseLeads to resource contentionTuning process is non-trivialCurrent UI operational in focus
We wanted to understand application efficiency
Spark cluster successPlatform rolls out with a maximum supported load.Early projects ramp up, usage is fineEarly successes feed back into recommendations to use the platformNew users start loading up the platform just as initial successes are being scaledPlatform limits hit, scaling requirements now begin to be understood and planned forRough times whilst the platform operation learns to lead the application usage
spark ui provides masses of info for only recent jobs / stages/tasks by default when the job is alive when serving spark ui from history server, there is still little summary information to debug the job config: Have I used the right magic number (locality wait, cores, numPartitions, job scheduling mode, etc.) difficult to compare different execution of the same job, due to this missing level of summary, (execution time is almost the only metrics to compare)
Sparklint provides:Live view of batch & streaming application stats orEvent by event analysis of historical event logsStats and graphs for:Idle timeCore usageTask locality
A mechanism to listen the spark event log stream, and accumulate life time stats without losing (too many) details using constant memory in live mode because of the gauge we are using The mechanism also provides convenient replay when serving from a file A set of stats and graphs to describe the job performance uniformly: 1. idle time (duration when all calc are done on driver node, things to avoid) 2. max core usage, core usage percentage (should not be too high or too low, thinking about using avg numTaskInWait to supplement it) 3. task execution time for a certain stage by locality, (which honestly describes the opportunity cost of a lower locality, and indicates the idle locality wait config.)
DemoSimulated workload analyzing site access logs:read text file as JSONconvert to Record(ip, verb, status, time)countByIp, countByStatus, countByVerb
using the ReduceByKey.scala in repo as a sample to demo a series of attempts when we try to optimize a Spark application. The logs are included as well. The highlights for each run have been annotated in the screenshots in the attachment.
The application is basically reading a text file, json parse and convert to "Record(ip: String, verb: String, status: Int, time: Long)", then do countByIP, countByStatus, countByVerb on them, repeat 10 times.These are three independent map reduce jobs, each has one map stage (parsing) and one reduce stage (countByXXX).
Algo level optimization is out of the discussion here. The app need a constant number of CPU seconds, and a floating but bound amount of network i/o time (decided by job locality) to finish the execution.
Job took 10m7s to finishAlready pretty good distribution; low idle time indicates good worker usage, minimal driver node interaction in job But overall utilization is low Which is reflected in the common occurrence of the IDLE state (unused cores)
We use 16 cores as the baseline standard. The job takes 10 min to finish.The annotations in the pic describes what are we running here, and how to read sparklint graph.After reading the chart, we decided to decrease core to see if the execution time doubles or not, to figure out if we are bonded by CPU.
Job took 15m14s to finishCore usage increased, job is more efficient, execution time increased, but the app is not cpu bound
by using 8 cores, the job took 15 min to finish, shorter than the 20 min expectation, proving that we are not bonded by cpu. Actually this saw tooth pattern easily indicates we are not bonded by CPU, and can be used as a classic example; An example of cpu bounded application can be found in the last demo slide.This leads to another angle of optimization: job scheduling tweaking.
Job took 9m24s to finishCore utilization decreased proportionally, trading execution time for efficiencyLots of IDLE state shows we are over allocating resources
by using 32 cores, the job took 9 min to finish, proving again that throwing more cores doesn't provide commensurate performance gains.. The graph is a classic example about over allocating resources.We can assume we need no more than 24 cores to do the work effectively, so now we can look into other ways of tuning the job: dynamic allocation and increased parallelism.11
Job took 11m34s to finishCore utilization remains low, the config settings are not right for this workload.Dynamic allocation only effective at app start due to long executorIdleTimeout setting
we try to optimize resource requirement by using dynamic allocation, initially just using the default executorIdleTimeout of 1min. This has also led us to try 1 core / executor.Since we don't usually have any task longer than 1 minute, we proved that dynamic allocation is not the key in optimizing this kind of app that has shorter tasks.12
Job took 33m5s to finishCore utilization is up, but execution time is up dramatically due to reclaiming resources before each short running task.IDLE state is reduced to a minimum, looks efficient, but execution is much slower due to dynamic allocation overhead
we reduced executorIdleTimeout to 10s. In this way we decreased resource footprint and increased utilization.However this is a false saving for this job, because the job throughput is reduced due to low core supply and overhead in getting executors. This example proved again that dynamic allocation doesn't solve the optimization challenge when we have shorter tasks
So, lets try parallelism inside the job using FAIR scheduling.13
Job took 7m34s to finishCore utilization way up, with lower execution timeParallel execution is clearly visible in overlapping stagesFlat tops show we are becoming CPU bound
by using 16 cores and FAIR scheduler, this simple tweak cut the execution time from 10 min to 7.5 min, and our job now become CPU bounded (see annotation)The tweak to run the three count stages in parallel and use FAIR scheduling increases efficiency and reduces runtime, allowing us to become CPU bound,14
Job took 5m6s to finishCore utilization decreases, trading execution time for efficiency again here
by using 32 cores and FAIR scheduler, the execution time become 5 min (compare to 9 min in pic3 using the same resource). We reduce efficiency in order to gain execution time, this is a decision for the team to decide, if there is a hard SLA to hit, it may be worth running with lower utilization.We can now call the job scheduling optimization done. 15
Thanks to dynamic allocation the utilization is high despite being a bi-modal applicationData loading and mapping requires a large core count to get throughputAggregation and IO of results optimized for end file size, therefore requires less cores
Demos the correct scenario of using dynamic allocation, and throwing more cpu will help when the job is CPU bounded (the flat tops in the usage graph is the clear proof)In this case the partition count is chosen to optimize file size on HDFS, so the team are comfortable with the runtime.16
Future Features:History Server event sourcesInline recommendationsAuto-tuningStreaming stage parameter delegationReplay capable listener
Sparklint can easily distinguish CPU bounded and job scheduling bounded applications. (We are working on automating this judgment, by using average number of pending tasks)Really easy to spot when a job is not bounded by CPU, but job scheduling (leads to low core usage) and driver node operations (leads to idle time). In theory your app will be 2x faster if you throw 2x cores to it, but this is not always trueThe point of spark level optimization is to make your job CPU bounded, when you can decide freely between ($ gain from faster application / $ spent in providing more cores)If your job is CPU bounded, simply add coresIf your job has a lot of idle time, try decrease it by reducing unwanted/unintended driver node operations. (could be simple things like doing a map on a large array instead of an RDD and they forgot about it)If your job is job scheduling bounded, you can both reduce waste by using dynamic allocation (which in turns provides you high throughput when needed), and submit independent jobs in parallel using Futures and FAIR scheduler http://spark.apache.org/docs/latest/configuration.html#scheduling
The Credit:Lead developer is Robert Xue https://github.com/roboxueSDE @ Groupon
Contribute!Sparklint is OSS: