Profiling & Testing with SparkApache Spark 2.0 Improvements, Flame Graphs & Testing
Outline
Overview
Spark 2.0 Improvements
Profiling with Flame Graphs
How-to Flame Graphs
Testing in Spark
Overview
Apache Spark™ is a fast and general engine for large-scale data processing
Speed: Runs in-memory computing, up to 100x faster than MapReduce
Ease of Use: Support for Java, Scala, Python and R binding
Generality: Enabled for SQL, Streaming and complex analytics (ML)
Portable: Runs on Yarn, Mesos, standalone or Cloud
Overview (Big Picture)
Overview (architecture)
Overview (code sample)
Monte-carlo π calculation
“This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”
Main Takeaway
Spark SQL:
Provides parallelism, affordable at scale
Scale out SQL on storage for Big Data volumes
Scale out on CPU for memory-intensive queries
Offloading reports from RDBMS becomes attractive
Spark 2.0 improvements:
Considerable speedup of CPU-intensive queries
Spark 2.0 Improvements
SQL Queries
sqlContext.sql("
SELECT a.bucket, sum(a.val2) tot
FROM t1 a, t1 b
WHERE a.bucket=b.bucket and
a.val1+b.val1<1000
GROUP BY a.bucket
ORDER BY a.bucket").show()
Complex and resource-intensive SELECT statement:
EXPLAIN directive (execution plan)
Execution Plan
The execution plan:
First instrumentation point for SQL tuning
Shows how Spark wants to execute the query (break-down)
Main players:
Catalyst: the query optimizer
Catalyst (query optimizer)
Logical Plan:
Describes computation on data sets without defining how to conduct it
Physical Plan:
Defines which computation to conduct on each dataset
Project Tungsten (Goal)
“Improves the memory and CPU efficiency of Spark backend
execution by pushing performance close to the limits of
modern hardware.”
Project Tungsten
Perform manual memory management instead of relying on Java objects:
Reduce memory footprint
Eliminate garbage collection overheads
Use java.unsafe and off-heap memory
Code generation for expression evaluation:
Reduce virtual function calls and interpretation overhead (JVM)
Project Tungsten (Code-Gen)
Project Tungsten (Code-Gen)
The Volcano Iterator Model:
Standard for 30 years: almost all
databases do it.
Each operator is an “iterator” that
consumes records from its input
operator
Project Tungsten (Code-Gen)
Downside the Volcano Iterator Model:
Too many virtual function calls
at least 3 calls for each row in Aggregate phase
Can’t take advantage of modern CPU features
pipelining, prefetching, branch prediction,
SIMD, ILP, exploit instruction cache ...
Project Tungsten (Code-Gen)
What if we hire a college freshman to implement this query in Java in 10 mins?
Whole-stage Code-Gen: Spark as a “Compiler”
Project Tungsten (Code-Gen)
A student beating 30 years of science ...
Project Tungsten (Code-Gen)
Volcano
● Many virtual function calls
● Data in memory (or cache)
● No loop unrolling, SIMD
Hand-written code
● No virtual function calls
● Data in CPU registers
● Exploit compiler optimizations
○ loop unrolling, SIMD, pipeliningTake advantage of all the information that is known after query compilation
Execution plan comparison (legacy vs whole stage code-gen)
WholeStageCodeGen
Profiling with Flame Graphs
Root Cause Analysis
Benchmarking:
Run the workload and measure it with the relevant diagnostic tools
Goals: understand the bottleneck(s) and find root causes
Limitations:
Our tools & time available for analysis are limiting factors
Profiling CPU-Bound workloads
Flame graph visualization of stack profiles:
● Brain child of Brendan Gregg (Dec 2011)
● Code: https://github.com/brendangregg/FlameGraph
● Now very popular, available for many languages, also for JVM
Shows which parts of the code are hot
● Very useful to understand where CPU cycles are spent
Flame Graph Visualization
Recipe:
● Gather multiple stack traces
● Aggregate them by sorting alphabetically by function/method name
● Visualization using stacked colored boxes
● Length of the box proportional to time spent there
Flame Graph (Spark 1.6)
Flame Graph (Spark 2.0)
Spark CodeGen vs. Volcano
Code generation improves CPU-intensive workloads
● Replaces loops and virtual function calls with code generated for the query
● The use of vector operations (e.g. SIMD) also beneficial
● Codegen is crucial for modern in-memory DBs
Commercial RDBMS engines
● Typically use the slower volcano model (with loops and virtual function calls)
● In the past optimizing for I/O latency was more important, now CPU cycles matter
more
Flame Graphs
Pros: good to understand where CPU cycles are spent
● Useful for performance troubleshooting
● Functions at the top of the graph are the ones using CPU
● Parent methods/functions provide context
Limitations:
● Off-CPU and wait time not charted (experimental)
● Interpretation of flame graphs requires experience/knowledge
● Not included in Spark monitoring suite
How-to Flame Graphs
CERN Java Flight Recorder Approach (1/2)
Enable Java Flight Recorder (JFR)
● Extra options in spark-defaults.conf or CLI. Example:
Collect data with jcmd:
● Example, sampling for 10 sec:
CERN Java Flight Recorder Approach (2/2)
Process the jfr file:
● From .jfr to merged stacks
● Produce the .svg file with the flame graph
● Find details in Kay Ousterhout’s article:
https://gist.github.com/kayousterhout/7008a8ebf2bab
eedc7ce6f8723fd1bf4
PayPal Approach
https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-
spark-applications-using-flame-graphs/
CERN HProfiler Approach
HProfiler (CERN home-built tool)
● Automates collection and aggregation of stack traces into flame graphs for
distributed applications
● Integrates with YARN to identify the processes to trace across the cluster
Based on Linux perf_events stack sampling (bare metal)
Experimental tool
● Author Joeri Hermans @ CERN
● https://github.com/cerndb/Hadoop-Profiler
● Hadoop-performance-troubleshooting-stack-tracing
Testing in Spark
Testing in Spark
● Why to run Spark outside of a cluster
● What to test
● Running Local
● Running as a Unit Test
● Data Structures
Testing in Spark
Why to run Spark outside of a cluster
● Time
● Trusted Deployment
● Money
Testing in Spark
What to test
● Experiments
● Complex logic
● Data samples
● Business generated scenarios
Testing in Spark (Running Local)
Running Local
● A test doesn’t always need to be a unit test
● UIs like Zeppelin is OK for quick feedback
but lacks from IDE Features
● Running local in your IDE is priceless
Testing in Spark (Running Local)
Example
● Use runLocal flag to set a local SparkContext
● Separate out testable work from driver code
Testing in Spark (Unit Testing)
Example
FunSuite: TDD unit testing suite
for Scala
Testing in Spark (Data Structures)
Working with “hand-written” DataFrames:
Testing in Spark (Hive)
Testing with Hive:
● Spin-up a docker-hive container for Apache Hive (Big Data Europe)
● Enables real interaction allowing to:
○ create, delete, write, ...
Testing in Spark (Hive)
Putting Hive + Spark together:
● Create a custom hive-site.xml
● Start Spark with the provided hive-site.xml
○ spark-shell --files /PATH/hive-site.xml
Testing in Spark (Hive)
Start Spark with the provided hive-site.xml:
Testing in Spark (Mini-Clusters)
Mini-Clusters
● Hadoop-mini-cluster
● Spark-unit-testing-with-hdfs
● Support for:
○ HBase & Hive
○ Kafka & Storm
○ Zookeeper
○ HDFS
○ ...access HDFS files & test code
copy files from localFS to HDFS
Conclusions
Conclusions
Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)
● Scalability and performance on commodity HW
● Spark SQL useful for offloading queries from traditional RDBMS
● code generation speeds up to one order of magnitude on CPU-bound workloads
Diagnostics
● Profiling tools are important in MPP world
● Execution plans analyzed with flame graphs
● Cons: Very immature solutions
Testing
● Testing locally saves time, money and takes advantage of the IDE features
● Elegant ways to test a code by using local SparkContext
● Easy ways to recreate environments for testing real interactions (such Hadoop)
Profiling & Testing with Spark
THANK YOU!
References
● Deep-dive-into-catalyst-apache-spark-2.0
● http://es.slideshare.net/databricks/spark-performance-whats-next
● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
● http://www.brendangregg.com/flamegraphs.html
● http://db-blog.web.cern.ch/
● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
Q & A
Top Related