Download - Profiling & Testing with Spark

Transcript
Page 1: Profiling & Testing with Spark

Profiling & Testing with SparkApache Spark 2.0 Improvements, Flame Graphs & Testing

Page 2: Profiling & Testing with Spark

Outline

Overview

Spark 2.0 Improvements

Profiling with Flame Graphs

How-to Flame Graphs

Testing in Spark

Page 3: Profiling & Testing with Spark

Overview

Apache Spark™ is a fast and general engine for large-scale data processing

Speed: Runs in-memory computing, up to 100x faster than MapReduce

Ease of Use: Support for Java, Scala, Python and R binding

Generality: Enabled for SQL, Streaming and complex analytics (ML)

Portable: Runs on Yarn, Mesos, standalone or Cloud

Page 4: Profiling & Testing with Spark

Overview (Big Picture)

Page 5: Profiling & Testing with Spark

Overview (architecture)

Page 6: Profiling & Testing with Spark

Overview (code sample)

Monte-carlo π calculation

“This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”

Page 7: Profiling & Testing with Spark

Main Takeaway

Spark SQL:

Provides parallelism, affordable at scale

Scale out SQL on storage for Big Data volumes

Scale out on CPU for memory-intensive queries

Offloading reports from RDBMS becomes attractive

Spark 2.0 improvements:

Considerable speedup of CPU-intensive queries

Page 8: Profiling & Testing with Spark

Spark 2.0 Improvements

Page 9: Profiling & Testing with Spark

SQL Queries

sqlContext.sql("

SELECT a.bucket, sum(a.val2) tot

FROM t1 a, t1 b

WHERE a.bucket=b.bucket and

a.val1+b.val1<1000

GROUP BY a.bucket

ORDER BY a.bucket").show()

Complex and resource-intensive SELECT statement:

EXPLAIN directive (execution plan)

Page 10: Profiling & Testing with Spark

Execution Plan

The execution plan:

First instrumentation point for SQL tuning

Shows how Spark wants to execute the query (break-down)

Main players:

Catalyst: the query optimizer

Page 11: Profiling & Testing with Spark

Catalyst (query optimizer)

Logical Plan:

Describes computation on data sets without defining how to conduct it

Physical Plan:

Defines which computation to conduct on each dataset

Page 12: Profiling & Testing with Spark

Project Tungsten (Goal)

“Improves the memory and CPU efficiency of Spark backend

execution by pushing performance close to the limits of

modern hardware.”

Page 13: Profiling & Testing with Spark

Project Tungsten

Perform manual memory management instead of relying on Java objects:

Reduce memory footprint

Eliminate garbage collection overheads

Use java.unsafe and off-heap memory

Code generation for expression evaluation:

Reduce virtual function calls and interpretation overhead (JVM)

Page 14: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

Page 15: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

The Volcano Iterator Model:

Standard for 30 years: almost all

databases do it.

Each operator is an “iterator” that

consumes records from its input

operator

Page 16: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

Downside the Volcano Iterator Model:

Too many virtual function calls

at least 3 calls for each row in Aggregate phase

Can’t take advantage of modern CPU features

pipelining, prefetching, branch prediction,

SIMD, ILP, exploit instruction cache ...

Page 17: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

What if we hire a college freshman to implement this query in Java in 10 mins?

Page 18: Profiling & Testing with Spark

Whole-stage Code-Gen: Spark as a “Compiler”

Page 19: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

A student beating 30 years of science ...

Page 20: Profiling & Testing with Spark

Project Tungsten (Code-Gen)

Volcano

● Many virtual function calls

● Data in memory (or cache)

● No loop unrolling, SIMD

Hand-written code

● No virtual function calls

● Data in CPU registers

● Exploit compiler optimizations

○ loop unrolling, SIMD, pipeliningTake advantage of all the information that is known after query compilation

Page 21: Profiling & Testing with Spark

Execution plan comparison (legacy vs whole stage code-gen)

WholeStageCodeGen

Page 22: Profiling & Testing with Spark

Profiling with Flame Graphs

Page 23: Profiling & Testing with Spark

Root Cause Analysis

Benchmarking:

Run the workload and measure it with the relevant diagnostic tools

Goals: understand the bottleneck(s) and find root causes

Limitations:

Our tools & time available for analysis are limiting factors

Page 24: Profiling & Testing with Spark

Profiling CPU-Bound workloads

Flame graph visualization of stack profiles:

● Brain child of Brendan Gregg (Dec 2011)

● Code: https://github.com/brendangregg/FlameGraph

● Now very popular, available for many languages, also for JVM

Shows which parts of the code are hot

● Very useful to understand where CPU cycles are spent

Page 25: Profiling & Testing with Spark

Flame Graph Visualization

Recipe:

● Gather multiple stack traces

● Aggregate them by sorting alphabetically by function/method name

● Visualization using stacked colored boxes

● Length of the box proportional to time spent there

Page 26: Profiling & Testing with Spark

Flame Graph (Spark 1.6)

Page 27: Profiling & Testing with Spark

Flame Graph (Spark 2.0)

Page 28: Profiling & Testing with Spark

Spark CodeGen vs. Volcano

Code generation improves CPU-intensive workloads

● Replaces loops and virtual function calls with code generated for the query

● The use of vector operations (e.g. SIMD) also beneficial

● Codegen is crucial for modern in-memory DBs

Commercial RDBMS engines

● Typically use the slower volcano model (with loops and virtual function calls)

● In the past optimizing for I/O latency was more important, now CPU cycles matter

more

Page 29: Profiling & Testing with Spark

Flame Graphs

Pros: good to understand where CPU cycles are spent

● Useful for performance troubleshooting

● Functions at the top of the graph are the ones using CPU

● Parent methods/functions provide context

Limitations:

● Off-CPU and wait time not charted (experimental)

● Interpretation of flame graphs requires experience/knowledge

● Not included in Spark monitoring suite

Page 30: Profiling & Testing with Spark

How-to Flame Graphs

Page 31: Profiling & Testing with Spark

CERN Java Flight Recorder Approach (1/2)

Enable Java Flight Recorder (JFR)

● Extra options in spark-defaults.conf or CLI. Example:

Collect data with jcmd:

● Example, sampling for 10 sec:

Page 32: Profiling & Testing with Spark

CERN Java Flight Recorder Approach (2/2)

Process the jfr file:

● From .jfr to merged stacks

● Produce the .svg file with the flame graph

● Find details in Kay Ousterhout’s article:

https://gist.github.com/kayousterhout/7008a8ebf2bab

eedc7ce6f8723fd1bf4

Page 33: Profiling & Testing with Spark

PayPal Approach

https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-

spark-applications-using-flame-graphs/

Page 34: Profiling & Testing with Spark

CERN HProfiler Approach

HProfiler (CERN home-built tool)

● Automates collection and aggregation of stack traces into flame graphs for

distributed applications

● Integrates with YARN to identify the processes to trace across the cluster

Based on Linux perf_events stack sampling (bare metal)

Experimental tool

● Author Joeri Hermans @ CERN

● https://github.com/cerndb/Hadoop-Profiler

● Hadoop-performance-troubleshooting-stack-tracing

Page 35: Profiling & Testing with Spark

Testing in Spark

Page 36: Profiling & Testing with Spark

Testing in Spark

● Why to run Spark outside of a cluster

● What to test

● Running Local

● Running as a Unit Test

● Data Structures

Page 37: Profiling & Testing with Spark

Testing in Spark

Why to run Spark outside of a cluster

● Time

● Trusted Deployment

● Money

Page 38: Profiling & Testing with Spark

Testing in Spark

What to test

● Experiments

● Complex logic

● Data samples

● Business generated scenarios

Page 39: Profiling & Testing with Spark

Testing in Spark (Running Local)

Running Local

● A test doesn’t always need to be a unit test

● UIs like Zeppelin is OK for quick feedback

but lacks from IDE Features

● Running local in your IDE is priceless

Page 40: Profiling & Testing with Spark

Testing in Spark (Running Local)

Example

● Use runLocal flag to set a local SparkContext

● Separate out testable work from driver code

Page 41: Profiling & Testing with Spark

Testing in Spark (Unit Testing)

Example

FunSuite: TDD unit testing suite

for Scala

Page 42: Profiling & Testing with Spark

Testing in Spark (Data Structures)

Working with “hand-written” DataFrames:

Page 43: Profiling & Testing with Spark

Testing in Spark (Hive)

Testing with Hive:

● Spin-up a docker-hive container for Apache Hive (Big Data Europe)

● Enables real interaction allowing to:

○ create, delete, write, ...

Page 44: Profiling & Testing with Spark

Testing in Spark (Hive)

Putting Hive + Spark together:

● Create a custom hive-site.xml

● Start Spark with the provided hive-site.xml

○ spark-shell --files /PATH/hive-site.xml

Page 45: Profiling & Testing with Spark

Testing in Spark (Hive)

Start Spark with the provided hive-site.xml:

Page 46: Profiling & Testing with Spark

Testing in Spark (Mini-Clusters)

Mini-Clusters

● Hadoop-mini-cluster

● Spark-unit-testing-with-hdfs

● Support for:

○ HBase & Hive

○ Kafka & Storm

○ Zookeeper

○ HDFS

○ ...access HDFS files & test code

copy files from localFS to HDFS

Page 47: Profiling & Testing with Spark

Conclusions

Page 48: Profiling & Testing with Spark

Conclusions

Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)

● Scalability and performance on commodity HW

● Spark SQL useful for offloading queries from traditional RDBMS

● code generation speeds up to one order of magnitude on CPU-bound workloads

Diagnostics

● Profiling tools are important in MPP world

● Execution plans analyzed with flame graphs

● Cons: Very immature solutions

Testing

● Testing locally saves time, money and takes advantage of the IDE features

● Elegant ways to test a code by using local SparkContext

● Easy ways to recreate environments for testing real interactions (such Hadoop)

Page 49: Profiling & Testing with Spark

Profiling & Testing with Spark

THANK YOU!

Page 50: Profiling & Testing with Spark

References

● Deep-dive-into-catalyst-apache-spark-2.0

● http://es.slideshare.net/databricks/spark-performance-whats-next

● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf

● http://www.brendangregg.com/flamegraphs.html

● http://db-blog.web.cern.ch/

● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska

Page 51: Profiling & Testing with Spark

Q & A