Download - Debugging Apache Spark - Scala & Python super happy fun times 2017

Transcript
Page 1: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Debugging Apache Spark“Professional Stack Trace Reading”

with your friendsHolden & Joey

Page 2: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Who is Holden?● My name is Holden Karau● Prefered pronouns are she/her● I’m a Principal Software Engineer at IBM’s Spark Technology Center● Apache Spark committer (as of last month!) :)● previously Alpine, Databricks, Google, Foursquare & Amazon● co-author of Learning Spark & Fast Data processing with Spark

○ co-author of a new book focused on Spark performance coming this year*

● @holdenkarau● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos

Page 3: Debugging Apache Spark -   Scala & Python super happy fun times 2017
Page 4: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Spark Technology Center

4

IBMSpark Technology Center

Founded in 2015.Location:

Physical: 505 Howard St., San Francisco CAWeb: http://spark.tc Twitter: @apachespark_tc

Mission:Contribute intellectual and technical capital to the Apache Spark community.Make the core technology enterprise- and cloud-ready.Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com

Key statistics:About 50 developers, co-located with 25 IBM designers.Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project.Founding member of UC Berkeley AMPLab and RISE LabMember of R Consortium and Scala Center

Spark Technology Center

Page 5: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Who is Joey?

● Preferred pronouns: he/him● Where I work: Rocana – Platform Technical Lead● Where I used to work: Cloudera (’11-’15), NSA● Distributed systems, security, data processing, big

data● @fwiffo

Page 6: Debugging Apache Spark -   Scala & Python super happy fun times 2017

What is Rocana?

● We built a system for large scale real-timecollection, processing, and analysis ofevent-oriented machine data

● On prem or in the cloud, but not SaaS● Supportability is a big deal for us

○ Predictability of performance under load and failures○ Ease of configuration and operation○ Behavior in wacky environments

Page 7: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Who do we think y’all are?● Friendly[ish] people● Don’t mind pictures of cats or stuffed animals● Know some Spark● Want to debug your Spark applications● Ok with things getting a little bit silly

Lori Erickson

Page 8: Debugging Apache Spark -   Scala & Python super happy fun times 2017

What will be covered?● Getting at Spark’s logs & persisting them● What your options for logging are● Attempting to understand common Spark error messages● Understanding the DAG (and how pipelining can impact your life)● Subtle attempts to get you to use spark-testing-base or similar● Fancy Java Debugging tools & clusters - not entirely the path of sadness● Holden’s even less subtle attempts to get you to buy her new book● Pictures of cats & stuffed animals

Page 9: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Aka: Building our Monster Identification Guide

Page 10: Debugging Apache Spark -   Scala & Python super happy fun times 2017

So where are the logs/errors?(e.g. before we can identify a monster we have to find it)

● Error messages reported to the console*● Log messages reported to the console*● Log messages on the workers - access through the

Spark Web UI or Spark History Server :)● Where to error: driver versus worker

(*When running in client mode)

PROAndrey

Page 11: Debugging Apache Spark -   Scala & Python super happy fun times 2017

One weird trick to debug anything

● Don’t read the logs (yet)● Draw (possibly in your head) a model of how you think a

working app would behave● Then predict where in that model things are broken● Now read logs to prove or disprove your theory● Repeat

Krzysztof Belczyński

Page 12: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Working in YARN?(e.g. before we can identify a monster we have to find it)

● Use yarn logs to get logs after log collection● Or set up the Spark history server● Or yarn.nodemanager.delete.debug-delay-sec :)

Lauren Mitchell

Page 13: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Spark is pretty verbose by default

● Most of the time it tells you things you already know● Or don’t need to know● You can dynamically control the log level with

sc.setLogLevel● This is especially useful to increase logging near the

point of error in your code

Page 14: Debugging Apache Spark -   Scala & Python super happy fun times 2017

But what about when we get an error?

● Python Spark errors come in two-ish-parts often● JVM Stack Trace (Friend Monster - comes most errors)● Python Stack Trace (Boo - has information)● Buddy - Often used to report the information from Friend

Monster and Boo

Page 15: Debugging Apache Spark -   Scala & Python super happy fun times 2017

So what is that JVM stack trace?

● Java/Scala○ Normal stack trace○ Sometimes can come from worker or driver, if from worker may be

repeated several times for each partition & attempt with the error○ Driver stack trace wraps worker stack trace

● R/Python○ Same as above but...○ Doesn’t want your actual error message to get lonely○ Wraps any exception on the workers (& some exceptions on the

drivers)○ Not always super useful

Page 16: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Let’s make a simple mistake & debug :)

● Error in transformation (divide by zero)Image by: Tomomi

Page 17: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Bad outer transformation (Scala): val transform1 = data.map(x => x + 1)

val transform2 = transform1.map(x => x/0) // Will throw an

exception when forced to evaluate

transform2.count() // Forces evaluation

David Martyn Hunt

Page 18: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Let’s look at the error messages for it:17/01/23 12:41:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.ArithmeticException: / by zero

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

Continued for ~100 linesat scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

17/01/23 12:41:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ArithmeticException: / by zero

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

17/01/23 12:41:36 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ArithmeticException: / by zero

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)

at scala.Option.foreach(Option.scala:257)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)

at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.collect(RDD.scala:934)

at com.highperformancespark.examples.errors.Throws$.throwInner(throws.scala:11)

... 43 elided

Caused by: java.lang.ArithmeticException: / by zero

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

... 1 more

Page 19: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Bad outer transformation (Python): data = sc.parallelize(range(10))

transform1 = data.map(lambda x: x + 1)

transform2 = transform1.map(lambda x: x / 0)

transform2.count()

David Martyn Hunt

Page 20: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Let’s look at the error messages for it:[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

org.apache.spark.api.python.PythonException: Traceback (most recent call last):

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process

serializer.dump_stream(func(split_index, iterator), outfile)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

Continued for ~400 lines File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>

transform2 = transform1.map(lambda x: x / 0)

Page 21: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Working in Jupyter?

“The error messages were so useless - I looked up how to disable error reporting in Jupyter”

(paraphrased from PyData DC)

Page 22: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Working in Jupyter - try your terminal for help

Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in

tonynetone

AttributeError: unicode object has no attribute endsWith

Page 23: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Ok maybe the web UI is easier?Mr Thinktank

Page 24: Debugging Apache Spark -   Scala & Python super happy fun times 2017

And click through... afu007

Page 25: Debugging Apache Spark -   Scala & Python super happy fun times 2017

A scroll down (not quite to the bottom)

File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0)ZeroDivisionError: integer division or modulo by zero

Page 26: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Or look at the bottom of console logs: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line

180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line

175, in process

serializer.dump_stream(func(split_index, iterator), outfile)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in

pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in

pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in

pipeline_func

return func(split, prev_func(split, iterator))

Page 27: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Or look at the bottom of console logs: File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>

transform2 = transform1.map(lambda x: x / 0)

ZeroDivisionError: integer division or modulo by zero

Page 28: Debugging Apache Spark -   Scala & Python super happy fun times 2017

And in scala….Caused by: java.lang.ArithmeticException: / by zero

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply$mcII$sp(throws.sc

ala:17)

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

at

org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at

org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

... 1 more

Page 29: Debugging Apache Spark -   Scala & Python super happy fun times 2017

(Aside): DAG differences illustratedMelissa Wilkins

Page 30: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Pipelines (& Python)

● Some pipelining happens inside of Python○ For performance (less copies from Python to Scala)

● DAG visualization is generated inside of Scala○ Misses Python pipelines :(

Regardless of language● Can be difficult to determine which element failed● Stack trace _sometimes_ helps (it did this time)● take(1) + count() are your friends - but a lot of work :(● persist can help a bit too.

Arnaud Roberti

Page 31: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Side note: Lambdas aren’t always your friend

● Lambda’s can make finding the error more challenging● I love lambda x, y: x / y as much as the next human but

when y is zero :(● A small bit of refactoring for your debugging never hurt

anyone*● If your inner functions are causing errors it’s a good time

to have tests for them!● Difficult to put logs inside of them

*A blatant lie, but…. it hurts less often than it helps

Zoli Juhasz

Page 32: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Testing - you should do it!

● spark-testing-base provides simple classes to build your Spark tests with○ It’s available on pip & maven central

● That’s a talk unto itself though (and it's on YouTube)

Page 33: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Adding your own logging:

● Java users use Log4J & friends● Python users: use logging library (or even print!)● Accumulators

○ Behave a bit weirdly, don’t put large amounts of data in them

Page 34: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Also not all errors are “hard” errors

● Parsing input? Going to reject some malformed records● flatMap or filter + map can make this simpler● Still want to track number of rejected records (see

accumulators)● Invest in dead letter queues

○ e.g. write malformed records to an Apache Kafka topic

Mustafasari

Page 35: Debugging Apache Spark -   Scala & Python super happy fun times 2017

So using names & logging & accs could be: data = sc.parallelize(range(10))

rejectedCount = sc.accumulator(0)

def loggedDivZero(x):

import logging

try:

return [x / 0]

except Exception as e:

rejectedCount.add(1)

logging.warning("Error found " + repr(e))

return []

transform1 = data.flatMap(loggedDivZero)

transform2 = transform1.map(add1)

transform2.count()

print("Reject " + str(rejectedCount.value))

Page 36: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Ok what about if we run out of memory?

In the middle of some Java stack traces: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process

serializer.dump_stream(func(split_index, iterator), outfile)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much

return range(10000000000000)

MemoryError

Page 37: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Tubbs doesn’t always look the same

● Out of memory can be pure JVM (worker)○ OOM exception during join○ GC timelimit exceeded

● OutOfMemory error, Executors being killed by kernel, etc.

● Running in YARN? “Application overhead exceeded”● JVM out of memory on the driver side from Py4J

Page 38: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Reasons for JVM worker OOMs(w/PySpark)

● Unbalanced shuffles● Buffering of Rows with PySpark + UDFs

○ If you have a down stream select move it up stream● Individual jumbo records (after pickling)● Off-heap storage● Native code memory leak

Page 39: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Reasons for Python worker OOMs(w/PySpark)

● Insufficient memory reserved for Python worker● Jumbo records● Eager entire partition evaluation (e.g. sort +

mapPartitions)● Too large partitions (unbalanced or not enough

partitions)

Page 40: Debugging Apache Spark -   Scala & Python super happy fun times 2017

And loading invalid paths:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080)at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)at org.apache.spark.rdd.RDD.collect(RDD.scala:934)at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)at py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(Thread.java:745)

Page 41: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Connecting Java Debuggers

● Add the JDWP incantation to your JVM launch: -agentlib:jdwp=transport=dt_socket,server=y,address=[debugport]○ spark.executor.extraJavaOptions to attach debugger on the executors○ --driver-java-options to attach on the driver process○ Add “suspend=y” if only debugging a single worker & exiting too

quickly● JDWP debugger is IDE specific - Eclipse & IntelliJ have

docs

shadow planet

Page 42: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Connecting Python Debuggers

● You’re going to have to change your code a bit :(● You can use broadcast + singleton “hack” to start pydev

or desired remote debugging lib on all of the interpreters● See https://wiki.python.org/moin/PythonDebuggingTools

for your remote debugging options and pick the one that works with your toolchain

shadow planet

Page 43: Debugging Apache Spark -   Scala & Python super happy fun times 2017

Alternative approaches:

● Move take(1) up the dependency chain● DAG in the WebUI -- less useful for Python :(● toDebugString -- also less useful in Python :(● Sample data and run locally● Running in cluster mode? Consider debugging in client

mode

Melissa Wilkins

Page 45: Debugging Apache Spark -   Scala & Python super happy fun times 2017

High Performance Spark (soon!)

First seven chapters are available in “Early Release”*:● Buy from O’Reilly - http://bit.ly/highPerfSpark● Python is in Chapter 7 & Debugging in AppendixGet notified when updated & finished:● http://www.highperformancespark.com ● https://twitter.com/highperfspark

* Early Release means extra mistakes, but also a chance to help us make a more awesome book.

Page 46: Debugging Apache Spark -   Scala & Python super happy fun times 2017

And some upcoming talks:

● April○ Meetup of some type in Madrid (TBD)○ PyData Amsterdam○ Philly ETE○ Scala Days Chicago

● May○ Scala LX?○ Strata London○ 3rd Data Science Summit Europe in Israel

● June○ Scala Days CPH

Page 47: Debugging Apache Spark -   Scala & Python super happy fun times 2017

k thnx bye :)

If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark

Will tweet results “eventually” @holdenkarau

Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?:http://bit.ly/pySparkUDF

Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)