Debugging Apache Spark - Scala & Python super happy fun times 2017

Debugging Apache Spark“Professional Stack Trace Reading”

with your friendsHolden & Joey

Who is Holden?● My name is Holden Karau● Prefered pronouns are she/her● I’m a Principal Software Engineer at IBM’s Spark Technology Center● Apache Spark committer (as of last month!) :)● previously Alpine, Databricks, Google, Foursquare & Amazon● co-author of Learning Spark & Fast Data processing with Spark

○ co-author of a new book focused on Spark performance coming this year*

● @holdenkarau● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos

http://www.spark.tc/

https://twitter.com/holdenkarau


http://www.slideshare.net/hkarau

https://www.linkedin.com/in/holdenkarau

https://github.com/holdenk

http://bit.ly/holdenSparkVideos

Spark Technology Center

4

IBMSpark Technology Center

Founded in 2015.Location:

Physical: 505 Howard St., San Francisco CAWeb: http://spark.tc Twitter: @apachespark_tc

Mission:Contribute intellectual and technical capital to the Apache Spark community.Make the core technology enterprise- and cloud-ready.Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com

Key statistics:About 50 developers, co-located with 25 IBM designers.Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project.Founding member of UC Berkeley AMPLab and RISE LabMember of R Consortium and Scala Center

Spark Technology Center

http://spark.tc/

https://twitter.com/apachespark_tc

http://bigdatauniversity.com/

http://jiras.spark.tc/

Who is Joey?

● Preferred pronouns: he/him● Where I work: Rocana – Platform Technical Lead● Where I used to work: Cloudera (’11-’15), NSA● Distributed systems, security, data processing, big

data● @fwiffo

https://twitter.com/fwiffo

https://twitter.com/fwiffo

What is Rocana?

● We built a system for large scale real-timecollection, processing, and analysis ofevent-oriented machine data

● On prem or in the cloud, but not SaaS● Supportability is a big deal for us

○ Predictability of performance under load and failures○ Ease of configuration and operation○ Behavior in wacky environments

Who do we think y’all are?● Friendly[ish] people● Don’t mind pictures of cats or stuffed animals● Know some Spark● Want to debug your Spark applications● Ok with things getting a little bit silly

Lori Erickson

https://www.flickr.com/photos/lorika/



What will be covered?● Getting at Spark’s logs & persisting them● What your options for logging are● Attempting to understand common Spark error messages● Understanding the DAG (and how pipelining can impact your life)● Subtle attempts to get you to use spark-testing-base or similar● Fancy Java Debugging tools & clusters - not entirely the path of sadness● Holden’s even less subtle attempts to get you to buy her new book● Pictures of cats & stuffed animals

Aka: Building our Monster Identification Guide

So where are the logs/errors?(e.g. before we can identify a monster we have to find it)

● Error messages reported to the console*● Log messages reported to the console*● Log messages on the workers - access through the

Spark Web UI or Spark History Server :)● Where to error: driver versus worker

(*When running in client mode)

PROAndrey

One weird trick to debug anything

● Don’t read the logs (yet)● Draw (possibly in your head) a model of how you think a

working app would behave● Then predict where in that model things are broken● Now read logs to prove or disprove your theory● Repeat

Krzysztof Belczyński

Working in YARN?(e.g. before we can identify a monster we have to find it)

● Use yarn logs to get logs after log collection● Or set up the Spark history server● Or yarn.nodemanager.delete.debug-delay-sec :)

Lauren Mitchell

http://spark.apache.org/docs/latest/monitoring.html

Spark is pretty verbose by default

● Most of the time it tells you things you already know● Or don’t need to know● You can dynamically control the log level with

sc.setLogLevel● This is especially useful to increase logging near the

point of error in your code

But what about when we get an error?

● Python Spark errors come in two-ish-parts often● JVM Stack Trace (Friend Monster - comes most errors)● Python Stack Trace (Boo - has information)● Buddy - Often used to report the information from Friend

Monster and Boo

So what is that JVM stack trace?

● Java/Scala○ Normal stack trace○ Sometimes can come from worker or driver, if from worker may be

repeated several times for each partition & attempt with the error○ Driver stack trace wraps worker stack trace

● R/Python○ Same as above but...○ Doesn’t want your actual error message to get lonely○ Wraps any exception on the workers (& some exceptions on the

drivers)○ Not always super useful

Let’s make a simple mistake & debug :)

● Error in transformation (divide by zero)Image by: Tomomi

Bad outer transformation (Scala): val transform1 = data.map(x => x + 1)

val transform2 = transform1.map(x => x/0) // Will throw an

exception when forced to evaluate

transform2.count() // Forces evaluation

David Martyn Hunt

Let’s look at the error messages for it:17/01/23 12:41:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.ArithmeticException: / by zero

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)


at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)


at scala.collection.Iterator$class.foreach(Iterator.scala:750)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)


at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

at scala.collection.AbstractIterator.to(Iterator.scala:1202)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

Continued for ~100 linesat scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)


at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)


at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

17/01/23 12:41:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ArithmeticException: / by zero















at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)












17/01/23 12:41:36 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ArithmeticException: / by zero



























Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)

at scala.Option.foreach(Option.scala:257)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)




at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.collect(RDD.scala:934)

at com.highperformancespark.examples.errors.Throws$.throwInner(throws.scala:11)

... 43 elided

Caused by: java.lang.ArithmeticException: / by zero


























... 1 more

Bad outer transformation (Python): data = sc.parallelize(range(10))

transform1 = data.map(lambda x: x + 1)

transform2 = transform1.map(lambda x: x / 0)

transform2.count()

David Martyn Hunt

Let’s look at the error messages for it:[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

org.apache.spark.api.python.PythonException: Traceback (most recent call last):

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process

serializer.dump_stream(func(split_index, iterator), outfile)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func

return func(split, prev_func(split, iterator))





File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)

File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>

return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

Continued for ~400 lines File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>


Working in Jupyter?

“The error messages were so useless - I looked up how to disable error reporting in Jupyter”

(paraphrased from PyData DC)

Working in Jupyter - try your terminal for help

Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in

tonynetone

AttributeError: unicode object has no attribute endsWith

https://issues.apache.org/jira/browse/SPARK-19094

Ok maybe the web UI is easier?Mr Thinktank

And click through... afu007

A scroll down (not quite to the bottom)

File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0)ZeroDivisionError: integer division or modulo by zero

Or look at the bottom of console logs: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line

180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line

175, in process


File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in

pipeline_func



pipeline_func



pipeline_func


Or look at the bottom of console logs: File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)



File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>


File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>


ZeroDivisionError: integer division or modulo by zero

And in scala….Caused by: java.lang.ArithmeticException: / by zero

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply$mcII$sp(throws.sc

ala:17)

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)

at

com.highperformancespark.examples.errors.Throws$$anonfun$4.apply(throws.scala:17)















at

org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

at

org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)




at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

... 1 more

(Aside): DAG differences illustratedMelissa Wilkins

Pipelines (& Python)

● Some pipelining happens inside of Python○ For performance (less copies from Python to Scala)

● DAG visualization is generated inside of Scala○ Misses Python pipelines :(

Regardless of language● Can be difficult to determine which element failed● Stack trace _sometimes_ helps (it did this time)● take(1) + count() are your friends - but a lot of work :(● persist can help a bit too.

Arnaud Roberti

Side note: Lambdas aren’t always your friend

● Lambda’s can make finding the error more challenging● I love lambda x, y: x / y as much as the next human but

when y is zero :(● A small bit of refactoring for your debugging never hurt

anyone*● If your inner functions are causing errors it’s a good time

to have tests for them!● Difficult to put logs inside of them

*A blatant lie, but…. it hurts less often than it helps

Zoli Juhasz

Testing - you should do it!

● spark-testing-base provides simple classes to build your Spark tests with○ It’s available on pip & maven central

● That’s a talk unto itself though (and it's on YouTube)

https://www.youtube.com/playlist?list=PLRLebp9QyZtaoIpE2iaF3Q8itJOcdgYoX

Adding your own logging:

● Java users use Log4J & friends● Python users: use logging library (or even print!)● Accumulators

○ Behave a bit weirdly, don’t put large amounts of data in them

Also not all errors are “hard” errors

● Parsing input? Going to reject some malformed records● flatMap or filter + map can make this simpler● Still want to track number of rejected records (see

accumulators)● Invest in dead letter queues

○ e.g. write malformed records to an Apache Kafka topic

Mustafasari

So using names & logging & accs could be: data = sc.parallelize(range(10))

rejectedCount = sc.accumulator(0)

def loggedDivZero(x):

import logging

try:

return [x / 0]

except Exception as e:

rejectedCount.add(1)

logging.warning("Error found " + repr(e))

return []

transform1 = data.flatMap(loggedDivZero)

transform2 = transform1.map(add1)

transform2.count()

print("Reject " + str(rejectedCount.value))

Ok what about if we run out of memory?

In the middle of some Java stack traces: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main

process()

File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process








File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func

return f(iterator)



File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>


File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much

return range(10000000000000)

MemoryError

Tubbs doesn’t always look the same

● Out of memory can be pure JVM (worker)○ OOM exception during join○ GC timelimit exceeded

● OutOfMemory error, Executors being killed by kernel, etc.

● Running in YARN? “Application overhead exceeded”● JVM out of memory on the driver side from Py4J

Reasons for JVM worker OOMs(w/PySpark)

● Unbalanced shuffles● Buffering of Rows with PySpark + UDFs

○ If you have a down stream select move it up stream● Individual jumbo records (after pickling)● Off-heap storage● Native code memory leak

Reasons for Python worker OOMs(w/PySpark)

● Insufficient memory reserved for Python worker● Jumbo records● Eager entire partition evaluation (e.g. sort +

mapPartitions)● Too large partitions (unbalanced or not enough

partitions)

And loading invalid paths:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080)at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)at org.apache.spark.rdd.RDD.collect(RDD.scala:934)at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)at py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(Thread.java:745)

Connecting Java Debuggers

● Add the JDWP incantation to your JVM launch: -agentlib:jdwp=transport=dt_socket,server=y,address=[debugport]○ spark.executor.extraJavaOptions to attach debugger on the executors○ --driver-java-options to attach on the driver process○ Add “suspend=y” if only debugging a single worker & exiting too

quickly● JDWP debugger is IDE specific - Eclipse & IntelliJ have

docs

shadow planet

https://www.ibm.com/developerworks/library/os-eclipse-javadebug/

http://stackoverflow.com/questions/21114066/attach-intellij-idea-debugger-to-a-running-java-process

Connecting Python Debuggers

● You’re going to have to change your code a bit :(● You can use broadcast + singleton “hack” to start pydev

or desired remote debugging lib on all of the interpreters● See https://wiki.python.org/moin/PythonDebuggingTools

for your remote debugging options and pick the one that works with your toolchain

shadow planet

https://wiki.python.org/moin/PythonDebuggingTools

Alternative approaches:

● Move take(1) up the dependency chain● DAG in the WebUI -- less useful for Python :(● toDebugString -- also less useful in Python :(● Sample data and run locally● Running in cluster mode? Consider debugging in client

mode

Melissa Wilkins

Learning Spark

Fast Data Processing with Spark(Out of Date)

Fast Data Processing with Spark (2nd edition)

Advanced Analytics with Spark

Coming soon: Spark in Action

Coming soon:High Performance Spark

Coming Soon:Learning PySpark

http://bit.ly/learning-spark-presentation

http://bit.ly/learning-spark-presentation

http://bit.ly/fast-data-processing-presentation

http://bit.ly/fast-data-processing-presentation

http://bit.ly/fast-data-processing-with-spark-2nd-edition

http://bit.ly/fast-data-processing-with-spark-2nd-edition

http://bit.ly/advanced-analytics-spark

http://www.manning.com/bonaci/

http://www.manning.com/bonaci/

http://www.highperformancespark.com

High Performance Spark (soon!)

First seven chapters are available in “Early Release”*:● Buy from O’Reilly - http://bit.ly/highPerfSpark● Python is in Chapter 7 & Debugging in AppendixGet notified when updated & finished:● http://www.highperformancespark.com ● https://twitter.com/highperfspark

* Early Release means extra mistakes, but also a chance to help us make a more awesome book.

http://bit.ly/highPerfSpark



https://twitter.com/highperfspark

https://twitter.com/highperfspark

And some upcoming talks:

● April○ Meetup of some type in Madrid (TBD)○ PyData Amsterdam○ Philly ETE○ Scala Days Chicago

● May○ Scala LX?○ Strata London○ 3rd Data Science Summit Europe in Israel

● June○ Scala Days CPH

k thnx bye :)

If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark

Will tweet results “eventually” @holdenkarau

Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?:http://bit.ly/pySparkUDF

Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)

http://bit.ly/holdenTestingSpark

http://bit.ly/holdenTestingSpark


http://bit.ly/pySparkUDF

http://bit.ly/pySparkUDF

mailto:[email protected]

Debugging Apache Spark - Scala & Python super happy fun times 2017

Internet

Transcript of Debugging Apache Spark - Scala & Python super happy fun times 2017