Debugging Apache Spark - Scala & Python super happy fun times 2017

download Debugging Apache Spark -   Scala & Python super happy fun times 2017

of 47

  • date post

    21-Apr-2017
  • Category

    Internet

  • view

    146
  • download

    4

Embed Size (px)

Transcript of Debugging Apache Spark - Scala & Python super happy fun times 2017

  • Debugging Apache SparkProfessional Stack Trace Reading

    with your friendsHolden & Joey

  • Who is Holden? My name is Holden Karau Prefered pronouns are she/her Im a Principal Software Engineer at IBMs Spark Technology Center Apache Spark committer (as of last month!) :) previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark

    co-author of a new book focused on Spark performance coming this year*

    @holdenkarau Slide share http://www.slideshare.net/hkarau Linkedin https://www.linkedin.com/in/holdenkarau Github https://github.com/holdenk Spark Videos http://bit.ly/holdenSparkVideos

    http://www.spark.tc/https://twitter.com/holdenkarauhttps://twitter.com/holdenkarauhttp://www.slideshare.net/hkarauhttps://www.linkedin.com/in/holdenkarauhttps://github.com/holdenkhttp://bit.ly/holdenSparkVideos

  • Spark Technology Center

    4

    IBMSpark Technology Center

    Founded in 2015.Location:

    Physical: 505 Howard St., San Francisco CAWeb: http://spark.tc Twitter: @apachespark_tc

    Mission:Contribute intellectual and technical capital to the Apache Spark community.Make the core technology enterprise- and cloud-ready.Build data science skills to drive intelligence into business applications http://bigdatauniversity.com

    Key statistics:About 50 developers, co-located with 25 IBM designers.Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project.Founding member of UC Berkeley AMPLab and RISE LabMember of R Consortium and Scala Center

    Spark Technology Center

    http://spark.tc/https://twitter.com/apachespark_tchttp://bigdatauniversity.com/http://jiras.spark.tc/

  • Who is Joey?

    Preferred pronouns: he/him Where I work: Rocana Platform Technical Lead Where I used to work: Cloudera (11-15), NSA Distributed systems, security, data processing, big

    data @fwiffo

    https://twitter.com/fwiffohttps://twitter.com/fwiffo

  • What is Rocana?

    We built a system for large scale real-timecollection, processing, and analysis ofevent-oriented machine data

    On prem or in the cloud, but not SaaS Supportability is a big deal for us

    Predictability of performance under load and failures Ease of configuration and operation Behavior in wacky environments

  • Who do we think yall are? Friendly[ish] people Dont mind pictures of cats or stuffed animals Know some Spark Want to debug your Spark applications Ok with things getting a little bit silly

    Lori Erickson

    https://www.flickr.com/photos/lorika/https://www.flickr.com/photos/lorika/https://www.flickr.com/photos/lorika/

  • What will be covered? Getting at Sparks logs & persisting them What your options for logging are Attempting to understand common Spark error messages Understanding the DAG (and how pipelining can impact your life) Subtle attempts to get you to use spark-testing-base or similar Fancy Java Debugging tools & clusters - not entirely the path of sadness Holdens even less subtle attempts to get you to buy her new book Pictures of cats & stuffed animals

  • Aka: Building our Monster Identification Guide

  • So where are the logs/errors?(e.g. before we can identify a monster we have to find it)

    Error messages reported to the console* Log messages reported to the console* Log messages on the workers - access through the

    Spark Web UI or Spark History Server :) Where to error: driver versus worker

    (*When running in client mode)

    PROAndrey

  • One weird trick to debug anything

    Dont read the logs (yet) Draw (possibly in your head) a model of how you think a

    working app would behave Then predict where in that model things are broken Now read logs to prove or disprove your theory Repeat

    Krzysztof Belczyski

  • Working in YARN?(e.g. before we can identify a monster we have to find it)

    Use yarn logs to get logs after log collection Or set up the Spark history server Or yarn.nodemanager.delete.debug-delay-sec :)

    Lauren Mitchell

    http://spark.apache.org/docs/latest/monitoring.html

  • Spark is pretty verbose by default

    Most of the time it tells you things you already know Or dont need to know You can dynamically control the log level with

    sc.setLogLevel This is especially useful to increase logging near the

    point of error in your code

  • But what about when we get an error?

    Python Spark errors come in two-ish-parts often JVM Stack Trace (Friend Monster - comes most errors) Python Stack Trace (Boo - has information) Buddy - Often used to report the information from Friend

    Monster and Boo

  • So what is that JVM stack trace?

    Java/Scala Normal stack trace Sometimes can come from worker or driver, if from worker may be

    repeated several times for each partition & attempt with the error Driver stack trace wraps worker stack trace

    R/Python Same as above but... Doesnt want your actual error message to get lonely Wraps any exception on the workers (& some exceptions on the

    drivers) Not always super useful

  • Lets make a simple mistake & debug :)

    Error in transformation (divide by zero)Image by: Tomomi

  • Bad outer transformation (Scala): val transform1 = data.map(x => x + 1)

    val transform2 = transform1.map(x => x/0) // Will throw an

    exception when forced to evaluate

    transform2.count() // Forces evaluation

    David Martyn Hunt

  • Lets look at the error messages for it:17/01/23 12:41:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

    java.lang.ArithmeticException: / by zero

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

    at scala.collection.Iterator$class.foreach(Iterator.scala:750)

    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

    at scala.collection.AbstractIterator.to(Iterator.scala:1202)

    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

    Continued for ~100 linesat scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

    at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

    at org.apache.spark.scheduler.Task.run(Task.scala:99)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)

    17/01/23 12:41:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ArithmeticException: / by zero

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply$mcII$sp(throws.scala:9)

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

    at com.highperformancespark.examples.errors.Throws$$anonfun$1.apply(throws.scala:9)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)

    at scala.collection.Iterator$class.foreach(Iterator.scala:750)

    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)

    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)

    at scala.collection.AbstractIterator.to(Iterator.scala:1202)

    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)

    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)

    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)

    at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)

    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

    at org.apache.spark.scheduler.Task.run(Task.scala:99)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

    at java.util.concurrent.ThreadPoolE