Effective testing for spark programs Strata NY 2015

download Effective testing for spark programs   Strata NY 2015

of 38

Embed Size (px)

Transcript of Effective testing for spark programs Strata NY 2015

  • Effective Testing for Spark Programs

    or avoiding I didnt think that could happen

    Now mostly

    works*

    *See developer for details. Does not imply warranty. :p

  • Who am I? My name is Holden Karau Prefered pronouns are she/her Im a Software Engineer currently Alpine and previously Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark @holdenkarau Slide share http://www.slideshare.net/hkarau Linkedin https://www.linkedin.com/in/holdenkarau Spark Videos http://bit.ly/holdenSparkVideos

    https://twitter.com/holdenkarauhttps://twitter.com/holdenkarauhttp://www.slideshare.net/hkarauhttps://www.linkedin.com/in/holdenkarauhttp://bit.ly/holdenSparkVideos

  • What is going to be covered: What I think I might know about you A bit about why you should test your programs Doing traditional unit testing for Spark programs

    Along with special considerations for SQL/Hive & Streaming

    Using counters & other job acceptance tests w/ Spark Cute & scary pictures

    I promise at least one panda and one cat

    Future Work Some of this future work might even get done!

  • Who I think you wonderful humans are? Nice* people Like silly pictures Familiar with Apache Spark

    If not, buy one of my books or watch Pacos awesome video

    Familiar with one of Scala, Java, or Python If you know R well Id love to chat though

    Want to make better software (or models, or w/e)

  • So why should you test? Makes you a better person Save $s

    May help you avoid losing your employer all of their money Or users if we were in the bay

    AWS is expensive

    Waiting for our jobs to fail is a pretty long dev cycle This is really just to guilt trip you & give you flashbacks to your QA internships

  • Why dont we test? Its hard

    Faking data, setting up integration tests, urgh w/e

    Our tests can get too slow It takes a lot of time

    and people always want everything done yesterday or I just want to go home see my partner etc.

  • Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

  • An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() }

    Photo by morinesque

  • And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = { f.map(_.split(" ").toList) }

    Photo by morinesque

  • Wait, where were the batteries?

    Photo by Jim Bauer

  • Lets get batteries! Spark unit testing

    spark-testing-base - https://github.com/holdenk/spark-testing-base sscheck - https://github.com/juanrh/sscheck

    Integration testing spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests

    Performance spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf

    Spark job validation spark-validator - https://github.com/holdenk/spark-validator

    Photo by Mike Mozart

    https://github.com/holdenk/spark-testing-basehttps://github.com/juanrh/sscheckhttps://github.com/databricks/spark-integration-testshttps://github.com/databricks/spark-perfhttps://github.com/holdenk/spark-validator

  • A simple unit test re-visited (Scala)class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) }}

  • A simple unit test (Java)public class SampleJavaRDDTest extends SharedJavaSparkContext implements Serializable { @Test public void verifyMapTest() { List input = Arrays.asList(1,2); JavaRDD result = jsc().parallelize(input).map( new Function() { public Integer call(Integer x) { return x * x;}}); assertEquals(input.size(), result.count()); }}

  • A simple unit test (Python)class SimpleTest(SparkTestingBaseTestCase): """A simple test."""

    def test_basic(self): """Test a simple collect.""" input = ["hello world"] rdd = self.sc.parallelize(input) result = rdd.collect() assert result == input

  • Making fake data If you have production data you can sample you are lucky!

    If possible you can try and save in the same format

    sc.parallelize is pretty good for small tests Note: that we can specify the number of partitions

    Coming up with good test data can take a long time

    Lori Rielly

  • QuickCheck / ScalaCheck QuickCheck generates tests data under a set of constraints Scala version is ScalaCheck - supported by the two unit testing libraries for

    Spark sscheck

    Awesome people*, supports generating DStreams too!

    spark-testing-base Also Awesome people*, generates more pathological RDDs

    *I assume

    PROtara hunt

  • With sscheckdef forallRDDGenOfNtoM = { val minWords, maxWords = (50, 100) Prop.forAll(RDDGen.ofNtoM(50, 100, arbitrary[String])) { rdd : RDD[String] => rdd.map(_.length()).sum must be_>=(0.0) }}

  • With spark-testing-basetest("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() }}

  • Testing streaming.

    Photo by Steve Jurvetson

  • // Setup our Stream:

    class TestInputStream[T: ClassTag](@transient var sc:

    SparkContext,

    ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)

    extends FriendlyInputDStream[T](ssc_) {

    def start() {}

    def stop() {}

    def compute(validTime: Time): Option[RDD[T]] = {

    logInfo("Computing RDD for time " + validTime)

    val index = ((validTime - ourZeroTime) / slideDuration - 1).

    toInt

    val selectedInput = if (index < input.size) input(index) else

    Seq[T]()

    // lets us test cases where RDDs are not created

    if (selectedInput == null) {

    return None

    }

    val rdd = sc.makeRDD(selectedInput, numPartitions)

    logInfo("Created RDD " + rdd.id + " with " + selectedInput)

    Some(rdd)

    }

    }

    Artisanal Stream Testing Codetrait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext {

    // Name of the framework for Spark context def framework: String = this.getClass.getSimpleName

    // Master for Spark context def master: String = "local[4]"

    // Batch duration def batchDuration: Duration = Seconds(1)

    // Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString }

    // Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() }

    Photo by Steve Jurvetson

  • and continued.

    /** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) }

    // Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() }

    /** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream[R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } }

    }

  • and now for the clock/* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */class TestManualClock(var time: Long) extends Clock { def this() = this(0L)

    def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time }

    def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() }

    def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() }

    def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat

    /** * @param targetTi