Spark as a distributed Scala

download Spark as a distributed Scala

of 24

Embed Size (px)

Transcript of Spark as a distributed Scala

  • Spark As A Distributed Scala

  • Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark

    Who is who?Alexey Zvolinskiy ~4 years of Scala experience

    Passing through Functional Programming in Scala Specialization on Coursera

    @Fruzenshtein

  • Why Scala?

  • What makes Scala so great?1. Functional programming language*

    2. Immutability

    3. Type system

    4. Collections API

    5. Pattern matching

    6. Implicit

  • Functional programming language1. Function is a first class citizen

    2. Totality

    3. Determinism

    4. Purity

    A => B

    A1A2

    An

    B1B2

    Bn

    A => BAi Bi A => BAi Bi

    A => BAi Bi

  • Immutability1. Makes a code more predictable

    2. Reduces efforts to understand a code

    3. Key to thread-safety

    Books:Java concurrency in practiceEffective Java 2nd Edition

  • Type system1. Static typing

    2. Type inference

    3. Bounds Map[V, K]List[T1

  • Collections API

    val numbers = List(1,2,3,4,5,6,7,8,9,10)numbers.filter(_ % 2 == 0) .map(_ * 10)//List(20, 40, 60, 80, 100)

    filter(n:Int => Boolean)

    //(n => n % 2 == 0)

    //(n => n * 10)

    map(n:Int => Int)

  • Collections APIval groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)))

    groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get//List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))

  • And what?!=Parallelism=

  • Idea of parallelism

    How to divide a problem into subproblems?

    How to use a hardware optimally?

  • Parallelism background

  • Scala parallel collections

    val from0to100000: Range = 0 until 100000

    val list = from0to100000.toList

    //scala.collection.parallel.immutable.ParSeq[Int]val parList = list.par

  • Some benchmarksval list = from0to100000.toListfor (i
  • Ok, but what about Spark?!

  • Why distributed computations?

    single machine (shared memory)

    Multiple nodes (network)

    Parallel collections (scala)

    RDDs (spark)

    Almost the same API

  • RDD example

    Spark

    Spark

    Sparkval tweets: RDD[Tweet] = tweets.filter( _.contains(bigdata))

  • Latency

    Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee

    http://research.google.com/people/jeff/https://gist.github.com/2841832

  • Computation modelmemory disk network

    seconds -

    days

    weeks -

    months

    weeks -

    years

  • Scala transformations & actions1. Transformations are lazy

    2. Actions are eager

    mapfilterflatMap

    reducecollectcount

    val tweets: RDD[Tweet] = tweets.filter(_.contains(bigdata)) .map(t => (t.author, t.body)

    val tweets: RDD[Tweet] = tweets.filter(_.contains(bigdata)) .map(t => (t.author, t.body) .collect()

  • Rules of thumb1. Cache

    2. Apply efficiently

    3. Avoid shuffling

    val tweets: RDD[Tweet] = val cachedTweets = tweets.cache()cachedTweets.filter(_.contains(USA)) .map(t => (t.author, t.body)

    cachedTweets.map(t => (t.author, t.body) .filter(_.contains(USA))

  • Shuffling

    (1, 240)(2, 500)(2, 105)

    (3, 100)(1, 200)(1, 500)

    (1, 450)(3, 100)(3, 100)

    (2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100])

    groupByKey()

    Transaction(id: Int, amount: Int)

    We want to know how much money spent each client

  • Reduce before group

    (1, 240)(2, 605)

    (3, 100)(1, 700)

    (1, 450)(3, 200)

    (2, [605]) (1, [240, 700, 450]) (3, [100, 200])

    groupByKey()

    (1, 240)(2, 500)(2, 105)

    (3, 100)(1, 200)(1, 500)

    (1, 450)(3, 100)(3, 100)

    reduceByKey()

  • Thanks :)

    @Fruzenshtein