Spark as a distributed Scala

24
Spark As A Distributed Scala

Transcript of Spark as a distributed Scala

Page 1: Spark as a distributed Scala

Spark As A Distributed Scala

Page 2: Spark as a distributed Scala

Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark…

Who is who?Alexey Zvolinskiy ~4 years of Scala experience

Passing through Functional Programming in Scala Specialization on Coursera

@Fruzenshtein

Page 3: Spark as a distributed Scala

Why Scala?

Page 4: Spark as a distributed Scala

What makes Scala so great?1. Functional programming language*

2. Immutability

3. Type system

4. Collections API

5. Pattern matching

6. Implicit

Page 5: Spark as a distributed Scala

Functional programming language1. Function is a first class citizen

2. Totality

3. Determinism

4. Purity

A => B

A1A2…

An

B1B2…

Bn

A => BAi Bi A => BAi Bi

A => BAi Bi

Page 6: Spark as a distributed Scala

Immutability1. Makes a code more predictable

2. Reduces efforts to understand a code

3. Key to thread-safety

Books:Java concurrency in practiceEffective Java 2nd Edition

Page 7: Spark as a distributed Scala

Type system1. Static typing

2. Type inference

3. Bounds Map[V, K]

List[T1 <: T2]

Set[+T]

Page 8: Spark as a distributed Scala

Collections API

val numbers = List(1,2,3,4,5,6,7,8,9,10)numbers.filter(_ % 2 == 0) .map(_ * 10)//List(20, 40, 60, 80, 100)

filter(n:Int => Boolean)

//(n => n % 2 == 0)

//(n => n * 10)

map(n:Int => Int)

Page 9: Spark as a distributed Scala

Collections APIval groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)))

groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get//List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))

Page 10: Spark as a distributed Scala

And what?!=Parallelism=

Page 11: Spark as a distributed Scala

Idea of parallelism

How to divide a problem into subproblems?

How to use a hardware optimally?

Page 12: Spark as a distributed Scala

Parallelism background

Page 13: Spark as a distributed Scala

Scala parallel collections

val from0to100000: Range = 0 until 100000

val list = from0to100000.toList

//scala.collection.parallel.immutable.ParSeq[Int]val parList = list.par

Page 14: Spark as a distributed Scala

Some benchmarksval list = from0to100000.toListfor (i <- 1 to 10) { val t0 = System.currentTimeMillis() list.filter(isPrime(_)) println(System.currentTimeMillis - t0)}

def isPrime(n: Int): Boolean = ! ( (2 until n-1) exists (n % _ == 0))

val parList = list.parfor (i <- 1 to 10) { val t1 = System.currentTimeMillis() parList.filter(isPrime(_)) println(System.currentTimeMillis - t1)}

7106646763156275647887326543629662996286

5130510646494568458044464447443742904476

Page 15: Spark as a distributed Scala

Ok, but what about Spark?!

Page 16: Spark as a distributed Scala

Why distributed computations?

single machine (shared memory)

Multiple nodes (network)

Parallel collections (scala)

RDDs (spark)

Almost the same API

Page 17: Spark as a distributed Scala

RDD example

Spark

Spark

Sparkval tweets: RDD[Tweet] = …tweets.filter( _.contains(“bigdata”))

Page 18: Spark as a distributed Scala

Latency

Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee

Page 19: Spark as a distributed Scala

Computation modelmemory disk network

seconds -

days

weeks -

months

weeks -

years

Page 20: Spark as a distributed Scala

Scala transformations & actions1. Transformations are lazy

2. Actions are eager

mapfilterflatMap…

reducecollectcount…

val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body)

val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) .collect()

Page 21: Spark as a distributed Scala

Rules of thumb1. Cache

2. Apply efficiently

3. Avoid shuffling

val tweets: RDD[Tweet] = …val cachedTweets = tweets.cache()cachedTweets.filter(_.contains(“USA”)) .map(t => (t.author, t.body)

cachedTweets.map(t => (t.author, t.body) .filter(_.contains(“USA”))

Page 22: Spark as a distributed Scala

Shuffling

(1, 240)(2, 500)(2, 105)

(3, 100)(1, 200)(1, 500)

(1, 450)(3, 100)(3, 100)

(2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100])

groupByKey()

Transaction(id: Int, amount: Int)

We want to know how much money spent each client

Page 23: Spark as a distributed Scala

Reduce before group

(1, 240)(2, 605)

(3, 100)(1, 700)

(1, 450)(3, 200)

(2, [605]) (1, [240, 700, 450]) (3, [100, 200])

groupByKey()

(1, 240)(2, 500)(2, 105)

(3, 100)(1, 200)(1, 500)

(1, 450)(3, 100)(3, 100)

reduceByKey(…)

Page 24: Spark as a distributed Scala

Thanks :)

@Fruzenshtein