Spark as a distributed Scala
-
Upload
alex-fruzenshtein -
Category
Data & Analytics
-
view
27 -
download
0
Transcript of Spark as a distributed Scala
Spark As A Distributed Scala
Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark…
Who is who?Alexey Zvolinskiy ~4 years of Scala experience
Passing through Functional Programming in Scala Specialization on Coursera
@Fruzenshtein
Why Scala?
What makes Scala so great?1. Functional programming language*
2. Immutability
3. Type system
4. Collections API
5. Pattern matching
6. Implicit
Functional programming language1. Function is a first class citizen
2. Totality
3. Determinism
4. Purity
A => B
A1A2…
An
B1B2…
Bn
A => BAi Bi A => BAi Bi
A => BAi Bi
Immutability1. Makes a code more predictable
2. Reduces efforts to understand a code
3. Key to thread-safety
Books:Java concurrency in practiceEffective Java 2nd Edition
Type system1. Static typing
2. Type inference
3. Bounds Map[V, K]
List[T1 <: T2]
Set[+T]
Collections API
val numbers = List(1,2,3,4,5,6,7,8,9,10)numbers.filter(_ % 2 == 0) .map(_ * 10)//List(20, 40, 60, 80, 100)
filter(n:Int => Boolean)
//(n => n % 2 == 0)
//(n => n * 10)
map(n:Int => Int)
Collections APIval groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)))
groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get//List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))
And what?!=Parallelism=
Idea of parallelism
How to divide a problem into subproblems?
How to use a hardware optimally?
Parallelism background
Scala parallel collections
val from0to100000: Range = 0 until 100000
val list = from0to100000.toList
//scala.collection.parallel.immutable.ParSeq[Int]val parList = list.par
Some benchmarksval list = from0to100000.toListfor (i <- 1 to 10) { val t0 = System.currentTimeMillis() list.filter(isPrime(_)) println(System.currentTimeMillis - t0)}
def isPrime(n: Int): Boolean = ! ( (2 until n-1) exists (n % _ == 0))
val parList = list.parfor (i <- 1 to 10) { val t1 = System.currentTimeMillis() parList.filter(isPrime(_)) println(System.currentTimeMillis - t1)}
7106646763156275647887326543629662996286
5130510646494568458044464447443742904476
Ok, but what about Spark?!
Why distributed computations?
single machine (shared memory)
Multiple nodes (network)
Parallel collections (scala)
RDDs (spark)
Almost the same API
RDD example
Spark
Spark
Sparkval tweets: RDD[Tweet] = …tweets.filter( _.contains(“bigdata”))
Latency
Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee
Computation modelmemory disk network
seconds -
days
weeks -
months
weeks -
years
Scala transformations & actions1. Transformations are lazy
2. Actions are eager
mapfilterflatMap…
reducecollectcount…
val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body)
val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) .collect()
Rules of thumb1. Cache
2. Apply efficiently
3. Avoid shuffling
val tweets: RDD[Tweet] = …val cachedTweets = tweets.cache()cachedTweets.filter(_.contains(“USA”)) .map(t => (t.author, t.body)
cachedTweets.map(t => (t.author, t.body) .filter(_.contains(“USA”))
Shuffling
(1, 240)(2, 500)(2, 105)
(3, 100)(1, 200)(1, 500)
(1, 450)(3, 100)(3, 100)
(2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100])
groupByKey()
Transaction(id: Int, amount: Int)
We want to know how much money spent each client
Reduce before group
(1, 240)(2, 605)
(3, 100)(1, 700)
(1, 450)(3, 200)
(2, [605]) (1, [240, 700, 450]) (3, [100, 200])
groupByKey()
(1, 240)(2, 500)(2, 105)
(3, 100)(1, 200)(1, 500)
(1, 450)(3, 100)(3, 100)
reduceByKey(…)
Thanks :)
@Fruzenshtein