So Slick! An introduction Jan Christopher Vogt, EPFL Slick Team Scala User Group Berlin Brandenburg.
Seattle useR Group - R + Scala
-
Upload
shouheng-yi -
Category
Data & Analytics
-
view
3.663 -
download
1
Transcript of Seattle useR Group - R + Scala
Shouheng Yi
Data Scientist [email protected]
www.linkedin.com/in/shouhengyi
+
Seattle useR Group 05/05/2015
R is Hard to Scale• Architectural Parallelism: most R’s parallelism is
done on CPU level using MPI
• Data Parallelism: data must have full presents in RAM during an R session
• Why?
RC and Fortran
Scientists vs. Developers
• Scientists and researchers love R, because most of their computing tasks are iterative/procedural
• Software engineers are less impressed, because they need to develop concurrent, reactive and robust applications
Why I Found Scala Useful• Lives on JVM (most devs are comfortable with JVM)
• Great distributed frameworks - Akka, Slick, Spark, etc.
• Syntactic sugar (less typing) -> easier to debug -> rapid development
R
vec <- 1:100
sum <- 0
for(i in vec){ sum <- sum + i }
Scala
val vec = 1 to 100
val sum = (0 /: vec)((a, b) => a + b)
A Simple Task• Step 1: read from a CSV file that has 100,000,000
double elements (~1.7G).
read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours > vector <- read.csv(“./vector.csv”, quote = F, row.names = F)
• Step 2: calculate its sum
There are existing R packages like ff, bigmemory to address these out-of-memory issues, but I want to demonstrate an alternative method that is much more generic, robust and scalable
Rserve> library(Rserve) > Rserve()
Starting Rserve: /Library/Frameworks/R.framework/Resources/bin/R CMD /Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rserve/libs//Rserve
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
Rserv started in daemon mode.
Producer
case ProcessData(sum: Double, isEnd: Boolean)
Inbox
Worker
case DoWork(ind: Int, size: Int)
Inbox
sender ! doWork(ind, size - 1)
sender ! processData(sum, isEnd)
Producer Classclass Producer extends Actor with ActorLogging { // Some inputs var (size, nworker) = (1000000, 10) // Some counters and result holder var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0) // Create the router val workerRouter = context.actorOf( Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)), name = "workerRouter" ) // Read File and Chop It into Pieces val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size) // What to do when it enters override def preStart() = println(s"Producer $self is alive") // What to do when it exits override def postStop() = println(s"Producer $self is dead. The sum is $sum_total") // What mssgs to be received override def receive = { case ProcessData(sum) => sum_total += sum if(iterator.hasNext) { sender ! DoWork(iterator.next) } else { ncorpse += 1 context.stop(sender) } if(ncorpse == nworker) context.stop(self) }}
Worker Classclass Worker(master: ActorRef, sum_total: Double) extends Actor with ActorLogging { override def preStart() = { println(s"Worker $self is alive!!!") master ! ProcessData(sum_total) } override def receive = { case DoWork(iter) => // Rserve val c: RConnection = new RConnection() c.assign("x", iter.toArray) val sum: Double = c.eval("sum(as.numeric(x))").asDouble() c.close() // Asking for more println(s"$self => Partial Sum: $sum, Size: ${iter.length}") sender ! ProcessData(sum) }}
Mainobject Application extends App{ override def main(arg: Array[String]){ val system = ActorSystem("ClusterSystem") system.actorOf(Props[Producer], name = "producer") }}
object ClusterMessageProtocol { sealed trait Message // Producer side case class InitiateWorker(worker: ActorRef) extends Message case class ProcessData(sum: Double) extends Message // Actor side case class DoWork(iter: List[String]) extends Message}
…
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] is alive!!!
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!!
Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000
…
Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000
Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277
> sum(vector) [1] -13615.4
Applications1. Optimization Problems
Evaluating objective function, simulation in parallel (Differential Evolution!)
2. Distributed Matrix Operations
Product, transpose, inverse of distributed matrices, quadratic programming in large dimensional space
3. Real-time machine learning
Linear/logistic regression (see 2), Random Forest, Neural network
4. Statistical Inference
Bootstrap, sampling, log-likelihood estimation, Bayesian
Thank You! Any Questions?
Email: [email protected] LinkedIn: www.linkedin.com/in/shouhengyi
知乎: 伊⾸首衡