SparkR: Enabling Interactive Data Science at Scale
-
Upload
jeykottalam -
Category
Software
-
view
2.551 -
download
1
description
Transcript of SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
Shivaram Venkataraman Zongheng Yang
Fast !
Scalable Flexible
Statistics !
Packages Plots
Fast !
Scalable
Flexible
Statistics !
Plots
Packages
Outline
SparkR API Live Demo Design Details
RDD
Parallel Collection
Transformations map filter
groupBy …
Actions count collect
saveAsTextFile …
R + RDD = R2D2
R + RDD = RRDD
lapply lapplyPartition
groupByKey reduceByKey sampleRDD
collect cache filter …
broadcast includePackage
textFile parallelize
SparkR – R package for Spark
R
RRDD
Spark
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”)
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,
function(word) { list(word, 1L) })
Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,
function(word) { list(word, 1L) })
counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts)
Demo: Digit Classification
MNIST
A
b
|| Ax − b ||2Minimize x = (ATA)−1ATb
How does this work ?
Dataflow
Local Worker
Worker
Dataflow
Local
R
Worker
Worker
Dataflow
Local
R Spark Context
Java Spark
Context
JNI
Worker
Worker
Dataflow
Local Worker
Worker R Spark Context
Java Spark
Context
JNI
Spark Executor exec R
Spark Executor exec R
From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
Dataflow
Local Worker
Worker R Spark Context
Java Spark
Context
JNI
Spark Executor exec R
Spark Executor exec R
Pipelined RDD words <-‐ flatMap(lines,…) wordCount <-‐ lapply(words,…)
Spark Executor exec R Spark
Executor R exec
Pipelined RDD
Spark Executor exec R Spark
Executor R exec
Spark Executor exec R R Spark
Executor
Alpha developer release
One line install !
install_github("amplab-‐extras/SparkR-‐pkg", subdir="pkg")
SparkR Implementation
Very similar to PySpark Spark is easy to extend
329 lines of Scala code 2079 lines of R code 693 lines of test code in R
EC2 setup scripts All Spark examples MNIST demo YARN, Windows support
Also on github
Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx
On the Roadmap
High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark
SparkR
Combine scalability & utility
RDD à distributed lists Run R on clusters Re-use existing packages
SparkR https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman [email protected] Zongheng Yang [email protected]
SparkR mailing list [email protected]