SparkR: Enabling Interactive Data Science at Scale

33
SparkR: Enabling Interactive Data Science at Scale Shivaram Venkataraman Zongheng Yang

description

"SparkR" presentation by Shivaram Venkataraman and Zongheng Yang

Transcript of SparkR: Enabling Interactive Data Science at Scale

Page 1: SparkR: Enabling Interactive Data Science at Scale

SparkR: Enabling Interactive Data Science at Scale

Shivaram Venkataraman Zongheng Yang

Page 2: SparkR: Enabling Interactive Data Science at Scale

Fast !

Scalable Flexible

Page 3: SparkR: Enabling Interactive Data Science at Scale

Statistics !

Packages Plots

Page 4: SparkR: Enabling Interactive Data Science at Scale

Fast !

Scalable

Flexible

Statistics !

Plots

Packages

Page 5: SparkR: Enabling Interactive Data Science at Scale

Outline

SparkR API Live Demo Design Details

Page 6: SparkR: Enabling Interactive Data Science at Scale

RDD

Parallel Collection

Transformations map filter

groupBy …

Actions count collect

saveAsTextFile …

Page 7: SparkR: Enabling Interactive Data Science at Scale

R + RDD = R2D2

Page 8: SparkR: Enabling Interactive Data Science at Scale

R + RDD = RRDD

lapply lapplyPartition

groupByKey reduceByKey sampleRDD

collect cache filter …

broadcast includePackage

textFile parallelize

Page 9: SparkR: Enabling Interactive Data Science at Scale

SparkR – R package for Spark

R

RRDD

Spark

Page 10: SparkR: Enabling Interactive Data Science at Scale

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  

Page 11: SparkR: Enabling Interactive Data Science at Scale

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

Page 12: SparkR: Enabling Interactive Data Science at Scale

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)  

Page 13: SparkR: Enabling Interactive Data Science at Scale

Demo: Digit Classification

Page 14: SparkR: Enabling Interactive Data Science at Scale

MNIST

Page 15: SparkR: Enabling Interactive Data Science at Scale

A

b

|| Ax − b ||2Minimize x = (ATA)−1ATb

Page 16: SparkR: Enabling Interactive Data Science at Scale

How does this work ?

Page 17: SparkR: Enabling Interactive Data Science at Scale

Dataflow

Local Worker

Worker

Page 18: SparkR: Enabling Interactive Data Science at Scale

Dataflow

Local

R

Worker

Worker

Page 19: SparkR: Enabling Interactive Data Science at Scale

Dataflow

Local

R Spark Context

Java Spark

Context

JNI

Worker

Worker

Page 20: SparkR: Enabling Interactive Data Science at Scale

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

Page 21: SparkR: Enabling Interactive Data Science at Scale
Page 22: SparkR: Enabling Interactive Data Science at Scale

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Page 23: SparkR: Enabling Interactive Data Science at Scale
Page 24: SparkR: Enabling Interactive Data Science at Scale

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

Page 25: SparkR: Enabling Interactive Data Science at Scale

Pipelined RDD words  <-­‐  flatMap(lines,…)  wordCount  <-­‐  lapply(words,…)  

Spark Executor exec R Spark

Executor R exec

Page 26: SparkR: Enabling Interactive Data Science at Scale

Pipelined RDD

Spark Executor exec R Spark

Executor R exec

Spark Executor exec R R Spark

Executor

Page 27: SparkR: Enabling Interactive Data Science at Scale

Alpha developer release

One line install !  

     install_github("amplab-­‐extras/SparkR-­‐pkg",              subdir="pkg")  

Page 28: SparkR: Enabling Interactive Data Science at Scale

SparkR Implementation

Very similar to PySpark Spark is easy to extend

329 lines of Scala code 2079 lines of R code 693 lines of test code in R

Page 29: SparkR: Enabling Interactive Data Science at Scale

EC2 setup scripts All Spark examples MNIST demo YARN, Windows support

Also on github

Page 30: SparkR: Enabling Interactive Data Science at Scale

Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx

Page 31: SparkR: Enabling Interactive Data Science at Scale

On the Roadmap

High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark

Page 32: SparkR: Enabling Interactive Data Science at Scale

SparkR

Combine scalability & utility

RDD à distributed lists Run R on clusters Re-use existing packages

Page 33: SparkR: Enabling Interactive Data Science at Scale

SparkR https://github.com/amplab-extras/SparkR-pkg

Shivaram Venkataraman [email protected] Zongheng Yang [email protected]

SparkR mailing list [email protected]