SparkR: Enabling Interactive Data Science at Scale

Post on 02-Jul-2015

2.551 views 1 download

description

"SparkR" presentation by Shivaram Venkataraman and Zongheng Yang

Transcript of SparkR: Enabling Interactive Data Science at Scale

SparkR: Enabling Interactive Data Science at Scale

Shivaram Venkataraman Zongheng Yang

Fast !

Scalable Flexible

Statistics !

Packages Plots

Fast !

Scalable

Flexible

Statistics !

Plots

Packages

Outline

SparkR API Live Demo Design Details

RDD

Parallel Collection

Transformations map filter

groupBy …

Actions count collect

saveAsTextFile …

R + RDD = R2D2

R + RDD = RRDD

lapply lapplyPartition

groupByKey reduceByKey sampleRDD

collect cache filter …

broadcast includePackage

textFile parallelize

SparkR – R package for Spark

R

RRDD

Spark

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)  

Demo: Digit Classification

MNIST

A

b

|| Ax − b ||2Minimize x = (ATA)−1ATb

How does this work ?

Dataflow

Local Worker

Worker

Dataflow

Local

R

Worker

Worker

Dataflow

Local

R Spark Context

Java Spark

Context

JNI

Worker

Worker

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

Pipelined RDD words  <-­‐  flatMap(lines,…)  wordCount  <-­‐  lapply(words,…)  

Spark Executor exec R Spark

Executor R exec

Pipelined RDD

Spark Executor exec R Spark

Executor R exec

Spark Executor exec R R Spark

Executor

Alpha developer release

One line install !  

     install_github("amplab-­‐extras/SparkR-­‐pkg",              subdir="pkg")  

SparkR Implementation

Very similar to PySpark Spark is easy to extend

329 lines of Scala code 2079 lines of R code 693 lines of test code in R

EC2 setup scripts All Spark examples MNIST demo YARN, Windows support

Also on github

Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx

On the Roadmap

High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark

SparkR

Combine scalability & utility

RDD à distributed lists Run R on clusters Re-use existing packages

SparkR https://github.com/amplab-extras/SparkR-pkg

Shivaram Venkataraman shivaram@cs.berkeley.edu Zongheng Yang zongheng.y@gmail.com

SparkR mailing list sparkr-dev@googlegroups.com