SparkR: Enabling Interactive Data Science at Scale

Shivaram Venkataraman Zongheng Yang

Fast !

Scalable Flexible

Statistics !

Packages Plots

Fast !

Scalable

Flexible

Statistics !

Packages

Outline

SparkR API Live Demo Design Details

Parallel Collection

Transformations map filter

groupBy …

Actions count collect

saveAsTextFile …

R + RDD = R2D2

R + RDD = RRDD

lapply lapplyPartition

groupByKey reduceByKey sampleRDD

collect cache filter …

broadcast includePackage

textFile parallelize

SparkR – R package for Spark

Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”)

Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,

function(word) { list(word, 1L) })

Example: word_count.R library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,

function(word) { list(word, 1L) })

counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts)

Demo: Digit Classification

|| Ax − b ||2Minimize x = (ATA)−1ATb

How does this work ?

Dataflow

Local Worker

Worker

Dataflow

Worker

Dataflow

R Spark Context

Java Spark

Context

Worker

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

Spark Executor exec R

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

Spark Executor exec R

Pipelined RDD words <-‐ flatMap(lines,…) wordCount <-‐ lapply(words,…)

Spark Executor exec R Spark

Executor R exec

Pipelined RDD

Spark Executor exec R Spark

Executor R exec

Spark Executor exec R R Spark

Executor

Alpha developer release

One line install !

install_github("amplab-‐extras/SparkR-‐pkg", subdir="pkg")

SparkR Implementation

Very similar to PySpark Spark is easy to extend

329 lines of Scala code 2079 lines of R code 693 lines of test code in R

EC2 setup scripts All Spark examples MNIST demo YARN, Windows support

Also on github

Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx

On the Roadmap

High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark

SparkR

Combine scalability & utility

RDD à distributed lists Run R on clusters Re-use existing packages

SparkR https://github.com/amplab-extras/SparkR-pkg

Shivaram Venkataraman shivaram@cs.berkeley.edu Zongheng Yang zongheng.y@gmail.com

SparkR mailing list sparkr-dev@googlegroups.com

SparkR: Enabling Interactive Data Science at Scale

Software

Transcript of SparkR: Enabling Interactive Data Science at Scale

Enabling Real-Time Physics Simulation in Future Interactive Entertainmentreinman/mars/papers/SANDBOX-06-… · · 2007-08-19Enabling Real-Time Physics Simulation in Future Interactive

Recent Developments In SparkR For Advanced Analytics

SparkR Slides

First impressions of SparkR: our own machine learning algorithm

VisTrails: Enabling Interactive Multiple-View Visualizationscscheid/pubs/vistrails-vis2005.pdfVisTrails: Enabling Interactive Multiple-View Visualizations Louis Bavoil1 Steven P. Callahan1

Performant data processing with PySpark, SparkR and DataFrame API

Introduction to SparkR

VisIO: Enabling Interactive Visualization of Ultra-Scale, Time ...jwang/papers/VisIO-IPDPS11-Final.pdfVisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via

SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Programmation R sous Spark avec SparkR

Interface 2015 Tutorial- SparkR · Tutorial:*An*Introduc0on*to*SparkR* Hao$Lin$(Purdue$University)$ Morgantown,$WV$ Jun$12,$2015$ Part$of$the$slides$are$modiﬁed$from$Shivaram’s$slides$

Odessa: Enabling Interactive Perception Applications on Mobile Devices

Use r tutorial part1, introduction to sparkr

Web enabling RDB datas (migrating from interactive to web access) R é mi Jolin OpenVMS Executive Council - May 2001.

Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

SparkR + Zeppelin

Recent Development in SparkR for Advanced Analyticsfiles.meetup.com/4439192/Recent Development in SparkR for Advan… · About Apache Spark 2 • General engine for large-scale data

Spark Parameter Tuning via Trial-and-ErrorSpark Stream-ing and SparkR These parameters are speciﬁc to the Spark Streaming and SparkR higher-level components. Table 1.Summary of parameter

Parallelizing Existing R Packages with SparkR

Interface 2015 Tutorial- SparkR · Tutorial:AnIntroduc0ontoSparkR* Hao$Lin$(Purdue$University)$ Morgantown,$WV$ Jun$12,$2015$ Part$of$the$slides$are$modiﬁed$from$Shivaram’s$slides$