Parallelizing Existing R Packages

Parallelizing Existing R Packages with SparkR

Hossein Falaki @mhfalaki

About me

• Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks

What is SparkR?

An R package distributed with Apache Spark (soon CRAN): - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames

distributed/robust processing, data sources, off-memory data structures

+ Dynamic environment, interactivity, packages, visualization

SparkR architecture

SparkDriver

RBackend

Worker

DataSourcesJVM

SparkR architecture (since 2.0)

SparkDriver

RBackend

Worker

DataSourcesR

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

Overview of SparkR API :: Session

Spark session is your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQL tables o All DataFrame operations go through a SQL optimizer (catalyst) o Since 2.0 sqlContext is wrapped in a new object called SparkR Session.

> spark <- sparkR.session()

All SparkR functions work if you pass them a session or will assume an existing session.

Reading/Writing data

RBackend

Worker

HDFS/S3/…

read.df()write.df()

Moving data between R and JVM

RBackendSparkR::collect()

SparkR::createDataFrame()

Overview of SparkR API :: DataFrame API

SparkR DataFrame behaves similar to R data.frames > sparkDF$newCol <- sparkDF$col + 1 > subsetDF <- sparkDF[, c(“date”, “type”)] > recentData <- subset(sparkDF$date == “2015-10-24”) > firstRow <- sparkDF[[1, ]] > names(subsetDF) <- c(“Date”, “Type”) > dim(recentData) > head(count(group_by(subsetDF, “Date”)))

Overview of SparkR API :: SQL

You can register a DataFrame as a table and query it in SQL > logs <- read.df(“data/logs”, source = “json”) > registerTempTable(df, “logsTable”) > errorsByCode <- sql(“select count(*) as num, type from logsTable where type == “error” group by code order by date desc”)

> reviewsDF <- tableToDF(“reviewsTable”) > registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)

Moving between languages

R Scala

df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”)

val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

SparkR UDF API

spark.lapply Runs a function over a list of elements

spark.lapply()

dapply Applies a function to each partition of a SparkDataFrame

dapply()

dapplyCollect()

gapply Applies a function to each group within a SparkDataFrame

gapply()

gapplyCollect()

spark.lapply

Simplest SparkR UDF pattern For each element of a list:

1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver

spark.lapply(1:100, function(x) { runBootstrap(x) }

spark.lapply control flow

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

1.SerializeRclosure

3.Transferserializedclosureoverthenetwork

5.De-serializeclosure

4.Transferoverlocalsocket

6.Serializeresult

2.Transferoverlocalsocket

7.Transferoverlocalsocket9.Transferover

localsocket

10.Deserializeresult

8.Transferserializedclosureoverthenetwork

dapply

For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

dapply(sparkDF, func, schema)

combines results as DataFrame with schema

dapplyCollect(sparkDF, func)

combines results as R data.frame

dapply control & data flow

local socket cluster network local socket

input data

ser/de transfer

result data

ser/de transfer

dapplyCollect control & data flow

input data

ser/de transfer

result transfer result deser

gapply

Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

gapply(sparkDF, cols, func, schema)

combines results as DataFrame with schema

gapplyCollect(sparkDF, cols, func)

combines results as R data.frame

gapply control & data flow

input data

ser/de transfer

result data

ser/de transfer

shuffle

dapply vs. gapply

gapply dapply

signature gapply(df, cols, func, schema) gapply(gdf, func, schema)

dapply(df, func, schema)

user function signature

function(key, data) function(data)

data partition controlledbygrouping notcontrolled

Parallelizing data

• Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data

– Are partitions evenly sized?

• Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers using FileSystem

Packages on workers

• SparkR closure capture does not include packages • You need to import packages on each worker inside your

function •  If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages

Debugging user code

1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster

–  When R worker fails, Spark Driver throws exception with the R error text

3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers

Demo 26

Notebooks available at:

•  hSp://bit.ly/2krYMwC•  hSp://bit.ly/2ltLVKs

Thank you!

Parallelizing Existing R Packages

Software

Transcript of Parallelizing Existing R Packages

Parallelizing Security Checks on Commodity Hardware

Lecture 2: Parallelizing Graphics Pipeline Executiongraphics.cs.cmu.edu/courses/15869/fall2014content/lectures/02... · Lecture 2: Parallelizing Graphics Pipeline Execution ... Frame-Buﬀer

Generic Programming Techniques for Parallelizing … · Generic Programming Techniques for Parallelizing and Extending Procedural Finite Element Programs ... purely object-oriented

Parallelizing Queue Compiler

Parallelizing Programs

Parallelizing Compilers Presented by Yiwei Zhang.

Parallelizing and De-parallelizing Elimination Orders

OSCAR Parallelizing CompilerOSCAR Parallelizing …...OSCAR Parallelizing CompilerOSCAR Parallelizing Compiler and API for Low Powerand API for Low Power High Performance Multicores

Parallelizing and Vectorizing Compilers - Research University

Parallelizing the Data Cube

Parallelizing Big Data Machine Learning Applications with ...dsc.soic.indiana.edu/publications/Parallelizing Big Data Machine... · Parallelizing Big Data Machine Learning Applications

Parallelizing Description Logic Reasoningusers.encs.concordia.ca/~haarslev/students/Kejia_Wu.pdf · Parallelizing Description Logic Reasoning Kejia Wu A Thesis in The Department of

Integrate SparkR with existing R packages to accelerate data science workflows

Parallelizing Packet Processing in Container Overlay ...

Parallelizing the Execution of Sequential Scripts

Parallelizing the C++ CppCon2015 Standard Template Library ...stellar.cct.lsu.edu/pubs/ParallelizingSTL.pdf · So Why Parallelize the STL? ... Build widespread existing practice for

Parallelizing Simplex within SMT solvers

Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by

Parallelizing Query Optimization - VLDB · Parallelizing Query Optimization ... (e.g., many Siebel queries) ... used, thereby amortizing the optimization cost over multi-

Auto-Parallelizing Stateful Distributed Streaming Applications