Parallelizing Existing R Packages with SparkR

Hossein Falaki @mhfalaki

About me

• Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks

What is SparkR?

An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames

distributed/robust processing, data sources, off-memory data structures

+ Dynamic environment, interactivity, packages, visualization

SparkR architecture

SparkDriver

RBackend

Worker

DataSourcesJVM

SparkR architecture (since 2.0)

SparkDriver

RBackend

Worker

DataSourcesR

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

SparkR UDF API

spark.lapply Runs a function over a list of elements

spark.lapply()

dapply Applies a function to each partition of a SparkDataFrame

dapply()

dapplyCollect()

gapply Applies a function to each group within a SparkDataFrame

gapply()

gapplyCollect()

spark.lapply

Simplest SparkR UDF pattern For each element of a list:

1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver

spark.lapply(1:100, function(x) { runBootstrap(x) }

spark.lapply control flow

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

1.SerializeRclosure

3.Transferserializedclosureoverthenetwork

5.De-serializeclosure

4.Transferoverlocalsocket

6.Serializeresult

2.Transferoverlocalsocket

7.Transferoverlocalsocket9.Transferover

localsocket

10.Deserializeresult

8.Transferserializedclosureoverthenetwork

dapply

For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

dapply(sparkDF, func, schema)

combines results as DataFrame with provided schema

dapplyCollect(sparkDF, func)

combines results as R data.frame

dapply control & data flow

local socket cluster network local socket

input data

ser/de transfer

result data

ser/de transfer

dapplyCollect control & data flow

input data

ser/de transfer

result transfer result deser

gapply

Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

gapply(sparkDF, cols, func, schema)

combines results as DataFrame with provided schema

gapplyCollect(sparkDF, cols, func)

combines results as R data.frame

gapply control & data flow

input data

ser/de transfer

result data

ser/de transfer

shuffle

dapply vs. gapply

gapply dapply

signature gapply(df, cols, func, schema) gapply(gdf, func, schema)

dapply(df, func, schema)

userfunc2onsignature

function(key, data) function(data)

datapar22on controlledbygrouping notcontrolled

Parallelizing data

• Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data

– Are partitions evenly sized?

• Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers

Packages on workers

• SparkR closure capture does not include packages • You need to import packages on each worker inside your

function •  If not installed install packages on workers out-of-band •  spark.lapply() can be used to install packages

Debugging user code

1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster

–  When R worker fails, Spark Driver throws exception with the R error text

3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers

Demo 19

hMp://bit.ly/2krYMwChMp://bit.ly/2ltLVKs

Thank you!

Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with...

Transcript of Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with...

Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with...

Documents

Transcript of Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with...

R Data Access from hdfs,spark,hive

Spark Parameter Tuning via Trial-and-ErrorSpark Stream-ing and SparkR These parameters are speciﬁc to the Spark Streaming and SparkR higher-level components. Table 1.Summary of parameter

Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Spark R 소개 · 1. 빅데이터시스템이란? 2. Spark란무엇인가? 3. Spark R 소개 4. Spark R 데모. 1. 빅데이터시스템이란? •거대한데이터를저장하고그데이터를처리하기위한시스템.

Programmation R sous Spark avec SparkR - eric.univ-lyon2.freric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Spark_with_R.pdf · Ricco Rakotomalala Tutoriels Tanagra - 2 Le framework

Text modeling with R, Python, and Spark

ScootR: Scaling R Dataframes on Dataflow Systems · SparkR STS ScootR SparkR IPC Figure 1: R function call overhead compared to the native execution on the dataflow system. Source-to-source

SparkR Slides

Scalable Data Science with Spark and R

First impressions of SparkR: our own machine learning algorithm

SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Recent Developments In SparkR For Advanced Analytics

Programmation R sous Spark avec SparkR

Data Analytics using Apache Spark · Step 0: Setting up the Environment ... • sparkR →Weopen a R-cran session to Spark • spark-submit →Wesend our Spark application to Spark

Performance Analysis of Spark using k-means · like Cassandra (Spark Cassandra Connector) and R (SparkR). With Cassandra Connector, you can use Spark to access data stored in a Cassandra

Data processing with spark in r & python

Introduction to SparkR

Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf

Recent Development in SparkR for Advanced Analyticsfiles.meetup.com/4439192/Recent Development in SparkR for Advan… · About Apache Spark 2 • General engine for large-scale data

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung