ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow...

39
ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2 Cosmin Basca 2 Jens Meiners 1 Sebastian Breß 1 Tilmann Rabl 1 Juan Fumero 3 Volker Markl 2 0 Technische Universität Berlin 1 Oracle Labs 2 University of Manchester 3

Transcript of ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow...

Page 1: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

ScootR: Scaling R Dataframes on Dataflow SystemsAndreas Kunft1 Lukas Stadler2 Daniele Bonetta2 Cosmin Basca2 Jens Meiners1

Sebastian Breß1 Tilmann Rabl1 Juan Fumero3 Volker Markl2

0

Technische Universität Berlin1 Oracle Labs2 University of Manchester3

Page 2: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

R gained increased traction

• Dynamically typed, open-source language

• Rich support for analytics & statistics

1

Page 3: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

R gained increased traction

• Dynamically typed, open-source language

• Rich support for analytics & statistics

But

• Standalone R is not well suited for out-of-core data loads

2

Page 4: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Analytics pipelines often work on large amounts of raw data

• Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out

• Provide rich support for user-defined functions (UDFs)

3

Page 5: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Analytics pipelines often work on large amounts of raw data

• Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out

• Provide rich support for user-defined functions (UDFs)

But

• R users are often unfamiliar with DF APIs and concepts

4

Page 6: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Combine the usability of R with the scalability of dataflow engines

- Goals- From functions calls to an operator graph- Approaches to execute R UDFs- Our Approach: ScootR- Evaluation

5

Page 7: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

GOALS

1. Provide data.frame API with natural feeling

6

df$km <- df$miles * 1.6

df <- select(df, count = flights, distance)

df <- apply(df, func)

Page 8: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

GOALS

1. Provide data.frame API with natural feeling

2. Achieve comparable performance to native dataflow API

7

df$km <- df$miles * 1.6

df <- select(df, count = flights, distance)

df <- apply(df, func)

Page 9: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

From function calls to an operator graph

8

Page 10: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

MAPPING DATA TYPES

• R data.frame(T1,T2,…,TN) as Flink DataSet<TupleN<T1,T2,…,TN>>

• E.g., data.frame(integer, character) as DataSet<Tuple2<Integer, String>>

9

N columns N fieldsFixed element type ofTuple with arity N

Page 11: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

MAPPING R FUNCTIONS TO OPERATORS

• Functions on data.frames lazily build an operator graph

10

Page 12: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

MAPPING R FUNCTIONS TO OPERATORS

• Functions on data.frames lazily build an operator graph

1. Functions w/o UDFs are handled before execution, e.g.,a select function is mapped to a project operator

select(df$id, df$arrival) to ds.project(1, 3)

11

Page 13: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

MAPPING R FUNCTIONS TO OPERATORS

• Functions on data.frames lazily build an operator graph

1. Functions w/o UDFs are handled before execution

2. Functions w/ UDFs call R functions during execution

12

Page 14: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Approaches to execute R UDFs

13

Page 15: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

INTER PROCESS COMMUNICATION (IPC)

14

Driver

Client

Worker

Task

Task

Worker

Task

Task

R Process

R Process

R Process

R Process

Page 16: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

INTER PROCESS COMMUNICATION (IPC)

15

1 Communication + Serialization (R <> Java)

2 JVM and R compete for memory

Worker

filter R Process

filter <- function(df) {df$language == ‘english’

}

1

2JVM

Page 17: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

SOURCE-TO-SOURCE TRANSLATION (STS)

• Translate restricted set of functions to native dataflow API

• Constant translation overhead, but native execution performance

16

Page 18: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

SOURCE-TO-SOURCE TRANSLATION (STS)

• E.g., STS translation in SparkR to Spark’s Scala Dataframe API:

17

df <- filter(df,df$language == ‘english’

)val df = df.filter($”language” === “english”)

df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

Page 19: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Inter Process Communication

+ Execute arbitrary R code

- Data serialization

- Data exchange

- Java and R process compete for memory

Source-to-source translation

+ Native performance

- Restricted to a language subsetor requires to build full-fledgedcompiler

18

Page 20: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

A common runtime for R and Java

19

Page 21: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

BACKGROUND: TRUFFLE/GRAAL

20

HotSpot

JIT

Bytecode

Page 22: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

BACKGROUND: TRUFFLE/GRAAL

21

HotSpot

JIT

Bytecode

Graal

Page 23: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

BACKGROUND: TRUFFLE/GRAAL

22

HotSpot

Graal

Truffle

GraalVM

Page 24: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

BACKGROUND: TRUFFLE/GRAAL

23Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.

HotSpot Runtime

Graal Interpreter GC …

Truffle

TruffleR (fastR) TruffleJSjavac

*.js*.R*.java

GraalVM

AST Interpreter

Source Code

Page 25: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

SCOOTR: FASTR + FLINK

24

Page 26: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

SCOOTR OVERVIEW

25

flink.init(SERVER, PORT)flink.parallelism(DOP)

df <- flink.readdf(SOURCE, list("id", “body“, …),list(character, character, …)

)

words <- function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)

}

df <- flink.apply(df, words)

flink.writeAsText(df, SINK)flink.execute()

Page 27: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

SCOOTR OVERVIEW

26

flink.init(SERVER, PORT)flink.parallelism(DOP)

df <- flink.readdf(SOURCE, list("id", “body“, …),list(character, character, …)

)

words <- function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)

}

df <- flink.apply(df, words)

flink.writeAsText(df, SINK)flink.execute()

Page 28: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Efficient data access in R UDFs

27

Page 29: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

28

function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)

}

Page 30: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

29

function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)

}

1 Dataframe proxy keeps track of columns and provides efficient access

function(tuple) {len <- length(strsplit(tuple[[2]], " ")[[1]])list(tuple[[1]], tuple[[2]], len)

}

Page 31: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

30

function(df) {len <- length(strsplit(df$body, " ")[[1]])list(df$id, df$body, len)

}

1 Dataframe proxy keeps track of columns and provides efficient access

function(tuple) {len <- length(strsplit(tuple[[2]], " ")[[1]])flink.tuple(tuple[[1]], tuple[[2]], len)

}

2 Rewrite to directly instantiate a Flink tuple instead of an R list

Page 32: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

IMPACT OF DIRECT TYPE ACCESS

• From list(...) to flink.tuple(...)

• Avoids additional pass over R list to create Flink tuple

• Up to 1.75x performance improvement

31

Output w/ arity 2 Output w/ arity 19

Purple is function execution, pink (hatched) conversion from list to tuple

Page 33: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

Evaluation

32

Page 34: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

APPLY FUNCTION MICROBENCHMARK

• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB

• UDF: df$km <- df$miles * 1.6

33

Page 35: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

APPLY FUNCTION MICROBENCHMARK

• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB

• UDF: df$km <- df$miles * 1.6

34

ScootR and SparkR (STS) achieve near native performance

Page 36: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

APPLY FUNCTION MICROBENCHMARK

• Airline On-Time Performance Dataset (2005 – 2016)CSV, 19 columns, 9.5GB

• UDF: df$km <- df$miles * 1.6

35

Both heavily outperform gnu R and fastR

ScootR and SparkR (STS) achieve near native performance

Page 37: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

APPLY FUNCTION MICROBENCHMARK: SCALABILITY

36

Page 38: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

MIXED PIPELINE W/ PREPROCESSING AND ML

Pipeline:

- (Distributed) preprocessing of the dataset

- Data is collected locally and an generalized linear model is trained

37

Majority of the time is spent in preprocessing

ScootR is up to 11x faster than gnu R and fastR

Page 39: ScootR: Scaling R Dataframes on Dataflow Systems · ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1Lukas Stadler 2Daniele Bonetta Cosmin Basca2 Jens Meiners Sebastian

RECAP

• ScootR provides a data.frame API in R for Apache Flink

• R and Flink run within the same runtime

• Avoids serialization and data exchange

• Avoids type conversion

> Achieves near native performance for a rich set of operators

38