First impressions of SparkR: our own machine learning algorithm

24
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company SparkR RBelgium 21/10/2015

Transcript of First impressions of SparkR: our own machine learning algorithm

Page 1: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

SparkR

RBelgium21/10/2015

Page 2: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Who am I

• Data Scientist at InfoFarmwww.infofarm.be

• PhD in Math

• Author of parallelMLhttps://cran.r-project.org/web/packages/parallelML

• Daily R user

• Spark enthusiast

[email protected] @RosiersWannes

Page 3: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Overview

• Apache Spark

– A brief introduction

– R versus Scala (Java/Python)

• SparkR-1.4.0

– Getting started

– R integration

– Our own machine learning algorithms

• SparkR-1.5…

– What’s new?

– Spark MLlib

Page 4: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Apache Spark

Page 5: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

“Apache Spark is a fast and general engine for big data

processing, with built-in modules for streaming, SQL,

machine learning and graph processing”

Page 6: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Fast, Scalable and Fault Tolerant

Page 7: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

One ring to rule them all…

Page 8: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Being lazy…

• Transformations (map, filter, union, sort, …) are lazy

• Actions (count, collect, save, …) force computations of

transformations

… is a good thing!

Page 9: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

versus

• Scala advantages

– Natively written in Scala

– Big Data extension of Scala concepts

• R disadvantages

– Work in progress

– R packages not implemented for parallel

processing

Yet promising as excellent Big Data analysis tool for R users

Page 10: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

SparkR-1.4.0

Page 11: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Initializing SparkR

• Download Spark http://spark.apache.org/downloads

• Install

– Installation (R/install-dev.sh)

– Documentation (R/create-docs.sh)

• Run

Page 12: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Using SparkR

• sparkContext

• sqlContext

• Possibly hiveContext

• parquetFile

• jsonFile

• read.df(via "com.databricks.spark.csv”)

Page 13: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 14: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Integrating native R code

• Magrittr

• Local computations

• Within SparkR functions

collect createDataFrame

Page 15: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine learning

• Spark MLlib machine learning algorithms

were not available yet

• R algorithms are not implemented in a

distributed way

We implemented

– Naive Bayes (classification)

– K-means (clustering)

– Association rules (recommendation)

Page 16: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 17: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Performance

• Naive Bayes

• K-means

• Association rules

Set # observations Acion Time taken

Training + Calibration

5.890.434 + 654.325

Build model + Threshold

9min 6sec

Test 725.479 Prediction 3min 40sec

# observations Total time Time per iteration

7270238 3min 40 sec 25sec (4 iterations)

Action # observations Time taken

Construct rules 1.048.575 < 30sec

Predict 1 Instantly

Page 18: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Lessons learned

• Nasty workarounds: e.g.

– Rounding: var – var %% 1

– Adding constant column:

cast(data[[1]]*0, 'integer')

– Calculating which column has

the smallest value:

Page 19: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Lessons learned

• No notion of row indexes

Solvable via HiveQL

• Possible loss of orders

Solvable by keeping an order on a certain column

• Not all Spark code available yet (map, flatmap, lapply, …)

Solvable by altering source code to export them

• Slow computations due to framework

At least numPartitions might help you

Page 20: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Lessons learned

• Caching does not support all types:lapply(nb[["model"]], function(mod){

cache(mod)

count(mod)

})

})

• Sometimes necessary to collect intermediate results

local_model <- collect(model)

for( i in 0:n){

if(! i %in% local_model$category)

local_model <- rbind(local_model, c(i, -1))

}

When using R code, this will always be the case

Page 21: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

SparkR-1.5…

Page 22: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

What’s new?

• Time classes (adding/subtracting times)

• More math functions (e.g. atan, rand)

• More text functions (e.g. concat, locate)

• More R functions (e.g. dim, ifelse)

• Create contigency table (crosstab)

• First machine learning algorithm (glm)

Page 23: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Algorithms provided by Spark● Classification and regression

○ Linear models (SVMs, logistic regression, linear regression)○ Naive Bayes○ Decision trees○ Ensembles of trees (Random Forests and Gradient-Boosted trees)○ Isotonic regression

● Collaborative filteringo Alternating least squares (ALS)

● Frequent pattern mining○ FP-growth○ Association rules○ PrefixSpan

● Feature extraction and transformation● Clustering

○ K-Means○ Gaussian mixture○ Power Iteration clustering○ Latent Dirichlet allocation○ Streaming k-means

● Dimensionality reduction○ Singular value decomposition (SVD)○ Principal component analysis (PCA)

Page 24: First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Questions