First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

SparkR

RBelgium21/10/2015


Who am I

• Data Scientist at InfoFarmwww.infofarm.be

• PhD in Math

• Author of parallelMLhttps://cran.r-project.org/web/packages/parallelML

• Daily R user

• Spark enthusiast

[email protected] @RosiersWannes

http://www.infofarm.be

https://cran.r-project.org/web/packages/parallelML


Overview

• Apache Spark

– A brief introduction

– R versus Scala (Java/Python)

• SparkR-1.4.0

– Getting started

– R integration

– Our own machine learning algorithms

• SparkR-1.5…

– What’s new?

– Spark MLlib

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Apache Spark


“Apache Spark is a fast and general engine for big data

processing, with built-in modules for streaming, SQL,

machine learning and graph processing”


Fast, Scalable and Fault Tolerant


One ring to rule them all…


Being lazy…

• Transformations (map, filter, union, sort, …) are lazy

• Actions (count, collect, save, …) force computations of

transformations

… is a good thing!


versus

• Scala advantages

– Natively written in Scala

– Big Data extension of Scala concepts

• R disadvantages

– Work in progress

– R packages not implemented for parallel

processing

Yet promising as excellent Big Data analysis tool for R users


SparkR-1.4.0


Initializing SparkR

• Download Spark http://spark.apache.org/downloads

• Install

– Installation (R/install-dev.sh)

– Documentation (R/create-docs.sh)

• Run

http://spark.apache.org/downloads


Using SparkR

• sparkContext

• sqlContext

• Possibly hiveContext

• parquetFile

• jsonFile

• read.df(via "com.databricks.spark.csv”)


Integrating native R code

• Magrittr

• Local computations

• Within SparkR functions

collect createDataFrame


Machine learning

• Spark MLlib machine learning algorithms

were not available yet

• R algorithms are not implemented in a

distributed way

We implemented

– Naive Bayes (classification)

– K-means (clustering)

– Association rules (recommendation)


Performance

• Naive Bayes

• K-means

• Association rules

Set # observations Acion Time taken

Training + Calibration

5.890.434 + 654.325

Build model + Threshold

9min 6sec

Test 725.479 Prediction 3min 40sec

# observations Total time Time per iteration

7270238 3min 40 sec 25sec (4 iterations)

Action # observations Time taken

Construct rules 1.048.575 < 30sec

Predict 1 Instantly


Lessons learned

• Nasty workarounds: e.g.

– Rounding: var – var %% 1

– Adding constant column:

cast(data[[1]]*0, 'integer')

– Calculating which column has

the smallest value:


Lessons learned

• No notion of row indexes

Solvable via HiveQL

• Possible loss of orders

Solvable by keeping an order on a certain column

• Not all Spark code available yet (map, flatmap, lapply, …)

Solvable by altering source code to export them

• Slow computations due to framework

At least numPartitions might help you


Lessons learned

• Caching does not support all types:lapply(nb[["model"]], function(mod){

cache(mod)

count(mod)

})

})

• Sometimes necessary to collect intermediate results

local_model <- collect(model)

for( i in 0:n){

if(! i %in% local_model$category)

local_model <- rbind(local_model, c(i, -1))

}

When using R code, this will always be the case


SparkR-1.5…


What’s new?

• Time classes (adding/subtracting times)

• More math functions (e.g. atan, rand)

• More text functions (e.g. concat, locate)

• More R functions (e.g. dim, ifelse)

• Create contigency table (crosstab)

• First machine learning algorithm (glm)


Algorithms provided by Spark● Classification and regression

○ Linear models (SVMs, logistic regression, linear regression)○ Naive Bayes○ Decision trees○ Ensembles of trees (Random Forests and Gradient-Boosted trees)○ Isotonic regression

● Collaborative filteringo Alternating least squares (ALS)

● Frequent pattern mining○ FP-growth○ Association rules○ PrefixSpan

● Feature extraction and transformation● Clustering

○ K-Means○ Gaussian mixture○ Power Iteration clustering○ Latent Dirichlet allocation○ Streaming k-means

● Dimensionality reduction○ Singular value decomposition (SVD)○ Principal component analysis (PCA)


Questions

First impressions of SparkR: our own machine learning algorithm

Data & Analytics

Transcript of First impressions of SparkR: our own machine learning algorithm