First impressions of SparkR: our own machine learning algorithm
-
Upload
infofarm -
Category
Data & Analytics
-
view
1.177 -
download
0
Transcript of First impressions of SparkR: our own machine learning algorithm
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
SparkR
RBelgium21/10/2015
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Who am I
• Data Scientist at InfoFarmwww.infofarm.be
• PhD in Math
• Author of parallelMLhttps://cran.r-project.org/web/packages/parallelML
• Daily R user
• Spark enthusiast
[email protected] @RosiersWannes
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Overview
• Apache Spark
– A brief introduction
– R versus Scala (Java/Python)
• SparkR-1.4.0
– Getting started
– R integration
– Our own machine learning algorithms
• SparkR-1.5…
– What’s new?
– Spark MLlib
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Apache Spark
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Fast, Scalable and Fault Tolerant
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
One ring to rule them all…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Being lazy…
• Transformations (map, filter, union, sort, …) are lazy
• Actions (count, collect, save, …) force computations of
transformations
… is a good thing!
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
versus
• Scala advantages
– Natively written in Scala
– Big Data extension of Scala concepts
• R disadvantages
– Work in progress
– R packages not implemented for parallel
processing
Yet promising as excellent Big Data analysis tool for R users
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
SparkR-1.4.0
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Initializing SparkR
• Download Spark http://spark.apache.org/downloads
• Install
– Installation (R/install-dev.sh)
– Documentation (R/create-docs.sh)
• Run
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Using SparkR
• sparkContext
• sqlContext
• Possibly hiveContext
• parquetFile
• jsonFile
• read.df(via "com.databricks.spark.csv”)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Integrating native R code
• Magrittr
• Local computations
• Within SparkR functions
collect createDataFrame
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Machine learning
• Spark MLlib machine learning algorithms
were not available yet
• R algorithms are not implemented in a
distributed way
We implemented
– Naive Bayes (classification)
– K-means (clustering)
– Association rules (recommendation)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Performance
• Naive Bayes
• K-means
• Association rules
Set # observations Acion Time taken
Training + Calibration
5.890.434 + 654.325
Build model + Threshold
9min 6sec
Test 725.479 Prediction 3min 40sec
# observations Total time Time per iteration
7270238 3min 40 sec 25sec (4 iterations)
Action # observations Time taken
Construct rules 1.048.575 < 30sec
Predict 1 Instantly
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Lessons learned
• Nasty workarounds: e.g.
– Rounding: var – var %% 1
– Adding constant column:
cast(data[[1]]*0, 'integer')
– Calculating which column has
the smallest value:
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Lessons learned
• No notion of row indexes
Solvable via HiveQL
• Possible loss of orders
Solvable by keeping an order on a certain column
• Not all Spark code available yet (map, flatmap, lapply, …)
Solvable by altering source code to export them
• Slow computations due to framework
At least numPartitions might help you
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Lessons learned
• Caching does not support all types:lapply(nb[["model"]], function(mod){
cache(mod)
count(mod)
})
})
• Sometimes necessary to collect intermediate results
local_model <- collect(model)
for( i in 0:n){
if(! i %in% local_model$category)
local_model <- rbind(local_model, c(i, -1))
}
When using R code, this will always be the case
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
SparkR-1.5…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What’s new?
• Time classes (adding/subtracting times)
• More math functions (e.g. atan, rand)
• More text functions (e.g. concat, locate)
• More R functions (e.g. dim, ifelse)
• Create contigency table (crosstab)
• First machine learning algorithm (glm)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Algorithms provided by Spark● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)○ Naive Bayes○ Decision trees○ Ensembles of trees (Random Forests and Gradient-Boosted trees)○ Isotonic regression
● Collaborative filteringo Alternating least squares (ALS)
● Frequent pattern mining○ FP-growth○ Association rules○ PrefixSpan
● Feature extraction and transformation● Clustering
○ K-Means○ Gaussian mixture○ Power Iteration clustering○ Latent Dirichlet allocation○ Streaming k-means
● Dimensionality reduction○ Singular value decomposition (SVD)○ Principal component analysis (PCA)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Questions