Introduction to SparkR

35
Introduction to R and integration of SparkR and Spark’s MLlib Dang Trung Kien

Transcript of Introduction to SparkR

Introduction to R and integration of SparkR and

Spark’s MLlib

Dang Trung Kien

About me

• Statistics undergraduate

[email protected]

R

What is R?

• Statistical Programming Language

• Open source

• > 6000 available packages

• widely used in academics and research

Companies that use R• Facebook

• Google

• Foursquare

• Ford

• Bank of America

• ANZ

• …

Data types

• Vector

• Matrix

• List

• Data frame

Vector• c(1, 2, 3, 4)

## [1] 1 2 3 4

• 1:4

## [1] 1 2 3 4

• c("a", "b", "c")

## [1] "a" "b" “c"

• c(T, F, T)

## [1] TRUE FALSE TRUE

Matrix• matrix(c(1, 2, 3, 4), ncol=2)

## [,1] [,2]

## [1,] 1 3

## [2,] 2 4

• matrix(c(1, 2, 3, 4), ncol=2, byrow=T)

## [,1] [,2]

## [1,] 1 2

## [2,] 3 4

List• list(12, “twelve")

## [[1]]

## [1] 12

##

## [[2]]

## [1] "twelve"

Data framename <- c("A", "B", “C")

age <- c(30, 17, 42)

male <- c(T, F, F)

data.frame(name, age, male)

## name age male

## 1 A 30 TRUE

## 2 B 17 FALSE

## 3 C 42 FALSE

x <- 1:100 y <- 1:100 + runif(100, 0, 20) m <- lm(y~x) plot(y~x) abline(m$coefficients)

But…

• R is single-threaded

• Can only process data sets that fit in a single machine

SparkR

SparkR

• An R package that provides a light-weight front-end to use Apache Spark from R

• exposes the RDD API of Spark as distributed lists in R

• allows users to interactively run jobs from the R shell on a cluster

Sparkcount

countByKey

countByValue

flatMap

map (lapply)

broadcast

includePackage

Filter

reduce

reduceByKey

distinct

union

R+

Data flow

LocalWorker

Worker

Worker

R Spark Context

Java Spark

Context

RSpark Executer

JNI

Word countlines <- textFile(sc, “/path/to/file")

words <- flatMap(lines,

function(line) {

strsplit(line, " “)[[1]]

})

wordCount <- lapply(words, function(word) { list(word, 1L) })

counts <- reduceByKey(wordCount, "+", 2L)

output <- collect(counts)

for (wordcount in output) {

cat(wordcount[[1]], ": ", wordcount[[2]], “\n")

}

SparkR and Spark’s MLlib

Machine Learning

• Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.

Machine Learning• Supervised

Labels, features

Mapping of features to labels

Estimate a concept (model) that is closest to the true mapping

• Unsupervised

No labels

Clustering of data

Machine Learning

• Supervised

Naive Bayes, nearest neighbour, decision tree, linear regression, support vector machine…

• Unsupervised

K-means, DBSCAN, one-class SVM…

Supervised

Supervised

• Classification

Cat or dog?

Supervised

• Classification

Cat or dog?

• Regression

Age?

Unsupervised

Naive Bayes

• Supervised machine learning

• Classifies texts based on word frequency

Naive BayesP class | doc( ) = P class( ) P word | class( )

word in doc∏

classargmax P class | doc( ) =classargmax P class( ) P word | class( )

word in doc∏

classargmax log P class | doc( )( )

classargmax log P class( )( )+ log P word | class( )( )word in doc∑

P c( ) = number of class c documents in training setstotal number of documents in training sets

P w | c( ) = no. of occurences of word w in documents type c + 1total no. of words in documents type c + size of vocab

“a" “b” “c”

1 1 1 0

2 0 2 1

P 1( ) = P 2( ) = 12

P a |1( ) = 1+11+1+ 3

= 25

P b |1( ) = 25

P c |1( ) = 15

P a | 2( ) = 15

P b | 2( ) = 35

P c | 2( ) = 25

P 1|"a b b"( ) = 12× 25× 25× 25= 0.032

P 2 |"a b b"( ) = 12× 15× 35× 35= 0.036

MLlib

• Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib and SparkR

• Currently access to MLlib in SparkR is still in development. Thus use this method to run MLlib in R until MLlib is officially integrated into SparkR.

MLlib’s Naive Bayes in RR RDD of list(label,

features)Java RDD of

serialised R objects

Scala RDD of LabeledPoint

J("org.apache.spark.mllib.classification.NaiveBayes", "train", labeled.point.rdd, lambda)

rJava

Demo

Thank you for coming!