Spark Machine Learning · Spark Machine Learning Amir H. Payberah [email protected] SICS Swedish ICT...

78
Spark Machine Learning Amir H. Payberah [email protected] SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1/1

Transcript of Spark Machine Learning · Spark Machine Learning Amir H. Payberah [email protected] SICS Swedish ICT...

Page 1: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Spark Machine Learning

Amir H. [email protected]

SICS Swedish ICTJune 30, 2016

Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Page 2: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data

Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Page 3: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Page 4: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Page 5: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 6: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 7: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 8: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 9: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I Should I lend money to this customer given his spending behavior?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 10: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I Should I lend money to this customer given his spendingbehaviour?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Page 11: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data and Knowledge

I Knowledge is not concrete

I Spam is an abstraction

I Face is an abstraction

I Who to lend to is an abstraction

You do not find spam, faces, and financial advice in datasets,you just find bits!

Amir H. Payberah (SICS) MLLib June 30, 2016 4 / 1

Page 12: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Knowledge Discovery from Data (KDD)

I Preprocessing

I Data mining

I Result validation

Amir H. Payberah (SICS) MLLib June 30, 2016 5 / 1

Page 13: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

KDD - Preprocessing

I Data cleaning

I Data integration

I Data reduction, e.g., sampling

I Data transformation, e.g., normalization

Amir H. Payberah (SICS) MLLib June 30, 2016 6 / 1

Page 14: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

KDD - Mining Functionalities

I Classification and regression (supervised learning)

I Clustering (unsupervised learning)

I Mining the frequent patterns

I Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 7 / 1

Page 15: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

KDD - Result Validation

I Needs to evaluate the performance of the model on some criteria.

I Depends on the application and its requirements.

Amir H. Payberah (SICS) MLLib June 30, 2016 8 / 1

Page 16: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

MLlib - Data Types

Amir H. Payberah (SICS) MLLib June 30, 2016 9 / 1

Page 17: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Types - Local Vector

I Stored on a single machine

I Dense and sparse• Dense (1.0, 0.0, 3.0): [1.0, 0.0, 3.0]• Sparse (1.0, 0.0, 3.0): (3, [0, 2], [1.0, 3.0])

val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 10 / 1

Page 18: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Types - Labeled Point

I A local vector (dense or sparse) associated with a label.

I label: label for this data point.

I features: list of features for this data point.

case class LabeledPoint(label: Double, features: Vector)

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 11 / 1

Page 19: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

MLlib - Preprocessing

Amir H. Payberah (SICS) MLLib June 30, 2016 12 / 1

Page 20: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Transformation - Normalizing Features

I To get data in a standard Gaussian distribution: x−meansqrt(variance)

val features = labelData.map(_.features)

val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)

val scaledData = labelData.map(lp => LabeledPoint(lp.label,

scaler.transform(lp.features)))

Amir H. Payberah (SICS) MLLib June 30, 2016 13 / 1

Page 21: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

MLlib - Data Mining

Amir H. Payberah (SICS) MLLib June 30, 2016 14 / 1

Page 22: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Mining Functionalities

I Classification and regression (supervised learning)

I Clustering (unsupervised learning)

I Mining the frequent patterns

I Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 15 / 1

Page 23: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Classification and Regression(Supervised Learning)

Amir H. Payberah (SICS) MLLib June 30, 2016 16 / 1

Page 24: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Supervised Learning (1/3)

I Right answers are given.• Training data (input data) is labeled, e.g., spam/not-spam or a

stock price at a time.

I A model is prepared through a training process.

I The training process continues until the model achieves a desiredlevel of accuracy on the training data.

Amir H. Payberah (SICS) MLLib June 30, 2016 17 / 1

Page 25: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Supervised Learning (2/3)

I Face recognition

Training data

Testing data

[ORL dataset, AT&T Laboratories, Cambridge UK]

Amir H. Payberah (SICS) MLLib June 30, 2016 18 / 1

Page 26: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Supervised Learning (3/3)

I Set of N training examples: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉 is the feature vector of the ith example.

I yi is the ith feature vector label.

I A learning algorithm seeks a function yi = f(Xi).

Amir H. Payberah (SICS) MLLib June 30, 2016 19 / 1

Page 27: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Classification vs. Regression

I Classification: the output variable takes class labels.

I Regression: the output variable takes continuous values.

Amir H. Payberah (SICS) MLLib June 30, 2016 20 / 1

Page 28: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Types of Classification/Regression Models in Spark

I Linear models

I Decision trees

I Naive Bayes models

Amir H. Payberah (SICS) MLLib June 30, 2016 21 / 1

Page 29: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models

Amir H. Payberah (SICS) MLLib June 30, 2016 22 / 1

Page 30: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models

I Training dataset: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉

I Model the target as a function of a linear predictor applied to theinput variables: yi = g(wTxi).

• E.g., yi = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: f (w) :=∑n

i=1 L(g(wTxi ), yi )

I An optimization problem minw∈Rm

f(w)

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Page 31: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models

I Training dataset: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉

I Model the target as a function of a linear predictor applied to theinput variables: yi = g(wTxi).

• E.g., yi = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: f (w) :=∑n

i=1 L(g(wTxi ), yi )

I An optimization problem minw∈Rm

f(w)

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Page 32: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models - Regression (1/2)

I g(wTxi) = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: minimizing squared different between predicted valueand actual value: L(g(wTxi ), yi ) := 1

2(wTxi − yi )2

I Gradient descent

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Page 33: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models - Regression (1/2)

I g(wTxi) = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: minimizing squared different between predicted valueand actual value: L(g(wTxi ), yi ) := 1

2(wTxi − yi )2

I Gradient descent

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Page 34: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models - Regression (2/2)

val data: RDD[LabeledPoint] = ...

val splits = labelData.randomSplit(Array(0.7, 0.3))

val (trainigData, testData) = (splits(0), splits(1))

val numIterations = 100

val stepSize = 0.00000001

val model = LinearRegressionWithSGD

.train(trainigData, numIterations, stepSize)

val valuesAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 25 / 1

Page 35: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models - Classification (Logistic Regression) (1/2)

I Binary classification: output values between 0 and 1

I g(wTx) := 1

1+e−wT x(sigmoid function)

I If g(wTxi ) > 0.5, then yi = 1, else yi = 0

Amir H. Payberah (SICS) MLLib June 30, 2016 26 / 1

Page 36: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Linear Models - Classification (Logistic Regression) (2/2)

val data: RDD[LabeledPoint] = ...

val splits = labelData.randomSplit(Array(0.7, 0.3))

val (trainigData, testData) = (splits(0), splits(1))

val model = new LogisticRegressionWithLBFGS()

.setNumClasses(10)

.run(trainingData)

val predictionAndLabels = testData.map { point =>

val prediction = model.predict(point.features)

(prediction, point.label)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 27 / 1

Page 37: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Decision Tree

Amir H. Payberah (SICS) MLLib June 30, 2016 28 / 1

Page 38: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Decision Tree

I A greedy algorithm.

I It performs a recursive binary partitioning of the feature space.

I Decision tree construction algorithm:• Find the best split condition (quantified based on the impurity

measure).• Stops when no improvement possible.

Amir H. Payberah (SICS) MLLib June 30, 2016 29 / 1

Page 39: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Impurity Measure

I Measures how well are the two classes separated.

I The current implementation in Spark:• Regression: variance• Classification: gini and entropy

Amir H. Payberah (SICS) MLLib June 30, 2016 30 / 1

Page 40: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Stopping Rules

I The node depth is equal to the maxDepth training parameter.

I No split candidate leads to an information gain greater thanminInfoGain.

I No split candidate produces child nodes which each have at leastminInstancesPerNode training instances.

Amir H. Payberah (SICS) MLLib June 30, 2016 31 / 1

Page 41: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Decision Tree - Regression

val data: RDD[LabeledPoint] = ...

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val categoricalFeaturesInfo = Map[Int, Int]()

val impurity = "variance"

val maxDepth = 5

val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData,

categoricalFeaturesInfo, impurity, maxDepth, maxBins)

val labelsAndPredictions = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 32 / 1

Page 42: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Decision Tree - Classification

val data: RDD[LabeledPoint] = ...

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val impurity = "gini"

val maxDepth = 5

val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses,

categoricalFeaturesInfo, impurity, maxDepth, maxBins)

val labelAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 33 / 1

Page 43: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Random Forest

I Train a set of decision trees separately.

I The training can be done in parallel.

I The algorithm injects randomness into the training process, so thateach decision tree is a bit different.

Amir H. Payberah (SICS) MLLib June 30, 2016 34 / 1

Page 44: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Random Forest - Regression

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3

val featureSubsetStrategy = "auto"

val impurity = "variance"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,

numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

val labelsAndPredictions = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 35 / 1

Page 45: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Random Forest - Classification

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3

val featureSubsetStrategy = "auto"

val impurity = "gini"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses,

categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,

maxDepth, maxBins)

val labelAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 36 / 1

Page 46: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes

Amir H. Payberah (SICS) MLLib June 30, 2016 37 / 1

Page 47: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (1/4)

I Using the probability theory to classify things.

I Assumption: independency between every pair of features.

I y1: circles, and y2: triangles.

I (x1, x2) belongs to y1 or y2?

Amir H. Payberah (SICS) MLLib June 30, 2016 38 / 1

Page 48: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (1/4)

I Using the probability theory to classify things.

I Assumption: independency between every pair of features.

I y1: circles, and y2: triangles.

I (x1, x2) belongs to y1 or y2?

Amir H. Payberah (SICS) MLLib June 30, 2016 38 / 1

Page 49: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (2/4)

I (x1, x2) belongs to y1 or y2?

I If p(y1|x1, x2) > p(y2|x1, x2), the class is y1.

I If p(y1|x1, x2) < p(y2|x1, x2), the class is y2.

I Replace p(y|x) with p(x|y)p(y)p(x)

Amir H. Payberah (SICS) MLLib June 30, 2016 39 / 1

Page 50: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (2/4)

I (x1, x2) belongs to y1 or y2?

I If p(y1|x1, x2) > p(y2|x1, x2), the class is y1.

I If p(y1|x1, x2) < p(y2|x1, x2), the class is y2.

I Replace p(y|x) with p(x|y)p(y)p(x)

Amir H. Payberah (SICS) MLLib June 30, 2016 39 / 1

Page 51: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (3/4)

I Bayes theorem: p(y|x) = p(x|y)p(y)p(x)

I p(y|x): probability of instance x being in class y.

I p(x|y): probability of generating instance x given class y.

I p(y): probability of occurrence of class y

I p(x): probability of instance x occurring.

Amir H. Payberah (SICS) MLLib June 30, 2016 40 / 1

Page 52: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (4/4)

Is officer Drew male or female?

I p(y|x) = p(x|y)p(y)p(x)

I p(male|drew) =

I p(female|drew) =

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

Page 53: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (4/4)

I p(y|x) = p(x|y)p(y)p(x)

I p(male|drew) = ?

I p(female|drew) = ?

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

Page 54: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (4/4)

I p(y|x) = p(x|y)p(y)p(x)

I p(male|drew) = p(drew|male)p(male)p(drew) =

13× 3

838

= 0.33

I p(female|drew) = p(drew|female)p(female)p(drew) =

25× 5

838

= 0.66

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

Page 55: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes (4/4)

Officer Drew is female.

I p(y|x) = p(x|y)p(y)p(x)

I p(male|drew) = p(drew|male)p(male)p(drew) =

13× 3

838

= 0.33

I p(female|drew) = p(drew|female)p(female)p(drew) =

25× 5

838

= 0.66

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

Page 56: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Naive Bayes

val data: RDD[LabeledPoint] = ...

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val model = NaiveBayes.train(trainingData, lambda = 1.0,

modelType = "multinomial")

val predictionAndLabel = test.map(p =>

(model.predict(p.features), p.label))

Amir H. Payberah (SICS) MLLib June 30, 2016 42 / 1

Page 57: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering(Unsupervised Learning)

Amir H. Payberah (SICS) MLLib June 30, 2016 43 / 1

Page 58: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (1/5)

I Clustering is a technique for finding similarity groups in data, calledclusters.

I It groups data instances that are similar to each other in one clus-ter, and data instances that are very different from each other intodifferent clusters.

I Clustering is often called an unsupervised learning task as no classvalues denoting an a priori grouping of the data instances are given.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

Page 59: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (1/5)

I Clustering is a technique for finding similarity groups in data, calledclusters.

I It groups data instances that are similar to each other in one clus-ter, and data instances that are very different from each other intodifferent clusters.

I Clustering is often called an unsupervised learning task as no classvalues denoting an a priori grouping of the data instances are given.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

Page 60: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (1/5)

I Clustering is a technique for finding similarity groups in data, calledclusters.

I It groups data instances that are similar to each other in one clus-ter, and data instances that are very different from each other intodifferent clusters.

I Clustering is often called an unsupervised learning task as no classvalues denoting an a priori grouping of the data instances are given.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

Page 61: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (2/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

Page 62: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (2/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

Page 63: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (2/5)

I k-means clustering is a popular method for clustering.

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

Page 64: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (3/5)

I K: number of clusters (given)• One mean per cluster.

I Initialize means: by picking k samples at random.

I Iterate:• Assign each point to nearest mean.• Move mean to center of its cluster.

Amir H. Payberah (SICS) MLLib June 30, 2016 46 / 1

Page 65: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

Page 66: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

Page 67: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

Page 68: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

Page 69: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Clustering (5/5)

val data: RDD[LabeledPoint] = ...

val numClusters = 2

val numIterations = 20

val clusters = KMeans.train(data, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors

val WSSSE = clusters.computeCost(data)

println("Within Set Sum of Squared Errors = " + WSSSE)

Amir H. Payberah (SICS) MLLib June 30, 2016 48 / 1

Page 70: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

MLlib - Result Validation

Amir H. Payberah (SICS) MLLib June 30, 2016 49 / 1

Page 71: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Classification Model Evaluation

I There exists a true output and a model-generated predicted outputfor each data point.

I The results for each data point can be assigned to one of fourcategories:

1 True Positive (TP): label is positive and prediction is also positive2 True Negative (TN): label is negative and prediction is also negative3 False Positive (FP): label is negative but prediction is positive4 False Negative (FN): label is positive but prediction is negative

Amir H. Payberah (SICS) MLLib June 30, 2016 50 / 1

Page 72: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Binary Classification (1/2)

I Precision (positive predictive value): the fraction of retrieved in-stances that are relevant.

I Recall (sensitivity): the fraction of relevant instances that are re-trieved.

I F-measure: (1 + β2) precision.recallβ2.precision+recall

Amir H. Payberah (SICS) MLLib June 30, 2016 51 / 1

Page 73: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Binary Classification (2/2)

val model = new LogisticRegressionWithLBFGS()

.setNumClasses(2)

.run(trainingData)

val predictionAndLabels = testData.map { point =>

val prediction = model.predict(point.features)

(prediction, point.label)

}

val metrics = new BinaryClassificationMetrics(predictionAndLabels)

val precision = metrics.precisionByThreshold

precision.foreach { case (t, p) => println(s"Threshold: $t, Precision: $p") }

val recall = metrics.recallByThreshold

recall.foreach { case (t, r) => println(s"Threshold: $t, Recall: $r") }

val beta = 0.5

val fScore = metrics.fMeasureByThreshold(beta)

fScore.foreach { case (t, f) => println(s"Threshold: $t, F-score: $f") }

Amir H. Payberah (SICS) MLLib June 30, 2016 52 / 1

Page 74: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Regression Model Evaluation (1/2)

I Mean Squared Error (MSE):∑N−1

i=0 (yi−yi )2N

I Root Mean Squared Error (RMSE):

√∑N−1i=0 (yi−yi )2

N

I Mean Absolute Error (MAE):∑N−1

i=0 |yi − yi |

Amir H. Payberah (SICS) MLLib June 30, 2016 53 / 1

Page 75: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Regression Model Evaluation (2/2)

val numIterations = 100

val model = LinearRegressionWithSGD.train(trainigData, numIterations)

val valuesAndPreds = testData.map{ point =>

val prediction = model.predict(point.features)

(prediction, point.label)

}

val metrics = new RegressionMetrics(valuesAndPreds)

println(s"MSE = ${metrics.meanSquaredError}")

println(s"RMSE = ${metrics.rootMeanSquaredError}")

println(s"MAE = ${metrics.meanAbsoluteError}")

Amir H. Payberah (SICS) MLLib June 30, 2016 54 / 1

Page 76: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Summary

Amir H. Payberah (SICS) MLLib June 30, 2016 55 / 1

Page 77: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Summary

I Preprocessing: cleaning, integration, reduction, transformation

I Data mining: classification, clustering, frequent patterns, anomaly

I Result validation

Amir H. Payberah (SICS) MLLib June 30, 2016 56 / 1

Page 78: Spark Machine Learning · Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Questions?

Amir H. Payberah (SICS) MLLib June 30, 2016 57 / 1