Spark ML par Xebia (Spark Meetup du 11/06/2015)

25
SPARK ML A new High-Level API for MLlib Spark 1.4.0 preview

Transcript of Spark ML par Xebia (Spark Meetup du 11/06/2015)

Page 1: Spark ML par Xebia (Spark Meetup du 11/06/2015)

SPARK MLA new High-Level API for MLlib

Spark 1.4.0 preview

Page 2: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Matthieu Blanc

InstructorSpark Developper Training

@matthieublanc

Page 3: Spark ML par Xebia (Spark Meetup du 11/06/2015)

MLLIBMakes Machine Learning Easy and Scalable

Selection of Machine Learning Algorithms

Several design flaws :

• machine learning workflows/pipelines

• make MLlib itself a scalable project

• lack of homogeneity

org.apache.spark.ml to the rescue!

Page 4: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Machine Learning

TrainDataset ML Algorithm

ModelTestDataset Predictions

FeatureEngineering

FeatureEngineering

label features

features features prediction

Page 5: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Machine Learning Pipeline

• Simple construction of ML workflow

• Inspect and debug it

• Tune parameters

• Re-run it on new data

Page 6: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Dataframes

org.apache.spark.ml

Page 7: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Key concepts

• DataFrame as ML Datasets

• Abstractions :

• Transformers

• Estimators

• Evaluators

• Parameters API -> CrossValidator

Page 8: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Transformers

DataFrame DataFrame

def transform(dataset: DataFrame): DataFrame

colA colB … colX

colA colB … colX newCol

Page 9: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Transformer Usage

// Add a categoryVec column to the DataFrame // by applying OneHotEncoder transformation to the column categoryval classEncoder = new OneHotEncoder() .setInputCol("catergory") .setOutputCol("catergoryVec") val newDataFrame = classEncoder.transform(dataFrame)

dataFrame newDataFramecolA colB … category: double

colA colB … category: double categoryVec: vector

Page 10: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Transformers Examples

Normalizer

VectorAssembler

PolynomialExpansion

Model

Tokenizer

OneHotEncoder

HashingTF

Binarizer

Page 11: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Estimators

DataFrame

Model

def fit(dataset: DataFrame): Model

label: double features: vector …

extends Transformer

Page 12: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Model is a Transformer

DataFrame DataFrame

def transform(dataset: DataFrame): DataFrame

features: vector …

features: vector prediction: double …

Page 13: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Estimator + Model Usage

// Apply logisticRegression on a training dataset to create a model// used to compute predictions on a test datasetval logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // train val lrModel = logisticRegression.fit(trainDF)// predictval newDataFrameWithPredictions = lrModel.transform(testDF)

Page 14: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Estimators Examples

StringIndexer

StandardScaler

CrossValidator

Pipeline

LinearRegression

LogisticRegression

DecisionTreeClassifier

RandomForestClassifier

GBTClassifier

ALS

Page 15: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Evaluators

DataFrame Metric (Double)

area under ROC curvearea under PR curveroot mean square error

def evaluate(dataset: DataFrame): Double

label: Double prediction: Double …

Page 16: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Estimator + Model Usage

// Area under the ROC curve for the validation setval evaluator = new BinaryClassificationEvaluator()println(evaluator.evaluate(dataFrameWithLabelAndPrediction))

Page 17: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Evaluators Examples

RegressionEvaluator

BinaryClassificationEvaluator

Page 18: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Pipeline

TrainDataset ML Algorithm

ModelTestDataset Predictions

FeatureEngineering

FeatureEngineering

Pipeline

Transformer Estimator DataFrame

PipelineModel

Transformer Estimator

DataFrame DataFrame

Pipeline is an Estimator

Page 19: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Pipeline Usage// The stages of our pipelineval classEncoder = new OneHotEncoder() .setInputCol("class") .setOutputCol("classVec") val vectorAssembler = new VectorAssembler() .setInputCols(Array("age", "fare", "classVec")) .setOutputCol("features") val logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // the pipelineval pipeline = new Pipeline() .setStages(Array(classEncoder, vectorAssembler, logisticRegression))// train val pipelineModel = pipeline.fit(trainSet)// predict val validationPredictions = pipelineModel.transform(testSet)

Page 20: Spark ML par Xebia (Spark Meetup du 11/06/2015)

CrossValidator

Given• Estimator• Parameter Grid• Evaluator

Find the Model with the best Parameters

CrossValidator is also an Estimator

Page 21: Spark ML par Xebia (Spark Meetup du 11/06/2015)

CrossValidator Usage// We will cross validate our pipelineval crossValidator = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator)// The params we want to testval paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(2, 5, 1000)) .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) .addGrid(logisticRegression.maxIter, Array(10, 50, 100)) .build()crossValidator.setEstimatorParamMaps(paramGrid)// We will use a 3-fold cross validationcrossValidator.setNumFolds(3) // train val cvModel = crossValidator.fit(trainSet)// predict with the best modelval testSetWithPrediction = cvModel.transform(testSet)

Page 22: Spark ML par Xebia (Spark Meetup du 11/06/2015)

DEMOhttps://github.com/mblanc/spark-ml

Page 23: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Conclusion

DataFrame

o.a.spark.ml

RDD

o.a.spark.mllib

Today Tomorrow

uses uses

uses

uses

Page 24: Spark ML par Xebia (Spark Meetup du 11/06/2015)

Summary

• Integration with DataFrames

• Familiar API based on scikit-learn

• Simple parameters tuning

• Schema validation

• User-defined Transformers and Estimators

• Composable and DAG Pipelines

1.4.1? 1.5.0?

Page 25: Spark ML par Xebia (Spark Meetup du 11/06/2015)

MERCI