Spark ML par Xebia (Spark Meetup du 11/06/2015)
-
Upload
hadoop-user-group-france -
Category
Internet
-
view
904 -
download
2
Transcript of Spark ML par Xebia (Spark Meetup du 11/06/2015)
SPARK MLA new High-Level API for MLlib
Spark 1.4.0 preview
Matthieu Blanc
InstructorSpark Developper Training
@matthieublanc
MLLIBMakes Machine Learning Easy and Scalable
Selection of Machine Learning Algorithms
Several design flaws :
• machine learning workflows/pipelines
• make MLlib itself a scalable project
• lack of homogeneity
org.apache.spark.ml to the rescue!
Machine Learning
TrainDataset ML Algorithm
ModelTestDataset Predictions
FeatureEngineering
FeatureEngineering
label features
features features prediction
Machine Learning Pipeline
• Simple construction of ML workflow
• Inspect and debug it
• Tune parameters
• Re-run it on new data
Dataframes
org.apache.spark.ml
Key concepts
• DataFrame as ML Datasets
• Abstractions :
• Transformers
• Estimators
• Evaluators
• Parameters API -> CrossValidator
Transformers
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
colA colB … colX
colA colB … colX newCol
Transformer Usage
// Add a categoryVec column to the DataFrame // by applying OneHotEncoder transformation to the column categoryval classEncoder = new OneHotEncoder() .setInputCol("catergory") .setOutputCol("catergoryVec") val newDataFrame = classEncoder.transform(dataFrame)
dataFrame newDataFramecolA colB … category: double
colA colB … category: double categoryVec: vector
Transformers Examples
Normalizer
VectorAssembler
PolynomialExpansion
Model
Tokenizer
OneHotEncoder
HashingTF
Binarizer
Estimators
DataFrame
Model
def fit(dataset: DataFrame): Model
label: double features: vector …
extends Transformer
Model is a Transformer
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
features: vector …
features: vector prediction: double …
Estimator + Model Usage
// Apply logisticRegression on a training dataset to create a model// used to compute predictions on a test datasetval logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // train val lrModel = logisticRegression.fit(trainDF)// predictval newDataFrameWithPredictions = lrModel.transform(testDF)
Estimators Examples
StringIndexer
StandardScaler
CrossValidator
Pipeline
LinearRegression
LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier
ALS
Evaluators
DataFrame Metric (Double)
area under ROC curvearea under PR curveroot mean square error
def evaluate(dataset: DataFrame): Double
label: Double prediction: Double …
Estimator + Model Usage
// Area under the ROC curve for the validation setval evaluator = new BinaryClassificationEvaluator()println(evaluator.evaluate(dataFrameWithLabelAndPrediction))
Evaluators Examples
RegressionEvaluator
BinaryClassificationEvaluator
Pipeline
TrainDataset ML Algorithm
ModelTestDataset Predictions
FeatureEngineering
FeatureEngineering
Pipeline
Transformer Estimator DataFrame
PipelineModel
Transformer Estimator
DataFrame DataFrame
Pipeline is an Estimator
Pipeline Usage// The stages of our pipelineval classEncoder = new OneHotEncoder() .setInputCol("class") .setOutputCol("classVec") val vectorAssembler = new VectorAssembler() .setInputCols(Array("age", "fare", "classVec")) .setOutputCol("features") val logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // the pipelineval pipeline = new Pipeline() .setStages(Array(classEncoder, vectorAssembler, logisticRegression))// train val pipelineModel = pipeline.fit(trainSet)// predict val validationPredictions = pipelineModel.transform(testSet)
CrossValidator
Given• Estimator• Parameter Grid• Evaluator
Find the Model with the best Parameters
CrossValidator is also an Estimator
CrossValidator Usage// We will cross validate our pipelineval crossValidator = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator)// The params we want to testval paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(2, 5, 1000)) .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) .addGrid(logisticRegression.maxIter, Array(10, 50, 100)) .build()crossValidator.setEstimatorParamMaps(paramGrid)// We will use a 3-fold cross validationcrossValidator.setNumFolds(3) // train val cvModel = crossValidator.fit(trainSet)// predict with the best modelval testSetWithPrediction = cvModel.transform(testSet)
DEMOhttps://github.com/mblanc/spark-ml
Conclusion
DataFrame
o.a.spark.ml
RDD
o.a.spark.mllib
Today Tomorrow
uses uses
uses
uses
Summary
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameters tuning
• Schema validation
• User-defined Transformers and Estimators
• Composable and DAG Pipelines
1.4.1? 1.5.0?
MERCI