R, Scikit-Learn and Apache Spark ML - What difference does it make?

35
R, Scikit-Learn and Apache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ

Transcript of R, Scikit-Learn and Apache Spark ML - What difference does it make?

Page 1: R, Scikit-Learn and Apache Spark ML - What difference does it make?

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Villu RuusmannOpenscoring OÜ

Page 2: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Overview

● Identifying long-standing, high-value opportunities in the applied predictive analytics domain

● Thinking about problems in API terms● Providing solutions in API terms● Developing and applying custom tools

+ A couple of tips if you're looking to buy or sell a VW Golf

Page 3: R, Scikit-Learn and Apache Spark ML - What difference does it make?

The trade-off

Page 4: R, Scikit-Learn and Apache Spark ML - What difference does it make?

"More data beats better algorithms"

Page 5: R, Scikit-Learn and Apache Spark ML - What difference does it make?

The state of the art

Page 6: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Scaling out horizontally

Page 7: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Elements of reproducibility

Standardized, human- and machine-readable descriptions:

● Dataset● Data pre- and post-processing steps:

○ From real-life input table (SQL, CSV) to model○ From model to real-life output table

● Model● Statistics

Page 8: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Calling R from within Apache Spark

1. Create and initialize R runtime2. Format and upload input RDD; upload and execute R

model; download output and parse into result RDD3. Destroy R runtime

Page 9: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Calling Scikit-Learn from within Apache Spark

1. Format input RDD (eg. using Java NIO) as numpy.array2. Invoke Scikit-Learn via Python/C API3. Parse output numpy.array into result RDD

Page 10: R, Scikit-Learn and Apache Spark ML - What difference does it make?

API prioritization

Training << Maintenance ~ Deployment

One-time activity << Repeated activitiesShort-term << Long-term

Page 11: R, Scikit-Learn and Apache Spark ML - What difference does it make?

JPMML - Java PMML API

● Conversion API● Maintenance API● Execution API

○ Interpreted mode○ Translated + compiled ("Transpiled") mode

● Serving API○ Integrations with popular Big Data frameworks○ REST web service

Page 12: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Calling JPMML-Spark from within Apache Spark

org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;

org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();

org.apache.spark.sql.Dataset<Row> input = ..;

org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);

Page 13: R, Scikit-Learn and Apache Spark ML - What difference does it make?

The case study

Predicting the price of VW Golf cars using GBT algorithms:

● 71 columns:○ A continuous label: log(price)○ Two string and four numeric categorical features○ 64 binary-like (0/1) and numeric continuous features

● 270'458 rows:○ 153'978 complete cases○ 116'480 incomplete (ie. with missing values) cases

Page 14: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Gradient-Boosted Trees (GBTs)

Page 15: R, Scikit-Learn and Apache Spark ML - What difference does it make?

R training and conversion API#library("caret")

library("gbm")

library("r2pmml")

cars = read.csv("cars.tsv", sep = "\t", na.strings = "N/A")

factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")

for(factor_col in factor_cols){

cars[, factor_col] = as.factor(cars[, factor_col])

}

# Doesn't work with factors with missing values

#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)

cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)

r2pmml(cars.gbm, "gbm.pmml")

Page 16: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Scikit-Learn training and conversion APIfrom sklearn_pandas import DataFrameMapper

from sklearn.model_selection import GridSearchCV

from sklearn2pmml import sklearn2pmml, PMMLPipeline

cars = pandas.read_csv("cars.tsv", sep = "\t", na_values = ["N/A", "NA"])

mapper = DataFrameMapper(..)

regressor = ..

tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)

tuner.fit(mapper.fit_transform(cars), cars["price"])

pipeline = PMMLPipeline([

("mapper", mapper),

("regressor", tuner.best_estimator_)

])

sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)

Page 17: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Dataset

R LightGBM XGBoost Scikit-Learn

Apache Spark ML

Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>

Memory layout

Contiguous, dense

Contiguous, dense(?)

Contiguous, dense/sparse

Contiguous, dense/sparse

Distributed,dense/sparse

Data type Any double float float or double

double

Categorical values

As-is (factor) Encoded Binarized Binarized Binarized

Missing values

Yes Pseudo (NaN) Pseudo (NaN) No No

Page 18: R, Scikit-Learn and Apache Spark ML - What difference does it make?

LightGBM via Scikit-Learnfrom sklearn_pandas import DataFrameMapper

from sklearn2pmml.preprocessing import PMMLLabelEncoder

from lightgbm import LGBMRegressor

mapper = DataFrameMapper(

[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +

[(continuous_columns, None)]

)

transformed_cars = mapper.fit_transform(cars)

regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)

regressor.fit(transformed_cars, cars["price"],

categorical_feature = list(range(0, len(factor_columns))))

Page 19: R, Scikit-Learn and Apache Spark ML - What difference does it make?

XGBoost via Scikit-Learnfrom sklearn_pandas import DataFrameMapper

from sklearn2pmml.preprocessing import PMMLLabelBinarizer

from xgboost.sklearn import XGBRegressor

mapper = DataFrameMapper(

[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +

[(continuous_columns, None)]

)

transformed_cars = mapper.fit_transform(cars)

regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)

regressor.fit(transformed_cars, cars["price"])

Page 20: R, Scikit-Learn and Apache Spark ML - What difference does it make?

GBT algorithm (training)

R LightGBM XGBoost Scikit-Learn

Apache Spark ML

Abstraction gbm LGBMRegressor XGBRegressor GradientBoostingRegressor

GBTRegressor

Parameterizability

Medium High High Medium Medium

Split type Multi-way Binary Binary Binary Binary

Categorical values

"set contains" "equals" Pseudo ("equals")

Pseudo ("equals")

"equals"

Missing values

First-class Pseudo Pseudo No No

Page 21: R, Scikit-Learn and Apache Spark ML - What difference does it make?

gbm-style splits<Node id="9">

<SimplePredicate field="interior_type" operator="isMissing"/>

<Node id="12" score="3.0702062395803734E-4">

<SimplePredicate field="colour" operator="isMissing"/>

</Node>

<Node id="10" score="-0.018950416258408962">

<SimpleSetPredicate field="colour" booleanOperator="isIn">

<Array type="string">Grün Rot Violett Weiß</Array>

</SimpleSetPredicate>

</Node>

<Node id="11" score="-0.0017446280908351925">

<SimpleSetPredicate field="colour" booleanOperator="isIn">

<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>

</SimpleSetPredicate>

</Node>

</Node>

Page 22: R, Scikit-Learn and Apache Spark ML - What difference does it make?

LightGBM- and XGBoost-style splits (1/3)<Node id="39" defaultChild="76">

<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>

<Node id="76" score="0.0030283758">

<SimplePredicate field="colour" operator="notEqual" value="Orange"/>

</Node>

<Node id="77" score="0.02483887">

<SimplePredicate field="colour" operator="equal" value="Orange"/>

</Node>

</Node>

Page 23: R, Scikit-Learn and Apache Spark ML - What difference does it make?

LightGBM- and XGBoost-style splits (2/3)<Node id="39">

<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>

<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->

<Node id="76" score="0.0030283758">

<CompoundPredicate booleanOperator="or">

<SimplePredicate field="colour" operator="isMissing"/>

<SimplePredicate field="colour" operator="notEqual" value="Orange"/>

</CompoundPredicate>

</Node>

<!-- else if("Orange".equals(colour)) return 0.02483887 -->

<Node id="77" score="0.02483887">

<SimplePredicate field="colour" operator="equal" value="Orange"/>

</Node>

<!-- else return null -->

</Node>

Page 24: R, Scikit-Learn and Apache Spark ML - What difference does it make?

LightGBM- and XGBoost-style splits (2/3)<Node id="39">

<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>

<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->

<Node id="77" score="0.02483887">

<CompoundPredicate booleanOperator="and">

<SimplePredicate field="colour" operator="isNotMissing"/>

<SimplePredicate field="colour" operator="equal" value="Orange"/>

</CompoundPredicate>

</Node>

<!-- else return 0.0030283758 -->

<Node id="76" score="0.0030283758">

<True/>

</Node>

</Node>

Page 25: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Model measurement using JPMMLorg.dmg.pmml.tree.TreeModel treeModel = ..;

treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){

private int count = 0; // Number of Node elements

private int maxDepth = 0; // Max "nesting depth" of Node elements

@Override

public VisitorAction visit(org.dmg.pmml.tree.Node node){

this.count++;

int depth = 0;

for(org.dmg.pmml.PMMLObject parent : getParents()){

if(!(parent instanceof org.dmg.pmml.tree.Node)) break;

depth++;

}

this.maxDepth = Math.max(this.maxDepth, depth);

return super.visit(node);

}

});

Page 26: R, Scikit-Learn and Apache Spark ML - What difference does it make?
Page 27: R, Scikit-Learn and Apache Spark ML - What difference does it make?
Page 28: R, Scikit-Learn and Apache Spark ML - What difference does it make?
Page 29: R, Scikit-Learn and Apache Spark ML - What difference does it make?

GBT algorithm (interpretation)

R LightGBM XGBoost Scikit-Learn

Apache Spark ML

Feature importances

Direct Direct Transformed Transformed Transformed

Decision path No No(?) No(?) Transformed Transformed

Model persistence

RDS (binary) Proprietary (text)

Proprietary (binary, text)

Pickle (binary) SER (binary) or JSON (text)

Model reusability

Good Fair(?) Good Fair Fair

Java API No No Pseudo No Yes

Page 30: R, Scikit-Learn and Apache Spark ML - What difference does it make?

LightGBM feature importancesAge 936

Mileage 887

Performance 738

[Category] 205

New? 179

[Type of fuel] 170

[Type of interior] 167

Airbags? 130

[Colour] 129

[Type of gearbox] 105

Page 31: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Model execution using JPMMLorg.dmg.pmml.PMML pmml;

try(InputStream is = ..){

pmml = org.jpmml.model.PMMLUtil.unmarshal(is);

}

org.jpmml.evaluator.Evaluator evaluator =

new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);

org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);

org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);

for(int value = min; value <= max; value += increment){

Map<FieldName, FieldValue> arguments =

Collections.singletonMap(inputField.getName(), inputField.prepare(value));

Map<FieldName, ?> result = evaluator.evaluate(arguments);

System.out.println(result.get(targetField.getName()));

}

Page 32: R, Scikit-Learn and Apache Spark ML - What difference does it make?
Page 33: R, Scikit-Learn and Apache Spark ML - What difference does it make?
Page 34: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Lessons (to be-) learned

● Limits and limitations of individual APIs● Vertical integration vs. horizontal integration:

○ All capabilities on a single platform○ Specialized capabilities on specialized platforms

● Ease-of-use and robustness beat raw performance in most application scenarios

● "Conventions over configuration"

Page 35: R, Scikit-Learn and Apache Spark ML - What difference does it make?

Q&[email protected]

https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml