Skutil - H2O meets Sklearn - Taylor Smith

16
Scikit-Util H2O MEETS SKLEARN Taylor Smith October 26, 2016

Transcript of Skutil - H2O meets Sklearn - Taylor Smith

Page 1: Skutil - H2O meets Sklearn - Taylor Smith

Scikit-UtilH2O MEETS SKLEARN

Taylor SmithOctober 26, 2016

Page 2: Skutil - H2O meets Sklearn - Taylor Smith

Agenda About me Problem statement Overview

Package motivation Notable H2O additions

Side-by-side Questions

Page 3: Skutil - H2O meets Sklearn - Taylor Smith

About me Taylor Smith

Data scientist at State Farm M.S. Analytics from The University of Texas at Austin ~3 years in data science, ~6 years writing software

[email protected]

http://github.com/tgsmith61591

https://www.linkedin.com/in/taylorgsmith

@TayGriffinSmith

Page 4: Skutil - H2O meets Sklearn - Taylor Smith

Problem statementWHY AM I STANDING HERE TALKING TO YOU?

Page 5: Skutil - H2O meets Sklearn - Taylor Smith

DS/DE—typical division of labor Data scientist

1. Frame the problem2. Gather raw data3. Analyze

Data engineer1. Gather raw data2. Consolidate data3. Production

Page 6: Skutil - H2O meets Sklearn - Taylor Smith

Where’s the disconnect? Exploration

Technologies (Hadoop/Spark/Python/R)

Implementation Technologies (Python/R/Java) Dependencies/versioning

Discrepancy in tooling

Page 7: Skutil - H2O meets Sklearn - Taylor Smith

Package motivation What is skutil?

Began as a pre-processing library to unify Caret, sklearn, etc. Specifically relevant to actuarial departments—(why?)

Evolved to include H2O modules

Objectives: Deliver an easy transition into the world of distributed computing that H2O offers Help bridge “gap” between data scientist and data engineer roles Provide the same, familiar interface that sklearn users have come to know and love

Page 8: Skutil - H2O meets Sklearn - Taylor Smith

Package motivation [cont’d] Regarding R… H2O package completeness

Why Python…Quickly growing active user baseEasily supported by non-DS engineersCI/CD friendly

https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/

Page 9: Skutil - H2O meets Sklearn - Taylor Smith

Skutil—Notable H2O additions H2OPipeline

Similar to sklearn.pipeline.Pipeline

H2OTransformer

H2OTransformer H2OEstimator

Page 10: Skutil - H2O meets Sklearn - Taylor Smith

Skutil—Notable H2O additions [cont’d] H2OGridSearchCV (and H2ORandomizedSearchCV)

Similar to sklearn.grid_search module

Parameter grid

Param set 0

Param set n

… Best model

Page 11: Skutil - H2O meets Sklearn - Taylor Smith

Ok, I have a model… now what? Deploying in Python?

Pickle-compatible persistence Entire pipelines can be stored

Deploying model in Java? Leverage H2O’s built-in “download POJO” capability*

(future release will auto-gen main class and compile runnable fat-jar)

* Just the H2O model; not the full pipeline

Page 12: Skutil - H2O meets Sklearn - Taylor Smith

Skutil at a glance—present and future Current (v0.1.3)

Transformers Feature selection Imputation Class balancers Model selection & Pipelines

Road map PySpark integration

(Thank you to fellow contributor, Charles Drotar) Automated runnable jar creation using jinja

+

Page 13: Skutil - H2O meets Sklearn - Taylor Smith

H2O vs. SklearnSKUTIL IN ACTION

Page 14: Skutil - H2O meets Sklearn - Taylor Smith

H2O vs. Sklearn

Load dataSplit dataFit model

Page 15: Skutil - H2O meets Sklearn - Taylor Smith

Skutil vs. Sklearn

Load dataSplit dataFit model

Page 16: Skutil - H2O meets Sklearn - Taylor Smith

Questions?THANK YOU!!