Skutil - H2O meets Sklearn - Taylor Smith
-
Upload
srisatish-ambati -
Category
Technology
-
view
60 -
download
1
Transcript of Skutil - H2O meets Sklearn - Taylor Smith
Scikit-UtilH2O MEETS SKLEARN
Taylor SmithOctober 26, 2016
Agenda About me Problem statement Overview
Package motivation Notable H2O additions
Side-by-side Questions
About me Taylor Smith
Data scientist at State Farm M.S. Analytics from The University of Texas at Austin ~3 years in data science, ~6 years writing software
http://github.com/tgsmith61591
https://www.linkedin.com/in/taylorgsmith
@TayGriffinSmith
Problem statementWHY AM I STANDING HERE TALKING TO YOU?
DS/DE—typical division of labor Data scientist
1. Frame the problem2. Gather raw data3. Analyze
Data engineer1. Gather raw data2. Consolidate data3. Production
Where’s the disconnect? Exploration
Technologies (Hadoop/Spark/Python/R)
Implementation Technologies (Python/R/Java) Dependencies/versioning
Discrepancy in tooling
Package motivation What is skutil?
Began as a pre-processing library to unify Caret, sklearn, etc. Specifically relevant to actuarial departments—(why?)
Evolved to include H2O modules
Objectives: Deliver an easy transition into the world of distributed computing that H2O offers Help bridge “gap” between data scientist and data engineer roles Provide the same, familiar interface that sklearn users have come to know and love
Package motivation [cont’d] Regarding R… H2O package completeness
Why Python…Quickly growing active user baseEasily supported by non-DS engineersCI/CD friendly
https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/
Skutil—Notable H2O additions H2OPipeline
Similar to sklearn.pipeline.Pipeline
H2OTransformer
H2OTransformer H2OEstimator
Skutil—Notable H2O additions [cont’d] H2OGridSearchCV (and H2ORandomizedSearchCV)
Similar to sklearn.grid_search module
Parameter grid
Param set 0
Param set n
… Best model
Ok, I have a model… now what? Deploying in Python?
Pickle-compatible persistence Entire pipelines can be stored
Deploying model in Java? Leverage H2O’s built-in “download POJO” capability*
(future release will auto-gen main class and compile runnable fat-jar)
* Just the H2O model; not the full pipeline
Skutil at a glance—present and future Current (v0.1.3)
Transformers Feature selection Imputation Class balancers Model selection & Pipelines
Road map PySpark integration
(Thank you to fellow contributor, Charles Drotar) Automated runnable jar creation using jinja
+
H2O vs. SklearnSKUTIL IN ACTION
H2O vs. Sklearn
Load dataSplit dataFit model
Skutil vs. Sklearn
Load dataSplit dataFit model
Questions?THANK YOU!!