Sparkling Water 2.0

29
Sparkling Water 2.0 Jakub Háva [email protected] Machine Learning with H2O, Budapest September 2, 2016

Transcript of Sparkling Water 2.0

Page 1: Sparkling Water 2.0

Sparkling Water 2.0

Jakub Há[email protected]

Machine Learning with H2O, BudapestSeptember 2, 2016

Page 2: Sparkling Water 2.0

Who am I

• Finishing high-performance cluster monitoring tool for JVM based languages

• Finishing Master’s at Charles Uni in Prague • Software engineer at H2O.ai

• Tea lover ( doesn’t mean I don’t like beer!)

Page 3: Sparkling Water 2.0

Distr ibuted Sparkl ing Team

• Michal - Mt. View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP • Vlad - Mt. View, CA

Page 4: Sparkling Water 2.0

H2O.aiMachine Intelligence

H2O+Spark = Sparkling

Water

Page 5: Sparkling Water 2.0

Sparkl ing Water

• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side

• Transparent use of H2O data structures and algorithms with Spark API

• Platform for building Smarter Applications • Excels in existing Spark workflows requiring

advanced Machine Learning algorithms

Functionality missing in H2O can be replaced by Spark and vice versa

Page 6: Sparkling Water 2.0

Benef i ts

• Additional algorithms – NLP

• Powerful data munging • ML Pipelines

• Advanced algorithms

• speed v. accuracy

• advanced parameters

• Fully distributed and parallelised

• Graphical environment

• R/Python interface

Page 7: Sparkling Water 2.0

H2O.aiMachine Intelligence

How to use Sparkling

Water?

Page 8: Sparkling Water 2.0

Star t spark with Sparkl ing Water

Page 9: Sparkling Water 2.0

H2O.aiMachine Intelligence

Model Building

Data Source

Data munging Modelling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 10: Sparkling Water 2.0

H2O.aiMachine Intelligence

Data Munging

Data Source

Data load/munging/ exploration Modelling

Page 11: Sparkling Water 2.0

H2O.aiMachine Intelligence

Stream processing

DataSourceO

ff-lin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm/Flink

Export modelin a binary format

or as code

Modelling

Page 12: Sparkling Water 2.0

H2O.aiMachine Intelligence

What is inside?

Page 13: Sparkling Water 2.0

H2O.aiMachine Intelligence

Cluster

Worker node

Spark executor

Scala/Py main program

Driver node

H2OContext

SparkContext

Worker node

Spark executor

Worker node

Spark executor

Page 14: Sparkling Water 2.0

H2O.aiMachine Intelligence

H2O

Ser

vice

sH

2O S

ervi

ces

DataSource

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

DataSource

h2oContext.asDataFrame

h2oContext.asH2OFrame

Page 15: Sparkling Water 2.0

H2O.aiMachine Intelligence

New features, finally!

Page 16: Sparkling Water 2.0

Scala code in H2O Flow

• New type of cell • Access Spark from Flow UI • Experimenting made easy

Page 17: Sparkling Water 2.0

H2O Frame as Spark’s Datasource

• Use native Spark API to load and save data • Spark can optimise the queries when loading data

from H2O Frame • Use of Spark query optimiser

• Let’s see it in practise!

Page 18: Sparkling Water 2.0

Machine learning pipelines

• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML

Pipelines • Can serialise fitted/unfitted pipelines • Unified API => Arguments are set in the same way

for Spark and H2O Models

Page 19: Sparkling Water 2.0

MLlib Algor ithms in Flow UI

• Can examine them in H2O Flow • Can generate POJO out of them

• Let’s see Support Vector Machines ( SVM )!

Page 20: Sparkling Water 2.0

H2O Algo from Flow

REST API FacadeREST API Adapter Impl

H2O Algorithm Impl

REST API TCP Port 54321

H2O Flow

Web UI

H2O Algorithm API Call

Sparkling Water

Page 21: Sparkling Water 2.0

Pure MLlib Algo from Flow

REST API FacadeREST API Adapter Impl

REST API TCP Port 54321

H2O Flow

Web UI

Mllib Impl

Spark

MLlib SVM Algorithm API Call

Sparkling Water

Page 22: Sparkling Water 2.0

PySparkl ing made easy

• PySparkling now in PyPi • Contains all H2O and Sparkling Water

dependencies, no need to worry about them • Just add in on your Python path and that’s it

Page 23: Sparkling Water 2.0

Sparkl ing Water high-availabil i ty

• New solution • Not yet in master, in testing phase • Sparkling Water is using external H2O cluster

instead of starting H2O in each executor • Spark executors can come and go and H2O won’t

be affected

Page 24: Sparkling Water 2.0

Sparkl ing Water Internal Backend

Sparkling App

jar file

Spark Master

JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 25: Sparkling Water 2.0

Sparkl ing Water External Backend

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

H2O Cluster

H2O

Page 26: Sparkling Water 2.0

And others!

• Support for Datasets • Zeppelin notebook support • Integration with TensorFlow • Support for fast joins • A Lots of bug fixes..

Page 27: Sparkling Water 2.0

Coming features

• RSparkling • Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Secure Communication - SSL • …

Page 28: Sparkling Water 2.0

H2O.aiMachine Intelligence

Checkout H2O.ai Training Books

http://h2o.ai/resources

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

More info

Page 29: Sparkling Water 2.0

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O