Different Strategies of Scaling H2O Machine Learning on Apache Spark

41
Different Strategies of Scaling H2O Machine Learning on Apache Spark Jakub Háva [email protected] H2O & Booking.com, Amsterdam, April 6, 2017

Transcript of Different Strategies of Scaling H2O Machine Learning on Apache Spark

Page 1: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Different Strategies of Scaling H2O Machine

Learning on Apache Spark

Jakub Há[email protected]

H2O & Booking.com, Amsterdam,April 6, 2017

Page 2: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Who am I

• Finishing high-performance cluster monitoring tool for JVM based languages ( JNI, JVMTI, instrumentation )

• Finishing Master’s at Charles Uni in Prague • Software engineer at H2O.ai - Core Sparkling Water

• Climbing & Tea lover, ( doesn’t mean I don’t like beer!)

Page 3: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Distributed Sparkling Team

• Michal - Mt. View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP • Vlad - Mt. View, CA

Page 4: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

H2O+Spark = Sparkling

Water

Page 5: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Sparkling Water

• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side

• Transparent use of H2O data structures and algorithms with Spark API

• Platform for building Smarter Applications • Excels in existing Spark workflows requiring advanced

Machine Learning algorithms

Functionality missing in H2O can be replaced by Spark and vice versa

Page 6: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Benefits

• Additional algorithms – NLP

• Powerful data munging • ML Pipelines

• Advanced algorithms

• speed v. accuracy

• advanced parameters

• Fully distributed and parallelised

• Graphical environment

• R/Python interface

Page 7: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

How to use Sparkling Water?

Page 8: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Start spark with Sparkling Water

Page 9: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Model Building

Data Source

Data munging Modelling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 10: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Data Munging

Data Source

Data load/munging/ exploration Modelling

Page 11: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Stream processing

DataSourceO

ff-lin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm/Flink

Export modelin a binary format

or as code

Modelling

Page 12: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Features Overview

Page 13: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Scala code in H2O Flow

• New type of cell • Access Spark from Flow UI • Experimenting made easy

Page 14: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O Frame as Spark’s Datasource

• Use native Spark API to load and save data • Spark can optimise the queries when loading data

from H2O Frame • Use of Spark query optimiser

Page 15: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Machine learning pipelines

• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML

Pipelines • Can serialise fitted/unfitted pipelines • Unified API => Arguments are set in the same way

for Spark and H2O Models

Page 16: Different Strategies of Scaling H2O Machine Learning on Apache Spark

MLlib Algorithms in Flow UI

• Can examine them in H2O Flow • Can generate POJO out of them

Page 17: Different Strategies of Scaling H2O Machine Learning on Apache Spark

PySparkling made easy

• PySparkling now in PyPi • Contains all H2O and Sparkling Water

dependencies, no need to worry about them • Just add in on your Python path and that’s it

Page 18: Different Strategies of Scaling H2O Machine Learning on Apache Spark

And others!

• Support for Datasets • RSparkling • Zeppelin notebook support • Integration with TensorFlow ( via DeepWater) • Support for high-cardinality fast joins • Secure Communication - SSL • Bug fixes..

Page 19: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Coming features

• Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Integration with Steam • …

Page 20: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Internal Backend

Page 21: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Sparkling Water Internal Backend

Sparkling App

jar file

Spark Master

JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 22: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Internal Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Page 23: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Internal Backend

H2O

Ser

vice

sH

2O S

ervi

ces

DataSource

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

DataSource

Page 24: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Pros & Cons

• Advantages • Easy to configure • Faster ( no need to send data to different cluster)

• Disadvantages • Spark kills or joins new executor => H2O goes

down • No way how to discover all executors

Page 25: Different Strategies of Scaling H2O Machine Learning on Apache Spark

When to use

• Have a few Spark Executors • Stable environment - not many Spark jobs need to

be repeated • Quick jobs

Page 26: Different Strategies of Scaling H2O Machine Learning on Apache Spark

When not To Use

• Unstable environment • Big number of executors • Big data, long running tasks which often fails in the

middle

Page 27: Different Strategies of Scaling H2O Machine Learning on Apache Spark

How to use

from pysparkling import *

conf = H2OConf(sc) .set_internal_cluster_mode() hc = H2OContext.getOrCreate(sc, conf)

Page 28: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

External Backend

Page 29: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Overview

• Sparkling Water is using external H2O cluster instead of starting H2O in each executor

• Spark executors can come and go and H2O won’t be affected

• Start H2O cluster on YARN automatically

Page 30: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Separation Approach

• Separating Spark and H2O• But preserving same API:

val h2oContext = H2OContext.getOrCreate(“http://h2o:54321/)

• Spark and H2O can be submitted as Yarn job and controlled in separation• But H2O still needs non-elastic environment (H2O itself

does not implement HA yet)

Page 31: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Sparkling Water External Backend

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

H2O Cluster

H2O

Page 32: Different Strategies of Scaling H2O Machine Learning on Apache Spark

External Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Page 33: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Pros & Cons

• Advantages• H2O does not crash when Spark executor goes down• Better resource management since resources can be

planned per tool

• Disadvantages• Transfer overhead between Spark and H2O processes

• under measurement with cooperation of a customer

Page 34: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Extended H2O Jar

• External cluster requires H2O jar file to be extended by sparkling water classes

• ./gradlew extendJar -PdownloadH2O=cdh5.8

• We are working on making this even easier so the extended Jars will be available online

Page 35: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Modes

• Auto Start mode • Start h2o cluster automatically on YARN

• Manual Start Mode • User is responsible for starting the cluster

manually

Page 36: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Manual Cluster Start Mode

from pysparkling import *

conf = H2OConf(sc) .set_external_cluster_mode() .use_manual_cluster_start() .set_cloud_name(“test”) .set_h2o_driver_path(“path to h2o driver”)

hc = H2OContext.getOrCreate(sc, conf)

Page 37: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Auto Cluster Start Mode

from pysparkling import *

conf = H2OConf(sc) .set_external_cluster_mode() .use_auto_cluster_start() .set_h2o_driver_path(“path_to_extended_driver") .set_num_of_external_h2o_nodes(4) .set_mapper_xmx(“2G”) .set_yarn_queue(“abc”)

hc = H2OContext.getOrCreate(sc, conf)

Page 38: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Multi-network Environment

• Node with Spark driver & H2O client is connected to more networks

• Spark driver choose IP in different network then the external H2O cloud

• We need to manually configure H2O client IP to be on the same network as rest of the H2O cloud

Page 39: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Demo Time

Page 40: Different Strategies of Scaling H2O Machine Learning on Apache Spark

H2O.aiMachine Intelligence

Checkout H2O.ai Training Books

http://h2o.ai/resources

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

More info

Page 41: Different Strategies of Scaling H2O Machine Learning on Apache Spark

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O