Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job...

50
Multi-tenant Deep Learning and Streaming as-a-Service with Hopsworks Theoflos Kakantousis (@theofloskak) COO – Logical Clocks AB Big Data Moscow 2018

Transcript of Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job...

Page 1: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Multi-tenant Deep Learning and Streaming as-a-Service with HopsworksTheoflos Kakantousis (@theofloskak)COO – Logical Clocks AB

Big Data Moscow 2018

Page 2: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved2

Deep Learning & Streaming-as-a-Service in Sweden

● Hopsworks

– Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service

– Built on Hops Hadoop (www.hops.io)

– hops.site, 600+ users as of September 2018 ● RISE SICS ICE

– 250 kW Datacenter, ~1000 servers

https://www.sics.se/projects/sics-ice-data-center-in-lulea

Page 3: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved3

[…] the general consensus seems to be that everyoneexpects some gain in performance numbers if the dataset size is increased dramatically [...]

Deep Learning needs Big data

Sun et Al. - Revisiting Unreasonable Efectiveness of Data in Deep Learning Era - 2017

Joel et Al. - Deep Learning Scaling is Predictable, Empirically - 2017

Page 4: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved4

AI Hierarchy of Needs

DataEngineers

DataScientists

DataScientists?

DDL(Distributed

Deep Learning)

Deep Learning, RL

Machine Learning (ML)

Data Analytics

Data Pipelines

Big Data

Lots of GPUs

GPUs

Page 5: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Full-stack Data Science

Page 6: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved6

Hopsworks

Hopsworks

Rest API

Page 7: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved7

Hopsworks

Develop Train Test Deploy

MySQL Cluster

Hive

InfuxDB

ElasticSearch

KafkaProjects,Datasets,Users

HopsFS / YARN

Spark, Flink, Tensorfow

Jupyter

Jobs, Kibana, Grafana

RESTAPI

Page 8: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Big data needs scalable storage

Page 9: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved9

HopsFS*

Metadata

Datanode

Namenode

● HDFS derivative with distributed metadata

– 37x increased capacity– 16x increased

throughput

HDFS Client

HDFS Client

Scale-out all layers

* HopsFS - https://goo.gl/yFCsGc

Scale Challenge Winner (2017)

Hops

Page 10: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved10

HopsFS support for Small Files *

RAMNVMe Disk

Datanode

Namenode

> 64KB (Configurable)

< 1KB 1KB < > 64KB

● Integrates NVMe ● Open Images Dataset:

– 9m images– ~80% small fles (<64 KB)

NVMe Disk

Metadata layer - NDB

*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al

Page 11: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Multi-tenancy

Page 12: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved12

Projects for Software-as-a-Service

Proj-42 Proj-X

Shared TopicTopic /Projs/My/Data

Proj-AllCompanyDB

Page 13: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved13

Manage Projects like Github

Page 14: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved14

Share like in Dropbox

Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc

Page 15: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved15

Project Authorization

● Data Owner Privileges– Import/Export data– Manage Membership– Share DataSets, Topics

● Data Scientist Privileges

– Write and Run code● Delegate Administration of

privileges to users

Page 16: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved16

Custom Python environments with Conda

Python libraries are usable by Spark/Tensorfow

Page 17: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved17

TLS (not Kerberos) for security in Hops

● X.509 Certifcates for authentication● 1 Certifcate for each project

user● New App certifcate

generated for each job

● Store an audit trail of the operations (read/write/create/etc) users and apps perform on HopsFs

Resource Manager

Node Manager

HopsFs

Generate App Cert

Auth w/ App Cert

Project_user cert

Page 18: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved18

TLS certifcate generation

[email protected]

Users don’t see the certifcates,authenticate using:• LDAP• password • 2-Factor Authentication

Add/DelUsers

Distributed Database

Insert/Remove CertsProject Mgr

RootCA

HDFSSparkKafkaYARN

Cert Signing Requests

IntermediateCertifcate Authority

Hopsworks

Page 19: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Streaming-as-a-Service

Page 20: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved20

ETL Workloads

ParquetHive

Hopsworks Jobs

trigger

Elastic

pipelines transform raw datato structured data

HopsFS

Page 21: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved21

Streaming Analytics in Hopsworks

HopsFS YARN

HopsFS YARN

Grafana/InfluxDBGrafana/InfluxDB

Elastic/KibanaElastic/Kibana

Public Cloud or On-PremisePublic Cloud or On-Premise

Parquet / ORC

Data Src

Batch Analytics

Kafka

…...MySQLMySQL

Page 22: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved22

Lifecycle of a Streaming Job

Developer

1.Discover: Schema Registry and Kafka Broker Endpoints2.Create: Kafka Properties file with certs and broker

details3.Create: Producer/Consumer using Kafka Properties

4.Download: the Schema for the Topic from the Schema Registry

5.Distribute: X.509 certs to all hosts on the cluster6.Cleanup securely

Operations

Facilitate dev+ops with hops-util https://github.com/logicalclocks/hops-util

Page 23: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved23

Kafka Self-Service UI

Manage & Share• Topics• ACLs• Avro Schemas

Manage & Share• Topics• ACLs• Avro Schemas

Page 24: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved24

Realtime Logs

● YARN aggregates logs on job completion– No good to us for Streaming

● Collect logs and make them searchable in real-time using Filebeat, Logstash, Elasticsearch, and Kibana

Page 25: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved25

Realtime Logs

Page 26: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved26

Resource Monitoring/Alerting

Page 27: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved27

Jupyter Notebooks

Page 28: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved28

Dela* – A Global Ecosystem for Datasets

Peer-to-Peer Search and Download for Huge DataSets(ImageNet, YouTube8M, MsCoCo, Reddit, etc)

*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)

Page 29: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

ML & Deep Learning-as-a-Service

Page 30: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved30

HopsML Pipeline

Page 31: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved31

HopsML Spark/TensorFlow Arch

Executor/Tf Executor/Tf

Driver

HopsFSTensorBoard Model Serving

Conda Envs

Conda Envs

Page 32: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Distributed Training

Page 33: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved33

Deep Learning Hierarchy of Scale

DDLAllReduce

on GPU Servers

DDL with GPU Serversand Parameter Servers

Parallel Experiments on GPU Servers

Single GPU

Many GPUs on a Single GPU Server

Days/Hours

Days

Weeks

Minutes

Training Time for ImageNet

Hours

“My Model’s Training.”

Training

Page 34: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved34

GPU Resource Requests in HopsYARN

HopsYARN HopsYARN

10 GPUs on 1 host

100 GPUs on 10 hosts with ‘Infiniband’

Hops supports a Hetrogenous Mix of GPUs

4 GPUs on any host

Page 35: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Experiments in Hopsworks

Page 36: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved36

The boring part of the job

● Find good Hyperparameters for your model

● Test diferent confgurations● Automate this!

“I have to run a hundred experiments to fnd the best

model,” he complained, as he showed me his Jupyter notebooks.

“That takes time. Every experiment takes a lot of

programming, because there are so many diferent parameters.

[https://thomaswdinsmore.com/2018/01/30/predictions-for-2018/ ]

Page 37: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved37

Experiments in TensorFlow/Hopsworks

● Run and evaluate multiple models in parallel on a subset of the dataset

Experiment 1 Experiment 2

Experiment 4Experiment 3

Page 38: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved38

Reproducible Experiments

● Results tracking● Hyperparameter tracking● Jupyter notebook versioning● Conda Env versioning● WIP: Dataset versioning

Page 39: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved39

Experiments Dashboard

Page 40: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved40

TensorBoard (1)

Page 41: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved41

TensorBoard (2)

Page 42: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved42

HopsAPI*

● Python (also Java/Scala)– Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod, TensorFlowOnSpark– Parallel experiments

● Gridsearch● Model Architecture Search with Genetic Algorithms

– Secure Streaming Analytics with Kafka/Spark/Flink– SSL/TLS certs, Avro Schema, Endpoints for

Kafka/Zookeeper/etc.

* https://github.com/logicalclocks/hops-util-py

Page 43: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Model Serving

Page 44: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved44

Standard serving infrastructure

Scale model serving with Kubernetes

Considered best practice by the community

Provide tools to easily:● Fault tolerance● Rolling release new models● Autoscaling

Page 45: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved45

Model Monitoring

HopsFS

Serving infrastructure

Re-train and deploy new model

Model monitoring infrastructure

● Log model inference requests/results to Kafka● Spark monitors model performance and input data● When to retrain?

Page 46: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved46

Model Serving on Kubernetes

Page 47: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved47

Orchestrating Hops workfows

Data Collection

Experimentation Training ServingFeature

Extraction

Data Transformation & Verifcation

Test

Airflow (Hopsworks Operator)

Page 48: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Demo

Page 49: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

©2018 Logical Clocks AB. All Rights Reserved49

Summary

● Build a single platform to cover the entire AI hierarchy of needs.

● Increase productivity of Data Scientists – Manage all your data pipelines and workflows

under a single roof– Have first-class support for Python / Streaming/

Deep Learning / ML / Data Governance / GPUs

Page 50: Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties

Hopsworks → logicalclocks.comGitHub → github.com/logicalclocksTwitter → @logicalclocks