Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

ibmedge copy 2016 IBM Corporation

Enabling Cognitive Workloads on the Cloud GPUs with Mesos Docker and Marathon on POWER

Seetharami Seelam IBM Research

Indrajit Poddar IBM Systems

ibmedge

Please Note

bull IBMrsquos statements regarding its plans directions and intent are subject to change or withdrawal without notice and at IBMrsquos sole discretion

bull Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision

bull The information mentioned regarding potential future products is not a commitment promise or legal obligation to deliver any material code or functionality Information about potential future products may not be incorporated into any contract

bull The development release and timing of any future features or functionality described for our products remains at our sole discretion

bull Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment The actual throughput or performance that any user will experience will vary depending upon many factors including considerations such as the amount of multiprogramming in the userrsquos job stream the IO configuration the storage configuration and the workload processed Therefore no assurance can be given that an individual user will achieve results similar to those stated here

1

ibmedge

About Seelam

Expertise bull 10+ years in Large scale high performance and

distributed systems

bull Built multiple cloud services for IBM Bluemix autoscaling business rules containers POWER containers and Deep Learning as a service

bull Enabled and scaled Docker on POWERZ for extreme density (tens of thousands of containers)

bull Enabling GPUs in the cloud for container-based workloads (MesosKubDocker)

2

Dr Seetharami R Seelam Research Staff Member IBM T J Watson Research Center Yorktown Heights NY sseelamusibmcom Twitter sseelam

ibmedge

About Indrajit (aka IP)

Expertise

bull Accelerated Cloud Data Services Machine

Learning and Deep Learning bull Apache Spark TensorFlowhellip with GPUs

bull Distributed Computing (scale out and up)

bull Cloud Foundry Spectrum Conductor Mesos Kubernetes Docker OpenStack WebSphere

bull Cloud computing on High Performance Systems

bull OpenPOWER IBM POWER

3

Indrajit Poddar Senior Technical Staff Member Master Inventor IBM Systems ipoddarusibmcom Twitter ipoddar

ibmedge

Agenda

bull Introduction to Cognitive workloads and POWER

bull Requirements for GPUs in the Cloud

bull MesosGPU enablement

bull KubernetesGPU enablement

bull Demo of GPU usage with Docker on OpenPOWER to identify dog breads

bull Machine and Deep Leaning Ecosystem on OpenPOWER

bull Summary and Next Steps

4

ibmedge

Cognition

5

What you and I (our brains) do without even thinking about ithellipwe recognize a bicycle

ibmedge

Now machines are learning the way we learnhellip

6

From Texture of the Nervous System

of Man and the Vertebrates by

Santiago Ramoacuten y Cajal Artificial Neural Networks

ibmedge

But training needs a lot computational resources

Easy scale-out with Deep Learning model training is hard to distribute

Training can take hours days or weeks

Input data and model sizes are becoming

larger than ever (eg video input billions of

features etc)

Real-time analytics with Unprecedented demand for offloaded computation

accelerators and higher memory bandwidth systems

Resulting inhellip

Moorersquos law is dying

ibmedge

OpenPOWER Open Hardware for High Performance

8

Systems designed for

big data analytics

and superior cloud economics

Upto

12 cores per cpu

96 hardware threads per cpu

1 TB RAM

76Tbs combined IO Bandwidth

GPUs and FPGAs cominghellip

OpenPOWER

Traditional

Intel x86

httpwwwsoftlayercombare-metal-searchprocessorModel[]=9

ibmedge

New OpenPOWER Systems with NVLink

9

S822LC-hpc ldquoMinskyrdquo

2 POWER8 CPUs with 4 NVIDIAreg Teslareg P100

GPUs GPUs hooked directly to CPUs using

Nvidiarsquos NVLink high-speed interconnect httpwww-03ibmcomsystemspowerhardwares822lc-hpcindexhtml

ibmedge

Transparent acceleration for Deep Learning on OpenPOWER and GPUs

10

Huge speed-ups

with GPUs and

OpenPOWER

httpopenpowerdevpostcom

Impressive acceleration examples bull artNet Genre classifier

bull Distributed Tensorflow for cancer detection

bull Scale up and out genetics bioinformatics

bull Full red blood cell modeling

bull Accelerated ultrasound imaging

bull Emergency service prediction

ibmedge

Enabling AcceleratorsGPUs in the cloud stack

Deep Learning apps

11

Containers and images

OR

Accelerators

Clustering frameworks

ibmedge

Requirements for GPUs in the Cloud

12

FunctionFeature Comments

GPUs exposed to Dockerized

applications

Apps need access to devnvidia to use the GPUs

Support for NVIDIA GPUs Both IBM Cloud and POWER systems support NVIDIA GPUs

Support Multiple GPUs per node IBM Cloud machines have up to 2 K80s (4 GPUs) and POWER nodes

have many more

Containers require no GPU drivers GPU drivers are huge and drivers in a container creates a portability

problems so we need to support to mounting GPU drivers into the

container from the host (volume injection)

GPU Isolation GPUs allocated to a workloads should be invisible to other workloads

GPU Auto-discovery Worker node agent automatically discovers the GPU types and numbers

and report to the scheduler

GPU Usage metrics GPU utilization is critical for developers so need to expose these metrics

Support for heterogeneous GPUs in a

cluster (including app to pick a GPU

type)

IBM Cloud has M60 K80 etc and different workloads need different

GPUs

GPU sharing GPUs should be isolated between workloads

GPUs should be sharable in some cases between workloads

ibmedge

NVIDIA Docker

13

Credit httpsgithubcomNVIDIAnvidia-docker

bull A docker wrapper and tools to package and GPU based apps

bull Uses drivers on the host

bull Manual GPU assignment

bull Good for single node

bull Available on POWER

ibmedge

Mesos and Ecosystem

bull Open-source cluster manager

bull Enables siloed applications to be consolidated on a shared pool of resources delivering

bull Rich framework ecosystem

bull Emerging GPU support

14

ibmedge

Mesos GPU support (Joint work between Mesosphere NVIDIA and IBM)

Credit Kevin Klaues Mesosphere

Mesos support for GPUs v 11 bull Mesos will support GPU in two different

frameworks ndash Unified containerizer

bull No docker support initially

bull Remove Docker daemon from the node

ndash Docker containerizer

bull Traditional executor for Docker

bull Docker container based deployment

bull On going work ndash Code to allocate GPUs at the node in docker

containerizer

ndash Code to share the state with unified containerizer

ndash Logic for node recovery (nvidia driving this work)

bull Limitations ndash No GPU sharing between docker containers

ndash Limited GPU usage information exposed in the UI

ndash Slave recovery code will evolve over time

ndash NVIDIA GPUs only

ibmedge

Implementation

bull GPU shared by mesos containerizer and docker containerizer

bull mesos-docker-executor extended to handle devices isolationexposition through docker daemon

bull Native docker implementation for CPUmemoryGPUGPU driver volume management

16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker

Containerizer Docker Daemon

CPU Memory GPU GPU driver volume

mesos-docker-executor

Nvidia GPU Isolator Mesos Agent

Docker image label check

comnvidiavolumesneeded=nvidia_driver

ibmedge

Mesos GPU monitor and Marathon on OpenPOWER

17

ibmedge

Usage and Progress

bull Usage

bull Compile Mesos with flag configure --with-nvml=nvml-header-path ampamp make ndashj install

bull Build GPU images following nvidia-docker (httpsgithubcomNVIDIAnvidia-docker)

bull Run a docker task with additional such resource ldquogpus=1rdquo

bull Release

bull Target release 11

bull GPU allocator for docker containerizer (code review)

bull GPU isolationexposition support for msos-docker-executor (code review)

bull GPU driver volume injection (under development)

18

ibmedge

Eco-system Activities

bull Marathon

bull GPU support for Mesos Containerizer in release 13

bull GPU support for Docker Containerizer ready for release (waiting for Mesos side code merge)

19

ibmedge

Kubernetes

bull Open source orchestration system for Docker containers

bull Handles scheduling onto nodes in a compute cluster

bull Actively manages workloads to ensure that their state matches the users declared intentions

bull Emerging support for GPUs

20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster

-API server -Scheduler -Controller Mgr

Support HA mode

Cluster state

ibmedge

Kubernetes GPU support bull Design Doc for GPU support in K8s has been out for a while

ndash httpsgithubcomkuberneteskubernetesblobmasterdocsproposalsgpu-supportmd

FunctionFeature Kub Community Our Contribution

GPUs exposed to

Dockerized applications

Yes

Support for NVIDIA GPUs Yes

Support Multiple GPUs per

node

Yes a PR is

pending

Containers require no GPU

drivers

No PoC complete

GPU Isolation Yes

GPU Auto-discovery No future

GPU Usage metrics No future

Support for heterogeneous

GPUs in a cluster

(including app to pick a

GPU type)

No future

GPU sharing No future

GPU on Kubernetes updates in community httpsgithubcomkuberneteskubernetespull28216

ibmedge

Status of GPUs in Mesos and Kubernetes

22

FunctionFeature NVIDIA Docker Mesos Kubernetes

GPUs exposed to Dockerized applications

Support for NVIDIA GPUs

Support Multiple GPUs per node

Containers require no GPU drivers Future

GPU Isolation

GPU Auto-discovery Future Future

GPU Usage metrics Future Future

Support for heterogeneous GPUs in a cluster (including app to pick a

GPU type)

Future Future

GPU sharing

(not controlled)

Future Future

copy 2016 IBM Corporation ibmedge

Demo

23

ibmedge

Machine Learning and Deep Learning analytics on OpenPOWER No code changes needed

24

ATLAS

Automatically Tuned Linear Algebra

Software)

ibmedge

Learn More and Get Startedhellip

25

Power-Efficient Machine Learning on

POWER Systems using FPGA Acceleration

Machine and Deep Learning on Power Systems

Register for a SuperVessel Account and take deep learning

notebooks running in docker containers a spin

httpsny1ptopenlabcombigdata_cluster

ibmedge

Summary and Next Steps bull Cognitive Machine and Deep Learning workloads are everywhere

bull OpenPOWER and Accelerators will help speed up these workloads

bull Containers can be leveraged with accelerators for agile deployment of these new workloads

bull Docker Mesos and Kubernetes are making rapid progress to support accelerators

bull OpenPOWER and this emerging cloud stack makes it the preferred platform for Cognitive workloads

|

26

ibmedge

Special Thanks to Collaborators

bull Kevin Klues Mesosphere

bull Yu Bo Li IBM

bull Rajat Phull NVIdia

bull Guangya Liu IBM

bull Qian Zhang IBM

bull Benjamin Mahler Mesosphere

bull Vikrama Ditya Nvidia

bull Yong Feng IBM

bull Christy L Norman Perez IBM

bull Kubernetes Team


Thank You

Seelam ndash sseelamusibmcom

IP - ipoddarusibmcom


Backup

29

ibmedge

Notices and Disclaimers

30

Copyright copy 2016 by International Business Machines Corporation (IBM) No part of this document may be reproduced or transmitted in any form without written permission from IBM

US Government Users Restricted Rights - Use duplication or disclosure restricted by GSA ADP Schedule Contract with IBM

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors IBM shall have no responsibility to update this information THIS DOCUMENT IS DISTRIBUTED AS IS WITHOUT ANY WARRANTY EITHER EXPRESS OR IMPLIED IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION INCLUDING BUT NOT LIMITED TO LOSS OF DATA BUSINESS INTERRUPTION LOSS OF PROFIT OR LOSS OF OPPORTUNITY IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided

IBM products are manufactured from new parts or new and used parts In some cases a product may not be new and may have been previously installed Regardless our warranty terms applyrdquo

Any statements regarding IBMs future direction intent or product plans are subject to change or withdrawal without notice

Performance data contained herein was generally obtained in a controlled isolated environments Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved Actual performance cost savings or other results in other operating environments may vary

References in this document to IBM products programs or services does not imply that IBM intends to make such products programs or services available in all countries in which IBM operates or does business

Workshops sessions and associated materials may have been prepared by independent session speakers and do not necessarily reflect the views of IBM All materials and discussions are provided for informational purposes only and are neither intended to nor shall constitute legal or other guidance or advice to any individual participant or their specific situation

It is the customerrsquos responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customerrsquos business and any actions the customer may need to take to comply with such laws IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

ibmedge

Notices and Disclaimers Conrsquot

31

Information concerning non-IBM products was obtained from the suppliers of those products their published announcements or other publicly available sources IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance compatibility or any other claims related to non-IBM products Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products IBM does not warrant the quality of any third-party products or the ability of any such third-party products to interoperate with IBMrsquos products IBM EXPRESSLY DISCLAIMS ALL WARRANTIES EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE

The provision of the information contained herein is not intended to and does not grant any right or license under any IBM patents copyrights trademarks or other intellectual property right

IBM the IBM logo ibmcom Asperareg Bluemix Blueworks Live CICS Clearcase Cognosreg DOORSreg Emptorisreg Enterprise Document Management Systemtrade FASPreg FileNetreg Global Business Services reg Global Technology Services reg IBM ExperienceOnetrade IBM SmartCloudreg IBM Social Businessreg Information on Demand ILOG Maximoreg MQIntegratorreg MQSeriesreg Netcoolreg OMEGAMON OpenPower PureAnalyticstrade PureApplicationreg pureClustertrade PureCoveragereg PureDatareg PureExperiencereg PureFlexreg pureQueryreg pureScalereg PureSystemsreg QRadarreg Rationalreg Rhapsodyreg Smarter Commercereg SoDA SPSS Sterling Commercereg StoredIQ Tealeafreg Tivolireg Trusteerreg Unicareg urbancodereg Watson WebSpherereg Worklightreg X-Forcereg and System zreg ZOS are trademarks of International Business Machines Corporation registered in many jurisdictions worldwide Other product and service names might be trademarks of IBM or other companies A current list of IBM trademarks is available on the Web at Copyright and trademark information at wwwibmcomlegalcopytradeshtml

ibmedge

Please Note

bull IBMrsquos statements regarding its plans directions and intent are subject to change or withdrawal without notice and at IBMrsquos sole discretion

bull Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision

bull The information mentioned regarding potential future products is not a commitment promise or legal obligation to deliver any material code or functionality Information about potential future products may not be incorporated into any contract

bull The development release and timing of any future features or functionality described for our products remains at our sole discretion

bull Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment The actual throughput or performance that any user will experience will vary depending upon many factors including considerations such as the amount of multiprogramming in the userrsquos job stream the IO configuration the storage configuration and the workload processed Therefore no assurance can be given that an individual user will achieve results similar to those stated here

1

ibmedge

About Seelam


distributed systems




2


ibmedge


Expertise







3


ibmedge

Agenda








4

ibmedge

Cognition

5


ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

About Seelam


distributed systems




2


ibmedge


Expertise







3


ibmedge

Agenda








4

ibmedge

Cognition

5


ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


Expertise







3


ibmedge

Agenda








4

ibmedge

Cognition

5


ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Agenda








4

ibmedge

Cognition

5


ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Cognition

5


ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


6




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge






features etc)



Resulting inhellip


ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


8


big data analytics


Upto

12 cores per cpu


1 TB RAM



OpenPOWER

Traditional

Intel x86


ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


9





ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


10

Huge speed-ups

with GPUs and

OpenPOWER








ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


Deep Learning apps

11


OR

Accelerators


ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


12



applications




have many more










type)


GPUs



ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

NVIDIA Docker

13







ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Mesos and Ecosystem





14

ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge











containerizer







ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Implementation




16

Nvidia GPU

Allocator

Nvidia Volume

Manager

Mesos

Containerizer

Docker







ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


17

ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Usage and Progress

bull Usage




bull Release





18

ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


bull Marathon



19

ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge

Kubernetes





20

Kubernetes

master

Docker

Engine

Docker

Engine

Docker

Engine

Host Host Host

Kubelet

Proxy

Kubelet

Proxy

Kubelet Proxy

Etcd

cluster


Support HA mode

Cluster state

ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge




GPUs exposed to


Yes



node

Yes a PR is

pending


drivers

No PoC complete

GPU Isolation Yes




GPUs in a cluster


GPU type)

No future



ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


22






GPU Isolation




GPU type)

Future Future

GPU sharing

(not controlled)

Future Future


Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31





Demo

23

ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


24

ATLAS


Software)

ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge


25







ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge






|

26

ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31




ibmedge



bull Yu Bo Li IBM



bull Qian Zhang IBM



bull Yong Feng IBM




Thank You




Backup

29

ibmedge


30










ibmedge


31





Thank You




Backup

29

ibmedge


30










ibmedge


31





Backup

29

ibmedge


30










ibmedge


31




ibmedge


30










ibmedge


31




ibmedge


31




Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

Technology

Transcript of Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER