Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER
date post
07-Jan-2017Category
Technology
view
114download
2
Embed Size (px)
Transcript of Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER
#ibmedge 2016 IBM Corporation
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER
Seetharami Seelam, IBM Research
Indrajit Poddar, IBM Systems
#ibmedge
Please Note:
IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBMs sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the users job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
1
#ibmedge
About Seelam
Expertise: 10+ years in Large scale high performance and
distributed systems
Built multiple cloud services for IBM Bluemix: autoscaling, business rules, containers, POWER containers, and Deep Learning as a service
Enabled and scaled Docker on POWER/Z for extreme density (tens of thousands of containers)
Enabling GPUs in the cloud for container-based workloads (Mesos/Kub/Docker)
2
Dr. Seetharami R. Seelam Research Staff Member IBM T. J. Watson Research Center Yorktown Heights, NY sseelam@us.ibm.com Twitter: sseelam
mailto:sseelam@us.ibm.com
#ibmedge
About Indrajit (a.k.a. I.P)
Expertise: Accelerated Cloud Data Services, Machine
Learning and Deep Learning Apache Spark, TensorFlow with GPUs
Distributed Computing (scale out and up)
Cloud Foundry, Spectrum Conductor, Mesos, Kubernetes, Docker, OpenStack, WebSphere
Cloud computing on High Performance Systems
OpenPOWER, IBM POWER
3
Indrajit Poddar Senior Technical Staff Member, Master Inventor, IBM Systems ipoddar@us.ibm.com Twitter: @ipoddar
#ibmedge
Agenda
Introduction to Cognitive workloads and POWER
Requirements for GPUs in the Cloud
Mesos/GPU enablement
Kubernetes/GPU enablement
Demo of GPU usage with Docker on OpenPOWER to identify dog breads
Machine and Deep Leaning Ecosystem on OpenPOWER
Summary and Next Steps
4
#ibmedge
Cognition
5
What you and I (our brains) do without even thinking about it..we recognize a bicycle
#ibmedge
Now machines are learning the way we learn.
6
From "Texture of the Nervous System
of Man and the Vertebrates" by
Santiago Ramn y Cajal. Artificial Neural Networks
https://en.wikipedia.org/wiki/Nervous_Systemhttps://en.wikipedia.org/wiki/Vertebrateshttps://en.wikipedia.org/wiki/Santiago_Ram%C3%B3n_y_Cajalhttps://en.wikipedia.org/wiki/Santiago_Ram%C3%B3n_y_Cajal
#ibmedge
But training needs a lot computational resources
Easy scale-out with: Deep Learning model training is hard to distribute
Training can take hours, days or weeks
Input data and model sizes are becoming
larger than ever (e.g. video input, billions of features etc.)
Real-time analytics with: Unprecedented demand for offloaded computation,
accelerators, and higher memory bandwidth systems
Resulting in.
Moores law is dying
#ibmedge
OpenPOWER: Open Hardware for High Performance
8
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/bare-metal-search?processorModel[]=9
http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9
#ibmedge
New OpenPOWER Systems with NVLink
9
S822LC-hpc Minsky:
2 POWER8 CPUs with 4 NVIDIA Tesla P100
GPUs GPUs hooked directly to CPUs using
Nvidias NVLink high-speed interconnect http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html
http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html
#ibmedge
Transparent acceleration for Deep Learning on OpenPOWER and GPUs
10
Huge speed-ups
with GPUs and
OpenPOWER!
http://openpower.devpost.com/
Impressive acceleration examples: artNet Genre classifier
Distributed Tensorflow for cancer detection
Scale up and out genetics bioinformatics
Full red blood cell modeling
Accelerated ultrasound imaging
Emergency service prediction
http://openpower.devpost.com/http://openpower.devpost.com/
#ibmedge
Enabling Accelerators/GPUs in the cloud stack
Deep Learning apps
11
Containers and images
OR
Accelerators
Clustering frameworks
#ibmedge
Requirements for GPUs in the Cloud
12
Function/Feature Comments
GPUs exposed to Dockerized
applications
Apps need access to /dev/nvidia* to use the GPUs
Support for NVIDIA GPUs Both IBM Cloud and POWER systems support NVIDIA GPUs
Support Multiple GPUs per node IBM Cloud machines have up to 2 K80s (4 GPUs) and POWER nodes
have many more
Containers require no GPU drivers GPU drivers are huge and drivers in a container creates a portability
problems so we need to support to mounting GPU drivers into the
container from the host (volume injection)
GPU Isolation GPUs allocated to a workloads should be invisible to other workloads
GPU Auto-discovery Worker node agent automatically discovers the GPU types and numbers
and report to the scheduler
GPU Usage metrics GPU utilization is critical for developers so need to expose these metrics
Support for heterogeneous GPUs in a
cluster (including app to pick a GPU
type)
IBM Cloud has M60, K80, etc and different workloads need different
GPUs
GPU sharing GPUs should be isolated between workloads
GPUs should be sharable in some cases between workloads
#ibmedge
NVIDIA Docker
13
Credit: https://github.com/NVIDIA/nvidia-docker
A docker wrapper and tools to package and GPU based apps
Uses drivers on the host
Manual GPU assignment
Good for single node
Available on POWER
#ibmedge
Mesos and Ecosystem
Open-source cluster manager
Enables siloed applications to be consolidated on a shared pool of resources, delivering:
Rich framework ecosystem
Emerging GPU support
14
#ibmedge
Mesos GPU support (Joint work between Mesosphere, NVIDIA and IBM)
Credit: Kevin Klaues, Mesosphere
Mesos support for GPUs v 1.1 Mesos will support GPU in two different
frameworks Unified containerizer
No docker support initially
Remove Docker daemon from the node
Docker containerizer
Traditional executor for Docker
Docker container based deployment
On going work Code to allocate GPUs at the node in docker
containerizer
Code to share the state with unified containerizer
Logic for node recovery (nvidia driving this work)
Limitations No GPU sharing between docker containers
Limited GPU usage information exposed in the UI
Slave recovery code will evolve over time
NVIDIA GPUs only
#ibmedge
Implementation
GPU shared by mesos containerizer and docker containerizer
mesos-docker-executor extended to handle devices isolation/exposition through docker daemon
Native docker implementation for CPU/memory/GPU/GPU driver volume management
16
Nvidia GPU
Allocator
Nvidia Volume
Manager
Mesos
Containerizer
Docker
Containerizer Docker Daemon
CPU Memory GPU GPU driver volume
mesos-docker-executor
Nvidia GPU Isolator Mesos Agent
Docker image label check:
com.nvidia.volumes.needed="nvidia_driver"
#ibmedge
Mesos GPU monitor and Marathon on OpenPOWER
17
#ibmedge
Usage and Progress
Usage
Compile Mesos with flag: ../configure --with-nvml=/nvml-header-path && make j ins