Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

  • #ibmedge 2016 IBM Corporation

    Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

    Seetharami Seelam, IBM Research

    Indrajit Poddar, IBM Systems

  • #ibmedge

    Please Note:

    IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBMs sole discretion.

    Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

    The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

    The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

    Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the users job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.


  • #ibmedge

    About Seelam

    Expertise: 10+ years in Large scale high performance and

    distributed systems

    Built multiple cloud services for IBM Bluemix: autoscaling, business rules, containers, POWER containers, and Deep Learning as a service

    Enabled and scaled Docker on POWER/Z for extreme density (tens of thousands of containers)

    Enabling GPUs in the cloud for container-based workloads (Mesos/Kub/Docker)


    Dr. Seetharami R. Seelam Research Staff Member IBM T. J. Watson Research Center Yorktown Heights, NY sseelam@us.ibm.com Twitter: sseelam


  • #ibmedge

    About Indrajit (a.k.a. I.P)

    Expertise: Accelerated Cloud Data Services, Machine

    Learning and Deep Learning Apache Spark, TensorFlow with GPUs

    Distributed Computing (scale out and up)

    Cloud Foundry, Spectrum Conductor, Mesos, Kubernetes, Docker, OpenStack, WebSphere

    Cloud computing on High Performance Systems



    Indrajit Poddar Senior Technical Staff Member, Master Inventor, IBM Systems ipoddar@us.ibm.com Twitter: @ipoddar

  • #ibmedge


    Introduction to Cognitive workloads and POWER

    Requirements for GPUs in the Cloud

    Mesos/GPU enablement

    Kubernetes/GPU enablement

    Demo of GPU usage with Docker on OpenPOWER to identify dog breads

    Machine and Deep Leaning Ecosystem on OpenPOWER

    Summary and Next Steps


  • #ibmedge



    What you and I (our brains) do without even thinking about it..we recognize a bicycle

  • #ibmedge

    Now machines are learning the way we learn.


    From "Texture of the Nervous System

    of Man and the Vertebrates" by

    Santiago Ramn y Cajal. Artificial Neural Networks


  • #ibmedge

    But training needs a lot computational resources

    Easy scale-out with: Deep Learning model training is hard to distribute

    Training can take hours, days or weeks

    Input data and model sizes are becoming

    larger than ever (e.g. video input, billions of features etc.)

    Real-time analytics with: Unprecedented demand for offloaded computation,

    accelerators, and higher memory bandwidth systems

    Resulting in.

    Moores law is dying

  • #ibmedge

    OpenPOWER: Open Hardware for High Performance


    Systems designed for

    big data analytics

    and superior cloud economics


    12 cores per cpu

    96 hardware threads per cpu

    1 TB RAM

    7.6Tb/s combined I/O Bandwidth

    GPUs and FPGAs coming



    Intel x86



  • #ibmedge

    New OpenPOWER Systems with NVLink


    S822LC-hpc Minsky:

    2 POWER8 CPUs with 4 NVIDIA Tesla P100

    GPUs GPUs hooked directly to CPUs using

    Nvidias NVLink high-speed interconnect http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html


  • #ibmedge

    Transparent acceleration for Deep Learning on OpenPOWER and GPUs


    Huge speed-ups

    with GPUs and



    Impressive acceleration examples: artNet Genre classifier

    Distributed Tensorflow for cancer detection

    Scale up and out genetics bioinformatics

    Full red blood cell modeling

    Accelerated ultrasound imaging

    Emergency service prediction


  • #ibmedge

    Enabling Accelerators/GPUs in the cloud stack

    Deep Learning apps


    Containers and images



    Clustering frameworks

  • #ibmedge

    Requirements for GPUs in the Cloud


    Function/Feature Comments

    GPUs exposed to Dockerized


    Apps need access to /dev/nvidia* to use the GPUs

    Support for NVIDIA GPUs Both IBM Cloud and POWER systems support NVIDIA GPUs

    Support Multiple GPUs per node IBM Cloud machines have up to 2 K80s (4 GPUs) and POWER nodes

    have many more

    Containers require no GPU drivers GPU drivers are huge and drivers in a container creates a portability

    problems so we need to support to mounting GPU drivers into the

    container from the host (volume injection)

    GPU Isolation GPUs allocated to a workloads should be invisible to other workloads

    GPU Auto-discovery Worker node agent automatically discovers the GPU types and numbers

    and report to the scheduler

    GPU Usage metrics GPU utilization is critical for developers so need to expose these metrics

    Support for heterogeneous GPUs in a

    cluster (including app to pick a GPU


    IBM Cloud has M60, K80, etc and different workloads need different


    GPU sharing GPUs should be isolated between workloads

    GPUs should be sharable in some cases between workloads

  • #ibmedge

    NVIDIA Docker


    Credit: https://github.com/NVIDIA/nvidia-docker

    A docker wrapper and tools to package and GPU based apps

    Uses drivers on the host

    Manual GPU assignment

    Good for single node

    Available on POWER

  • #ibmedge

    Mesos and Ecosystem

    Open-source cluster manager

    Enables siloed applications to be consolidated on a shared pool of resources, delivering:

    Rich framework ecosystem

    Emerging GPU support


  • #ibmedge

    Mesos GPU support (Joint work between Mesosphere, NVIDIA and IBM)

    Credit: Kevin Klaues, Mesosphere

    Mesos support for GPUs v 1.1 Mesos will support GPU in two different

    frameworks Unified containerizer

    No docker support initially

    Remove Docker daemon from the node

    Docker containerizer

    Traditional executor for Docker

    Docker container based deployment

    On going work Code to allocate GPUs at the node in docker


    Code to share the state with unified containerizer

    Logic for node recovery (nvidia driving this work)

    Limitations No GPU sharing between docker containers

    Limited GPU usage information exposed in the UI

    Slave recovery code will evolve over time

    NVIDIA GPUs only

  • #ibmedge


    GPU shared by mesos containerizer and docker containerizer

    mesos-docker-executor extended to handle devices isolation/exposition through docker daemon

    Native docker implementation for CPU/memory/GPU/GPU driver volume management


    Nvidia GPU


    Nvidia Volume





    Containerizer Docker Daemon

    CPU Memory GPU GPU driver volume


    Nvidia GPU Isolator Mesos Agent

    Docker image label check:


  • #ibmedge

    Mesos GPU monitor and Marathon on OpenPOWER


  • #ibmedge

    Usage and Progress


    Compile Mesos with flag: ../configure --with-nvml=/nvml-header-path && make j ins