Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

Click here to load reader

  • date post

    07-Jan-2017
  • Category

    Technology

  • view

    114
  • download

    2

Embed Size (px)

Transcript of Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

  • #ibmedge 2016 IBM Corporation

    Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

    Seetharami Seelam, IBM Research

    Indrajit Poddar, IBM Systems

  • #ibmedge

    Please Note:

    IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBMs sole discretion.

    Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

    The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

    The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

    Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the users job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

    1

  • #ibmedge

    About Seelam

    Expertise: 10+ years in Large scale high performance and

    distributed systems

    Built multiple cloud services for IBM Bluemix: autoscaling, business rules, containers, POWER containers, and Deep Learning as a service

    Enabled and scaled Docker on POWER/Z for extreme density (tens of thousands of containers)

    Enabling GPUs in the cloud for container-based workloads (Mesos/Kub/Docker)

    2

    Dr. Seetharami R. Seelam Research Staff Member IBM T. J. Watson Research Center Yorktown Heights, NY sseelam@us.ibm.com Twitter: sseelam

    mailto:sseelam@us.ibm.com

  • #ibmedge

    About Indrajit (a.k.a. I.P)

    Expertise: Accelerated Cloud Data Services, Machine

    Learning and Deep Learning Apache Spark, TensorFlow with GPUs

    Distributed Computing (scale out and up)

    Cloud Foundry, Spectrum Conductor, Mesos, Kubernetes, Docker, OpenStack, WebSphere

    Cloud computing on High Performance Systems

    OpenPOWER, IBM POWER

    3

    Indrajit Poddar Senior Technical Staff Member, Master Inventor, IBM Systems ipoddar@us.ibm.com Twitter: @ipoddar

  • #ibmedge

    Agenda

    Introduction to Cognitive workloads and POWER

    Requirements for GPUs in the Cloud

    Mesos/GPU enablement

    Kubernetes/GPU enablement

    Demo of GPU usage with Docker on OpenPOWER to identify dog breads

    Machine and Deep Leaning Ecosystem on OpenPOWER

    Summary and Next Steps

    4

  • #ibmedge

    Cognition

    5

    What you and I (our brains) do without even thinking about it..we recognize a bicycle

  • #ibmedge

    Now machines are learning the way we learn.

    6

    From "Texture of the Nervous System

    of Man and the Vertebrates" by

    Santiago Ramn y Cajal. Artificial Neural Networks

    https://en.wikipedia.org/wiki/Nervous_Systemhttps://en.wikipedia.org/wiki/Vertebrateshttps://en.wikipedia.org/wiki/Santiago_Ram%C3%B3n_y_Cajalhttps://en.wikipedia.org/wiki/Santiago_Ram%C3%B3n_y_Cajal

  • #ibmedge

    But training needs a lot computational resources

    Easy scale-out with: Deep Learning model training is hard to distribute

    Training can take hours, days or weeks

    Input data and model sizes are becoming

    larger than ever (e.g. video input, billions of features etc.)

    Real-time analytics with: Unprecedented demand for offloaded computation,

    accelerators, and higher memory bandwidth systems

    Resulting in.

    Moores law is dying

  • #ibmedge

    OpenPOWER: Open Hardware for High Performance

    8

    Systems designed for

    big data analytics

    and superior cloud economics

    Upto:

    12 cores per cpu

    96 hardware threads per cpu

    1 TB RAM

    7.6Tb/s combined I/O Bandwidth

    GPUs and FPGAs coming

    OpenPOWER

    Traditional

    Intel x86

    http://www.softlayer.com/bare-metal-search?processorModel[]=9

    http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9http://www.softlayer.com/bare-metal-search?processorModel[]=9

  • #ibmedge

    New OpenPOWER Systems with NVLink

    9

    S822LC-hpc Minsky:

    2 POWER8 CPUs with 4 NVIDIA Tesla P100

    GPUs GPUs hooked directly to CPUs using

    Nvidias NVLink high-speed interconnect http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html

    http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.htmlhttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html

  • #ibmedge

    Transparent acceleration for Deep Learning on OpenPOWER and GPUs

    10

    Huge speed-ups

    with GPUs and

    OpenPOWER!

    http://openpower.devpost.com/

    Impressive acceleration examples: artNet Genre classifier

    Distributed Tensorflow for cancer detection

    Scale up and out genetics bioinformatics

    Full red blood cell modeling

    Accelerated ultrasound imaging

    Emergency service prediction

    http://openpower.devpost.com/http://openpower.devpost.com/

  • #ibmedge

    Enabling Accelerators/GPUs in the cloud stack

    Deep Learning apps

    11

    Containers and images

    OR

    Accelerators

    Clustering frameworks

  • #ibmedge

    Requirements for GPUs in the Cloud

    12

    Function/Feature Comments

    GPUs exposed to Dockerized

    applications

    Apps need access to /dev/nvidia* to use the GPUs

    Support for NVIDIA GPUs Both IBM Cloud and POWER systems support NVIDIA GPUs

    Support Multiple GPUs per node IBM Cloud machines have up to 2 K80s (4 GPUs) and POWER nodes

    have many more

    Containers require no GPU drivers GPU drivers are huge and drivers in a container creates a portability

    problems so we need to support to mounting GPU drivers into the

    container from the host (volume injection)

    GPU Isolation GPUs allocated to a workloads should be invisible to other workloads

    GPU Auto-discovery Worker node agent automatically discovers the GPU types and numbers

    and report to the scheduler

    GPU Usage metrics GPU utilization is critical for developers so need to expose these metrics

    Support for heterogeneous GPUs in a

    cluster (including app to pick a GPU

    type)

    IBM Cloud has M60, K80, etc and different workloads need different

    GPUs

    GPU sharing GPUs should be isolated between workloads

    GPUs should be sharable in some cases between workloads

  • #ibmedge

    NVIDIA Docker

    13

    Credit: https://github.com/NVIDIA/nvidia-docker

    A docker wrapper and tools to package and GPU based apps

    Uses drivers on the host

    Manual GPU assignment

    Good for single node

    Available on POWER

  • #ibmedge

    Mesos and Ecosystem

    Open-source cluster manager

    Enables siloed applications to be consolidated on a shared pool of resources, delivering:

    Rich framework ecosystem

    Emerging GPU support

    14

  • #ibmedge

    Mesos GPU support (Joint work between Mesosphere, NVIDIA and IBM)

    Credit: Kevin Klaues, Mesosphere

    Mesos support for GPUs v 1.1 Mesos will support GPU in two different

    frameworks Unified containerizer

    No docker support initially

    Remove Docker daemon from the node

    Docker containerizer

    Traditional executor for Docker

    Docker container based deployment

    On going work Code to allocate GPUs at the node in docker

    containerizer

    Code to share the state with unified containerizer

    Logic for node recovery (nvidia driving this work)

    Limitations No GPU sharing between docker containers

    Limited GPU usage information exposed in the UI

    Slave recovery code will evolve over time

    NVIDIA GPUs only

  • #ibmedge

    Implementation

    GPU shared by mesos containerizer and docker containerizer

    mesos-docker-executor extended to handle devices isolation/exposition through docker daemon

    Native docker implementation for CPU/memory/GPU/GPU driver volume management

    16

    Nvidia GPU

    Allocator

    Nvidia Volume

    Manager

    Mesos

    Containerizer

    Docker

    Containerizer Docker Daemon

    CPU Memory GPU GPU driver volume

    mesos-docker-executor

    Nvidia GPU Isolator Mesos Agent

    Docker image label check:

    com.nvidia.volumes.needed="nvidia_driver"

  • #ibmedge

    Mesos GPU monitor and Marathon on OpenPOWER

    17

  • #ibmedge

    Usage and Progress

    Usage

    Compile Mesos with flag: ../configure --with-nvml=/nvml-header-path && make j ins