Odsc workshop - Distributed Tensorflow on Hops

Distributed DeepLearning on Hops Robin Andersson

Fabio BusoRISE SICS AB

Logical Clocks AB

London | October 12th-14th 2017

Please register on odsc.hops.site

Big Data and AI

Why you are here

4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Deep Learning with GPUs (on Hops)

Separate Clusters for Big Data and ML

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

I need estimates for the ROI on these candidate features in our product

We are on it. Need to first sync up with IT and engineering

Data Science in Enterprises Today

Data Science Team

Collaboration Overhead is HighPrepare Dataset samples for Data Science

Data Science Team Data Engineering

We need access to these Datasets

DataLake

1. Update Access Rights

GPU Cluster2. Copy Dataset Samples (some time later)

3. Run experiments

How it should be

Data Science Data Engineering

Here’s someone who can help you out

I need help to work on a project for the CTO

Project

Conda Env, CPU/Storage Quotas, Self-Service, GDPR

Kafka Topics

DataLake

GPU Cluster

Elasticsearch

HopsWorks Data Platform

HopsWorks

Kafka Topic

Project X Project Y

Project Data

HopsFS

Open Source fork of Apache HDFS

16x faster than HDFS

37x more capacity than HDFS

SSL/TLS instead of Kerberos

Scale Challenge Winner (2017)

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

HopsYARN GPUs

Native GPU support in YARN - world first

Implications

- Schedule GPUs just like memory or CPU- Exclusive allocation (no GPU-sharing)- Distributed, scale-out Machine Learning

TensorFlow first-class support in Hops

Run in

Spark ExecutorTensorFlow code

0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout

0.002 learning rate, 0.7 dropout

Spark ExecutorTensorFlow code

HopsUtil

Library for launching TensorFlow jobs

Manages the TensorBoard lifecycle

Helper Functions for Spark/Kafka/HDFS/etc

HopsUtil - Read data

from hopsutil import hdfs

dataset=path.join(hdfs.project_path(),‘Resources/mnist/tfr/train’)

files=tf.gfile.Glob(path.join(dataset,‘part-*’))

file_queue=tf.train.string_input_producer(files, … )

HopsUtil - initialize Pydoop HDFS API

Pydoop HDFS API is a rich api that provides operations such as

- Connecting to an HDFS instance- General file operations (create, read, write)- Get information on files, directories, fs

Connect to HopsFS using HopsUtil:

pydoop_handle = hdfs.get()17

HopsUtil - TensorBoard

from hopsutil import tensorboard

logdir = tensorboard.logdir()

sv = tf.train.Supervisor(is_chief=True, logdir=logdir, [...], save_model_secs=60)

HopsUtil - Hyperparameter searching

from hopsutil import tflauncher

def training(learning_rate, dropout):[....]

params = {‘learning_rate': [0.001, 0.002, 0.003], 'dropout': [0.3, 0.5, 0.7]}tflauncher.launch(spark, training, params)

HopsUtil - Logging

while not sv.should_stop() and step < steps:

hdfs.log(sess.run(accuracy))

DEMO TIME!TensorFlow tour on HopsWorks

How to get started

How to get started (2)

How to get started (3)

TensorBoard

Dela - Search for interesting datasets

Dela - Import a Dataset

p2p network of Hops clusters

Find and share interesting datasets

Exploits unused bandwidth and backs off in case of network traffic

The Challenge

http://timdettmers.com/2017/08/31/deep-learning-research-directions

Experiment Time and Research Productivity

● Minutes, Hours:○ Interactive analysis!

● 1-4 days○ Interactivity replaced by

many parallel experiments● 1-4 weeks

○ High value experiments only● >1 month

○ Don’t even try!

Solution: Go distributed

State-of-the-Art in GPU Hardware

Nvidia DGX-1

SingleRoot Commodity GPU Cluster Computing

The budget side

Commodity Server*

➔ 10 Nvidia GTX 1080Ti◆ 11 GB Memory

➔ 256 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband➔ SingleRoot PCI Complex

10 x Commodity Server = 150K Euro

Nvidia DGX-1

➔ 8 Nvidia Tesla V100◆ 16 GB Memory

➔ 512 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband

➔ NVLink

Price per DGX-1 = 150K Euro

*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/

Distributed TensorFlow

Distribute TensorFlow graph

Workers / Parameter server

Synchronous / Asynchronous

Model / Data parallelism

Problems:- Clusterspec- Manually starting process

Introducing TensorFlowOnSpark by YAHOO!

Wrapper for Distributed TensorFlow

- Creates clusterspec automatically!- Runs on a Hadoop/Spark cluster- Starts the workers/parameter servers automatically- First attempt at “scheduling” GPUs- Simplifies the programming model- Manages TensorBoard- “Migrate all existing TF programs with < 10 lines of code”

TensorFlowOnSpark architecture

38 HopsFs

Spark Driver

Spark ExecutorParameter

Server

Spark Executor

Worker

Spark Executor

Worker

Scaling TensorFlowOnSpark

Near linear scaling up to 8 workers

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

TensorFlowOnSpark on Hops

Our improved TensorFlowOnSpark - 1

Problem:Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’

GPUs.Solution:

Hops provides GPU scheduling!

Problem:A worker will wait until GPUs become available,

potentially forever!Solution:

GPU scheduling ensures that the GPU is only allocated for that particular worker.

Problem:Each parameter server allocates 1 GPU, this is a waste!

Solution:Only workers may use GPUs

Conversion guide: TensorFlowOnSpark

TFCluster.run(spark, training_fun, num_executors, num_ps…)

Add PySpark and TensorFlowOnSpark imports

Create your own FileWriter

Replace tf.train.Server() with TFNode.start_cluster_server()

Full conversion guide for Distributed TensorFlow to TensorFlowOnSparkhttps://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

DEMO TIME!Distributed TF on Spark

Distributed Stochastic Gradient Descent

SDG with Data Parallelism (Single Host)

Facebook: Scaling Synchronous SDGJune 2017: training time on ImageNet from 2 weeks to 1 hour

➔ ~90% scaling efficiency going from 8 to 256 GPUs

Learning rate heuristic/ Warm up phase/ Large batches

48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

All-Reduce

N GPUs, K parametersComm. cost: 2(N-1) * K/N

Independent from # GPUs

overlap communication and computation

Drawback: Synchronous communication

From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Baidu All-Reduce - Performance scaling

50From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Horovod - Better than Baidu All-Reduce?

Fork of Baidu All-Reduce

Improvements

1. Replaced Baidu ring-allreduce with NVIDIA NCCL2. Tensor Fusion3. Support for larger models4. Pip package5. Horovod Timeline

Migrating existing code to run on Horovod

1. Run hvd.init()

2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. Local rank maps to unique GPU for the process.

3. Wrap optimizer in hvd.DistributedOptimizer. 4. Add hvd.BroadcastGlobalVariablesHook(0) to

broadcast initial variable states from rank 0 to all other processes.

Horovod/Baidu AllReduce

Provide as a service on HopsWorks

Integration of All-Reduce with a Hadoop cluster- Use YARN to schedule GPUs

Scheduling of homogeneous GPUs and network- YARN supports node labels

HopsFS authentication/authorization

TensorBoard lifecycle management as in HopsUtil

The teamActive contributors:Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.

Past contributors:Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson, Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.

www.hops.iogithub.com/hopshadoop

@hopshadoop

Odsc workshop - Distributed Tensorflow on Hops

Technology

Transcript of Odsc workshop - Distributed Tensorflow on Hops

Tensorflow - Aalto · Tensorflow API TensorFlow has APIs available in several languages both for constructing and executing a TensorFlow graph. The Python API is at present the most

Introduction to TensorFlow 2...Deep Learning Intro to TensorFlow TensorFlow @ Google 2.0 and Examples Getting Started TensorFlow Deep Learning Doodles courtesy of @dalequark Weight

Driven data ODSC

Deep Learning Lab11: TensorFlow 101€¦ · TensorFlow 4 Read and Preprocess Data tf.keras Premade Estimators TensorFlow Hub Distribution Strategy GPU CPU TPU SavedModel TensorFlow

Japanese Hops

TensorFlow w/XLA: TensorFlow, Compiled! - Autodiff · PDF fileTensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Jeff Dean Google Brain team ... XLA:TPU TensorFlow

Survey of Google’s TensorFlow · Survey of Google’s TensorFlow What is Tensorflow? At it’s core Tensorflow is a library for large-scale computing. It excels at performing ...

TensorFlow Graph Optimizations - Stanford Universityweb.stanford.edu/class/cs245/slides/TFGraphOptimizationsStanford.… · TensorFlow Graph concepts TensorFlow (v1.x) programs generate

Multi-tenant Streaming and TensorFlow as a Service with Hops · Secure Structured Spark Streaming App Developer 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka

Installing TensorFlow For Jetson Platform · Installing TensorFlow For Jetson Platform SWE-SWDOCTFX-001-INST _v001 | 1 Chapter 1. OVERVIEW TensorFlow on Jetson Platform TensorFlow™

Multi-tenant Streaming and TensorFlow as a Service with Hops · •Europe’s Only Hadoop Distribution –Hops Hadoop-Fully Open-Source-Supports larger/faster Hadoop Clusters •Hopsworks

Homegrown Hops

TensorFlow Tutorial

TensorFlow Extended (TFX) · TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access

TensorFlow CodeLab

ODSC and iRODS

랩탑으로 tensorflow 도전하기 - tensorflow 설치

Weichen (TensorFlow)

TensorFlow and Keras - Statinfer...Contents •Deep Learning frameworks •What is TensorFlow •Key terms in TensorFlow •Working with TensorFlow •Regression Model building •MNIST

ODSC Causal Inference Workshop (November 2016) (1)