Odsc workshop - Distributed Tensorflow on Hops

55
@ODSC Distributed Deep Learning on Hops Robin Andersson Fabio Buso RISE SICS AB Logical Clocks AB London | October 12 th -14 th 2017

Transcript of Odsc workshop - Distributed Tensorflow on Hops

Page 1: Odsc workshop - Distributed Tensorflow on Hops

@ODSC

Distributed DeepLearning on Hops Robin Andersson

Fabio BusoRISE SICS AB

Logical Clocks AB

London | October 12th-14th 2017

Page 2: Odsc workshop - Distributed Tensorflow on Hops

Please register on odsc.hops.site

Page 3: Odsc workshop - Distributed Tensorflow on Hops

Big Data and AI

3

Page 4: Odsc workshop - Distributed Tensorflow on Hops

Why you are here

4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Page 5: Odsc workshop - Distributed Tensorflow on Hops

Deep Learning with GPUs (on Hops)

5

Page 6: Odsc workshop - Distributed Tensorflow on Hops

Separate Clusters for Big Data and ML

6

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

Page 7: Odsc workshop - Distributed Tensorflow on Hops

7

I need estimates for the ROI on these candidate features in our product

We are on it. Need to first sync up with IT and engineering

Data Science in Enterprises Today

7

Data Science Team

CTO

Page 8: Odsc workshop - Distributed Tensorflow on Hops

88

IT

Collaboration Overhead is HighPrepare Dataset samples for Data Science

Data Science Team Data Engineering

We need access to these Datasets

DataLake

Ok

1. Update Access Rights

GPU Cluster2. Copy Dataset Samples (some time later)

3. Run experiments

Page 9: Odsc workshop - Distributed Tensorflow on Hops

99

How it should be

IT

Data Science Data Engineering

Here’s someone who can help you out

I need help to work on a project for the CTO

Project

Conda Env, CPU/Storage Quotas, Self-Service, GDPR

Kafka Topics

DataLake

GPU Cluster

Elasticsearch

Page 10: Odsc workshop - Distributed Tensorflow on Hops

HopsWorks Data Platform

10

Page 11: Odsc workshop - Distributed Tensorflow on Hops

HopsWorks

11

Kafka Topic

Project X Project Y

Project Data

Page 12: Odsc workshop - Distributed Tensorflow on Hops

HopsFS

12

Open Source fork of Apache HDFS

16x faster than HDFS

37x more capacity than HDFS

SSL/TLS instead of Kerberos

Scale Challenge Winner (2017)

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

Page 13: Odsc workshop - Distributed Tensorflow on Hops

HopsYARN GPUs

13

Native GPU support in YARN - world first

Implications

- Schedule GPUs just like memory or CPU- Exclusive allocation (no GPU-sharing)- Distributed, scale-out Machine Learning

Page 14: Odsc workshop - Distributed Tensorflow on Hops

TensorFlow first-class support in Hops

14

Run in

Spark ExecutorTensorFlow code

0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout

0.002 learning rate, 0.7 dropout

Spark ExecutorTensorFlow code

Spark ExecutorTensorFlow code

Page 15: Odsc workshop - Distributed Tensorflow on Hops

HopsUtil

Library for launching TensorFlow jobs

Manages the TensorBoard lifecycle

Helper Functions for Spark/Kafka/HDFS/etc

15

Page 16: Odsc workshop - Distributed Tensorflow on Hops

HopsUtil - Read data

from hopsutil import hdfs

dataset=path.join(hdfs.project_path(),‘Resources/mnist/tfr/train’)

files=tf.gfile.Glob(path.join(dataset,‘part-*’))

file_queue=tf.train.string_input_producer(files, … )

16

Page 17: Odsc workshop - Distributed Tensorflow on Hops

17

HopsUtil - initialize Pydoop HDFS API

Pydoop HDFS API is a rich api that provides operations such as

- Connecting to an HDFS instance- General file operations (create, read, write)- Get information on files, directories, fs

Connect to HopsFS using HopsUtil:

from hopsutil import hdfs

pydoop_handle = hdfs.get()17

Page 18: Odsc workshop - Distributed Tensorflow on Hops

HopsUtil - TensorBoard

from hopsutil import tensorboard

[...]

logdir = tensorboard.logdir()

sv = tf.train.Supervisor(is_chief=True, logdir=logdir, [...], save_model_secs=60)

18

Page 19: Odsc workshop - Distributed Tensorflow on Hops

HopsUtil - Hyperparameter searching

from hopsutil import tflauncher

def training(learning_rate, dropout):[....]

params = {‘learning_rate': [0.001, 0.002, 0.003], 'dropout': [0.3, 0.5, 0.7]}tflauncher.launch(spark, training, params)

19

Page 20: Odsc workshop - Distributed Tensorflow on Hops

HopsUtil - Logging

from hopsutil import hdfs

[...]

while not sv.should_stop() and step < steps:

hdfs.log(sess.run(accuracy))

[...]

20

Page 21: Odsc workshop - Distributed Tensorflow on Hops

DEMO TIME!TensorFlow tour on HopsWorks

21

Page 22: Odsc workshop - Distributed Tensorflow on Hops

22

How to get started

Page 23: Odsc workshop - Distributed Tensorflow on Hops

23

How to get started (2)

Page 24: Odsc workshop - Distributed Tensorflow on Hops

24

How to get started (3)

Page 25: Odsc workshop - Distributed Tensorflow on Hops

25

TensorBoard

Page 26: Odsc workshop - Distributed Tensorflow on Hops

26

Dela - Search for interesting datasets

Page 27: Odsc workshop - Distributed Tensorflow on Hops

27

Dela - Import a Dataset

Page 28: Odsc workshop - Distributed Tensorflow on Hops

Dela

28

p2p network of Hops clusters

Find and share interesting datasets

Exploits unused bandwidth and backs off in case of network traffic

Page 29: Odsc workshop - Distributed Tensorflow on Hops

The Challenge

29

http://timdettmers.com/2017/08/31/deep-learning-research-directions

Page 30: Odsc workshop - Distributed Tensorflow on Hops

Experiment Time and Research Productivity

● Minutes, Hours:○ Interactive analysis!

● 1-4 days○ Interactivity replaced by

many parallel experiments● 1-4 weeks

○ High value experiments only● >1 month

○ Don’t even try!

30

Page 31: Odsc workshop - Distributed Tensorflow on Hops

Solution: Go distributed

31

Page 32: Odsc workshop - Distributed Tensorflow on Hops

State-of-the-Art in GPU Hardware

32

Page 33: Odsc workshop - Distributed Tensorflow on Hops

Nvidia DGX-1

33

Page 34: Odsc workshop - Distributed Tensorflow on Hops

SingleRoot Commodity GPU Cluster Computing

34

Page 35: Odsc workshop - Distributed Tensorflow on Hops

The budget side

35

Commodity Server*

➔ 10 Nvidia GTX 1080Ti◆ 11 GB Memory

➔ 256 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband➔ SingleRoot PCI Complex

10 x Commodity Server = 150K Euro

Nvidia DGX-1

➔ 8 Nvidia Tesla V100◆ 16 GB Memory

➔ 512 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband

➔ NVLink

Price per DGX-1 = 150K Euro

*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/

Page 36: Odsc workshop - Distributed Tensorflow on Hops

36

Distributed TensorFlow

Distribute TensorFlow graph

Workers / Parameter server

Synchronous / Asynchronous

Model / Data parallelism

Problems:- Clusterspec- Manually starting process

Page 37: Odsc workshop - Distributed Tensorflow on Hops

37

Introducing TensorFlowOnSpark by YAHOO!

Wrapper for Distributed TensorFlow

- Creates clusterspec automatically!- Runs on a Hadoop/Spark cluster- Starts the workers/parameter servers automatically- First attempt at “scheduling” GPUs- Simplifies the programming model- Manages TensorBoard- “Migrate all existing TF programs with < 10 lines of code”

37

Page 38: Odsc workshop - Distributed Tensorflow on Hops

TensorFlowOnSpark architecture

38 HopsFs

Spark Driver

Spark ExecutorParameter

Server

Spark Executor

Worker

Spark Executor

Worker

Page 39: Odsc workshop - Distributed Tensorflow on Hops

Scaling TensorFlowOnSpark

39

Near linear scaling up to 8 workers

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

Page 40: Odsc workshop - Distributed Tensorflow on Hops

TensorFlowOnSpark on Hops

40

Page 41: Odsc workshop - Distributed Tensorflow on Hops

41

Our improved TensorFlowOnSpark - 1

Problem:Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’

GPUs.Solution:

Hops provides GPU scheduling!

41

Page 42: Odsc workshop - Distributed Tensorflow on Hops

42

Our improved TensorFlowOnSpark - 2

Problem:A worker will wait until GPUs become available,

potentially forever!Solution:

GPU scheduling ensures that the GPU is only allocated for that particular worker.

42

Page 43: Odsc workshop - Distributed Tensorflow on Hops

43

Our improved TensorFlowOnSpark - 3

Problem:Each parameter server allocates 1 GPU, this is a waste!

Solution:Only workers may use GPUs

43

Page 44: Odsc workshop - Distributed Tensorflow on Hops

44

Conversion guide: TensorFlowOnSpark

TFCluster.run(spark, training_fun, num_executors, num_ps…)

Add PySpark and TensorFlowOnSpark imports

Create your own FileWriter

Replace tf.train.Server() with TFNode.start_cluster_server()

Full conversion guide for Distributed TensorFlow to TensorFlowOnSparkhttps://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

44

Page 45: Odsc workshop - Distributed Tensorflow on Hops

DEMO TIME!Distributed TF on Spark

45

Page 46: Odsc workshop - Distributed Tensorflow on Hops

Distributed Stochastic Gradient Descent

46

Page 47: Odsc workshop - Distributed Tensorflow on Hops

SDG with Data Parallelism (Single Host)

47

Page 48: Odsc workshop - Distributed Tensorflow on Hops

Facebook: Scaling Synchronous SDGJune 2017: training time on ImageNet from 2 weeks to 1 hour

➔ ~90% scaling efficiency going from 8 to 256 GPUs

Learning rate heuristic/ Warm up phase/ Large batches

48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Page 49: Odsc workshop - Distributed Tensorflow on Hops

All-Reduce

49

N GPUs, K parametersComm. cost: 2(N-1) * K/N

Independent from # GPUs

overlap communication and computation

Drawback: Synchronous communication

From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Page 50: Odsc workshop - Distributed Tensorflow on Hops

Baidu All-Reduce - Performance scaling

50From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Page 51: Odsc workshop - Distributed Tensorflow on Hops

Horovod - Better than Baidu All-Reduce?

51

Fork of Baidu All-Reduce

Improvements

1. Replaced Baidu ring-allreduce with NVIDIA NCCL2. Tensor Fusion3. Support for larger models4. Pip package5. Horovod Timeline

Page 52: Odsc workshop - Distributed Tensorflow on Hops

5252

Migrating existing code to run on Horovod

1. Run hvd.init()

2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. Local rank maps to unique GPU for the process.

3. Wrap optimizer in hvd.DistributedOptimizer. 4. Add hvd.BroadcastGlobalVariablesHook(0) to

broadcast initial variable states from rank 0 to all other processes.

Page 53: Odsc workshop - Distributed Tensorflow on Hops

Horovod/Baidu AllReduce

53

Provide as a service on HopsWorks

Integration of All-Reduce with a Hadoop cluster- Use YARN to schedule GPUs

Scheduling of homogeneous GPUs and network- YARN supports node labels

HopsFS authentication/authorization

TensorBoard lifecycle management as in HopsUtil

Page 54: Odsc workshop - Distributed Tensorflow on Hops

The teamActive contributors:Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.

Past contributors:Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson, Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.

54

Page 55: Odsc workshop - Distributed Tensorflow on Hops

www.hops.iogithub.com/hopshadoop

@hopshadoop

55