DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API...

1

HUAWEI | EUROPEAN RESEARCH CENTER

John Doe

— Huawei Confidential —

DeepSpark

Contributors: Natan Peterfreund, Roman Talyansky, Uri Verner,

Zach Melamed, Youliang Yan, Rongfu Zheng

Presenter: Uri Verner

Asynchronous deep learning over Spark

2

DeepSpark is a scalable

deep learning framework

for -based

distributed environments

3

Outline

Background

DeepSpark architecture

Data locality optimizations

Initial results

Useful tools

4

What is Apache Spark?

Spark is an advanced framework for distributed computation

Very fast at iterative algorithms

In-memory data caching between iterations

Provides fault-tolerance and recovery

Efficient at data transportation between nodes (“shuffle”)

Easy and expressive APIs

5

Synchronous vs. Asynchronous Training

parameter server

w o r k e r s

< repeatedly >

Workers can get out of sync

• network delays

• waiting for data

• machine crashed

• etc.

update parameter server

input data

6

System Architecture

Spark executor

Training manager

Caffe

GPU

GPU

GPU

GPU

Spark executor

Training manager

Caffe

GPU

GPU

GPU

GPU

Spark Driver

GPUGPUGPU GPUGPU GPU

M O D E L

data

Training worker machines

Distributed parameter server

in Spark RDD

HDFS HDFS

data

7

Data Parallelismwith Asynchronous Distributed Stochastic Descent

worker

Each worker operates

asynchronously with other workers.

M O D E L

Parameter server

1

2

3

4

Download

model 𝑀

Compute

update ∆𝑀

Upload

∆𝑀 to PS

Update model:

𝑀:= 𝑀 + ∆𝑀

data

8

Distributed Parameter Server

M O D E LSpark Resilient Distributed Dataset (RDD)

- Cached in memory

- API for distributed processing

Model update procedure:

- Training workers send local updates to PS

machines in split form

- Compute a new global model

- Update training workers with new model L O C A L

ready model

updates

G L O B L

merged

model

Workers

9

Workers Don’t Wait For Model Update

global

model

local

model

model

update

accumulated

updates

Forward/Backwardload if new add

update

get model

from PS

send updates

to PS

training loop: 1 2 3

4

1 2update loop:

gpugpugpu

Caffe

10

Preserve Local Updates

global

model

local

model

model

update

accumulated

updates

Forward/Backwardload if new add

update

get model

from PS

send updates

to PS

training loop: 1 2 3

4

1 2update loop:

gpugpugpu

Caffe

“read-my-writes” [1]

[1] ”More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server”, Ho et al., NIPS 2013

11

Limited Staleness

slowest worker

global

model

local

model

gpu

global

model

local

model

gpu

another worker

version: 2 version: 10

load if new load if new

Use a configurable staleness threshold

12

Work Assignment with HDFS

HDFS HDFS HDFS HDFS HDFS

The input data is distributed,

and replicated.

stored in blocks of 128MB (by default),

worker worker worker worker

Worker machines may also be HDFS machines.

Problem: assign each (unique) data block to a worker

Requirements (in order of priority):

- Equal work distribution

- Minimize data transfer over the network

13

The Data Block Assignment Problem

data blocks (N) replicas (R) workers (W)

for each

data block

choose one

replica

such that each

worker gets𝑁

𝑊blocks (±1)

and non-local

assignments are

minimized

and assign it to a

worker

locality

14

Solving HDFS Locality OptimizationRepresent as a minimum-cost flow optimization problem

A classical problem with an efficient solution[2]

[2] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications.


𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0

𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0

𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1

cost = 1 𝑟𝑒𝑚𝑜𝑡𝑒 𝑟𝑒𝑝𝑙𝑖𝑐𝑎0 𝑙𝑜𝑐𝑎𝑙 𝑟𝑒𝑝𝑙𝑖𝑐𝑎

flow N out

𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 𝑁𝑊

𝑐𝑜𝑠𝑡 = 0

15

Assigning the HDFS data blocks

Assignment


flow N out

16

Initial Results

Setup: 4 machines with one Titan X per machine, TCP/IP over Connect-X 3 Infiniband,

GoogleNet model (from Caffe), each machine is used as both worker and PS.

0

2

4

6

8

10

12

0K 10K 20K 30K 40K

los

s

iterations

Single worker

DeepSpark

BSP (the ideal)

0

100

200

300

400

500

600

700

ite

rati

on

tim

e [

ms

]

Single worker

DeepSpark

BSP

17

Useful Optimization & Debugging Tools

Visualize the program’s execution using NVIDIA Tools Extension (NVTX)

Mark the beginnings and endings of all your important operations

Caffe

Spark

18

Useful Optimization & Debugging Tools

See CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX

Time ranges are marked using push-pop semantics.

C++ trick: define a special class with “push” in constructor & “pop” in destructor

Define a macro that creates a “profiling” object with info about function; to describe

function use macros __PRETTY_FUNCTION__, __FILE__, and __LINE__

Example:

int func() {

PROFILER_FUNCTION_SCOPE();

... body of function ...

}

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/

19

Copyright © 2016 Huawei Technologies. All Rights Reserved.

The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new

technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such

information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.

EUROPEAN RESEARCH CENTEREUROPEAN RESEARCH CENTER

Copyright © 2016 Huawei Technologies. All Rights Reserved.

The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new

technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such

information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.

DeepSpark

Contact emails:

[email protected]

[email protected]

mailto:[email protected]

mailto:[email protected]

DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API...

Documents

Transcript of DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API...