Efficient and Robust Parallel DNN Training through Model ...

13
E FFICIENT AND ROBUST PARALLEL DNN T RAINING THROUGH MODEL PARALLELISM ON MULTI -GPU P LATFORM Chi-Chung Chen 1 Chia-Lin Yang 1 Hsiang-Yun Cheng 2 ABSTRACT The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonly used parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communication overhead due to frequent weight synchronization among GPUs. Another approach is pipelined model parallelism, which partitions a DNN model among GPUs, and processes multiple mini-batches concurrently. This approach can significantly reduce inter-GPU communication cost compared to data parallelism. However, pipelined model parallelism faces the weight staleness issue; that is, gradients are computed with stale weights, leading to training instability and accuracy loss. In this paper, we present a pipelined model parallel execution method that enables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique, SpecTrain. Experimental results show that our proposal achieves up to 8.91x speedup compared to data parallelism on a 4-GPU platform while maintaining comparable model accuracy. 1 I NTRODUCTION Deep Neural Networks (DNNs) have gained a lot interests in recent years as it has demonstrated success for many clas- sification and regression tasks, including image recognition (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2015; Szegedy et al., 2017; Real et al., 2019), lan- guage translation (Vaswani et al., 2017; Sutskever et al., 2014; Bahdanau et al., 2014; Arivazhagan et al., 2019) and speech recognition (Kim et al., 2017; Hinton et al., 2012). Training a DNN model is compute-intensive and often takes days to weeks, due to the large amount of training data and the ever-increasing model size. As such, multi-GPU plat- forms are widely adopted to speed up DNN training through parallel execution. (Harlap et al., 2018; Dean et al., 2012; Cipar et al., 2013; Zhang et al., 2016; Lin et al., 2017; Lian et al., 2015; Chen et al., 2016). In multi-GPU platforms, data parallelism (Krizhevsky, 2014) is a commonly used parallelization approach. When data parallelism is applied, each GPU holds a complete copy of DNN model and processes a subset of the training data. Data parallelism suffers from excessive inter-GPU communication overheads, since weights updated at each individual GPU need to be synchronized. These communica- tion overheads increase as the model size increases, severely hindering the scalability of data parallelism (Seide et al., 2014; Strom, 2015; Alistarh et al., 2016; Zhou et al., 2016; Aji & Heafield, 2017; Lin et al., 2017). Another parallelization approach is model parallelism, which partitions a DNN model among GPUs. Each GPU is responsible for the weight updates of the assigned model layers. Therefore, the amount of data communicated among GPUs is significantly less than data parallelism. Further- more, model parallelism enables training a large model ex- ceeding the size constraint limited by the accelerator mem- ory capacity. However, for a naive model parallelism im- plementation, only one GPU is active at a time. To enable parallel execution, PipeDream (Harlap et al., 2018) proposes to adopt pipelining by injecting multiple mini-batches to the model concurrently. However, pipelined model parallelism introduces the staleness and consistency issue for weight updates. Since multiple mini-batches are simultaneously processed in the pipeline, a later mini-batch could start the training process before its prior mini-batch updates weights. The staleness/consistency problem leads to unstable train- ing and degrades model accuracy. Pipedream(Harlap et al., 2018) seeks to alleviate the consistency issue by ensuring the forward and backward passes of a mini-batch in the same GPU sees the same weights through storing multiple versions of weights. However, the weight staleness prob- lem still remains unsolved. To completely avoid the weight staleness and consistency issue, Gpipe (Huang et al., 2018) proposes a different pipelining method from Pipedream. It divides each mini-batch into multiple micro-batches and pipelining the micro-batches of the same mini-batch, and weights are updated synchronously at the end of a mini- batch. Even though GPipe could increase GPU utilization arXiv:1809.02839v4 [cs.DC] 28 Oct 2019

Transcript of Efficient and Robust Parallel DNN Training through Model ...

EFFICIENT AND ROBUST PARALLEL DNN TRAINING THROUGH MODELPARALLELISM ON MULTI-GPU PLATFORM

Chi-Chung Chen 1 Chia-Lin Yang 1 Hsiang-Yun Cheng 2

ABSTRACTThe training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train aDNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed upthe process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonlyused parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communicationoverhead due to frequent weight synchronization among GPUs. Another approach is pipelined model parallelism,which partitions a DNN model among GPUs, and processes multiple mini-batches concurrently. This approachcan significantly reduce inter-GPU communication cost compared to data parallelism. However, pipelinedmodel parallelism faces the weight staleness issue; that is, gradients are computed with stale weights, leading totraining instability and accuracy loss. In this paper, we present a pipelined model parallel execution method thatenables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique,SpecTrain. Experimental results show that our proposal achieves up to 8.91x speedup compared to data parallelismon a 4-GPU platform while maintaining comparable model accuracy.

1 INTRODUCTION

Deep Neural Networks (DNNs) have gained a lot interestsin recent years as it has demonstrated success for many clas-sification and regression tasks, including image recognition(Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Heet al., 2015; Szegedy et al., 2017; Real et al., 2019), lan-guage translation (Vaswani et al., 2017; Sutskever et al.,2014; Bahdanau et al., 2014; Arivazhagan et al., 2019) andspeech recognition (Kim et al., 2017; Hinton et al., 2012).Training a DNN model is compute-intensive and often takesdays to weeks, due to the large amount of training data andthe ever-increasing model size. As such, multi-GPU plat-forms are widely adopted to speed up DNN training throughparallel execution. (Harlap et al., 2018; Dean et al., 2012;Cipar et al., 2013; Zhang et al., 2016; Lin et al., 2017; Lianet al., 2015; Chen et al., 2016).

In multi-GPU platforms, data parallelism (Krizhevsky,2014) is a commonly used parallelization approach. Whendata parallelism is applied, each GPU holds a completecopy of DNN model and processes a subset of the trainingdata. Data parallelism suffers from excessive inter-GPUcommunication overheads, since weights updated at eachindividual GPU need to be synchronized. These communica-tion overheads increase as the model size increases, severelyhindering the scalability of data parallelism (Seide et al.,2014; Strom, 2015; Alistarh et al., 2016; Zhou et al., 2016;Aji & Heafield, 2017; Lin et al., 2017).

Another parallelization approach is model parallelism,which partitions a DNN model among GPUs. Each GPUis responsible for the weight updates of the assigned modellayers. Therefore, the amount of data communicated amongGPUs is significantly less than data parallelism. Further-more, model parallelism enables training a large model ex-ceeding the size constraint limited by the accelerator mem-ory capacity. However, for a naive model parallelism im-plementation, only one GPU is active at a time. To enableparallel execution, PipeDream (Harlap et al., 2018) proposesto adopt pipelining by injecting multiple mini-batches to themodel concurrently. However, pipelined model parallelismintroduces the staleness and consistency issue for weightupdates. Since multiple mini-batches are simultaneouslyprocessed in the pipeline, a later mini-batch could start thetraining process before its prior mini-batch updates weights.The staleness/consistency problem leads to unstable train-ing and degrades model accuracy. Pipedream(Harlap et al.,2018) seeks to alleviate the consistency issue by ensuringthe forward and backward passes of a mini-batch in thesame GPU sees the same weights through storing multipleversions of weights. However, the weight staleness prob-lem still remains unsolved. To completely avoid the weightstaleness and consistency issue, Gpipe (Huang et al., 2018)proposes a different pipelining method from Pipedream. Itdivides each mini-batch into multiple micro-batches andpipelining the micro-batches of the same mini-batch, andweights are updated synchronously at the end of a mini-batch. Even though GPipe could increase GPU utilization

arX

iv:1

809.

0283

9v4

[cs

.DC

] 2

8 O

ct 2

019

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

TrainingData

Forward"Cat"

"Deer"

Backward

Loss

Figure 1. Overview of Stochastic Gradient Descent.

compared with non-pipelining model parallelism, it achievessignificantly lower GPU throughput than Pipedream. Ac-cording to the experiment in (Harlap, 2019), the throughputof GPipe could be only 29% of Pipedream.

In this paper, we present a pipelined model parallel ex-ecution method that enables high GPU utilization whilemaintaining robust training accuracy via a novel weightprediction technique, SpecTrain. We adopt the pipeliningmethod in Pipedream and perform detailed comparison withdata-level parallelisms on a multi-GPU platform. For CNNmodels, pipelined model parallelism performs comparableto data parallelism. But for FCN/RNN models which havemassive model parameters, pipelined model parallelismshows superior performance advantage over data parallelism, up to 8.91x speedup on a 4-GPU platform. However, in theaspect of training robustness, pipelined model parallelismshows unstable learning and leads to worse model accuracythan data parallelism due to the weight staleness problem.To tackle this challenge, the proposed SpecTrain mecha-nism, predicts the future weights in early pipeline stages,so that a mini-batch can adopt the predicted future weightsrather than the stale weights to perform its computation. Thedesign is based on the observation that smoothed gradientsused in momentum-based optimizers (Kingma & Ba, 2014)reflect the trend of weight updates and can be leveraged foraccurate weight prediction. Our evaluations show that Spec-Train induces no accuracy drop in most of the workloadscompared to data parallelism, while Pipedream incurs 1.1%accruacy drop.

The rest of the paper is organized as follows. Section 2provides background on parallel DNN training and comparedata parallelism and model parallelism in terms of load bal-ance, inter-GPU communications, and training efficiency.The staleness issue and our weight prediction technique,SpecTrain, are introduces in Section 3. Section 4 describesevaluation methodologies and results, followed by a sum-mary of related work in Section 5. Finally, we conclude thepaper in Section 6.

2 BACKGROUND AND MOTIVATION

2.1 DNN Training

DNN models are typically trained using Stochastic GradientDescent (SGD), as shown in Figure 1. Training data are

GPU0CPU GPU1

(replica)

WeightSynchronization

(replica)

2

1

2

3

mini-batch

(a) Data Parallelism.

GPU0CPU GPU1

(subnet1)

mini-batch

IntermediateData

(subnet0)

1

2 3

45

(b) Model Parallelism.

Figure 2. Training process with different parallelization methods.

randomly sampled into mini-batches. Mini-batches are fedinto the model to traverse the model in two phases: forwardand backward passes. The forward pass generates predic-tions and calculates the loss between the prediction and theground truth. Then the backward pass backpropagates errorsto obtain gradients to update model weights. A mini-batchpassing though the forward and backward phases is referredto as an iteration, and an epoch is defined as performing theforward-backward pass through the entire training dataset.The training process iterates multiple epochs until the modelconverges.

2.2 Parallel Training

The training process of DNN is time-consuming, taking daysor weeks to finish a large-scale training project. ParallelizingDNN training is needed to speedup the process. There aretwo parallelization approaches: data parallelism and modelparallelism.

For data parallelism (Krizhevsky, 2014), each GPU holdsa full copy of the model and processes a different set oftraining data, as shown in Figure 2(a). Each GPU computesits own gradients. These gradients are aggregated at the pa-rameter server by summation. The aggregated gradients arethen broadcasted to all GPUs to update weights. Since everyGPU receives the same gradients, weights in the DNN modelstay consistent among GPUs. The gathering and scatteringprocess of gradients is called weight synchronization, whichrequires a large amount of inter-GPU communications. Avariant of data parallelism widely used in distributed sys-tems is asynchronous data parallelism (Dean et al., 2012;

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Cipar et al., 2013; Zhang et al., 2016; Lian et al., 2015;Chen et al., 2016). When asynchronous data parallelism isapplied, the parameter server is in charge of weight updates.Each GPU sends its gradients to the parameter server, whichthen updates weights and send the updated weights back tothat GPU. In this way, there is no synchronization amongGPUs. This approach solves the unstable networking issuein distributed computing environment but it introduces theinconsistency issue. Moreover, this method does not reducethe amount of data transfers among GPUs. Since we tar-get on multi-GPU systems where GPUs are connected withPCIe links, synchronous data parallelism is preferred.

Model parallelism (Harlap et al., 2018) partitions a modelamong multiple GPUs, where each GPU is responsible forthe weight updates of the assigned model layers, as shownin Figure 2(b). Intermediate data like layer outputs forthe forward pass and gradients for the backward pass aretransferred among GPUs. Since these partitions have de-pendencies, in a naive implementation of model parallelism,only one GPU is active at a time, leading to low GPU utiliza-tion. To enable parallel execution, PipeDream (Harlap et al.,2018) proposes to adopt pipelining by injecting multiplemini-batches to the pipeline concurrently. Therefore, eachGPU could process different mini-batches simultaneously.

2.3 Data Parallelism vs. Model Parallelism

We compare data parallelism and model parallelism in threeaspects: load balance across GPUs, inter-GPU communica-tion, and training efficiency.

Load Balance across GPUs Data parallelism partitionstraining data across multiple GPUs, therefore, load bal-ance could be easily maintained. As for model parallelism,achieving load balance is more challenging. Since the com-plexity of different DNN layers varies, it would introducesignificant efforts for programmers to partition model lay-ers to GPUs in a balanced way. A few prior works haveaddressed this issue(Harlap et al., 2018; Mirhoseini et al.,2017). PipeDream (Harlap et al., 2018) proposes to pro-file the processing time of each layer offline and use dy-namic programming to partition the model. Mirhoseini et al.(Mirhoseini et al., 2017) adopts reinforcement learning todynamically partition the model at run-time.

Inter-GPU Communication Both data parallelism andmodel parallelism need inter-GPU communications. Dataparallelism communicates gradients for weight synchroniza-tion, while model parallelism transfers intermediate databetween sub-models. Figure 3 compares the amount of datatransfers among GPUs for these two parallization methods.1

We observe that data parallelism requires 13.4x more inter-

1The experimental setup is described in Section 4.1.

3,167

1,379

970

3,168

1,839

1,492 1,372

784 1,037

6 30 31 0

500

1000

1500

2000

2500

3000

3500

Inte

r-G

PU

tra

nsfe

r p

er

batc

h (

MB

)

Data P. Model P.

Figure 3. The amount of inter-GPU data transfers per mini-batchwhen applying different parallization approaches on a 4-GPU sys-tem.

22.5 15.8

10.2

76.7

33.1 39.4

77.5 84.2

89.8

23.3

66.9 60.6

0

10

20

30

40

50

60

70

80

90

100

Perc

en

tag

e (

%)

Inter-GPU time % Other %

Figure 4. Percentage of time spent on inter-GPU communicationswhen applying data parallelism on a 4-GPU system.

GPU communications on average (up to 528x) than modelparallelism. The only exception is Inception v4 (Szegedyet al., 2017), as its relatively smaller size of weights reducesthe cost of weight synchronization. SNN (Klambauer et al.,2017), Residual LSTM (Kim et al., 2017), and Transformer(Vaswani et al., 2017) require almost no communications be-tween GPUs when model parallelism is applied, since thesemodels have fewer intermediate data between layers. Theexcessive inter-GPU communications of data parallelismlead to considerable slowdown. As shown in Figure 4, onaverage, 26.7% of training time is spent on inter-GPU datatransfer (up to 76.7%) when data parallelism is applied.

Training Efficiency Both data parallelism and model par-allelism affect DNN training efficiency, i.e., the model con-vergence rate and model accuracy. For data parallelism,data partition size affects a key DNN training parameter:the mini-batch size. A large mini-batch size increases theGPU utilization but could degrade the model accuracy if itis excessively large (Wu et al., 2017; Keskar et al., 2016).In contrast, a small mini-batch size may under-utilize GPUresources. For data parallelism, if each GPU performs thetraining process for a mini-batch, then the effective mini-

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0 1000 2000 3000 4000 5000

Tra

in A

ccu

racy

Iteration

Data Parallelism

Model Parallelism

Figure 5. The accuracy curve when training Transformer (Vaswaniet al., 2017). The light-color curve denotes the measured values.To make the trends clear, we add a thick line to show the movingaverage over 20 latest values.

batch size is the number of GPU times the mini-batch size.To avoid the adverse effect of a large mini-batch size, wecould choose to partition a mini-batch among GPUs. How-ever, it could under-utilize GPU resources. As the numberof GPU increases, it becomes more and more challengingto select a data partition size that is good for DNN trainingefficiency.

For model parallelism, if training is proceeded in thepipelined manner, it induces the staleness issue. Since mul-tiple mini-batches are in progress in the pipeline, before ear-lier mini-batches update weights, latter mini-batches adoptstale weights to derive gradients. This staleness issue leadsto unstable learning and worse model accuracy. Figure 5shows the comparison of model accuracy, i.e., the percent-age of correct classifications, between model and data par-allelism. We observe that the accuracy of data parallelismsteadily increases as training goes on, but the accuracy ofmodel parallelism fluctuates.

As the DNN model continues to grow, we need to deploymore GPUs to speedup training. As such, data parallelism isexpected to face the scalability issue. Model parallelism ap-pears to be appealing for parallelizing DNN training. There-fore, in this paper, we tackle the main challenge, i.e., thestaleness issue, for realizing a robust and efficient modelparallelism approach to parallelize DNN training.

3 ROBUST PIPELINE DESIGN

3.1 Staleness Issue

Pipelined DNN training is bi-directional (Harlap et al.,2018); that is, a mini-batch flows through the pipeline (for-ward pass) and then traverse back for weight updates (back-ward pass). It issues a forward task and a backward task ina round-robin manner as demonstrated in Figure 6. Once atask is completed, it asynchronously executes the next oneto maximize the utilization. The output data will be sent tothe next GPU via a background thread to hide the latency.

Staleness is a critical issue that should be resolved in

1 2

1 1

GPU0

GPU1

GPU2 1

2

3

1 2

1

1

3 2

3

ForwardTask BackwardTask Idle

SteadyState

3

52

3

63

5

5 5

7

6 5

6

4

4

4 4

4

4

2 6 7 7

7 6

85

8

Figure 6. Example timeline of pipelined model parallelism. Theboxes with index i denotes the time spent on processing the i-thmini-batch. Cyan and yellow boxes indicate forward and backwardtasks respectively. A round trip of processing a mini-batch ispresented by the arrows.

pipelined model parallelism. We use an example, as shownin Figure 7, to explain the staleness issue in pipelining. Inthis example, Wt represents the version of weights at timeunit t.

In the staleness-free single GPU implementation of train-ing, both the forward and backward passes of a mini-batchare performed based on the latest weights. At time t, theforward and backward passes of a mini-batch both performcomputation based on the latest weights Wt, which was gen-erated by the backward pass of the previous mini-batch attime t− 1. For example, in Figure 7(a), the 3-rd mini-batchproduces W4, and the processing of the 4-th mini-batch atthe 4-th time unit is based on W4.

Different from the single GPU implementation, mini-batches in pipelined training adopt inconsistent and staleweights to perform computation. In pipelined training,a mini-batch is processed by different GPUs in multipleconsecutive time units to finish the forward and backwardpasses. Since multiple mini-batches are in progress in thepipeline, weights are continuously updated at every timeunit. Thus, a mini-batch adopts inconsistent versions ofweights during its round trip (i.e., the entire flow of forwardand backward passes) in the pipeline. In addition, beforeearlier mini-batches update the weights, a latter mini-batchperforms computation based on stale weights. For example,in Figure 7(b), the 4-th mini-batch adopts various versionsof weights, ranging from W4 to W6, during its round trip.From the 4-th mini-batch’s perspective, W4 and W5 arestale and the only staleness-free version of weights isW6, asW6 is derived after the 3-rd mini-batch updates the weights.Such a weight update behavior is called the weight stalenessissue and it leads to unstable and inferior convergence.

3.2 SpecTrain: Staleness Mitigation via WeightPrediction

To resolve the staleness issue, we propose a weight pre-diction method, SpecTrain, and apply this method in the

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

ForwardTask BackwardTask

GPU 1 2 2 3 3 4 41 6 6 7 7 8 8

t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8

�4

55

(a) Single GPU

1 2

1

GPU0

GPU1

GPU2 1

2

3

1 2

1

1

3 2

3 3

4 2

3

5 3

5

5 5

6

4

4 4

4

4

2

�4

�5

�5

�6

t=2 t=3 t=4 t=5 t=6

�6

(b) Pipelining

GPU0

GPU1

GPU2

�4

�5

�5

1 2

1

1

2

3

1 2

1

1

3 2

3 3

4 2

3

5 3

5

5 5

6

4

4 4

4

4

2

t=2 t=3 t=4 t=5 t=6

(c) Pipelining with PipeDream Weight Stashing.

�ˆ6

�ˆ6

�ˆ6

1 2

1

GPU0

GPU1

GPU2 1

2

3

1 2

1

1

3 2

3 3

4 2

3

5 3

5

5 5

6

4

4 4

4

4

2

t=2 t=3 t=4 t=5 t=6 ,(d) Pipelining with SpecTrain

Figure 7. Example timeline of single-GPU, pipelining, PipeDreamwith weight stashing, and our SpecTrain. The round trip of the5-th mini-batch and its adopted weight versions are illustrated.Note that we intentionally align tasks among GPUs to mark thetimestamps. In actual execution, GPUs run the next task back toback without waiting other GPUs.

pipeline. SpecTrain predicts future weights in early pipelinestages, so that a mini-batch can perform computation basedon the predicted future weights rather than the stale weights.Our goal is to build a consistent and staleness-free trainingprocedure.

In a consistent training procedure, the entire round tripof a mini-batch should adopt the same weight version.PipeDream (Harlap et al., 2018) proposes Weight Stash-ing to alleviate the inconsistency problem. Weight Stashingensures the forward and backward passes of a mini-batchin the same GPU uses the same weights as demonstrated inFigure 7(c), where the 4-th mini-batch uses W4 during bothforward and backward pass on GPU 0, even though newerversions of weights are produced. Pipedream applies WeightStashing in their experiments. To implement Weight Stash-ing, every GPU should maintain a queue for storing severalold versions of weights. This additional queue wastes GPUmemory space. Moreover, the training process of Weight

Stashing still uses stale weights.

To maintain weight consistency and avoid staleness, Spec-Train predicts future weights and adopts the predictedweights, rather than the earliest version of weights, through-out the entire round trip of a mini-batch. Figure 7(d) il-lustrates the idea of SpecTrain. Suppose that a mini-batchcompletes its round trip at time t, at early pipeline stages,the mini-batch predicts the future version of weights (Wt),which is expected to become the most updated weights attime t, and uses this future version of weights to performcomputation. For example, in Figure 7(d), the processing ofthe 4-th mini-batch in its entire round trip is based on W6

rather than W4.

Weight Prediction. SpecTrain predicts future weightsbased on the observation that smoothed gradients used inMomentum SGD (Kingma & Ba, 2014) reflect the trend ofweight updates. Momentum SGD is a common techniquethat can help to speed up and improve the stability of SGDby smoothing the weight updates. A smoothed gradient (vt)is the weighted average of recent gradients and is calculatedby the following equation:

vt = γ · vt−1 + (1− γ) · gt (1)

where γ is the decay factor with 0 < γ ≤ 1 and gt is thenewly generated gradient. Through averaging with recentgradients by the decay factor, smoothed gradient vt reflectsthe trend of weight updates. Thus, we can use smoothedgradients to predict future weights.

In original SGD, weight update from time t to time t+ 1 isconducted by the following equation:

Wt+1 =Wt − η · gt (2)

where η is the learning rate. Since our goal is to predictfuture weights, the true gradient gt is not derived yet whenwe make the prediction at an early pipeline stage. Never-theless, we can use smoothed gradient vt−1 to replace thetrue gradient gt, as smoothed gradient reflects the trend ofweight updates. Thus, we can use the following equation tomake weight prediction:

Wt+1 =Wt − η · vt−1 (3)

Given a version of weights and the version difference s, wecan make prediction on the future version of weights byrecursively applying Equation 3, as shown below:

Wt+s =Wt − s · η · vt−1 (4)

The version difference s is defined as the number of timeunits between current pipeline stage and the time a mini-batch completes its round trip. As shown in the following

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

0.0E+0

5.0E-7

1.0E-6

1.5E-6

2.0E-6

2.5E-6

3.0E-6

3.5E-6

4.0E-6

0 1000 2000 3000 4000 5000

RMSE

Iteration

Staled Predicted

(a) Version difference s = 1.

0.0E+0

5.0E-7

1.0E-6

1.5E-6

2.0E-6

2.5E-6

3.0E-6

3.5E-6

4.0E-6

0 1000 2000 3000 4000 5000

RMSE

Iteration

Staled Predicted

(b) Version difference s = 2.

0.0E+0

5.0E-7

1.0E-6

1.5E-6

2.0E-6

2.5E-6

3.0E-6

3.5E-6

4.0E-6

0 1000 2000 3000 4000 5000

RMSE

Iteration

Staled Predicted

(c) Version difference s = 3.

Figure 8. Comparison between the RMSE of the predicted weights and the RMSE of the stale weights when training SNN on a 4-GPUsystem.

equations, the calculation of version difference s dependson GPU index, k, and whether the mini-batch is at a for-ward or backward pass. For mini-batches at forward passes,assuming that there are N GPUs in the multi-GPU system,the version difference is calculated by

s = bk/2c+N − k − 1 (5)

For mini-batches at backward passes, the version differencecan be simply calculated by

s = bk/2c (6)

For example, in Figure 7(d), at the 4-th time unit, the 4-thmini-batch is at its first pipeline stage of a forward pass andits version difference is s = b0/2c+ 3− 0− 1 = 2, as thismini-batch will not complete until the 6-th time unit. Thus,the future weights of the 4-th mini-batch can be predictedby W6 = W4 − (6 − 4) · η · v3. By using the predictedweights W6 rather than the stale weights W4 to performcomputation, SpecTrain can mitigate the staleness issue inpipelined model parallelism.

Prediction Accuracy. We conducted an experiment toevaluate the accuracy of our weight prediction. To quan-tify the accuracy, we calculate the root-mean-square error(RMSE) between the predicted weights, Wt, and the actualweights, Wt. As a comparison, the RMSE between the staleweights, Wt−s, and actual weights, Wt, is also analyzed.The experiment is conducted by training the SNN (Klam-bauer et al., 2017) model 2. To demonstrate how the versiondifference s affects the prediction accuracy, we performRMSE evaluations on s = 1, s = 2 and s = 3. Figure 8shows the error curves w/ and w/o weight prediction. Asshown in the figure, the RMSE of the predicted weights is

2Experimental setup is described in Section 4.1. Due to spacelimitation, we only show the RMSE results of SNN. Experimentson other models also show similar results.

apparently lower than the RMSE of the stale ones, indicat-ing that the smoothed gradient based weight prediction isaccurate and can help to mitigate the staleness issue. As theversion difference increases, the RMSE of the stale weightsalso increases, while the RMSE of the predicted weights isalways much lower than the RMSE of the stale ones.

4 EXPERIMENTS

4.1 Evaluation Methodology

We use Falconwitch PS1816 (H3 Platform Inc.), a single-node multi-GPU system designed by H3 Platform, to con-duct our performance and robustness studies. The multi-GPU platform is equipped with four PCIe 3.0 x16 connectedNVIDIA Tesla P40 GPUs, and it supports simultaneous exe-cution of multiple peer-to-peer (P2P) transfers as long as thesource and destination devices of these transfer requests aredifferent. The CPU on the platform is Intel Xeon E5-2650with 160GB DDR4-2400 off-chip main memory.

To evaluate the performance of the proposed method, sixrepresentative deep learning models are chosen as the bench-mark. These models cover different types of neural net-works, including convolutional neural network (CNN), fully-connected network (FCN), and recurrent neural network(RNN). For all these models, the mini-batch size is set to128 and the momentum factor γ is set to 0.9. Since we focuson comparing the performance of different parallelizationapproaches rather than pursuing optimal accuracy, we adopta fixed learning rate instead of variable rates to maintain thestability of the experiments.

CNN Models We use three state-of-the-art image clas-sification CNNs, including VGG16 (Simonyan & Zisser-man, 2014), ResNet-152 (He et al., 2015) and Inceptionv4 (Szegedy et al., 2017), to evaluate the impact of differentparallelization methods on CNN models. To train thesemodels, we use CIFAR-10, which contains 50,000 trainingimages and 10,000 testing images, as the dataset. These

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

three CNN models cover a wide range of layer shapes andsizes, and capture the evolution of CNNs in recent years. Wetrain these three models, VGG16, ResNet-152, and Inceptionv4, using Momentum SGD with learning rates 2e-3, 2e-2,and 1e-3, respectively. Unlike the other two models, thefirst stage in VGG16 takes much longer execution time thanother stages in the pipeline and thus becomes the bottleneck.As a limitation of pipelining, this layer cannot be furtherdivided for better load balance. Therefore, we combinepipeline with data parallelism and adopt data parallelism inthe first stage to mitigate the imbalance between pipelinestages.

FCN Models For FCN models, we use SNN (Klam-bauer et al., 2017) (trained by CIFAR-10) and Transformer(trained by IMDb Movie Review Sentiment Dataset (Maaset al., 2011)) to study the impact of different parallelizationapproaches. SNN is stacked by 32 fully-connected layerswith 2048 hidden units, while Transformer has 6 blocksin both encoder and decoder with 8 heads and 512 hiddenunits in fully-connected layers. IMDb Movie Review Sen-timent Dataset consists of 25,000 training movie reviewsand 25,000 testing reviews, and each of the review is taggedwith a sentiment, either positive or negative. Transformeris trained to determine the sentiment given input sentences.Most of the parameters remain the same as the originalmodel, except that the input sentences are truncated to 20words. These two models, SNN and Transformer, are trainedusing Momentum SGD with learning rates 1e-3 and 5e-5respectively.

RNN Models We study the impact of different paralleliza-tion methods on Residual LSTM (Kim et al., 2017) usingIMDb Dataset. The Residual LSTM comprises 8 layersof long short-term memory (LSTM) with 512 embeddingunits, 512 output units, and 1024 memory units. During thetraining process, the learning rate is set to 5e-3.

We use TensorFlow (Abadi et al., 2015) framework with adeep learning library, cuDNN (Chetlur et al., 2014), to im-plement these targeted neural network models. TensorFlowprovides high-level APIs for building data parallelism train-ing process, and it also provides flexibility to implementmodel parallelism paradigms, including PipeDream and ourproposed method. We evaluate the following training imple-mentations:

• Single GPU: Training on single GPU. No paralleliza-tion method is applied.

• Data P.: Data parallelism approach.

• Vanilla Model P.: Model parallelism with pipeline.

• PipeDream: Model parallelism with pipeline, and thestaleness issue is mitigated by maintaining the consis-

tency between the forward and backward tasks of thesame GPU (Weight Stashing).

• SpecTrain: Pipelined model parallelism with our pro-posed weight prediction mechanism (SpecTrain) tosolve the staleness issue. We partition the model basedon the same approach used by PipeDream.

4.2 Throughput

Figure 9 compares the throughput (measured as the numberof training samples per second) for data and pipelined modelparallelism. The throughput measurements are conductedover the interval between the 50th and 250th training itera-tions to get stable results. Reported throughput numbers arenormalized to Single GPU. Since the staleness mitigationmethod in PipeDream and SpecTrain have very little impacton performance, we only show the result of Vanilla ModelP.

For CNN models, including VGG16, ResNet-152, and Incep-tion v4, the throughput of both parallel schemes increases asthe number of GPUs increases. The throughput of pipelin-ing is in par with data parallelism, except for Inception v4,which contains less weights and thus imposes less commu-nication overheads on data parallelism.

For FCN/RNN models, including SNN, Transformer, andResidual LSTM, the scalability trend is different. The 4-GPUresults of Data P. show significant throughput degradationcompared to the 2-GPU results for all these three models.On average, at the 4-GPU system, Data P. can only provide38.5% throughput improvement compared to Single GPU.The reason is that FCN and RNN models typically containa rich amount of weights, and, for each mini-batch, theseweights should be synchronized among all GPUs when DataP. is applied. Detailed analysis about the synchronizationoverhead incurred by Data P. will be discussed in Section 4.3.Different from Data P., the model parallelism scheme getssignificant throughput improvement on the 4-GPU system,as they do not suffer from the huge synchronization over-head. On average, when running FCN/RNN models on the4-GPU system, model parallelism provides 3.10x through-put improvement over Data P..

In summary, model parallelism provides a pipeline designwithout incurring throughput-harmful synchronization over-heads. On average, at the 4-GPU system, model parallelismprovides 98.5% throughput improvement (up to 8.91x) overData P.

4.3 Performance Breakdown

To analyze the performance difference between data paral-lelism and model parallelism in detail, we breakdown theoverall execution time into computing time, P2P transfertime, P2P-induced idle time, and imbalance-induced idle

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

0.00

0.50

1.00

1.50

2.00

2.50

3.00

2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU 2GPU 4GPU

No

rma

lize

d S

pe

ed

up

Data P. Vanilla Model P.

VGG16 ResNet-152

Inceptionv4

SNN Transformer Residual LSTM

CNN Mean

FCN/RNN Mean

Overall Mean

Figure 9. Throughput of the targeted neural network models with various parallelization approaches at 2-GPU and 4-GPU systemsnormalized to the throughput of Single GPU.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.

Da

ta P

.

Mo

de

l P.N

orm

. B

atc

h P

roc.

Tim

e

Compute P2P P2P-induced Idle Imbalance Idle

VGG16 ResNet-152

Inceptionv4

SNN Transformer ResidualLSTM

CNNMean

FCN/RNNMean

OverallMean

Figure 10. Performance breakdown of Data P. and Vanilla Model P. normalized to Data P..

time. The computing time includes the time spent on com-puting. P2P transfer time represents the data transfer timethat cannot be overlapped with computing, and P2P-inducedidle time is the time that data transfer is stalled due to busyPCI-E interconnection. Imbalance-induced idle is the GPUidle time waiting for other GPUs to come to the synchro-nization point, i.e., imbalanced workloads among GPUs.

We use NVIDIA CUDA 8.0 profiler (nvprof) to get theruntime data related to performance breakdown during thetraining process. First, we dump the GPU trace by executingthe training program with nvprof on the 4-GPU system. Thetrace contains the starting timestamps, ending timestamps,and transfer size, of every computing kernel and data trans-fer transaction. We then write a script to analyze the GPUtrace and get the performance breakdown.

Figure 10 illustrates the performance breakdown of differentparallelization methods on the 4-GPU system when process-ing a mini-batch, normalized to the overall execution time ofa mini-batch when Data P. is applied. Data P. spends moretime on P2P transfer and P2P-induced idle than the modelparallelism scheme, as Data P. transfers huge amount ofdata during weight synchronization, especially when train-ing FCN/RNN models. On average, the P2P-related time(i.e., P2P transfer and P2P-induced idle) accounts for 26.7%overall execution time when Data P. is applied, while theP2P-related time in pipelined model parallelism is only

4.52%. For FCN/RNN models, the P2P-related time inData P. contributes to 49.8% of total execution time onaverage, indicating that the communication among GPUswould greatly impact performance. In the aspect of work-load balance among GPUs, model parallelism shows higherimbalance-induced idle time than data parallelism.

The computing time of Data P. is 63.1% longer than modelparallelism on average. One reason is that the mini-batchpartition leads to GPU under-utilization. Another reasonis due to kernel preprocessing recomputation. The state-of-the-art algorithms for computation in neural networks,such as FFT (Mathieu et al., 2013) and Winograd (Lavin &Gray, 2016), require kernels to preprocess weights beforecomputing the input data. This means that, for data paral-lelism, every GPU needs to reperform the preprocessing onthe replicated weights, thus increasing the computationaloverheads.

4.4 Staleness and Convergence

Since pipelined model parallelism induces the stalenessproblem, we analyze the impact of staleness on conver-gence for different model parallelism based parallelizationapproaches, including Vanilla Model P., PipeDream, and theSpecTrain we proposed. These model parallelism based par-allelization methods are also compared with Data P., whichhas no staleness issues. Vanilla Model P. represents the

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

0.65

0.66

0.67

0.68

0.69

0.7

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000# Iteration

Data Parallelism

Model Parallelism

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0 1000 2000 3000 4000 5000

Tra

in L

oss

Iteration

Data P.

Vanilla Model P.

PipeDream

SpecTrain

(a) Train Loss

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0 1000 2000 3000 4000 5000

Va

l Lo

ss

Iteration

Data P. Vanilla Model P. PipeDream SpecTrain

(b) Validation Loss

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0 1000 2000 3000 4000 5000

Accu

racy

Iteration

Data P.

Vanilla Model P.

PipeDream

SpecTrain

(c) Validation Accuracy

Figure 11. Learning curves when using different parallelization schemes to train Transformer. The light-color lines denote the measuredvalues. To make the trends clear, we add a thick line to show the moving average over 20 latest values.

worst case, as no staleness mitigation technique is applied.PipeDream mitigates the inconsistency between the forwardand backward pass by sticking to stale weights while Spec-Train utilizes weight prediction to alleviate the stalenessissue. The experiments are done by training each modelfor 5000 iterations. The learning rates of these models aretuned, as described in Section 4.1, to ensure the trainingconverges eventually. Training loss, validation loss, andvalidation accuracy are recorded every 20 steps (iterations)to show the learning curve. In the following paragraphs, weanalyze the impact of staleness problem on learning curveand model accuracy.

Figure 11 shows the learning curve of Transformer whendifferent staleness mitigation techniques are applied. Dueto the space limitation, only the learning curve of Trans-former is illustrated, but the trend is similar for all neuralnetwork models. As shown in Figure 11, Vanilla ModelP. is the least stable parallelization scheme, especially nearthe convergence point, as no staleness mitigation techniqueis applied. PipeDream mitigates the weight inconsistencyissue between the forward and backward pass but still usesstale weights, so the improvement is limited and the learningcurve is very close to Vanilla Model P. Our SpecTrain tech-nique effectively alleviates the instability problem by weightprediction and the learning curve of SpecTrain is similar toData P., indicating that using SpecTrain can achieve nearrobust training process.

Table 1 shows the results of model accuracy. Vanilla ModelP., the Model P. without staleness mitigation technique, isseriously affected by the staleness problem, resulting in0.55% accuracy drop on average, compared to Data P.. Thestaleness mitigation method adopted by PipeDream cannotalleviate the staleness issue in our experiments, losing 1.1%validation accuracy over Data P. on average, which is worsethan Vanilla Model P. It is likely because the adoption ofthe stale weights in forward pass aggravates staleness, al-though it resolves the weight inconsistency issue. By usingweight prediction to resolve the staleness issue, our Spec-

Train shows no accuracy drop in almost all of the workloadsand achieves the same validation accuracy as the staleness-free Data P. The only exception is the RNN model, ResidualLSTM, most likely due to the gradient exploding issue (Pas-canu et al., 2013). The gradients in RNNs sometimes dra-matically increase, and a sudden exploding gradient lowersthe prediction accuracy of SpecTrain.

5 RELATED WORK

Most of related studies regarding DNN training paralleliza-tion target on distributed systems. Data parallalism is themost commonly used parallelization scheme in distributedsystems. Many deep learning frameworks such as Tensor-flow (Abadi et al., 2015), PyTorch (Paszke et al., 2017)and Caffe (Jia et al., 2014) provide high-level APIs forapplying data parallel training. It is well known that thehuge communication overhead of data parallelism limitsits scalability. Some of prior studies focus on reducing thecommunication overheads for data parallelism. Seide et al.(Seide et al., 2014) propose 1-bit SGD, which quantizes thegradients to one bit per value to lower the communicationcost. Strom (Strom, 2015) proposes to send gradients onlywhen it is larger than a threshold to reduce communications.Alistarh et al. (Alistarh et al., 2016) design another gra-dient quantization approach to do compression, and theirproposed scheme allows users to trade off accuracy andperformance. Zhou et al. (Zhou et al., 2016) speed up dataparallelism on convolutional neural networks by using low-bit-width weights, activations, and gradients. Aji et al. (Aji& Heafield, 2017) propose Gradient Dropping to preventfrequent gradient transfers. Lin et al. (Lin et al., 2017)implement Gradient Dropping (Aji & Heafield, 2017), anddevelop many techniques to increase the robustness. Sun etal. (Sun et al., 2017) improve the performance of parameterserver by leveraging distributed shared memory to reducethe networking time. Mamidala et al. (Mamidala et al.,2018) apply MPI parallelism to replace the gradient aggre-gation process at the parameter server, in order to provide

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Table 1. Model accuracy of different neural network models whenvarious parallelization schemes are applied. The bold values indi-cates the best one among the three pipelining implementations.

ParallelizationScheme

Min.Train Loss

Min.Val. Loss

Max. Val.Accuracy

VGG16

Data P. 0.213271 0.794613 73.4776%Vanilla Model P. 0.204126 0.811148 73.0569%

PipeDream 0.200585 0.811144 72.8365%SpecTrain 0.185566 0.796017 73.8081%

ResNet-152

Data P. 0.338845 0.892366 71.2139%Vanilla Model P. 0.254327 0.945588 70.3225%

PipeDream 0.467527 0.979401 67.9287%SpecTrain 0.231724 0.924241 70.9936%

Inception v4

Data P. 0.804475 0.913155 69.1607%Vanilla Model P. 0.858834 0.919470 68.6599%

PipeDream 0.864199 0.930320 68.5297%SpecTrain 0.756939 0.874898 70.7732%

SNN

Data P. 0.431832 1.45124 50.9115%Vanilla Model P. 0.766552 1.440200 50.6911%

PipeDream 0.810452 1.450239 50.4107%SpecTrain 0.724402 1.406447 52.0733%

Transformer

Data P. 0.649287 0.660871 60.3265%Vanilla Model P. 0.655801 0.662379 60.0963%

PipeDream 0.655877 0.662544 59.9760%SpecTrain 0.652193 0.662502 60.1362%

Residual LSTM

Data P. 0.347742 0.658583 66.0557%Vanilla Model P. 0.459975 0.652651 65.0240%

PipeDream 0.467595 0.652948 64.8137%SpecTrain 0.454813 0.652251 64.8137%

better scalability. Project Adam (Chilimbi et al., 2014) andFireCaffe (Iandola et al., 2016) design deep learning trainingsystems that optimize and balance workload computationand communication to scale DNN.

In distributed systems, another issue, straggler problem, fur-ther downgrades the performance of data parallelism. Dueto unstable networking, some workers may encounter heavytraffic and consume much time on weight synchronization,making other workers stall. This straggler problem reducesGPU utilization. Chen et al. (Chen et al., 2016) propose togive up slow workers at run-time, solving the straggler prob-lem without affecting algorithmic correctness. Dean et al.(Dean et al., 2012) propose a novel variant of SGD calledasynchronous data parallelism, which skips the synchro-nization between workers. Some deep learning frameworkslike Tensorflow (Abadi et al., 2015) and Theano-MPI (Ma

et al., 2016) have supported this parallelization approach.However, without synchronization, workers may use staleweights to calculate. As such, asynchronous data paral-lelism also faces the staleness issue. Cipar et al. (Ciparet al., 2013) present bounded staleness to prevent calculat-ing gradients based on excessively stale weights. Zhang et al.(Zhang et al., 2016) propose staleness-dependant learningrate to apply stale gradients on weight updates with a decayproportional to the inverse of the weight version difference.

Model parallelism is another appealing approach to paral-lelize DNN training. Krizhevsky (Krizhevsky, 2014) pro-poses to apply model parallelism on fully-connected layersof AlexNet (Krizhevsky et al., 2012). Li et al.(Li et al.,2014) design a two-stage pipeline for training recurrent neu-ral network. Mirhoseini et al. (Mirhoseini et al., 2017) usereinforcement learning to come up with a device placementpolicy for model parallelism. The above methods cannot par-allelize some portions of a model if dependencies exist, i.e.inter-layer parallelization. Harlap et al. propose PipeDream(Harlap et al., 2018) as the first work using pipeline in modelparallelism to train DNN models, enabling inter-layer paral-lelization. Huo et al. (Huo et al., 2018b;a) also propose algo-rithms to break backward locking for model parallelism sothat the backward pass can be performed in parallel. GPipe(Huang et al., 2018) updates the weights synchronouslyand reduce the GPU idle time by dividing the mini-batchinto multiple micro-batches and piplining the micro-batchesdivided from the same mini-batch.

Some recent works consider not only data parallelism andmodel parallelism when finding parallelization strategy. Jiaet al. (Jia et al., 2019; 2018) partition the tensor along mul-tiple dimensions and search for best parallelization strategy.

6 CONCLUSION

As training DNN models is time-consuming, multi-GPUacceleration is a widely adopted way to speed up. Dataparallelism is the most common parallelization scheme butit suffers from huge inter-GPU communication overheads.Model parallelism provides another parallelization methodwith much less communications, however, posing a chal-lenge: staleness issue. We propose a weight predictionmethod, SpecTrain, that leverages smoothed gradients forweight prediction to mitigate the staleness problem. Experi-mental results show that the throughput of pipelined modelparallelism is 98.5% higher than data parallelism on average(up to 8.91x). In addition, SpecTrain shows robustness ontraining with no accuracy drop in most of the workloadscompared to staleness-free training.

1National Taiwan University, Taipei, Taiwan 2AcademiaSinica, Taipei, Taiwan. Correspondence to: Chia-Lin Yang<[email protected]>.

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mane, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL https://www.tensorflow.org/.

Aji, A. F. and Heafield, K. Sparse communication for dis-tributed gradient descent. In Proceedings of the Confer-ence on Empirical Methods in Natural Language Pro-cessing, pp. 440–445, 2017.

Alistarh, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD:randomized quantization for communication-optimalstochastic gradient descent. CoRR, abs/1610.02132, 2016.URL http://arxiv.org/abs/1610.02132.

Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., John-son, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G.,Cherry, C., Macherey, W., Chen, Z., and Wu, Y. Mas-sively multilingual neural machine translation in the wild:Findings and challenges. CoRR, abs/1907.05019, 2019.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.

Chen, J., Monga, R., Bengio, S., and Jozefowicz,R. Revisiting distributed synchronous SGD. CoRR,abs/1604.00981, 2016. URL http://arxiv.org/abs/1604.00981.

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,J., Catanzaro, B., and Shelhamer, E. cudnn: Efficientprimitives for deep learning. CoRR, abs/1410.0759, 2014.URL http://arxiv.org/abs/1410.0759.

Chilimbi, T. M., Suzue, Y., Apacible, J., and Kalyanaraman,K. Project adam: Building an efficient and scalable deeplearning training system. In Proceedings of the 11thUSENIX Symposium on Operating Systems Design andImplementation, pp. 571–582, 2014.

Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson,G., Keeton, K., and Xing, E. P. Solving the stragglerproblem with bounded staleness. In Proceedings of the14th Workshop on Hot Topics in Operating Systems, 2013.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le,Q. V., Mao, M. Z., Ranzato, M., Senior, A. W., Tucker,P. A., Yang, K., and Ng, A. Y. Large scale distributeddeep networks. In Proceedings of the 26th Conference onNeural Information Processing Systems, pp. 1232–1240,2012.

H3 Platform Inc. Falconwitch ps1816. http://www.h3platform.com/product/.

Harlap, A. Improving ml applications in sharedcomputing environments. Doctoral dissertation,2019. URL https://aaronharlap.github.io//papers/aharlap_dissertation.pdf.

Harlap, A., Narayanan, D., Phanishayee, A., Seshadri,V., Devanur, N. R., Ganger, G. R., and Gibbons, P. B.Pipedream: Fast and efficient pipeline parallel DNNtraining. CoRR, abs/1806.03377, 2018. URL http://arxiv.org/abs/1806.03377.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. CoRR, abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r.,Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath,T. N., et al. Deep neural networks for acoustic modelingin speech recognition: The shared views of four researchgroups. IEEE Signal processing magazine, 29(6):82–97,2012.

Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,Q. V., and Chen, Z. Gpipe: Efficient training of gi-ant neural networks using pipeline parallelism. CoRR,abs/1811.06965, 2018. URL http://arxiv.org/abs/1811.06965.

Huo, Z., Gu, B., and Huang, H. Training neural networksusing features replay. CoRR, abs/1807.04511, 2018a.URL http://arxiv.org/abs/1807.04511.

Huo, Z., Gu, B., Yang, Q., and Huang, H. Decoupled par-allel backpropagation with convergence guarantee. InProceedings of the 35th International Conference on Ma-chine Learning, pp. 2103–2111, 2018b.

Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer,K. Firecaffe: Near-linear acceleration of deep neural net-work training on compute clusters. In Proceedings of theConference on Computer Vision and Pattern Recognition,pp. 2592–2600, 2016.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,Girshick, R., Guadarrama, S., and Darrell, T. Caffe:Convolutional architecture for fast feature embedding.CoRR, 2014. URL https://arxiv.org/abs/1408.5093.

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hidden di-mensions in parallelizing convolutional neural networks.In Proceedings of the 35th International Conference onMachine Learning, pp. 2279–2288, 2018.

Jia, Z., Zaharia, M., and Aiken, A. Beyond data and modelparallelism for deep neural networks. In Proceedings ofthe 2nd Conference on Systems and Machine Learning,2019.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,M., and Tang, P. T. P. On large-batch training for deeplearning: Generalization gap and sharp minima. CoRR,abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.

Kim, J., El-Khamy, M., and Lee, J. Residual LSTM: de-sign of a deep recurrent architecture for distant speechrecognition. In Proceedings of the 18th Conference onInternational Speech Communication Association, pp.1591–1595, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter,S. Self-normalizing neural networks. In Proceedings ofthe 31st Conference on Neural Information ProcessingSystems, pp. 972–981, 2017.

Krizhevsky, A. One weird trick for parallelizing convo-lutional neural networks. CoRR, abs/1404.5997, 2014.URL http://arxiv.org/abs/1404.5997.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In Proceedings of the 26th Conference on Neural Infor-mation Processing Systems, pp. 1106–1114, 2012.

Lavin, A. and Gray, S. Fast algorithms for convolutional neu-ral networks. In Proceedings of the Conference on Com-puter Vision and Pattern Recognition, pp. 4013–4021,2016.

Li, B., Zhou, E., Huang, B., Duan, J., Wang, Y., Xu, N.,Zhang, J., and Yang, H. Large scale recurrent neuralnetwork on GPU. In Proceedings of the InternationalJoint Conference on Neural Networks, pp. 4062–4069,2014.

Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous par-allel stochastic gradient for nonconvex optimization. InProceedings of the 29th Conference on Neural Informa-tion Processing Systems, pp. 2737–2745, 2015.

Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, W. J.Deep gradient compression: Reducing the commu-nication bandwidth for distributed training. CoRR,abs/1712.01887, 2017. URL http://arxiv.org/abs/1712.01887.

Ma, H., Mao, F., and Taylor, G. W. Theano-mpi: A theano-based distributed training framework. In Proceedings ofthe 22nd International European Conference on Paralleland Distributed Computing, pp. 800–813, 2016.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,and Potts, C. Learning word vectors for sentiment anal-ysis. In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics, pp. 142–150,2011.

Mamidala, A. R., Kollias, G., Ward, C., and Artico, F.MXNET-MPI: embedding MPI parallelism in parame-ter server task model for scaling deep learning. CoRR,abs/1801.03855, 2018. URL http://arxiv.org/abs/1801.03855.

Mathieu, M., Henaff, M., and LeCun, Y. Fast train-ing of convolutional networks through ffts. CoRR,abs/1312.5851, 2013. URL http://arxiv.org/abs/1312.5851.

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen,R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., andDean, J. Device placement optimization with reinforce-ment learning. In Proceedings of the 34th InternationalConference on Machine Learning, pp. 2430–2439, 2017.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficultyof training recurrent neural networks. In Proceedings ofthe 30th International Conference on Machine Learning,pp. 1310–1318, 2013.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,A. Automatic differentiation in pytorch. In NIPS AutodiffWorkshop, 2017.

Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regular-ized evolution for image classifier architecture search. InProceedings of the 33rd AAAI Conference on ArtificialIntelligence, pp. 4780–4789, 2019.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochas-tic gradient descent and its application to data-paralleldistributed training of speech dnns. In Proceedings of the15th Conference of the International Speech Communica-tion Association, pp. 1058–1062, 2014.

Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Strom, N. Scalable distributed DNN training using com-modity GPU cloud computing. In Proceedings of the 16thConference of the International Speech CommunicationAssociation, pp. 1488–1492, 2015.

Sun, C., Zhang, Y., Yu, W., Zhang, R., Bhuiyan, M. Z. A.,and Li, J. DPS: A dsm-based parameter server for ma-chine learning. In Proceedings of the 14th InternationalSymposium on Pervasive Systems, Algorithms and Net-works & 11th International Conference on Frontier ofComputer Science and Technology & 3rd InternationalSymposium of Creative Computing, pp. 20–27, 2017.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-quence learning with neural networks. In Proceedingsof 28th Conference on Neural Information ProcessingSystems, pp. 3104–3112, 2014.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.Inception-v4, inception-resnet and the impact of resid-ual connections on learning. In Proceedings of the 31stAAAI Conference on Artificial Intelligence, pp. 4278–4284, 2017.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need. In Proceedings of the 31st Conference onNeural Information Processing Systems, pp. 6000–6010,2017.

Wu, L., Zhu, Z., and E, W. Towards understanding gen-eralization of deep learning: Perspective of loss land-scapes. CoRR, abs/1706.10239, 2017. URL http://arxiv.org/abs/1706.10239.

Zhang, W., Gupta, S., Lian, X., and Liu, J. Staleness-awareasync-sgd for distributed deep learning. In Proceedingsof the 25th International Joint Conference on ArtificialIntelligence, pp. 2350–2356, 2016.

Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou,Y. Dorefa-net: Training low bitwidth convolutionalneural networks with low bitwidth gradients. CoRR,abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160.