High Performance Machine Learning Distributed training · High Performance Machine Learning...

High Performance Machine LearningDistributed training

Pawe l Rosciszewski

[email protected]: 521 EA, office hours: Friday 10:00-11:30

March 7, 2018

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 1 / 21

Supplementary courses

Coursera - Machine Learning(https://www.coursera.org/learn/machine-learning)

Coursera - Deep Learning(https://www.coursera.org/specializations/deep-learning)

Stanford CS231N (http://cs231n.stanford.edu/)

DataCamp premium support (https://www.datacamp.com/)


Recap

HPC is crucial for contemporary Machine Learning workloads

Huge training datasets, big models, compute intensity

Also unsupervised learning, compute-intensive workloads, inference

We will focus mostly on training supervised learning models


Stochastic Gradient Descent to rule them all

Figure: SGD iterations [1]


Single-node optimizations

Vectorization - example

High performance libraries - numPy, cuDNN, Intel MKL, eigen...

Hardware - examples(1060, P100, V100 - CPU, mem, P2P)

Monitoring - examples(top, nvidia-smi, ...)

Experiments - NHWC/NCHW, batch size, benchmarks


Computational graph - example


Profiling - example


Recap

why HPC is crucial for ML

dataset, model sizes, hardware used for training

convolutions and their implementations

benchmarking, profiling of ML code

numerical formats used in ML

distributed training algorithms


Schedule

numerical formats used in ML - recap

distributed training algorithms - recap

4.04 - W lodzimierz Kaoka (VoiceLab.ai) - practical deployment ofRNNs for acoustic model inference

4.04 - graph representations?

11.04 - TensorFlow hands-on

18.04 - midterm test (45 min), TF hands-on

25.04 - lab starts


Hardware trends

Figure: TensorCores [9]


Hardware trends

Figure: Tensor Processing Unit architecture [10]


Multinode training

Figure: SGD iterations [1]


Multinode training - data parallelism

Figure: General architecture of data parallel multinode training [1]


Multinode training - model parallelism

Figure: General architecture of model parallel multinode training [2]


Multinode training - mixed model/data

Convolutional layers cumulatively contain about 90-95% of thecomputation, about 5% of the parameters, and have largerepresentations. [3]Fully-connected layers contain about 5-10% of the computation,about 95% of the parameters, and have small representations. [3]

Figure: Mixed data/model parallel multinode training [3]


Multinode training - search parallelism

Figure: US Department of Energy Deep Learning objectives [11]

”DNNs in general do not have good strong scaling behavior, so to fullyexploit large-scale parallelism they rely on a combination of model, dataand search parallelism.” [4]


Multinode training - model averaging

Figure: Parallel training of DNNs with natural gradient and parameter averaging [5]


Multinode training - model averaging frequency?

Figure: Experiments with model averaging frequency [6]


Multinode training - model averaging frequency?

Figure: Using experience from our HPCS class to optimize a popular ML framework [6]


Federated Learning

Figure: Federated Learning architecture [7,8]


References

1 https://github.com/tensorflow/models/tree/master/research/inception

2 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., others,2012. Large scale distributed deep networks, in: Advances in Neural Information Processing Systems. pp. 1223–1231.

3 Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.

4 Stevens, R., 2017. Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPCArchitecture. ACM Press, pp. 65–65. https://doi.org/10.1145/3078597.3091526

5 Povey, D., Zhang, X., Khudanpur, S., 2014. Parallel training of deep neural networks with natural gradient andparameter averaging. CoRR, vol. abs/1410.7455.

6 Rosciszewski, P., Kaliski, J., 2017. Minimizing Distribution and Data Loading Overheads in Parallel Training of DNNAcoustic Models with Frequent Parameter Averaging. IEEE, pp. 560–565. https://doi.org/10.1109/HPCS.2017.89

7 McMahan, H.B., Moore, E., Ramage, D., Hampson, S., 2016. Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629.

8 https://research.googleblog.com/2017/04/federated-learning-collaborative.html

9 https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

10 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

11 https://www.slideshare.net/insideHPC/a-vision-for-exascale-simulation-and-deep-learning


High Performance Machine Learning Distributed training · High Performance Machine Learning...

Documents

Transcript of High Performance Machine Learning Distributed training · High Performance Machine Learning...