High Performance Machine Learning Distributed training · High Performance Machine Learning...

21
High Performance Machine Learning Distributed training Pawe l Ro´ sciszewski [email protected] Office: 521 EA, office hours: Friday 10:00-11:30 March 7, 2018 Pawe l Ro´ sciszewski High Performance Machine Learning March 7, 2018 1 / 21

Transcript of High Performance Machine Learning Distributed training · High Performance Machine Learning...

Page 1: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

High Performance Machine LearningDistributed training

Pawe l Rosciszewski

[email protected]: 521 EA, office hours: Friday 10:00-11:30

March 7, 2018

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 1 / 21

Page 2: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Supplementary courses

Coursera - Machine Learning(https://www.coursera.org/learn/machine-learning)

Coursera - Deep Learning(https://www.coursera.org/specializations/deep-learning)

Stanford CS231N (http://cs231n.stanford.edu/)

DataCamp premium support (https://www.datacamp.com/)

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 2 / 21

Page 3: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Recap

HPC is crucial for contemporary Machine Learning workloads

Huge training datasets, big models, compute intensity

Also unsupervised learning, compute-intensive workloads, inference

We will focus mostly on training supervised learning models

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 3 / 21

Page 4: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Stochastic Gradient Descent to rule them all

Figure: SGD iterations [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 4 / 21

Page 5: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Single-node optimizations

Vectorization - example

High performance libraries - numPy, cuDNN, Intel MKL, eigen...

Hardware - examples(1060, P100, V100 - CPU, mem, P2P)

Monitoring - examples(top, nvidia-smi, ...)

Experiments - NHWC/NCHW, batch size, benchmarks

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 5 / 21

Page 6: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Computational graph - example

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 6 / 21

Page 7: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Profiling - example

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 7 / 21

Page 8: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Recap

why HPC is crucial for ML

dataset, model sizes, hardware used for training

convolutions and their implementations

benchmarking, profiling of ML code

numerical formats used in ML

distributed training algorithms

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 8 / 21

Page 9: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Schedule

numerical formats used in ML - recap

distributed training algorithms - recap

4.04 - W lodzimierz Kaoka (VoiceLab.ai) - practical deployment ofRNNs for acoustic model inference

4.04 - graph representations?

11.04 - TensorFlow hands-on

18.04 - midterm test (45 min), TF hands-on

25.04 - lab starts

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 9 / 21

Page 10: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Hardware trends

Figure: TensorCores [9]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 10 / 21

Page 11: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Hardware trends

Figure: Tensor Processing Unit architecture [10]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 11 / 21

Page 12: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training

Figure: SGD iterations [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 12 / 21

Page 13: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - data parallelism

Figure: General architecture of data parallel multinode training [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 13 / 21

Page 14: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - model parallelism

Figure: General architecture of model parallel multinode training [2]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 14 / 21

Page 15: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - mixed model/data

Convolutional layers cumulatively contain about 90-95% of thecomputation, about 5% of the parameters, and have largerepresentations. [3]Fully-connected layers contain about 5-10% of the computation,about 95% of the parameters, and have small representations. [3]

Figure: Mixed data/model parallel multinode training [3]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 15 / 21

Page 16: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - search parallelism

Figure: US Department of Energy Deep Learning objectives [11]

”DNNs in general do not have good strong scaling behavior, so to fullyexploit large-scale parallelism they rely on a combination of model, dataand search parallelism.” [4]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 16 / 21

Page 17: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - model averaging

Figure: Parallel training of DNNs with natural gradient and parameter averaging [5]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 17 / 21

Page 18: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - model averaging frequency?

Figure: Experiments with model averaging frequency [6]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 18 / 21

Page 19: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Multinode training - model averaging frequency?

Figure: Using experience from our HPCS class to optimize a popular ML framework [6]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 19 / 21

Page 20: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

Federated Learning

Figure: Federated Learning architecture [7,8]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 20 / 21

Page 21: High Performance Machine Learning Distributed training · High Performance Machine Learning Distributed training Pawe l Ro sciszewski pawel.rosciszewski@pg.edu.pl O ce: 521 EA, o

References

1 https://github.com/tensorflow/models/tree/master/research/inception

2 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., others,2012. Large scale distributed deep networks, in: Advances in Neural Information Processing Systems. pp. 1223–1231.

3 Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.

4 Stevens, R., 2017. Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPCArchitecture. ACM Press, pp. 65–65. https://doi.org/10.1145/3078597.3091526

5 Povey, D., Zhang, X., Khudanpur, S., 2014. Parallel training of deep neural networks with natural gradient andparameter averaging. CoRR, vol. abs/1410.7455.

6 Rosciszewski, P., Kaliski, J., 2017. Minimizing Distribution and Data Loading Overheads in Parallel Training of DNNAcoustic Models with Frequent Parameter Averaging. IEEE, pp. 560–565. https://doi.org/10.1109/HPCS.2017.89

7 McMahan, H.B., Moore, E., Ramage, D., Hampson, S., 2016. Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629.

8 https://research.googleblog.com/2017/04/federated-learning-collaborative.html

9 https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

10 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

11 https://www.slideshare.net/insideHPC/a-vision-for-exascale-simulation-and-deep-learning

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 21 / 21