High Performance Machine Learning Distributed training · High Performance Machine Learning...
Transcript of High Performance Machine Learning Distributed training · High Performance Machine Learning...
High Performance Machine LearningDistributed training
Pawe l Rosciszewski
[email protected]: 521 EA, office hours: Friday 10:00-11:30
March 7, 2018
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 1 / 21
Supplementary courses
Coursera - Machine Learning(https://www.coursera.org/learn/machine-learning)
Coursera - Deep Learning(https://www.coursera.org/specializations/deep-learning)
Stanford CS231N (http://cs231n.stanford.edu/)
DataCamp premium support (https://www.datacamp.com/)
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 2 / 21
Recap
HPC is crucial for contemporary Machine Learning workloads
Huge training datasets, big models, compute intensity
Also unsupervised learning, compute-intensive workloads, inference
We will focus mostly on training supervised learning models
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 3 / 21
Stochastic Gradient Descent to rule them all
Figure: SGD iterations [1]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 4 / 21
Single-node optimizations
Vectorization - example
High performance libraries - numPy, cuDNN, Intel MKL, eigen...
Hardware - examples(1060, P100, V100 - CPU, mem, P2P)
Monitoring - examples(top, nvidia-smi, ...)
Experiments - NHWC/NCHW, batch size, benchmarks
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 5 / 21
Computational graph - example
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 6 / 21
Profiling - example
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 7 / 21
Recap
why HPC is crucial for ML
dataset, model sizes, hardware used for training
convolutions and their implementations
benchmarking, profiling of ML code
numerical formats used in ML
distributed training algorithms
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 8 / 21
Schedule
numerical formats used in ML - recap
distributed training algorithms - recap
4.04 - W lodzimierz Kaoka (VoiceLab.ai) - practical deployment ofRNNs for acoustic model inference
4.04 - graph representations?
11.04 - TensorFlow hands-on
18.04 - midterm test (45 min), TF hands-on
25.04 - lab starts
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 9 / 21
Hardware trends
Figure: TensorCores [9]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 10 / 21
Hardware trends
Figure: Tensor Processing Unit architecture [10]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 11 / 21
Multinode training
Figure: SGD iterations [1]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 12 / 21
Multinode training - data parallelism
Figure: General architecture of data parallel multinode training [1]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 13 / 21
Multinode training - model parallelism
Figure: General architecture of model parallel multinode training [2]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 14 / 21
Multinode training - mixed model/data
Convolutional layers cumulatively contain about 90-95% of thecomputation, about 5% of the parameters, and have largerepresentations. [3]Fully-connected layers contain about 5-10% of the computation,about 95% of the parameters, and have small representations. [3]
Figure: Mixed data/model parallel multinode training [3]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 15 / 21
Multinode training - search parallelism
Figure: US Department of Energy Deep Learning objectives [11]
”DNNs in general do not have good strong scaling behavior, so to fullyexploit large-scale parallelism they rely on a combination of model, dataand search parallelism.” [4]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 16 / 21
Multinode training - model averaging
Figure: Parallel training of DNNs with natural gradient and parameter averaging [5]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 17 / 21
Multinode training - model averaging frequency?
Figure: Experiments with model averaging frequency [6]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 18 / 21
Multinode training - model averaging frequency?
Figure: Using experience from our HPCS class to optimize a popular ML framework [6]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 19 / 21
Federated Learning
Figure: Federated Learning architecture [7,8]
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 20 / 21
References
1 https://github.com/tensorflow/models/tree/master/research/inception
2 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., others,2012. Large scale distributed deep networks, in: Advances in Neural Information Processing Systems. pp. 1223–1231.
3 Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.
4 Stevens, R., 2017. Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPCArchitecture. ACM Press, pp. 65–65. https://doi.org/10.1145/3078597.3091526
5 Povey, D., Zhang, X., Khudanpur, S., 2014. Parallel training of deep neural networks with natural gradient andparameter averaging. CoRR, vol. abs/1410.7455.
6 Rosciszewski, P., Kaliski, J., 2017. Minimizing Distribution and Data Loading Overheads in Parallel Training of DNNAcoustic Models with Frequent Parameter Averaging. IEEE, pp. 560–565. https://doi.org/10.1109/HPCS.2017.89
7 McMahan, H.B., Moore, E., Ramage, D., Hampson, S., 2016. Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629.
8 https://research.googleblog.com/2017/04/federated-learning-collaborative.html
9 https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
10 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
11 https://www.slideshare.net/insideHPC/a-vision-for-exascale-simulation-and-deep-learning
Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 21 / 21