Download - Intro to the Distributed Version of TensorFlow

Transcript
Page 1: Intro to the Distributed Version of TensorFlow

Dr. Miha Pelko, NorCom*

@mpelko

*views are my own

Page 2: Intro to the Distributed Version of TensorFlow

WE ARE HIRING!

Page 3: Intro to the Distributed Version of TensorFlow
Page 4: Intro to the Distributed Version of TensorFlow
Page 5: Intro to the Distributed Version of TensorFlow
Page 6: Intro to the Distributed Version of TensorFlow

Configuration at Yahoo! :

“We avoid unnecessary data movement between Hadoop clusters and separate deep learning clusters.”

“YARN works well for deep learning. Multiple experiments of deep learning can be conducted concurrently on a single cluster. It makes deep learning extremely cost effective as opposed to conventional methods. In the past, we had teams use “notepad” to schedule GPU resources manually, which was painful and worked only for a small number of users.”

From: http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop

Page 7: Intro to the Distributed Version of TensorFlow

See: https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html

Inception-v3

RETRAIN INSTEAD OF DISTRIBUTE

Page 8: Intro to the Distributed Version of TensorFlow

ERLKÖNIG RECOGNITION

Page 9: Intro to the Distributed Version of TensorFlow

3 SPECIFIC CATEGORIES

erlkönig

car

road

Cut the last layer and train a new one~30 minutes on Desktop CPU> 90% accuracy

Page 10: Intro to the Distributed Version of TensorFlow
Page 11: Intro to the Distributed Version of TensorFlow
Page 12: Intro to the Distributed Version of TensorFlow
Page 13: Intro to the Distributed Version of TensorFlow

TasksJobs

One server per task!

Page 14: Intro to the Distributed Version of TensorFlow
Page 15: Intro to the Distributed Version of TensorFlow

§ Wrapper over a Coordinator, a Saver, and a SessionManager

§ Variable initialization

§ Checkpointing

§ Summarizes to the log

§ Automatic initialization from the most recent checkpoint

§ is_chief flag in replica-type models

Page 16: Intro to the Distributed Version of TensorFlow
Page 17: Intro to the Distributed Version of TensorFlow

Source: http://download.tensorflow.org/paper/whitepaper2015.pdf

Page 18: Intro to the Distributed Version of TensorFlow

Performance of Distributed TensorFlow: A Multi-Node and Multi-GPU Configurationhttp://www.altoros.com/performance-benchmark-distributed-tensorflow.html

Page 19: Intro to the Distributed Version of TensorFlow

§ Putting it all together (including deployment management)§ See: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html§ See: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dist_test§ See: https://github.com/bwahn/tensorflow-kr-docker

§ In-graph replication vs. Between-graph replication§ See: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html#replicated-training

§ Specific hardware components§ How to handle GPUs?§ Other hardware?

§ Model splitting parallelization§ You’re on your own

§ See also: https://www.youtube.com/watch?v=YAkdydqUE2c

Page 20: Intro to the Distributed Version of TensorFlow

Thank you.