Managed Distributed TensorFlow with YARN1301487/FULLTEXT01.pdf · processing of Big Data. With the...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Managed DistributedTensorFlow with YARNEnabling Large-Scale Machine Learningon Hadoop Clusters

TOBIAS JOHANSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Managed Distributed TensorFlow with YARNEnabling Large-Scale Machine Learning on Hadoop Clusters

Tobias Johansson

Master of Science Thesis

Software Engineering of Distributed SystemsSchool of Electrical Engineering and Computer Science

KTH Royal Institute of Technology

Stockholm, Sweden

31 January 2018

Examiner: Jim Dowling

TRITA EECS-EX-2018:39

Abstract

Apache Hadoop is the dominant open source platform for the storage andprocessing of Big Data. With the data stored in Hadoop clusters, it is advantageousto be able to run TensorFlow applications on the same cluster that holds the inputdata sets for training machine learning models. TensorFlow supports distributedexecutions where Deep Neural Networks can be trained utilizing a large amountof compute nodes. To configure and launch distributed TensorFlow applicationsmanually is complex and impractical, and gets worse with more nodes.

This project presents a framework that utilizes Hadoop’s resource managerYARN to manage distributed TensorFlow applications. The proposal is a nativeYARN application with one ApplicationMaster (AM) per job, utilizing the AM asa registry for discovery prior to job execution. Conforming TensorFlow code tothe framework typically is about a few lines of code.

In comparison to TensorFlowOnSpark, the user experience is very similar,and collected performance data indicates that there exists an advantage of runningTensorFlow directly on YARN with no extra layer in between.

i

Sammanfattning

Apache Hadoop ar den ledande oppen kallkod-plattformen for lagringen ochprocesseringen av big data. Med data lagrat i Hadoop-kluster, ar det fordelaktigtatt kunna kora TensorFlow-applikationer pa samma kluster som haller ingaendedataset for traning av maskininlarningsmodeller. TensorFlow stodjer distribueradeexekveringar dar djupa neurala natverk kan tranas genom att anvanda en stormangd berrakningsnoder. Att konfigurera och starta distribuerade TensorFlow-applikationer manuellt ar komplext och opraktiskt och blir varre med fler noder.

Detta projekt presenterar ett ramverk som anvander Hadoops resurhanterareYARN for att hantera distribuerade TensorFlow-applikationer. Forslaget ar enhemmahorande YARN-applikation med en ApplicationMaster (AM) per jobbsom anvander AM som ett register for upptackt innan jobbet kors. Att anpassaTensorFlow-kod till ramverket handlar typiskt om nagra rader kod.

I jamforelse med TensorFlowOnSpark ar anvandarupplevelse valdigt likt ochinsamlad prestandadata indikerar att det finns en fordel med att kora TensorFlowdirekt pa YARN utan nagot extra lager daremellan.

iii

Contents

1 Introduction 11.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem context . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Risks, Consequences and Ethics . . . . . . . . . . . . . . . . . . 31.6 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Architectural Overview . . . . . . . . . . . . . . . . . . . 52.1.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Hops Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 YARN Application Development . . . . . . . . . . . . . . . . . . 9

2.3.1 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 ApplicationMaster . . . . . . . . . . . . . . . . . . . . . 10

2.4 Distributed TensorFlow . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Method 153.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . 153.2 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . 163.4 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.1 High-Level View . . . . . . . . . . . . . . . . . . . . . . 183.4.2 gRPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3 yarntf Environment Variables . . . . . . . . . . . . . . . 213.4.4 Client Arguments . . . . . . . . . . . . . . . . . . . . . . 223.4.5 TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . 23

v

vi CONTENTS

3.4.6 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.7 RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.8 File Distribution . . . . . . . . . . . . . . . . . . . . . . 243.4.9 HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.10 Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.11 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . 253.4.12 ApplicationMaster Notes . . . . . . . . . . . . . . . . . . 25

4 Analysis 274.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Design Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions 315.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 35

A yarntf-submit 41

B MNIST Performance 43

C MNIST Code 45

List of Figures

2.1 YARN components [1] . . . . . . . . . . . . . . . . . . . . . . . 72.2 Distributed training [2] . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Start a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Activity in yarntf.createClusterServer() . . . . . . . . . . . . . . 20

4.1 MNIST 10k steps . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 MNIST 1-10k steps, linear regression α . . . . . . . . . . . . . . 294.3 MNIST 1–10k steps, linear regression β . . . . . . . . . . . . . . 29

vii

List of Tables

2.1 ApplicationSubmissionContext fundamental data fields . . . . . . 102.2 ContainerLaunchContext fundamental data fields . . . . . . . . . 10

3.1 Essential yarntf environment variables . . . . . . . . . . . . . . . 213.2 Extra yarntf environment variables . . . . . . . . . . . . . . . . . 22

B.1 MNIST perf, batch size 100, 1–5 workers . . . . . . . . . . . . . 43B.2 MNIST 1–10k steps, least squares trend line . . . . . . . . . . . . 44

ix

List of Listings

2.1 ClusterSpec creation . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Distributed TensorFlow execution . . . . . . . . . . . . . . . . . 133.1 yarntf execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Vanilla distributed TF initialization . . . . . . . . . . . . . . . . . 183.3 yarntf initialization . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Node registration information . . . . . . . . . . . . . . . . . . . . 213.5 Node registration information . . . . . . . . . . . . . . . . . . . . 223.6 GPU environment variables . . . . . . . . . . . . . . . . . . . . . 24A.1 yarntf-submit script . . . . . . . . . . . . . . . . . . . . . . . . . 41C.1 Hops-TensorFlow MNIST code . . . . . . . . . . . . . . . . . . . 45

xi

List of Acronyms and Abbreviations

AM Application Manager

HDFS Hadoop Distributed File System

Hops Hadoop Open Platform-as-a-Service

NM Node Manager

PS Parameter Server

RM Resource Manager

RDMA Remote Direct Memory Access

TF TensorFlow

YARN Yet Another Resource Negotiator

xiii

Chapter 1

Introduction

Apache Hadoop [3] is the dominant platform for the storage and processingof Big Data. The Hadoop ecosystem’s base components is a filesystem, theHadoop filesystem (HDFS), and a resource manager (YARN). YARN managesthe allocation of resources to applications, acting as a kind of operating systemfor the data center. Hops Hadoop [4] is a distribution of Hadoop developed atKTH that provides support for distributed metadata in both HDFS and YARN.

TensorFlow [5], released by Google as open source software in late 2015,has quickly become the dominant platform for Deep Learning. As of October2016 (v0.11), TensorFlow has added support for reading datasets stored in HDFS.TensorFlow also supports distributed operation, where Deep Neural Networks canbe trained using more than one server, with each one utilizing more than oneCPU and GPU. Given that many large datasets used for Deep Learning are/willreside on Hadoop, there would be huge advantages to processing the data usingTensorFlow while the data is in-place in Hadoop on HDFS.

Is it possible to run distributed TensorFlow on YARN without a pre-existingframework in between? This thesis presents a solution for running distributedTensorFlow applications on YARN for Hops. The solution contains of a nativeYARN application that manage the containers for an application and bootstrapsthe container cluster with help of an also developed Python module. The Pythonmodule mainly helps with distribution of connection information between thenodes.

1.1 Problem description

Currently, there is no support for running TensorFlow applications on Hadoop.Hadoop native applications require support for scheduling using YARN, andYARN needs to know about the availability of GPUs for TensorFlow.

1

2 CHAPTER 1. INTRODUCTION

Apache Hadoop’s YARN as it is, does not provide service discovery, a DNSservice, allocation of ports or allocation of GPUs. Support for the latter have beendeveloped in parallel to this work, by [6], for Hops Hadoop.

Another issue, regarding GPU performance for distributed TensorFlow applicationis the latency bottleneck with moving data in and out of GPU device memory. Thiscould be resolved using RDMA over InfiniBand for GPU-to-GPU communication.

Again, is it possible to run distributed TensorFlow on YARN without a pre-existing framework in between?

1.2 Problem context

Running a distributed TensorFlow application entails starting multiple instancesof the application, giving each instance information about the cluster and its role.The cluster information contains the IP address and port for each node, for thecommunication between the instances. To set this up manually is time consumingand forces the user to have direct communication with all machines that are goingto run the application.

The datasets used for training Deep Neural Networks, the input for TensorFlowapplications, can be very large for example ImageNet [7] and therefore can be longrunning, the compute could take several days or even weeks. In organizationswhere the input dataset are stored on Hadoop on HDFS, in-place processingtherefore would be very beneficial.

Using CPUs for the high computational workload required by neural networksis not sufficient [8]. Given that GPUs are designed to have a high degreeof parallelism and high memory bandwidth, they outperforms CPUs on neuralnetworks algorithms where multiple individual ”neurons” that can be processedindependently.

1.3 Goals

The goal of this project is to design and develop support for running nativedistributed TensorFlow applications in YARN. This will require changes to theprogramming model in TensorFlow to allow programmer to specify requirementssuch as – ”start this application on 10 servers with 1 CPU, and 5 servers eachwith 4 GPUs”. This implies to also use the work of [6] for GPU support. Thesolution should also have the ability to support RDMA and easily be integratedwith HopsWorks, the UI for Hops Hadoop.

1.4. DELIMITATIONS 3

1.4 DelimitationsTo delimit the work, fault tolerance and failover will not be solved for. Furthermoreonly the Python TensorFlow API will be addressed.

1.5 Risks, Consequences and EthicsIf this implementation succeeds it could enable running Deep Learning applicationin general and specifically distributed TensorFlow applications on a cluster withmuch less manual configuration. This could disrupt workplaces and lessen thedemand for server administration work. On the other hand, it would simplify thedevelopment of distributed Deep Learning applications. This could acceleratetowards future implementations of ”true” artificial intelligence, for which theconsequences are unknown, and is an ethical issue that must be evaluated inparallel with the advance of Deep Learning.

Furthermore distributed training on commodity hardware, enable that existinghardware can be utilized without the need of buying a (single) more powerfulreplacement machine to speed up the training. This is good from a sustainabilitypoint of view, since less old hardware will be discarded. Economy wise, runningdistributed training with a cluster resources manager like YARN can help to saveexpenses, through that it becomes easier to fully utilize hardware, in comparisonto use dedicated hardware for a task.

1.6 Structure of this thesisChapter 1 describes the problem and its context. Chapter 2 provides thebackground necessary to understand the problem and the specific knowledge thatthe reader will need to understand the rest of this thesis. Following this Chapter 3describes the goals, development, and solution proposed in this thesis project. Thesolution is analyzed and evaluated in Chapter 4. Finally, Chapter 5 offers someconclusions and suggests future work.

Chapter 2

Background

In this chapter, background knowledge for this work will be given. It will gothrough Apache Hadoop which includes YARN, its relationship and differenceswith Hops (Hadoop Open Platform-as-a-Service) in Section 2.1 and 2.2. Detailsregarding native YARN application development be addressed in Section 2.3.Later we will give an overview of TensorFlow’s API for distributed machinelearning applications in 2.4. At last related work is presented, which includesother proposed solutions in Section 2.5.

2.1 Apache HadoopThe Apache Hadoop ecosystem emerged from three important papers released byGoogle between 2003-2006. These papers presents Google File System (GFS) [9],MapReduce [10] and BigTable[11]. These works showed for a novel approacheson how to store and process very large data sets, i.e. big data. Hadoop’s filesystem HDFS is an open source implementation of the fault tolerant distributedfile system GFS. MapReduce that was later implemented for Hadoop presented aframework for an easy to use programming model for processing large data sets inparallel, with the work automatically distributed on a large cluster of commoditymachines. BigTable inspired Apache HBase [12] which is a distributed storagesystem for big data that uses HDFS to store its data, BigTable in turn uses GFS.All this work showed the way for a new data processing paradigm of large scaleparallel computation on commodity hardware.

2.1.1 Architectural Overview

Hadoop’s key components is HDFS (Hadoop Distributed File System) whichhandles storage and YARN (Yet Another Resource Negotiator) services manage

5

6 CHAPTER 2. BACKGROUND

computing resources and application scheduling [1]. These components haveservices that are working together to perform work, this section will give anoverview to these services roles and how they work together, with YARN ingreater detail.

2.1.2 HDFSHDFS’s services are NameNode and DataNode. The architecture has a master/slavemodel where a single NameNode, a master server that maintains the filesystem’smetadata, i.e. filesystem namespace, access control information, mapping fromfiles to file chunks (blocks) and the information of where the chunks is located.

The DataNodes, several in a Hadoop cluster, are the worker nodes that storethe HDFS data blocks that build up files. On directive sent by the NameNode, theDataNode e.g. provides block storage on its local file system, fulfills read/writerequests, create/delete and replicates data blocks, and periodically sends blockreports and heartbeats to the NameNode. The block report lists all data blocksstored by a DataNode. The periodic heartbeats confirms that the DataNode isalive and is used for handling cluster membership.

To avoid single point of failure (SPOF), a high-availability cluster setup hasa Standby NameNode which will take over in case of if the active NameNode islost. The same approach is used for YARN’s ResourceManager.

2.1.3 YARNYARN’s services are ResourceManager (RM), NodeManager (NM) and ApplicationMaster(AM). Its functionality is separated in two layers [13]:

• Platform layer, first-level scheduling: responsible for resource management,includes multiple (usually) NodeManagers and a single per cluster ResourceManagerrunning on a master node that allocates cluster resources and scheduling ofjobs on the worker nodes. Each worker node have a NodeManager whichspawns and manages containers scheduled by the ResourceManager.

• Framework layer, second-level scheduling: coordinates distributed applicationexecution, consists of ApplicationMaster, one for each YARN application(or framework). Negotiates resources for desired for the application withResourceManager and orchestrates the distributed application execution.

A container is an isolated run time environment on a single node with a givenamount of physical resources such as RAM, CPU cores and GPUs. All YARNapplication execution is ran inside containers, even the ApplicationMaster, once

2.1. APACHE HADOOP 7

started negotiates with the ResourceManager for more containers, AM’scontainer allocations and releases is done dynamically at run time. If there are noavailable containers, the request must wait and the application execution is halteduntil container allocation. It is not possible to pre-allocate the resources togetherwith ApplicationMaster submission, nor is gang scheduling is supported. Thiscan result in that a partial amount of the containers is allocated, for an applicationthat needs all its requested containers to be executed.

Figure 2.1: YARN components [1]

The NodeManager monitors resource usage of individual containers and keepstrack of the worker nodes health for the ResourceManager. It also is responsiblefor managing a distributed cache (file system cache for files used by containers),managing container logs and auxiliary services that can be used by YARNapplications.


An end Hadoop user can either interact directly interact with YARN bysubmitting an ApplicationMaster or use a framework in between that runs theapplication, e.g. MapReduce [14] or Spark [15].

With the services described, the flow and how the services work together whena YARN application is submitted will now be briefly explained, see Figure 2.1 foran overview. The process starts with that a Client using YARN’s API submits aapplication to the ResourceManager with information for the ApplicationMastercontainer’s physical resources requirements and ContainerLaunchContext. TheContainerLaunchContext specifies the start-up command inside the container,what files that should be available and environment variables. After submit theClient can monitor the application. After submit, RM directs a NodeManagerto start an AM instance on a worker node. Then the AM starts it can requestcontainers from RM to do/run the actual computation/application. On eachcompute container allocation, AM starts the container by providing NM aContainerLaunchContext for the compute container.

More details regarding application submission and application life cycle isgiven in Section 2.3.

2.2 Hops Hadoop

Hadoop Open Platform-as-a-Service (Hops) [4] is a Hadoop distribution thatprovides an UI, i.e. HopsWorks [16, 17], through which the end-user will run jobsinstead of with command-line interface. HopsWorks provides user authenticationand two new abstractions, i.e. projects and datasets, where the users manageproject membership. Datasets is also managed by users and can be shared betweenprojects.

Hops also has its own implementation of HDFS, i.e. HopsFS, and YARN thatimplement quota-based scheduling for CPU and memory [18].

HopsFS stores its metadata in an in-memory distributed database, i.e. MySQLNDB Cluster, which gives stateless NameNodes. With this approach, multipleNameNodes are supported and the architectural metadata bottleneck [19] in HDFSare overcome which enables much larger cluster sizes and more than up at least16 times higher throughput [20].

Furthermore Hops-YARN introduces a distributed ResourceManager [21] likewith HopsFS, also with its state is persisted in a MySQL NDB Cluster. Thisenables zero down-time on RM failover and better scalability. From an applicationdevelopment perspective, Hops-YARN is fully compatible with Apache HadoopYARN.

2.3. YARN APPLICATION DEVELOPMENT 9

2.3 YARN Application Development

In this section, details needed for develop a native YARN application, it buildson Section 2.1.3. At first the Client will be more thoroughly explained and thenmore details regarding application submission and application life cycle from adeveloper perspective.

2.3.1 Client

The Client is a regular Java program (main method runnable) that instantiatesorg.apache.hadoop.yarn.client.api.YarnClient, this object isinitialized with an org.apache.hadoop.yarn.conf.YarnConfigurationobject. The YarnConfiguration object depends on finding the following YARNconfiguration files in its classpath [1, 13]:

• yarn-default.xml

• yarn-site.xml

These files contains YARN configuration parameters that is needed tocommunicate with the YARN cluster and ResourceManager. This is usuallyresolved by that the Java program is ran with the bin/yarn script [22]:

$HADOOP HOME/bin/yarn <jar> [mainClass] args...

With the YarnClient object initialized, to submit an application, the Client firstmust call yarnClient.createApplication() to notify theResourceManager. The response contains an ApplicationSubmissionContext withan unique ApplicationID, and information regarding the Hadoop cluster state,e.g. number of NameNodes and their address, rack name and the schedulerqueues current application count and physical resource capability. To submit, theClient responds to RM by callingyarnClient.submitApplication(appContext) whereappContext is the instance of ApplicationSubmissionContext, this object“represents all of the information needed by the ResourceManager to launch theApplicationMaster for an application” [23], its most fundamental data fields are:


Table 2.1: ApplicationSubmissionContext fundamental data fieldsName Data type DescriptionamContainer ContainerLaunchContext Describes the AM container.applicationId ApplicationId Unique ID from RM.resource Resource Memory and CPU cores for AM.priority Priority Scheduler priority.queue String Scheduler queue.nodeLabelExpression String To select specific compute

nodes for the application.

In turn ContainerLaunchContext “represents all of the information needed by theNodeManager to launch a container” [24], and its most fundamental data fieldsare:

Table 2.2: ContainerLaunchContext fundamental data fieldsName Data type Descriptioncommands List<String> Command to execute in container.environment Map<String,String> Environment variables

(Hadoop specific classpaths needed).localResources Map<String,LocalResource> File resources; maps relative path

in container to a file resource.

Where a LocalResource object points to a file in HDFS. In the case off writing aClient, one file that always needs to be included as a LocalResource in theContainerLaunchContext is the JAR file for the ApplicationMaster [25]. Afterthe Client have submitted the application, it can monitor the state of theapplication periodically callingyarnClient.getApplicationReport(appId), it can also force to killthe application. Any direct output from the application is not available for theClient. To present more information from the execution, AM can provide atracking URL which is accessible in the application monitoring report.

2.3.2 ApplicationMasterThe role of an ApplicationMaster is to orchestrate the distributed execution for anapplication. On start-up, AM needs to register itself to RM to start hearbeating[25]. In this registration an optional tracking URL can be provided, AM is thenresponsible to host the tracking server. If it is not the AM’s first attempt, it could

2.4. DISTRIBUTED TENSORFLOW 11

have crashed and been restarted, the registration response can contain referencesto application containers references from previous attempts.

Application containers is requested from RM, and on allocations AM communicateswith NMs to launch the containers. The communication with the ResourceManagerand NodeManagers can be done either synchronously or asynchronously. If thelatter, communication is done through the objects org.apache.hadoop.yarn.client.api.async.AMRMClientAsync and NMClientAsync respectively.On instantiation of a communication object, a callback handler class needs tobe provided. These callback handler classes implements interfaces for handlingevents. For RM the most fundamental events are container allocation andcompletion, i.e. container execution is completed. For NM important events iscontainer started, stopped and errors.

The container requests to RM is done per container, gang scheduling is notsupported [26]. The requests contains resources (memory/CPU cores), prioritylevel, and optionally specifies specific compute nodes and racks. On allocation,the AM starts the container using NMClientAsync, by providing a reference tothe container obtained on the allocation callback and a ContainerLaunchContext,see Table 2.2, that specifies the execution. When the work is done (application isfinished), the AM is expected to unregister itself.

2.4 Distributed TensorFlow

TensorFlow, presented in November 2015 [5], is an open-source software libraryfor expressing and executing machine learning algorithms, including training andinference algorithms for deep neural network models. The execution can bedistributed by the library and executed on large-scale distributed systems withthousands of computation nodes, with support for utilizing GPUs.

The library emerged from the Google Brain project and build on their earlierclose-source work DistBelief [2], that presented novel methods for large-scaledistributed model training.

For distributed executions, TensorFlow have two types nodes, parameterserver (PS) and worker. A common training configuration is that multiple workersis training the same model using mini-batches of the input dataset, with sharedparameters hosted on the PS which are updated by all workers [27]. See Figure2.2 for an example showing distributed training with one parameter server andthree workers.


Figure 2.2: Distributed training [2]

Now to deployment details. The cluster of nodes for an application is specifiedin a tf.train.ClusterSpec object prior to execution, this object needs to becreated on each node. With hardcoded addresses the ClusterSpec creation couldlook like this (the same one each node):

Listing 2.1: ClusterSpec creation

t f . t r a i n . C l u s t e r S p e c ({” ps ” : [

” ps0 . example . com : 22 22 ”] } ) ,” worker ” : [

” worker0 . example . com : 2 2 2 2 ” ,” worker1 . example . com : 2 2 2 2 ” ,” worker2 . example . com :22 22 ”

]

Furthermore, on each node a tf.train.Server object is to be created,ClusterSpec is passed to the constructor together with the node’s job name, i.e.PS or worker and an index, this creates a local server. Alternatively each workernode runs an execution session on its local server, or is a session executed remotelyon the server.

A distributed execution with one PS and three workers could look like inListing 3.1. Here, to start the distributed TensorFlow application, the execution isneeded to be started on each of the node, as can be seen the cluster informationfor creating the ClusterSpec is the same for each node but job name and

2.4. DISTRIBUTED TENSORFLOW 13

task index differs. In practice this could mean that the developer needs toSSH to each of the four nodes for starting each server.

Listing 2.2: Distributed TensorFlow execution

# On ps0 . example . com :$ py thon t r a i n e r . py \

−−p s h o s t s =ps0 . example . com :2222 \−−w o r k e r h o s t s = worker0 . example . com : 2 2 2 2 , \

worker1 . example . com : 2 2 2 2 , \worker2 . example . com :2222 \

−−job name =ps −− t a s k i n d e x =0# On worker0 . example . com :$ py thon t r a i n e r . py \



−−job name = worker −− t a s k i n d e x =0# On worker1 . example . com :$ py thon t r a i n e r . py \



−−job name = worker −− t a s k i n d e x =1# On worker2 . example . com :$ py thon t r a i n e r . py \



−−job name = worker −− t a s k i n d e x =2

The complete programming model is not dealt with in this work, since thefocus is to orchestrate a distributed execution, therefore just a few more importantnotes. A distribute training process does have a chief worker that coordinates thetraining, monitors the session, and from model checkpoints it can manage failuresif a another worker or parameter server is restarted [28]. Moreover TensorFlowcan read and write on HDFS with help of a few environment variables being sat[29]. This section ends with a few words regarding TensorBoard (TB).


2.4.1 TensorBoardTensorBoard is a visualization for TensorFlow to visualize graphs. It is useful forinspecting and understand TensorFlow application runs, where the execution canbe followed. To use this toolkit, it is expected that the chief worker gives its logdiras argument to tf.summary.FileWriter in its constructor [30].

2.5 Related workThere are several proposals related to this topic on how to solve for manage theorchestration of distributed TensorFlow applications. The most notable work isTensorFlowOnSpark (TFoS) which was presented and open-sourced by Yahoo inFebruary 2017 [31]. TFoS runs on top of Spark [15] which already have supportto be ran with YARN on a Hadoop cluster. To adapt TensorFlow code for TFoSjust a couple of lines of code needs to be change. It also presents use of RDD’s tofeed the workers and support for RDMA over InfiniBand.

DataBricks proposed another solution with Spark, i.e. TensorFrames [32] thatlets you manipulate Spark’s DataFrames with TensorFlow programs. Accordingto [31] its big limitation is that it could not support asynchronous distributedlearning.

Other interesting proposals are TensorFlowOnYARN (TOY) [33] and usage ofservice assembly [34]. TOY uses remote TensorFlow nodes, managing a C++ TFserver through JNI, where the YARN Client acquires ClusterSpec and launchesthe application remotely.

The latter is experimental and uses upcoming Hadoop features that simplifiesbringing new services to YARN, using Docker, a new upcoming API layer, andDNS.

Lastly we take a look outside Hadoop. For Kubernetes [35] (that is influencedby Borg [36]) there is a very simple solution [37] for the problem, i.e. byusing Kubernetes’s DNS addon, the nodes hostnames can be predicted prior toexecution. With a Jinja2 template, the deployment specification can be createdand piped directly to launch. For Mesos/Marathon the proposal of [38] is verysimilar.

Chapter 3

Method

The aim of the project was to extend the Hadoop/Hops YARN platform with aprototype TensorFlow YARN implementation to address the research goals, i.e.a framework that seamlessly manages the distributed execution for distributedTensorFlow applications. In this chapter we will present the analyze of therequirements to identify design needs in Section 3.1, the development processin Section 3.2, design considerations including how previous work solved for theproblems in Section 3.3, and our design proposal in Section 3.4.

3.1 Requirements AnalysisWe wanted to enable so that YARN can run distributed TensorFlow applicationsnatively. For this a YARN application is needed to be developed.

The two types of TensorFlow nodes, parameter server and worker, executesthe same code but with different input arguments. Shared arguments builds upthe ClusterSpec and the node specific arguments specifies type of node and a taskindex.

Since the ClusterSpec is static and we cannot reserve ports for the containersthat is launched for the TensorFlow application nodes, this implies that theapplication execution on the worker and parameter nodes needs to be bootstrapped,or that a random port is selected.

Furthermore, since YARN was found to not support DNS or service discoveryyet [39], we needed to concluded the need to resolve for discovery between anapplications computation containers before executing the TensorFlow application.There could also be a need for mentioned containers to rediscover the ApplicationMaster,if it would restart.

Another issue we found to handle is that gang scheduling was not supportedin YARN [26]; since applications are not started before all requested containers

15

16 CHAPTER 3. METHOD

are allocated this could induce deadlocks if there are no resources to allocate allcontainers for a distributed application, but only a part of them. Even withouta deadlock, resources will be locked and unused until the application have allcontainers.

Over to the last requirements:

• Distribution of binaries and application dependencies to all nodes.

• Read datasets from HDFS.

• Support GPU as a resource and custom protocols like RDMA.

• A seamlessly and dynamic execution model where the user can specifyrequirements such as – “start this application on 10 servers with 1 CPU,and 5 servers each with 4 GPUs”.

• Integrate the solution with HopsWorks.

3.2 DevelopmentA native YARN application was implemented as a solution for this project. Sincethe large amount of code needed for a basic YARN application, the code basewas derived from the example application DistributedShell [40] in Hadoop’s codebase.

For testing and development, org.apache.hadoop.yarn.server.MiniYARNCluster was used to submit applications instead of a real Hadoopcluster. MiniYARNCluster gives an environment with a given number ofsimulated NodeManagers. To validate the execution of the ApplicationMaster andthe TensorFlow containers it spawned, logs were collected and analyzed. Theselogs was either read programmatically from the NodeManagers during execution,or by aggregating the logs and using the $HADOOP HOME/bin/yarn logscommand [41, 42]. For the latter there is a need to configure MiniYARNCluster bysetting the value of yarn.log-aggregation-enable to true, to run thecommands from code we was using org.apache.hadoop.yarn.client.cli.LogsCLI. In this manner, by analyzing logs, unit tests could be created.

3.3 Design ConsiderationsStudies of YARN’s API for deployment and resource allocation, and TensorFlow’smodel for distributed execution led to that the following design challenges wasidentified:

3.3. DESIGN CONSIDERATIONS 17

• Discovery off computation containers.

• Discovery off AM container.

• Communication between containers.

• Avoid port collision.

• Distribution of ClusterSpec.

• Handle container failure.

The main problem identified was discovery between of the AM spawnedcontainers to resolve the ClusterSpec and discovery of the AM. After theTensorFlow application nodes is launched in containers, in order to be able togenerate a ClusterSpec or communicate with the AM, there is a need for some typeof discovery mechanism. The most appealing solution was to use the simplifiedAPI layer approach Hortonworks proposed [34] on Hadoop, which is similar toGoogle’s proposals for Kubernetes [37] and Mesos [38], and get a non-staticClusterSpec through DNS hostnames. These approaches was out-of-scope forthe project because the implementations needed [39] was not released.

In the proposal of TensorFlowOnYARN [33] it was suggested that theApplicationMaster selects random ports for all containers prior to execution,collects their IP on allocation, and that it could then provide a full clusterspecification on container startup as part of the execution command. One issuewith this approach is possible port collisions, since multiple containers have thesame IP, another is that possible ways of handle failures is limited because of thestatic ClusterSpec. Neither is it possible for old containers to know where (onwhat node) an AM started after a crash [43].

TensorFlowOnSpark [31] that runs on top of Spark [15] creates a Sparkexecutor for each TensorFlow node which binds to a port and fetches its hostname,this information is collected by the Spark driver, i.e. ApplicationMaster. Innext step the TensorFlow main function is passed to execute on to each executortogether with cluster spec information, from this each node can create tf.train.ClusterSpec and Server objects to run the application.

Another major design choice was if to use one ApplicationMaster per jobor have a long-running service to take care of job submit requests. The latterapproach was not found the be used in any other project.

For inter-process communication, communication between the nodes for theframework prototype and distributing of information to build up the ClusterSpecat each computation node, Remote Procedure Call (RPC) and REpresentationalState Transfer (REST) was considered.


3.4 ProposalIn this section the proposed solution is presented.

3.4.1 High-Level ViewNow to the actual solution from top to down. As mentioned a native YARNapplication is implemented, i.e. Hops-TensorFlow, and also a Python packagecalled yarntf. The Python package helps to bootstrap the execution by setting acluster specification, i.e. a tf.train.ClusterSpec object.

We will now look into how to execute a TensorFlow application on a Hadoopcluster with the solution, and then the architecture. The application containers isdependent on that Python 2.7 is installed on the cluster nodes. Equivalent to theexample in Listing 3.1, we can use the following command:

Listing 3.1: yarntf execution$HOPSTF HOME / b i n / y a r n t f−su bmi t \

−−p s e s 1 \−−worke r s 3 \−−main t r a i n e r . py

In comparison to vanilla TensorFlow application, what we need to changein the code to conform use the proposed the solution, is to replace the usage oftf.train.ClusterSpec() and tf.train.Server() with yarntf.createClusterServer() which returns a tuple of the same object as thefirst two constructors. In practice, replace the following code:

Listing 3.2: Vanilla distributed TF initializationd e f main ( ) :

p s h o s t s = FLAGS . p s h o s t s . s p l i t ( ” , ” )w o r k e r h o s t s = FLAGS . w o r k e r h o s t s . s p l i t ( ” , ” )c l u s t e r = t f . t r a i n . C l u s t e r S p e c ({

” ps ” : p s h o s t s ,” worker ” : w o r k e r h o s t s

} )s e r v e r = t f . t r a i n . S e r v e r (

c l u s t e r ,job name =FLAGS . job name ,t a s k i n d e x =FLAGS . t a s k i n d e x

)

3.4. PROPOSAL 19

With:

Listing 3.3: yarntf initializationd e f main ( ) :

c l u s t e r , s e r v e r = y a r n t f . c r e a t e C l u s t e r S e r v e r ( )

Each TensorFlow job gets its own ApplicationMaster (AM), there is no long-running service needed on the Hadoop cluster. On submit, the AM then spawnsone container for each PS and worker. To submit and monitor a AM for a job,a Client was implemented. This Client is what the yarntf-submit scriptin Listing 3.1 uses. yarntf-submit is just a wrapper script for a call to$HADOOP HOME/bin/yarn using the proposed frameworks JAR file as anargument, see Appendix A. The flow when we starts a job with 1 PS and 1worker could look like in Figure 3.1, where the Client upload the AM JAR fileand TensorFlow code to HDFS.

Figure 3.1: Start a job

With the TF nodes started and with the TensorFlow code waiting to beexecuted. To generate the tf.train.ClusterSpec object on each priorto TF execution, each node needs to know each other’s IP address and whatport that is used for TensorFlow. This is where the yarntf package comesin, when calling yarntf. createClusterServer() each PS/workerregisters its connection info to the AM through a remote procedure call. TheAM has a ClusterSpec gRPC server, and PS/worker nodes are clients. Afterregistration they polls the cluster specification from AM. The AM’s clusterspecification gRPC server only returns the registered nodes if all nodes have


registered, else an empty list. In Figure 3.2 the flow of activity when callingyarntf.createClusterServer() is to be seen.

Figure 3.2: Activity in yarntf.createClusterServer()

3.4. PROPOSAL 21

3.4.2 gRPC

As stated, the ApplicationMaster has a gRPC [44] server for cluster specificationand each PS/worker creates a client when calling yarntf.createClusterServer().

The registration message to the AM contains the following information, inversion 3 of the protocol buffers language (proto3):

Listing 3.4: Node registration information

message C o n t a i n e r {s t r i n g a p p l i c a t i o n I d = 1 ;s t r i n g i p = 2 ;i n t 3 2 p o r t = 3 ;s t r i n g jobName = 4 ;i n t 3 2 t a s k I n d e x = 5 ;i n t 3 2 t b P o r t = 6 ;

}

The complete cluster specification that each nodes retrieves from the AM isan array of the same data type.

3.4.3 yarntf Environment Variables

For the nodes to get the address to the gRPC server (AM) and its own role (jobname and task index), the containers gets a set of environment variables that yarntfuses:

Table 3.1: Essential yarntf environment variablesVariable DescriptionYARNTF AM ADDRESS [ip:port] to the ApplicationMaster’s

cluster specification server.YARNTF APPLICATION ID ID from RM.YARNTF JOB NAME ”PS” or ”worker”.YARNTF TASK INDEX Task index (0 to n).YARNTF TENSORBOARD ”true” or undefined.YARNTF TB DIR For tf.summary.FileWriter.YARNTF PROTOCOL Defined if an alternative protocol are to be

used for TF internals (default: gRPC).

These environment variables are generated by the AM and sat in ContainerLaunchContextfor each compute container upon allocation. More variables with the prefix


YARNTF are set that could be useful for the TensorFlow application developer,namely:

Table 3.2: Extra yarntf environment variablesVariable DescriptionYARNTF MEMORY Containers memory in MB.YARNTF VCORES Containers virtual CPU cores.YARNTF GPUS Containers GPU cores.YARNTF PSES Number of PS:es in the job.YARNTF WORKERS Number of workers in the job.YARNTF HOME DIR HDFS home directory path for the AM.

It is also possible to give the TensorFlow job arbitrary environment variables,by passing arguments to the Client.

3.4.4 Client ArgumentsThe end user never interacts with the ApplicationMaster directly, only throughthe Client. Therefore all arguments to TensorFlow application will go throughthe Client and passed on by the ApplicationMaster to the compute containersContainerLaunchContext. Arguments to Client.main() or yarntf-submitare given in the form of an array of strings where we can have options with valuesor flags (options without a value). Options are written with a prefix of --. Theonly always required option to yarntf-submit is --main, which specifiesthe location of the TensorFlow application’s main Python file. To give argumentsto TensorFlow applications --args is used, everything after this option untilanother option is recognized is considered the options value, this does not holdtrue for other options value which are finished with whitespaces. An exampleon how launch distributed training is to be seen in Listing 3.5 that specifies ascheduler queue, number of compute nodes and their resources, followed by theTensorFlow application and its arguments. The images argument in this case isthe relative path to the input data set from the submitting Hadoop user’s homedirectory, where the application utilizes the YARNTF HOME DIR environmentvariable.

Listing 3.5: Node registration information$HOPSTF HOME / b i n / y a r n t f−su bmi t \−−queue d e f a u l t \−−worke r s 3 \−−p s e s 1 \

3.4. PROPOSAL 23

−−memory 1024 \−−v c o r e s 1 \−−main $HOPSTF HOME / y a r n t f / examples / m n i s t / m n i s t . py \−−a r g s \−−images m n i s t / t f r / t r a i n \−−f o r m a t t f r \−−mode t r a i n \−−model m n i s t m o d e l

To add dependencies to the application we use the option --files where thevalue is a comma-separated list of .zip, .egg, and .py files. These dependencieswill be added to PYTHONPATH. Other fundamental arguments are specificationsof physical resources for the AM container, path to the AM JAR, node labelexpression and an alternative Python path. There also is the possibility to increasethe number of application attempt in case of AM failure, and to keep computecontainers between the attempts, but implemented behavior of AM is to discardany containers from previous attempts.

3.4.5 TensorBoardWith the Client flag --tensorboard, TensorBoard is enabled and started atthe worker with task index 0 (by convention ”chief” worker). The TensorBoardinstance’s logdir is set to the value of the environment variable YARNTF TB DIR,this value is automatically set by the AM. Therefore it is expected [11] thatthe application’s tf.summary.FileWriter takes the logdir, the value ofYARNTF TB DIR in its constructor.

To expose the address to the TensorBoard, the AM exposes a HTTP JSONendpoint. A call to the endpoint returns an array of strings on the format [ipaddress]:[port] the representing all registered TensorBoards. The endpointis available at [appTrackingUrl]/tensorboard as specified when theAM registers itself to the ResourceManager.

3.4.6 GPUApache Hadoop does not support GPU cores as a resource, but NVIDIA support isadded to Hops [6]. To support GPU, this was added as resource similar to memoryand virtual cores (CPU) for the TensorFlow application containers. To not wasteGPU:s for the parameter server, they always get zero.

An issue with YARN was detected on the implementing of GPU support; itis not possible to do multiple asynchronously request of containers with differentspecifications that have the same priority. If this is done, requests are lost. To


work around this issue, parameter server was set to always be given priority 1 andworkers priority 0.

The follow environment variables on the application containers are set forenabling GPU for TensorFlow:

Listing 3.6: GPU environment variablesCUDA HOME=/ u s r / l o c a l / cudaLD LIBRARY PATH=$LD LIBRARY PATH :$CUDA HOME/ l i b 6 4PATH=$PATH :$CUDA HOME/ b i n

This default value of CUDA HOME is hardcoded, but can be overwritten if it isdefined in the Client arguments using the argument --env CUDA HOME=[otherpath].

3.4.7 RDMATo use other protocols than gRPC between the TensorFlow, like RDMA, oneuses the Client argument --protocol. The value for this option will set theYARNTF PROTOCOL environment variable on the nodes and if using yarntf.createClusterServer(), the value will set the protocol parameter in thetf.train.Server constructor.

It is proposed by this work for this implementation, that RDMA nodes onthe Hadoop cluster is marked with YARN node labels. The option for the Clientarguments to define the node label expression is --node label expression.

3.4.8 File DistributionThe Client can add LocalResource to the AM, and the AM can addLocalResource to its containers, i.e. TensorFlow PS/worker containers. Todistribute files from the Client to the PS/worker containers, through the AM, thesefiles are copied to HDFS (with the exception if they already are in HDFS) and theirrelative path (name), URI, size and timestamp are added to a list i.e. a by this workimplemented data structure, io.hops.tensorflow.DistributedCacheList.The instance of this class is serialized, saved to HDFS and added as a LocalResourcefor the AM, the AM then reads this list and add the files as LocalResource tothe compute containers.

3.4.9 HopsWorksFor the benefits of HopsWorks the Client have added functionality to use it withoutClient arguments. To set options, instead getters and setters have been added thatcorrespond to arguments.

3.4. PROPOSAL 25

Also, to support Hops extended resource model, i.e. GPU support, the codebase uses Hops forked Hadoop libraries instead of Apache Hadoop. Besides theusage of GPU resources, everything can be ported to Apache Hadoop.

3.4.10 PackagingThe YARN application including Client and ApplicationMaster are packaged ina single JAR file. When the Client submits a job its own JAR file if usingyarntf, if using Client.main() or an instance of Client another JAR file cantheoretically be used. The Python package is made available on PyPi through ”pipinstall”.

3.4.11 Fault ToleranceWe don’t implement any fault tolerance, this section will just say how we handlefaults. On default, the number of attempts for the AM is 1, and if it would trymore attempts, then it does not reuse containers from previous attempts. And ifany container fails, the ApplicationMaster finishes.

Since the cluster specification is static, it is proposed that the whole jobis restarted by a higher level framework, e.g. Hopsworks, and continued at acheckpoint.

3.4.12 ApplicationMaster NotesThe behavior of parameter servers is that they run forever, they don’t stop whena TensorFlow application (all workers) is finished. Therefore the AM finishes thejob when all worker containers are completed.

Moreover, AM requests containers asynchronously and does not launch thecontainers until all are allocated. This could potentially cause a deadlock, wheremultiple AM have a part of the requested containers allocated. To work aroundthis there is a timeout for this allocation implemented that can be set by the Client,on timeout the application is stopped.

Chapter 4

Analysis

In this chapter we will present how the proposed solution was evaluated. Resultsfrom performance tests will be presented in Section 4.1, and an analyze of designdecisions taken in Section 4.2.

4.1 PerformanceThis section presents a performance evaluation utilizing CPU cores for computation.For GPU’s, we only confirmed the expected vastly increased performance on asingle worker’s training and inference on a convolutional neural network (CNN)with object recognition on the CIFAR-10 dataset [45].

To evaluate the performance for distributed TensorFlow, we did a quantitativedata collection, we compared the rate of steps per seconds when training aMNIST [46] classification model, between our proposed solution, i.e. yarntf ,and TensorFlowOnSpark (TFoS) [31]. MNIST is a large database of handwrittendigits where each image is 28 pixels by 28 pixels. The model was a softmaxmodel with one hidden layer, with all parameters (weights and biases) located onone parameter server (PS) visible to all other worker nodes. We originated fromthe same code base for both frameworks and kept them as similar as possible, theyarntf code is available in Appendix C.

The main comparison was done with the following setup:

• Hadoop environment: Hops [4] single-machine deployment on VirtualBox,where the virtual machine had 8 CPU cores and 64 GB RAM.

• ApplictionMaster (or Spark driver) resources: 1 CPU core, 1 GB RAM.

• TensorFlow nodes (PS/worker): 1 CPU core, 8 GB RAM.

• Input/output: Dataset stored in TFRecord format on HDFS.

27

28 CHAPTER 4. ANALYSIS

The TFRecord format was chosen because it is the recommended [47] if inputdata is saved on disc. This implies that for TFoS we did not use their mechanismto feed data through Spark RDD’s [15]. Regarding the TF code, the only thatdiffered was how the tf.train.ClusterSpec and tf.train.Serverobjects where created.

We tested the MNIST training with permutations of the following variables:

• yarntf and TensorFlowOnSpark

• 1 to 5 workers

• 1000, 2000, ..., 10000 training steps

The collected data, using a batch size of 100, is to be found in Appendix B. Thetime for a job was measured between the YARN Client’s submit and completionof the job. For 10000 training steps, the result can be seen in Figure 4.1. Here wecan see that the throughput was roughly equal using 1 and 2 workers, 9.1% higherthroughput for yarntf on 3 workers, 9.4% higher on 4 workers and 11% higher on5 workers.

4

6

8

10

12

14

16

18

20

1 2 3 4 5

Thro

ughput

[ste

ps/

s]

Workers [nodes]

TFoSyarntf

Figure 4.1: MNIST 10k steps

To get more insight from the collected data we calculated estimated linearregression trend lines describing the relationship between number of steps x, andtime elapsed to job completion y, for a fixed number of workers. The 5 trend linesy = α +βx, for 1–5 workers, was estimated with the least squares method [48]

4.1. PERFORMANCE 29

and the results presented in Figure 4.2 and 4.3, and Appendix B. The correlationcoefficients ρ was calculated to between 0.998–1.00 which indicates a very strongcorrelation and consistent data collected for this relation.

20

30

40

50

60

70

80

1 2 3 4 5

α [

s]

Workers [nodes]

TFoSyarntf

Figure 4.2: MNIST 1-10k steps, linear regression α

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 2 3 4 5

β [

s/st

ep]

Workers [nodes]

TFoSyarntf

Figure 4.3: MNIST 1–10k steps, linear regression β

From the trend lines we can interpret that TensorFlowOnSpark has a bigger

30 CHAPTER 4. ANALYSIS

overhead in general, as seen in Figure 4.2. Regarding the estimated performancefor larger computations when the initial overhead is negligible, then Figure 4.3 isof interest, TensorFlowOnspark seems to perform better on up to 2 worker nodes,and from 3, yarntf is better. For 1 worker TFoS gets a 9.6% better β value, andfor 5 workers yarntf gets a 4.8% better value.

4.2 Design AlternativesIn this section a couple of design alternatives will be discussed, mainly regardingdistribution of the cluster specification needed to create tf.train.ClusterSpec.

When not considering fault tolerance, the main problem was to distributeinformation for the cluster specification to the TensorFlow nodes. It wasconsidered that the ApplicationMaster creates a full cluster specification prior tolaunching the TF containers, the AM could retrieve the IPs for the containers, butnot the ports since the YARN containers does not have its own network interface,and it was found to not be possible since the AM could not reserve ports for thecontainers. This was solved by TOY [33] by pre-selecting random ports for theTF nodes. We found that this approach was too error prone. This conclusion ledto that we took a similar approach as TFoS [31], i.e. reserve ports on the computecontainers prior to execution, collect the cluster specification from each, and thendistribute the information to all nodes.

To launch all of TF application’s process on the YARN Client and usingremote servers, as TOY also did, was neither considered a good choice since itinduced a single point of failure (SPOF) that was possible to avoid.

Anyhow the approach of TOY of having remote servers did have a majoradvantage, it made sure that all of the Hadoop nodes had all needed dependenciesfor the TensorFlow application, since none was needed except on the Client.This leads to the big assumption yarntf makes, that Python is installed on allNodeMangers and that all dependencies is either installed on each NM or thatdependencies is attached on the Client’s application submit.

With the above conclusions there was the need of having something thatbootstrapped the TF application by generating the tf.train.ClusterSpecout on each node. The problem to solve here, was to enable discovery between thenodes. Having a centralized registry for this information was chosen in front ofa decentralized approach where each node would had broadcasted its informationto all other nodes, because of the latter resembled to be more complex with eachnode acting as both client and server. And lastly, choosing gRPC instead of REST(HTTP + JSON), was motivated with that it is more lightweight, and no need toparse JSON and instead having shared data schemas.

Chapter 5

Conclusions

This chapter explains the conclusions obtained from this work. In Section 5.1the proposed implementation is reviewed and the goals presented in Section 1.3evaluated. In Section 5.2 insights to improvements and complements to continuethis work is discussed.

5.1 Conclusion

In this section we will state the conclusions and insights gained as result of thisthesis project.

We have presented a framework, i.e. the Hops-TensorFlow [49] YARNapplication with a companying Python library yarntf [50], for managing distributedTensorFlow with Apache Hadoop YARN in general and in particular for Hops-YARN. The proposed system in Section 3.4 enables large-scale machine learningon Hadoop. In comparison to using standard distributed TensorFlow without acluster manger, it is an ease to use with the reduced complexity of configuringand launching applications. Running distributed TensorFlow on YARN without apre-existing framework in between, was successful.

The proposal is a native YARN application, with one ApplicationMastersubmitted for each job (TensorFlow application instance), which orchestratesthe distributed application to be launched by containers it allocates. Themodifications needed to be changed in the TF code is minor, typically a fewlines of Python code. The implemented client-side Python library, i.e. yarntf,is lightweight, with only 130 lines of code if we disregard the generated gRPCcode. Each TensorFlow node register itself at the ApplicationMaster which actsthe server-side for the gRPC calls, and then retrieves needed information tocreate tf.train.ClusterSpec and tf.train.Server objects which arereturned by yarntf.

31

32 CHAPTER 5. CONCLUSIONS

On job submit we can specify how many workers and what resources asdesired, including GPU resource request support on Hops. Moreover it is madepossible to request a TensorBoard which is exposed the ApplicationMaster, andmade possible to add application dependencies on submit which are distributed tothe job’s containers.

In comparison to TensorFlowOnSpark, which fast gained traction after itsrelease, the user experience and abilities are very similar. Performance wise, ourtests indicated that our proposal could have better up to 11% better performanceunder certain circumstances, and that it seems that we scale better when utilizingmore than two workers.

5.1.1 Goals

The goals that was sat for this work is considered to be successfully obtained.The main goal was develop a native Hadoop YARN framework that could manageto launch distributed TensorFlow applications with specified physical resourcerequirements. This was and likewise GPU support, as already mentioned a metgoal.

Regarding RDMA, this was not tested, we only implemented support forspecifying a arbitrary protocol for the tf.train.Server object and a proposalfor usage.

Finally, the goal to integrate with HopsWorks was achieved.

5.2 Future work

More extensive performance tests in comparison to TensorFlowOnSpark wouldbe desirable. Testing large jobs with a large amount of workers, on a bigphysical cluster, was left undone. Unfortunately the presented tests collecteddata is considered to not explain scaling characteristics differences in betweenthe solutions.

The next obvious thing to further work on for the proposed framework isfault tolerance and handling of failovers. Even though that all workers exceptthe ”chief” is usually stateless, with a static ClusterSpec, solutions are restrictedto use DNS to handle restarted nodes. With DNS support in YARN, utilizingit for failovers would be of interest. We should not forget here that YARNcontainers does not have its own network interface, which implies that portcollisions could occur, solving for this or usage of Docker would be of interest.Another approach would be to make YARN restart a failed container on the samenode, this would require modifications of YARN. Lastly, it is important to consider

5.2. FUTURE WORK 33

that the proposed ApplicationMaster is stateful and a strategy to recover its stateis needed, since we need it to keep track over the job’s containers.

Bibliography

[1] S. R. Alapati, Expert Hadoop Administration: Managing, Tuning, andSecuring Spark, YARN, and HDFS, ser. Addison-Wesley Data and Analytics.Pearson Education, 2017. ISBN 978-0-13-459719-5

[2] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z.Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “LargeScale Distributed Deep Networks,” in Proceedings of the 25th InternationalConference on Neural Information Processing Systems - Volume 1, ser.NIPS’12. USA: Curran Associates Inc., 2012, pp. 1223–1231.

[3] Apache Hadoop. http://hadoop.apache.org/.

[4] Hops Hadoop. http://www.hops.io/.

[5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden,M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-ScaleMachine Learning on Heterogeneous Systems,” 2015, software availablefrom tensorflow.org. [Online]. Available: https://www.tensorflow.org/

[6] R. Andersson, “GPU integration for Deep Learning on YARN,” Master’sthesis, KTH, School of Information and Communication Technology (ICT),Stockholm, 2017, TRITA ICT-EX-2017:151.

[7] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in 2009 IEEEConference on Computer Vision and Pattern Recognition, June 2009. doi:10.1109/CVPR.2009.5206848. ISSN 1063-6919 pp. 248–255.

35

http://hadoop.apache.org/

http://www.hops.io/

https://www.tensorflow.org/

36 BIBLIOGRAPHY

[8] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, ser. AdaptiveComputation and Machine Learning. MIT Press, 2016. ISBN 978-0-262-03561-3

[9] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,”in Proceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples, ser. SOSP ’03, 2003. ISBN 1-58113-757-5 pp. 29–43.

[10] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing onLarge Clusters,” in Proceedings of the 6th Symposium on Operating SystemsDesign & Implementation, ser. OSDI ’04. Berkeley, CA, USA: USENIXAssociation, 2004, p. 137–150.

[11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A Distributed StorageSystem for Structured Data,” in Proceedings of the 7th USENIX Symposiumon Operating Systems Design and Implementation, ser. OSDI ’06. Berkeley,CA, USA: USENIX Association, 2006, p. 205–218.

[12] Apache HBase. http://hbase.apache.org/.

[13] A. C. Murthy, V. K. Vavilapalli, D. Eadline, J. Niemiec, and J. Markham,Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processingwith Apache Hadoop 2, ser. Addison-Wesley Data and Analytics. PearsonEducation and Hortonworks, 2014. ISBN 978-0-321-93450-5

[14] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache HadoopYARN: Yet Another Resource Negotiator,” in Proceedings of the 4th AnnualSymposium on Cloud Computing, ser. SOCC ’13, 2013. ISBN 978-1-4503-2428-1 pp. 5:1–5:16.

[15] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in Presented aspart of the 9th USENIX Symposium on Networked Systems Design andImplementation (NSDI 12). San Jose, CA: USENIX, 2012. ISBN 978-931971-92-8 pp. 15–28.

[16] A. More and E. Gebremeskel, “HopsWorks: A project-based access controlmodel for Hadoop,” Bachelor’s thesis, KTH, School of Information andCommunication Technology (ICT), Stockholm, 2015, TRITA ICT-EX-2015:70.

http://hbase.apache.org/

BIBLIOGRAPHY 37

[17] M. Ismail, E. Gebremeskel, T. Kakantousis, G. Berthou, and J. Dowling,“Hopsworks: Improving User Experience and Development on Hadoop withScalable, Strongly Consistent Metadata,” in 2017 IEEE 37th InternationalConference on Distributed Computing Systems (ICDCS), June 2017. doi:10.1109/ICDCS.2017.41. ISSN 1063-6927 pp. 2525–2528.

[18] M. R. Hasan, “Quota based access-control for Hops: Improving clusterutilization with Hops-YARN,” Master’s thesis, KTH, School of Informationand Communication Technology (ICT), Stockholm, 2016, TRITA ICT-EX-2016:109.

[19] J. Shafer, S. Rixner, and A. L. Cox, “The Hadoop distributed filesystem:Balancing portability and performance,” in 2010 IEEE InternationalSymposium on Performance Analysis of Systems Software (ISPASS), March2010. doi: 10.1109/ISPASS.2010.5452045 pp. 122–133.

[20] M. Ismail, S. Niazi, M. Ronstrom, S. Haridi, and J. Dowling, “Scaling HDFSto More Than 1 Million Operations Per Second with HopsFS,” in 2017 17thIEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(CCGRID), May 2017. doi: 10.1109/CCGRID.2017.117 pp. 683–688.

[21] S. Kunganesan, “Distributed Resource Management for YARN,” Master’sthesis, KTH, School of Information and Communication Technology (ICT),Stockholm, 2015, TRITA ICT-EX-2015:231.

[22] Apache Software Foundation, “Apache Hadoop 2.7.3 – YARN Commands,”January 17, 2017. [Online]. Available: https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

[23] ——, “ApplicationSubmissionContext (Apache HadoopMain 2.7.3 API),” January 17, 2017. [Online].Available: http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html

[24] ——, “ContainerLaunchContext (Apache Hadoop Main 2.7.3 API),”January 17, 2017. [Online]. Available: http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.html

[25] ——, “Apache Hadoop 2.7.3 – Hadoop: Writing YARN Applications,”January 17, 2017. [Online]. Available: https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.html

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.html

https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

38 BIBLIOGRAPHY

[26] ——, “[YARN-624] Support gang scheduling in the AM RM protocol -ASF JIRA),” March 17, 2017. [Online]. Available: https://issues.apache.org/jira/browse/YARN-624

[27] Google, “Distributed TensorFlow — TensorFlow,” March 17, 2017.[Online]. Available: https://www.tensorflow.org/deploy/distributed

[28] J. Dowling, “Distributed TensorFlow - O’Reilly Media,” January 17, 2018.[Online]. Available: https://www.oreilly.com/ideas/distributed-tensorflow

[29] Google, “How to run TensorFlow on Hadoop — TensorFlow,” March 17,2017. [Online]. Available: https://www.tensorflow.org/deploy/hadoop

[30] ——, “tensorboard/README.md at master · tensorflow/tensorboard,” May17, 2017. [Online]. Available: https://github.com/tensorflow/tensorboard/blob/master/README.md

[31] Yahoo, “Open Sourcing TensorFlowOnSpark: DistributedDeep... — Hadoop at Yahoo,” March 17, 2017.[Online]. Available: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

[32] TensorFrames. https://github.com/databricks/tensorframes.

[33] Apache Software Foundation, “[YARN-6043] [HDL] Tensorflow onYARN - ASF JIRA,” March 17, 2017. [Online]. Available: https://issues.apache.org/jira/browse/YARN-6043

[34] W. Tan and V. K. Vavilapalli, “Data Lake 3.0 Part 3 - DistributedTensorFlow Assembly on Apache Hadoop YARN - Hortonworks,”March 17, 2017. [Online]. Available: https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/

[35] Kubernetes. https://kubernetes.io/.

[36] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, andJ. Wilkes, “Large-scale cluster management at Google with Borg,” inProceedings of the European Conference on Computer Systems (EuroSys),Bordeaux, France, 2015.

[37] Google, “ecosystem/kubernetes at master · tensorflow/ecosystem,” March17, 2017. [Online]. Available: https://github.com/tensorflow/ecosystem/tree/master/kubernetes

https://issues.apache.org/jira/browse/YARN-624


https://www.tensorflow.org/deploy/distributed

https://www.oreilly.com/ideas/distributed-tensorflow

https://www.tensorflow.org/deploy/hadoop

https://github.com/tensorflow/tensorboard/blob/master/README.md

https://github.com/tensorflow/tensorboard/blob/master/README.md

http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

https://github.com/databricks/tensorframes



https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/

https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/

https://kubernetes.io/

https://github.com/tensorflow/ecosystem/tree/master/kubernetes

https://github.com/tensorflow/ecosystem/tree/master/kubernetes

BIBLIOGRAPHY 39

[38] ——, “ecosystem/marathon at master · tensorflow/ecosystem),” March 17,2017. [Online]. Available: https://github.com/tensorflow/ecosystem/tree/master/marathon

[39] Apache Software Foundation, “[YARN-5079] [Umbrella] Native YARNframework layer for services and beyond - ASF JIRA,” March 17, 2017.[Online]. Available: https://issues.apache.org/jira/browse/YARN-5079

[40] Hops, “hops/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell at develop ·hopshadoop/hops,” January 17, 2017. [Online]. Available: https://github.com/hopshadoop/hops/tree/develop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell

[41] MapR, “YARN Log Aggregation - MapR 5.0 Documentation- doc.mapr.com,” March 17, 2017. [Online]. Available: http://doc.mapr.com/display/MapR/YARN+Log+Aggregation

[42] V. K. Vavilapalli, “Simplifying user-logs management and access in YARN- Hortonworks,” March 17, 2017. [Online]. Available: https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

[43] Apache Software Foundation, “[YARN-4758] Enable discovery of AMsby containers - ASF JIRA,” March 17, 2017. [Online]. Available:https://issues.apache.org/jira/browse/YARN-4758

[44] gRPC. https://grpc.io/.

[45] A. Krizhevsky. (2009) CIFAR-10 and CIFAR-100 datasets. https://www.cs.toronto.edu/∼kriz/cifar.html.

[46] Y. LeCun, C. Cortes, and C. Burges. (1998) MNIST handwritten digitdatabase. http://yann.lecun.com/exdb/mnist/.

[47] Google, “Importing Data — TensorFlow,” March 17, 2017. [Online].Available: https://www.tensorflow.org/programmers guide/datasets

[48] G. Blom, J. Enger, G. Englund, J. Grandell, and L. Holst, Sannolikhetsteorioch statistikteori med tillampningar. Lund, Sweden: Studentlitteratur,2005. ISBN 978-91-44-02442-4

[49] Hops-TensorFlow. https://github.com/hopshadoop/hops-tensorflow.

[50] yarntf. https://pypi.python.org/pypi/yarntf.

https://github.com/tensorflow/ecosystem/tree/master/marathon

https://github.com/tensorflow/ecosystem/tree/master/marathon


https://github.com/hopshadoop/hops/tree/develop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell



http://doc.mapr.com/display/MapR/YARN+Log+Aggregation

http://doc.mapr.com/display/MapR/YARN+Log+Aggregation

https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/


https://grpc.io/

https://www.cs.toronto.edu/~kriz/cifar.html

https://www.cs.toronto.edu/~kriz/cifar.html

http://yann.lecun.com/exdb/mnist/

https://www.tensorflow.org/programmers_guide/datasets

https://github.com/hopshadoop/hops-tensorflow

https://pypi.python.org/pypi/yarntf

Appendix A

yarntf-submit

Listing A.1: yarntf-submit script1 # ! / u s r / b i n / env bash2

3 s e t −u4

5 s c r i p t =$ ( r e a d l i n k −f ” $0 ” )6 s c r i p t d i r =$ ( d i rname ” $ s c r i p t ” )7 h o p s t f r o o t =$ ( d i rname ” $ s c r i p t d i r ” )8

9 j a r =” $ h o p s t f r o o t ” / t a r g e t / hops−t e n s o r f l o w −0.0.4−SNAPSHOT . j a r10 c l i e n t = i o . hops . t e n s o r f l o w . C l i e n t11

12 $HADOOP HOME/ b i n / ya rn j a r $ j a r $ c l i e n t −−a m j a r $ j a r ”$@”

41

Appendix B

MNIST Performance

Table B.1: MNIST perf, batch size 100, 1–5 workersSteps/ 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Setup [s] [s] [s] [s] [s] [s] [s] [s] [s] [s]TFoS 1 211 369 517 671 832 962 1140 1290 1450 1630TFoS 2 157 233 317 411 495 578 664 751 841 943TFoS 3 132 212 262 334 398 469 549 605 696 763TFoS 4 142 180 245 295 371 410 493 554 596 654TFoS 5 128 191 228 279 332 382 443 499 542 609yarntf 1 191 369 530 708 879 1060 1250 1420 1570 1700yarntf 2 118 217 297 378 503 565 681 773 865 955yarntf 3 101 176 222 289 359 411 488 566 629 699yarntf 4 91.3 160 197 251 324 389 438 492 530 597yarntf 5 90.7 144 185 247 287 333 395 434 492 549

43

44 APPENDIX B. MNIST PERFORMANCE

Table B.2: MNIST 1–10k steps, least squares trend lineSteps/ α β ρ

Setup [s] [s/step]TFoS 1 48.6 0.156 1.00TFoS 2 61.0 0.0869 1.00TFoS 3 58.6 0.0697 0.999TFoS 4 70.7 0.0588 0.998TFoS 5 73.9 0.0526 0.999yarntf 1 26.0 0.171 0.999yarntf 2 21.3 0.0934 0.999yarntf 3 30.0 0.0662 0.999yarntf 4 39.1 0.0560 0.998yarntf 5 39.3 0.0502 0.999

Appendix C

MNIST Code

Listing C.1: Hops-TensorFlow MNIST code1 ””” D i s t r i b u t e d MNIST on g r i d based on TensorFlow MNIST example

”””2

3 from f u t u r e i m p o r t a b s o l u t e i m p o r t4 from f u t u r e i m p o r t d i v i s i o n5 from f u t u r e i m p o r t p r i n t f u n c t i o n6

7 i m p o r t a r g p a r s e8 i m p o r t math9 i m p o r t os

10 i m p o r t t ime11 from d a t e t i m e i m p o r t d a t e t i m e12

13 i m p o r t t e n s o r f l o w as t f14

15 i m p o r t y a r n t f16

17 IMAGE PIXELS = 2818

19

20 d e f p r i n t l o g ( w o r k e r i d , a r g ) :21 p r i n t ( w o r k e r i d , end=” : ” )22 p r i n t ( a r g )23

24

25 d e f h d f s p a t h ( r e l a t i v e p a t h ) :26 r e t u r n os . e n v i r o n [ ”YARNTF HOME DIR” ] + ” / ” + r e l a t i v e p a t h27

28

29 d e f main ( a r g s ) :30 job name = os . e n v i r o n [ ”YARNTF JOB NAME” ]31 t a s k i n d e x = i n t ( os . e n v i r o n [ ”YARNTF TASK INDEX” ] )

45

46 APPENDIX C. MNIST CODE

32 num workers = i n t ( os . e n v i r o n [ ”YARNTF WORKERS” ] )33 num pses = i n t ( os . e n v i r o n [ ”YARNTF PSES” ] )34 w o r k e r i d = job name + s t r ( t a s k i n d e x )35

36 # P a r a m e t e r s37 h i d d e n u n i t s = 12838 b a t c h s i z e = 10039

40 # Get TF c l u s t e r and s e r v e r i n s t a n c e s41 c l u s t e r , s e r v e r = y a r n t f . c r e a t e C l u s t e r S e r v e r ( )42

43 d e f r e a d c s v e x a m p l e s ( i m a g e d i r , l a b e l d i r , b a t c h s i z e =100 ,num epochs=None , t a s k i n d e x =None , num workers=None ) :

44 p r i n t l o g ( w o r k e r i d , ” num epochs : {0} ” . f o r m a t ( num epochs ) )45 # Se tup queue o f csv image f i l e n a m e s46 t f r e c o r d p a t t e r n = os . p a t h . j o i n ( i m a g e d i r , ' p a r t −* ' )47 images = t f . g f i l e . Glob ( t f r e c o r d p a t t e r n )48 p r i n t l o g ( w o r k e r i d , ” images : {0} ” . f o r m a t ( images ) )49 image queue = t f . t r a i n . s t r i n g i n p u t p r o d u c e r ( images , s h u f f l e

= F a l s e , c a p a c i t y =1000 , num epochs=num epochs ,50 name=”

image queue ” )51

52 # Se tup queue o f csv l a b e l f i l e n a m e s53 t f r e c o r d p a t t e r n = os . p a t h . j o i n ( l a b e l d i r , ' p a r t −* ' )54 l a b e l s = t f . g f i l e . Glob ( t f r e c o r d p a t t e r n )55 p r i n t l o g ( w o r k e r i d , ” l a b e l s : {0} ” . f o r m a t ( l a b e l s ) )56 l a b e l q u e u e = t f . t r a i n . s t r i n g i n p u t p r o d u c e r ( l a b e l s , s h u f f l e

= F a l s e , c a p a c i t y =1000 , num epochs=num epochs ,57 name=”

l a b e l q u e u e ” )58

59 # Se tup r e a d e r f o r image queue60 i m g r e a d e r = t f . T e x t L i n e R e a d e r ( name=” i m g r e a d e r ” )61 , img csv = i m g r e a d e r . r e a d ( image queue )62 i m a g e d e f a u l t s = [ [ 1 . 0 ] f o r c o l i n r a n g e ( 7 8 4 ) ]63 img = t f . pack ( t f . d e c o d e c s v ( img csv , i m a g e d e f a u l t s ) )64 # Normal i ze v a l u e s t o [ 0 , 1 ]65 norm = t f . c o n s t a n t ( 2 5 5 , d t y p e = t f . f l o a t 3 2 , shape = ( 7 8 4 , ) )66 image = t f . d i v ( img , norm )67 p r i n t l o g ( w o r k e r i d , ” image : {0} ” . f o r m a t ( image ) )68

69 # Se tup r e a d e r f o r l a b e l queue70 l a b e l r e a d e r = t f . T e x t L i n e R e a d e r ( name=” l a b e l r e a d e r ” )71 , l a b e l c s v = l a b e l r e a d e r . r e a d ( l a b e l q u e u e )72 l a b e l d e f a u l t s = [ [ 1 . 0 ] f o r c o l i n r a n g e ( 1 0 ) ]73 l a b e l = t f . pack ( t f . d e c o d e c s v ( l a b e l c s v , l a b e l d e f a u l t s ) )74 p r i n t l o g ( w o r k e r i d , ” l a b e l : {0} ” . f o r m a t ( l a b e l ) )75

47

76 # R e t u r n a b a t c h o f examples77 r e t u r n t f . t r a i n . b a t c h ( [ image , l a b e l ] , b a t c h s i z e ,

n u m t h r e a d s = a r g s . r e a d e r s , name=” b a t c h c s v ” )78

79 d e f r e a d t f r e x a m p l e s ( pa th , b a t c h s i z e =100 , num epochs=None ,t a s k i n d e x =None , num workers=None ) :

80 p r i n t l o g ( w o r k e r i d , ” num epochs : {0} ” . f o r m a t ( num epochs ) )81

82 # Se tup queue o f TFRecord f i l e n a m e s83 t f r e c o r d p a t t e r n = os . p a t h . j o i n ( pa th , ' p a r t −* ' )84 f i l e s = t f . g f i l e . Glob ( t f r e c o r d p a t t e r n )85 queue name = ” f i l e q u e u e ”86

87 # s p l i t i n p u t f i l e s a c r o s s workers , i f s p e c i f i e d88 i f t a s k i n d e x i s n o t None and num workers i s n o t None :89 n u m f i l e s = l e n ( f i l e s )90 f i l e s = f i l e s [ t a s k i n d e x : n u m f i l e s : num workers ]91 queue name = ” f i l e q u e u e {0} ” . f o r m a t ( t a s k i n d e x )92

93 p r i n t l o g ( w o r k e r i d , ” f i l e s : {0} ” . f o r m a t ( f i l e s ) )94 f i l e q u e u e = t f . t r a i n . s t r i n g i n p u t p r o d u c e r ( f i l e s , s h u f f l e =

F a l s e , c a p a c i t y =1000 , num epochs=num epochs ,95 name=queue name )96

97 # Se tup r e a d e r f o r examples98 r e a d e r = t f . TFRecordReader ( name=” r e a d e r ” )99 , s e r i a l i z e d = r e a d e r . r e a d ( f i l e q u e u e )

100 f e a t u r e d e f = { ' l a b e l ' : t f . F i x e d L e n F e a t u r e ( [ 1 0 ] , t f . i n t 6 4 ) ,' image ' : t f . F i x e d L e n F e a t u r e ( [ 7 8 4 ] , t f . i n t 6 4 ) }

101 f e a t u r e s = t f . p a r s e s i n g l e e x a m p l e ( s e r i a l i z e d , f e a t u r e d e f )102 norm = t f . c o n s t a n t ( 2 5 5 , d t y p e = t f . f l o a t 3 2 , shape = ( 7 8 4 , ) )103 image = t f . d i v ( t f . t o f l o a t ( f e a t u r e s [ ' image ' ] ) , norm )104 p r i n t l o g ( w o r k e r i d , ” image : {0} ” . f o r m a t ( image ) )105 l a b e l = t f . t o f l o a t ( f e a t u r e s [ ' l a b e l ' ] )106 p r i n t l o g ( w o r k e r i d , ” l a b e l : {0} ” . f o r m a t ( l a b e l ) )107

108 # R e t u r n a b a t c h o f examples109 r e t u r n t f . t r a i n . b a t c h ( [ image , l a b e l ] , b a t c h s i z e ,

n u m t h r e a d s = a r g s . r e a d e r s , name=” b a t c h ” )110

111 i f job name == ” ps ” :112 s e r v e r . j o i n ( )113 e l i f job name == ” worker ” :114 # A s s i g n s ops t o t h e l o c a l worker by d e f a u l t .115 wi th t f . d e v i c e ( t f . t r a i n . r e p l i c a d e v i c e s e t t e r (116 w o r k e r d e v i c e =” / j o b : worker / t a s k :%d ” % t a s k i n d e x ,117 c l u s t e r = c l u s t e r ) ) :118

119 # V a r i a b l e s o f t h e h id de n l a y e r


120 hid w = t f . V a r i a b l e ( t f . t r u n c a t e d n o r m a l ( [ IMAGE PIXELS *IMAGE PIXELS , h i d d e n u n i t s ] ,

121 s t d d e v =1 .0 /IMAGE PIXELS ) , name=” hid w ” )

122 h i d b = t f . V a r i a b l e ( t f . z e r o s ( [ h i d d e n u n i t s ] ) , name=” h i d b ”)

123

124 # V a r i a b l e s o f t h e so f tmax l a y e r125 sm w = t f . V a r i a b l e ( t f . t r u n c a t e d n o r m a l ( [ h i d d e n u n i t s , 1 0 ] ,126 s t d d e v =1 .0 / math .

s q r t ( h i d d e n u n i t s ) ) , name=”sm w” )127 sm b = t f . V a r i a b l e ( t f . z e r o s ( [ 1 0 ] ) , name=” sm b ” )128

129 # P l a c e h o l d e r s o r QueueRunner / Reade r s f o r i n p u t d a t a130 num epochs = 1 i f a r g s . mode == ” i n f e r e n c e ” e l s e None i f

a r g s . epochs == 0 e l s e a r g s . epochs131 i n d e x = t a s k i n d e x i f a r g s . mode == ” i n f e r e n c e ” e l s e None132 worke r s = num workers i f a r g s . mode == ” i n f e r e n c e ” e l s e

None133

134 i f a r g s . f o r m a t == ” csv ” :135 images = h d f s p a t h ( a r g s . images )136 l a b e l s = h d f s p a t h ( a r g s . l a b e l s )137 x , y = r e a d c s v e x a m p l e s ( images , l a b e l s , 100 ,

num epochs , index , worke r s )138 e l i f a r g s . f o r m a t == ” t f r ” :139 images = h d f s p a t h ( a r g s . images )140 x , y = r e a d t f r e x a m p l e s ( images , 100 , num epochs , index

, worke r s )141 e l s e :142 r a i s e ( ” {0} f o r m a t n o t s u p p o r t e d f o r t f i n p u t mode” .

f o r m a t ( a r g s . f o r m a t ) )143

144 h i d l i n = t f . nn . x w p l u s b ( x , hid w , h i d b )145 h i d = t f . nn . r e l u ( h i d l i n )146

147 y = t f . nn . so f tmax ( t f . nn . x w p l u s b ( hid , sm w , sm b ) )148

149 g l o b a l s t e p = t f . V a r i a b l e ( 0 )150

151 l o s s = − t f . r educe sum ( y * t f . l o g ( t f . c l i p b y v a l u e ( y , 1e−10 , 1 . 0 ) ) )

152 t r a i n o p = t f . t r a i n . Adag radOp t imize r ( 0 . 0 1 ) . min imize (153 l o s s , g l o b a l s t e p = g l o b a l s t e p )154

155 # T e s t t r a i n e d model156 l a b e l = t f . argmax ( y , 1 , name=” l a b e l ” )157 p r e d i c t i o n = t f . argmax ( y , 1 , name=” p r e d i c t i o n ” )158 c o r r e c t p r e d i c t i o n = t f . e q u a l ( p r e d i c t i o n , l a b e l )

49

159 a c c u r a c y = t f . r educe mean ( t f . c a s t ( c o r r e c t p r e d i c t i o n , t f .f l o a t 3 2 ) , name=” a c c u r a c y ” )

160

161 s a v e r = t f . t r a i n . Save r ( )162 summary op = t f . summary . m e r g e a l l ( )163 i n i t o p = t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( )164

165 # C r e a t e a ” s u p e r v i s o r ” , which o v e r s e e s t h e t r a i n i n g p r o c e s sand s t o r e s model s t a t e i n t o HDFS

166 l o g d i r = h d f s p a t h ( a r g s . model )167 p r i n t ( ” t e n s o r f l o w model p a t h : {0} ” . f o r m a t ( l o g d i r ) )168 s u m m a r y w r i t e r = t f . summary . F i l e W r i t e r ( os . e n v i r o n [ ”

YARNTF TB DIR” ] , g raph = t f . g e t d e f a u l t g r a p h ( ) )169

170 i f a r g s . mode == ” t r a i n ” :171 sv = t f . t r a i n . S u p e r v i s o r ( i s c h i e f =( t a s k i n d e x == 0) ,172 l o g d i r = l o g d i r ,173 i n i t o p = i n i t o p ,174 summary op=summary op ,175 s a v e r = s a v e r ,176 g l o b a l s t e p = g l o b a l s t e p ,177 s u m m a r y w r i t e r = summary wr i t e r ,178 s t o p g r a c e s e c s =300 ,179 s a v e m o d e l s e c s =10)180 e l s e :181 sv = t f . t r a i n . S u p e r v i s o r ( i s c h i e f =( t a s k i n d e x == 0) ,182 l o g d i r = l o g d i r ,183 s a v e r = s a v e r ,184 g l o b a l s t e p = g l o b a l s t e p ,185 s t o p g r a c e s e c s =300 ,186 s a v e m o d e l s e c s =0)187 o u t p u t d i r = h d f s p a t h ( a r g s . o u t p u t )188 o u t p u t f i l e = t f . g f i l e . Open ( ” {0} / p a r t −{1:05 d}” . f o r m a t (

o u t p u t d i r , t a s k i n d e x ) , mode= 'w ' )189

190 # The s u p e r v i s o r t a k e s c a r e o f s e s s i o n i n i t i a l i z a t i o n ,r e s t o r i n g from

191 # a c h e c k p o i n t , and c l o s i n g when done or an e r r o r o c c u r s .192 wi th sv . m a n a g e d s e s s i o n ( s e r v e r . t a r g e t ) a s s e s s :193 p r i n t ( ” {0} s e s s i o n r e a d y ” . f o r m a t ( d a t e t i m e . now ( ) . i s o f o r m a t

( ) ) )194

195 # Loop u n t i l t h e s u p e r v i s o r s h u t s down or 1000000 s t e p shave comple t ed .

196 s t e p = 0197 c o u n t = 0198 w h i l e n o t sv . s h o u l d s t o p ( ) and s t e p < a r g s . s t e p s :199 # Run a t r a i n i n g s t e p a s y n c h r o n o u s l y .200 # See ` t f . t r a i n . S y n c R e p l i c a s O p t i m i z e r ` f o r a d d i t i o n a l


d e t a i l s on how t o201 # pe r fo rm * s y n c h r o n o u s * t r a i n i n g .202

203 # u s i n g QueueRunners / Reade r s204 i f a r g s . mode == ” t r a i n ” :205 i f ( s t e p % 100 == 0) :206 p r i n t (207 ” {0} s t e p : {1} a c c u r a c y : {2} ” . f o r m a t ( d a t e t i m e . now

( ) . i s o f o r m a t ( ) , s t e p , s e s s . run ( a c c u r a c y ) ) )208 , summary , s t e p = s e s s . run ( [ t r a i n o p , summary op ,

g l o b a l s t e p ] )209 s u m m a r y w r i t e r . add summary ( summary , s t e p )210 e l s e : # a r g s . mode == ” i n f e r e n c e ”211 l a b e l s , pred , acc = s e s s . run ( [ l a b e l , p r e d i c t i o n ,

a c c u r a c y ] )212 # p r i n t ( ” l a b e l : {0} , p r ed : {1} ” . f o r m a t ( l a b e l s , p r ed ) )213 p r i n t ( ” acc : {0} ” . f o r m a t ( acc ) )214 f o r i i n r a n g e ( l e n ( l a b e l s ) ) :215 c o u n t += 1216 o u t p u t f i l e . w r i t e ( ” {0} {1}\n ” . f o r m a t ( l a b e l s [ i ] , p r ed

[ i ] ) )217 p r i n t ( ” c o u n t : {0} ” . f o r m a t ( c o u n t ) )218

219 i f a r g s . mode == ” i n f e r e n c e ” :220 o u t p u t f i l e . c l o s e ( )221 # Delay c h i e f worker from s h u t t i n g down s u p e r v i s o r d u r i n g

i n f e r e n c e , s i n c e i t can l o a d model , s t a r t s e s s i o n ,222 # run i n f e r e n c e and r e q u e s t s t o p b e f o r e t h e o t h e r worke r s

even s t a r t / sync t h e i r s e s s i o n s .223 i f t a s k i n d e x == 0 :224 t ime . s l e e p ( 6 0 )225

226 # Ask f o r a l l t h e s e r v i c e s t o s t o p .227 p r i n t ( ” {0} s t o p p i n g s u p e r v i s o r ” . f o r m a t ( d a t e t i m e . now ( ) .

i s o f o r m a t ( ) ) )228 sv . s t o p ( )229

230

231 i f n a m e == ” m a i n ” :232 p a r s e r = a r g p a r s e . Argumen tPa r se r ( )233 p a r s e r . add a rgumen t ( ”−e ” , ”−−epochs ” , h e l p =” number o f epochs ” ,

t y p e = i n t , d e f a u l t =0)234 p a r s e r . add a rgumen t ( ”−f ” , ”−−f o r m a t ” , h e l p =” example f o r m a t : (

c sv | p i c k l e | t f r ) ” , c h o i c e s =[ ” csv ” , ” p i c k l e ” , ” t f r ” ] ,235 d e f a u l t =” t f r ” )236 p a r s e r . add a rgumen t ( ”− i ” , ”−−images ” , h e l p =”HDFS p a t h t o MNIST

images i n p a r a l l e l i z e d f o r m a t ” )237 p a r s e r . add a rgumen t ( ”− l ” , ”−− l a b e l s ” , h e l p =”HDFS p a t h t o MNIST

l a b e l s i n p a r a l l e l i z e d f o r m a t ” )

51

238 p a r s e r . add a rgumen t ( ”−m” , ”−−model ” , h e l p =”HDFS p a t h t o save /l o a d model d u r i n g t r a i n / t e s t ” , d e f a u l t =” m n i s t m o d e l ” )

239 p a r s e r . add a rgumen t ( ”−o ” , ”−−o u t p u t ” , h e l p =”HDFS p a t h t o savet e s t / i n f e r e n c e o u t p u t ” , d e f a u l t =” p r e d i c t i o n s ” )

240 p a r s e r . add a rgumen t ( ”−r ” , ”−−r e a d e r s ” , h e l p =” number o f r e a d e r /enqueue t h r e a d s ” , t y p e = i n t , d e f a u l t =1)

241 p a r s e r . add a rgumen t ( ”−s ” , ”−−s t e p s ” , h e l p =”maximum number o fs t e p s ” , t y p e = i n t , d e f a u l t =1000)

242 p a r s e r . add a rgumen t ( ”−X” , ”−−mode” , h e l p =” t r a i n | i n f e r e n c e ” ,d e f a u l t =” t r a i n ” )

243 a r g s = p a r s e r . p a r s e a r g s ( )244 p r i n t ( ” a r g s : ” , a r g s )245

246 p r i n t ( ” {0} ===== S t a r t ” . f o r m a t ( d a t e t i m e . now ( ) . i s o f o r m a t ( ) ) )247 main ( a r g s )248 p r i n t ( ” {0} ===== Stop ” . f o r m a t ( d a t e t i m e . now ( ) . i s o f o r m a t ( ) ) )

TRITA EECS-EX-2018:39

www.kth.se

Managed Distributed TensorFlow with YARN1301487/FULLTEXT01.pdf · processing of Big Data. With the...

Documents

Transcript of Managed Distributed TensorFlow with YARN1301487/FULLTEXT01.pdf · processing of Big Data. With the...