Performance and Control Flow in TensorFlow

9
Performance and Control Flow in TensorFlow Pang Wei Koh Stanford Department of Computer Science [email protected] Emma Pierson Stanford Department of Computer Science [email protected] 1 SUMMARY The success of deep learning on a variety of statistical tasks has sparked a revolution. This, in turn, raises new challenges in pro- gramming language design: how should we design a language that allows users to efficiently crunch a huge amount of data in parallel? How can we make use of specialized hardware? What sort of control flows are most effective? In this project, we will study TensorFlow – probably the most popular deep learning framework around – and how it attempts to answer these challenges. To understand TensorFlow from the bottom up, we begin by benchmarking and as- sessing ease-of-use of CUDA, the low-level framework upon which TensorFlow relies. We then investigate two recent higher-level Ten- sorFlow innovations – the Dataset API and eager execution. Last, to get a birds-eye view of the design considerations underlying the TensorFlow framework, we conduct three lengthy interviews with world-class TensorFlow practitioners, all of whom have more than a thousand hours of practical deep learning experience. 2 BACKGROUND TensorFlow is a deep learning framework developed by Google Brain and released in November 2015. As of this writing, it is the most popular deep learning framework on a variety of metrics [2]; the initial paper describing it [1] has more than 2,000 citations and the GitHub repository has more than 81,000 stars. TensorFlow has been used as the framework underlying numerous advances in the state-of-the-art, including the defeat of the human Go world champion by a neural network trained from tabula rasa in less than a day [15]. TensorFlow’s success has even given rise to specialized hardware like the Tensor Processing Unit [7, 8]. In this project, we examine several fundamental building blocks underlying TensorFlow. At a low level, TensorFlow relies on GPUs to efficiently do its number crunching; it interfaces with these GPUs through kernels written in CUDA, a parallel computing platform from NVIDIA. Tensorflow programs also often need to ingest and process a large amount of data, so much so that data loading speed is a common performance bottleneck [18]: for high-dimensional data, like images, data loading speed can be the rate-limiting factor, reducing CPU/GPU utilization. TensorFlow has developed several utilities to increase dataset loading speed, including a Dataset API and its own proprietary data format, TFRecord. We will study both GPU usage/CUDA as well as the Dataset API and explore how they affect performance. At a higher level, a core abstraction that TensorFlow relies on to perform computation is data flow graphs. (The name “Tensor- Flow” is a portmanteau of two mathematical objects fundamental to its operations – data flow graphs, and tensors, which are higher- dimensional generalizations of vectors). A data flow graph is an old programming language construct which is not original to Tensor- Flow [9]: it expresses a sequence of computations in as a graph in which nodes represent computations which transform input values. Data flow graphs are useful abstractions for a number of reasons: they can be easily visualized, making it easier to understand and de- bug increasingly complex neural networks (Figure 1); computations can be distributed, by assigning different nodes in the data flow graph to different computers; and the graph structure facilitates optimizations. Figure 1: A data flow graph for part of a convolutional neural network, visualized using TensorBoard. Source: https://www. tensorflow.org/get_started/graph_viz. In standard TensorFlow, a data flow graph is first constructed: i.e., the operations that need to be performed are described, but not actually performed. After the graph is constructed, its nodes can be evaluated. For example, in many neural networks, a forward pass first computes the values of critical nodes like the loss; a backward pass, which makes use of TensorFlow’s symbolic differentiation capabilities, then computes the gradient of the loss with respect to the network’s input parameters so that optimizations can be performed. This two-part approach of defining and then evaluating the com- putation graph has disadvantages, however. Because the graph structure must be predefined, it is difficult to implement dynamic computation graphs, in which the graph structure depends on the input. Such models are used, for example, in work using tree- structured LSTMs to do sentiment analysis (where the structure of the parse tree is determined by the input sentence) [16]. It is also somewhat unintuitive for practitioners used to the standard imperative execution paradigm to learn to define operations and only execute them later, and this separation can make TensorFlow somewhat difficult to debug. All these facts mean that rivals of Ten- sorFlow that offer imperative execution, like Torch, are appealing. In response to this, TensorFlow released eager execution [14], which allows operations to be executed in an imperative fashion. Eager execution is an extremely recent development in TensorFlow, and as of this writing can only be run using TensorFlow’s nightly build. In theory, it offers faster debugging, more intuitive imperative

Transcript of Performance and Control Flow in TensorFlow

Performance and Control Flow in TensorFlowStanford Department of Computer Science [email protected]
Emma Pierson Stanford Department of Computer Science
[email protected]
1 SUMMARY The success of deep learning on a variety of statistical tasks has sparked a revolution. This, in turn, raises new challenges in pro- gramming language design: how should we design a language that allows users to efficiently crunch a huge amount of data in parallel? How canwemake use of specialized hardware?What sort of control flows are most effective? In this project, we will study TensorFlow – probably the most popular deep learning framework around – and how it attempts to answer these challenges. To understand TensorFlow from the bottom up, we begin by benchmarking and as- sessing ease-of-use of CUDA, the low-level framework upon which TensorFlow relies. We then investigate two recent higher-level Ten- sorFlow innovations – the Dataset API and eager execution. Last, to get a birds-eye view of the design considerations underlying the TensorFlow framework, we conduct three lengthy interviews with world-class TensorFlow practitioners, all of whom have more than a thousand hours of practical deep learning experience.
2 BACKGROUND TensorFlow is a deep learning framework developed by Google Brain and released in November 2015. As of this writing, it is the most popular deep learning framework on a variety of metrics [2]; the initial paper describing it [1] has more than 2,000 citations and the GitHub repository has more than 81,000 stars. TensorFlow has been used as the framework underlying numerous advances in the state-of-the-art, including the defeat of the human Go world champion by a neural network trained from tabula rasa in less than a day [15]. TensorFlow’s success has even given rise to specialized hardware like the Tensor Processing Unit [7, 8].
In this project, we examine several fundamental building blocks underlying TensorFlow. At a low level, TensorFlow relies on GPUs to efficiently do its number crunching; it interfaces with these GPUs through kernels written in CUDA, a parallel computing platform from NVIDIA. Tensorflow programs also often need to ingest and process a large amount of data, so much so that data loading speed is a common performance bottleneck [18]: for high-dimensional data, like images, data loading speed can be the rate-limiting factor, reducing CPU/GPU utilization. TensorFlow has developed several utilities to increase dataset loading speed, including a Dataset API and its own proprietary data format, TFRecord. We will study both GPU usage/CUDA as well as the Dataset API and explore how they affect performance.
At a higher level, a core abstraction that TensorFlow relies on to perform computation is data flow graphs. (The name “Tensor- Flow” is a portmanteau of two mathematical objects fundamental to its operations – data flow graphs, and tensors, which are higher- dimensional generalizations of vectors). A data flow graph is an old programming language construct which is not original to Tensor- Flow [9]: it expresses a sequence of computations in as a graph in which nodes represent computations which transform input values.
Data flow graphs are useful abstractions for a number of reasons: they can be easily visualized, making it easier to understand and de- bug increasingly complex neural networks (Figure 1); computations can be distributed, by assigning different nodes in the data flow graph to different computers; and the graph structure facilitates optimizations.
Figure 1: A dataflowgraph for part of a convolutional neural network, visualized using TensorBoard. Source: https://www. tensorflow.org/get_started/graph_viz.
In standard TensorFlow, a data flow graph is first constructed: i.e., the operations that need to be performed are described, but not actually performed. After the graph is constructed, its nodes can be evaluated. For example, in many neural networks, a forward pass first computes the values of critical nodes like the loss; a backward pass, which makes use of TensorFlow’s symbolic differentiation capabilities, then computes the gradient of the loss with respect to the network’s input parameters so that optimizations can be performed.
This two-part approach of defining and then evaluating the com- putation graph has disadvantages, however. Because the graph structure must be predefined, it is difficult to implement dynamic computation graphs, in which the graph structure depends on the input. Such models are used, for example, in work using tree- structured LSTMs to do sentiment analysis (where the structure of the parse tree is determined by the input sentence) [16]. It is also somewhat unintuitive for practitioners used to the standard imperative execution paradigm to learn to define operations and only execute them later, and this separation can make TensorFlow somewhat difficult to debug. All these facts mean that rivals of Ten- sorFlow that offer imperative execution, like Torch, are appealing.
In response to this, TensorFlow released eager execution [14], which allows operations to be executed in an imperative fashion. Eager execution is an extremely recent development in TensorFlow, and as of this writing can only be run using TensorFlow’s nightly build. In theory, it offers faster debugging, more intuitive imperative
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
execution, and support for almost all of TensorFlow’s ops, but it is also described as “experimental”, and the nightly build is recom- mended only for “adventurous” users, so rough edges are expected. We assess the ease of use and performance of eager execution in our experiments, described below.
With all complex computational tools, interviewing real-world practitioners to understand how the tool is used in practice is an invaluable supplement to theoretical analyses of the tool’s structure and benchmark analyses on simulated data [13]. We therefore set up interviews with three deep learning experts to supplement our analyses with real-world perspectives.
3 APPROACH We conducted four investigations to understand the features which contribute to TensorFlow’s performance.
(1) We implemented benchmarks in CUDA, the low-level frame- work upon which TensorFlow relies on to interface with GPUs, and assessed both performance and ease-of-use.
(2) We implemented benchmarks using the Dataset API to as- sess both its runtime performance and ease-of-use.
(3) We installed the nightly build of TensorFlow so we could assess the runtime performance and ease-of-use of the eager execution mode.
(4) We tracked down three expert TensorFlow practitioners, in- cluding an author on the original TensorFlow paper, and ar- ranged lengthy interviews so that they could give us context on its real-world applications, history, and design choices. All our interview subjects have more than a thousand hours of experience in deep learning.
We describe the results of each of these investigations in separate sections below.
4 CUDA 4.1 Model CUDA is a parallel computing platform, introduced by NVIDIA in 2007, that aims to allow developers to write code that takes advantage of the computing power in GPUs. Concretely, it is a set of APIs and libraries that developers can make use of when writing programs in C++, Fortran, and other supported languages.
GPUs can perform certain types of computation very efficiently primarily because they are massively parallel: for example, the latest GPUs contain thousands of cores, each capable of running hundreds of thousands (if not more) of threads [11]. In particu- lar, many operations in machine learning can be parallelized: for example, in matrix-vector multiplication, each row of the matrix can be separately multiplied by the vector and then aggregated together. If done in parallel, this can potentially lead to significant computational savings, at the slight cost of having to transport the data from the CPU to the GPU and then back.
4.2 Benchmarks We set out to investigate how fast GPUs could perform a dot product between two vectors, an operation that is ubiquitous in machine learning. (We adapted some starter code for adding two vectors together from NVIDIA’s developer’s blog [4] to do this.) A simple
implementation in C++ took an average of 0.19 seconds to take the dot product of 2 vectors, each of length ≈ 16 million and filled with 32-bit floats (as measured with gprof). On the other hand, the corresponding CUDA implementation (using 32 blocks and 256 threads per block) took less than 1 millisecond (as measured with nvprof), which is more than 2 orders of magnitude faster. This speedup is due to the parallelization and not because a single thread on the GPU is inherently faster at floating point operations: when we ran our CUDA implementation using just 1 block and 1 thread per block, it took more than 4 seconds to complete.
4.3 Issues However, the efficiency of running code on the GPU does not come for free. In order to obtain speedups, developers need to write code that explicitly indexes into the parallel structure of the GPU, which is divided into blocks that each contain threads. This can be quite clunky, as shown in Fig 2: while the simple C++ implementation is straightforward to code and debug, the CUDA implementation requires reasoning about indices of threads, block dimensions, ac- cumulation, GPU-CPU synchronization, and the like. Moreover, the usual problems with parallel code, like race conditions, also come into play.
Fortunately, for the average user, Tensorflow abstracts all of this away, hiding any direct GPU interfacing from the end-user. Tensor- flow developers need only install the correct GPU-enabled version of Tensorflow to automatically benefit from the CUDA kernels al- ready implemented within Tensorflow, which cover most of every day use cases (advanced users can opt to write their own kernels to plug in). As seen in Fig 3, Tensorflow on the GPU is an order of magnitude faster at matrix-matrix multiplication than Tensorflow on the CPU. This ability to reap the performance benefits of GPUs without the hindrance of coding in low-level CUDA is one of the prime advantages of working in a framework like Tensorflow.
Figure 2: Dot product code in C++ and with CUDA.
5 DATASET API Tensorflow v1.3 introduced a new Dataset API [17]. This API, together with its TFRecord data format, gives us twomajor benefits: 1) it allows us to write Tensorflow programs that are more resource- efficient, and 2) it helps us prototype and iterate on Tensorflow
Performance and Control Flow in TensorFlow 2017, CS242, Final Project
2000 4000 6000 8000 10000 Matrix Dimension
10 2
10 1
CPU GPU
Figure 3: Runtimes for square matrix multiplication using GPU versus CPU.
programs more efficiently. It achieves these by enabling streaming – that is, lazily loading and processing data only when required – and by providing easy ways for developers to implement common data processing patterns. We elaborate on these in the sequel.
To test this new API, we downloaded and processed a subset of 10,000 images from the popular ImageNet dataset [3]. We then wrote simple Tensorflow programs, using different data ingestion methods, to iterate over all of these images and sum up all of their pixel values. We measured the best performance of these different programs over repeated runs.
5.1 Performance Naive data ingestion. The most common way of handling data ingestion – and the easiest to code up without using the Dataset API – is to load all of the required data into memory at the start of the program, and then simply index into it. We implemented this with the code shown in Fig 4 by storing the data on disk as a large 10GB Numpy matrix. As shown in Fig 4, this is slow, taking about 2 minutes to run; moreover, it requires being able to fit the whole matrix into memory. However, we note that in real-world applications where you might make multiple passes over the data, loading the entire dataset into memory (if feasible) could see more speedups since we would only need to do the loading once.
Data streaming with the DatasetAPI. We can get significant savings on time and memory by streaming the data via the Dataset API. Since we don’t need to use more than one image at a time, we can simply load the next example whenever we need it, which means that we only need enoughmemory to store one image instead of all the images at once. As seen in Fig 5, this method only takes about 1 minute to run, a 2x speedup. The computational savings come from two factors: not having to find and allocate a large contiguous chunk of memory, and loading the data in a binary format that can be used directly in Tensorflow without having to first convert it from the Numpy format.
Pre-fetching with the Dataset API. We can obtain additional savings by interleaving file I/O with computation time. In particular,
we can pre-fetch the next examples from disk while the compu- tation is ongoing, so that when we need the next example, it is already loaded into memory. This trades off a small amount of memory usage for greater speed. We implemented this in Fig 6 by just adding a single extra call to the Dataset API, resulting in a 25% speedup (down to 45s from 1 minute). In our example, the com- putation is quick (since it is just adding the pixel values together), and we expect even greater savings in real applications where the computation (e.g., taking gradients through large neural networks) can be much more expensive.
The TFRecord data format. Tensorflow recommends using their own TFRecord data format, which is a simple binary file format with some convenience features (like having built-in serial- izers and deserializers for common data types). The main benefit of converting data into a TFRecord is data locality, especially in cases where the data for a single example is spread over multiple locations. For example, in a medical diagnosis setting, one might have MRI image data stored in one place and a patient’s electronic medical history in a different place. This slows down loading individual examples from disk. Instead, we could first combine each example’s data into a single TFRecord and store that on disk, improving data locality for future look-ups. We did this for our image data in Fig 7: in this simple example, we combine the raw image data (X ) with the image label (Y ) in the same record. In our case, we have a single TFRecord that stores all the data; however, in cases where there is too much data to do this, we could easily have separated each example into its own TFRecord, since the TFRecordDataset con- structor also accepts a list of TFRecord paths. This is considerably more difficult to implement with Numpy, which does not have a built-in way of combining data from many different source files.
5.2 Ease of development Beyond performance improvements in running the entire program, the Dataset API makes life better for developers in two main ways.
Facilitating rapid prototyping. When writing a program, de- velopers often want to be able to test it out quickly. Unfortunately, with the naive method of loading data above, we have to wait for the entire dataset to be loaded before we can try anything out. In our running example, it takes 1.5 minutes to get to the computational loop (Fig 8), so if we have a bug there it is slow to test. However, if we lazily load in the data (Fig 9), we can get to the computational loop much, much quicker (40ms). This makes it far easier to develop code on the real data, instead of having to create artificially smaller datasets just for prototyping.
Support for common patterns. In machine learning applica- tions, there are many common operations that developers want to perform on their data: for example, batching examples up into small batches (instead of operating on individual examples), and taking multiple passes through the data, shuffling the order of data process- ing with each pass. These can be somewhat tedious to implement on our own. For example, in Fig 10, we show an implementation of batching and shuffling using Numpy; this makes the code messier (interleaving data processing code with the actual computation) and can be a big pain to implement if the batch size does not cleanly di- vide the number of training examples. However, using the Dataset
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
Figure 4: Ingesting data by loading it all into memory at the start of the program.
Figure 5: Ingesting data by streaming it from a TFRecord.
API, implementing batching and shuffling just requires two extra calls to the batch and shuffle functions (Fig 11).
6 EAGER EXECUTION Eager execution is an execution mode in TensorFlow introduced in late 2017 [14]. As of this writing, it is only part of the nightly
build of TensorFlow. Eager execution allows imperative, immediate execution of TensorFlow commands. In contrast, as described above, standard TensorFlow requires one to set up a computation graph and only then evaluate parts of the graph using session.run.
We ran a series of three experiments to evaluate performance and ease of development with eager execution. One’s a priori belief
Performance and Control Flow in TensorFlow 2017, CS242, Final Project
Figure 6: Ingesting data by streaming it from a TFRecord, using pre-fetching.
Figure 7: Improving data locality by writing to a TFRecord.
would be that eager execution might achieve comparable perfor- mance to standard TensorFlow on computationally simple tasks without complicated graphs. However, we might expect eager ex- ecution to have worse performance a) on more computationally intensive tasks (because it is still bleeding-edge) and b) on com- plex graphs (because it cannot optimize the graph structure ahead of time). Our experiments assess whether these hypotheses are accurate.
6.1 Performance We ran a series of three experiments to compare the performance of eager execution to performance of standard TensorFlow. All experiments were performed on a single NVIDIA TITAN Xp GPU with no other processes running on it.
Matrixmultiplication: We compared the time to multiply two random square matrices of varying dimensions (Figure 12). Times were very similar, as expected, for the two methods, although at very large matrix sizes eager execution began to be slower. (We were limited in the size of matrices we could assess because the GPU ran out of memory).
Matrix inversion: We compared the time to invert a random matrix of varying dimensions (Figure 13). Eager execution was only slightly slower for small matrices (for 1000 × 1000 matrices, it ran in 25 milliseconds as opposed to 18 milliseconds) but became sig- nificantly slower for larger matrices. We note that matrix inversion is more computationally expensive than matrix multiplication, and as such may enjoy greater benefits from the more fully optimized static computation.
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
Figure 8: Pre-loading all of the data makes debugging and prototyping slow.
Figure 9: Streaming data allows developers to prototype code more easily.
End-to-end neural network (autoencoder): In order to as- sess performance in a realistic scenario, we implemented a full autoencoder [5] in both the standard and eager framework. To generate simulated data X , of dimension p, we drew Z from a k- dimensional Gaussian with k < p: Z ∼ N (0, I ) and then setX = AZ , with A a transformation matrix. This data generation process, stan- dard in dimensionality reduction models like probabilistic principal components analysis, ensured thatX could in fact be represented as a transformation of a low-dimensional latent state, making an au- toencoder an appropriate model. We assessed the time to complete three epochs for 10,000 samples, k = 10, p = 1000, using a one-layer autoencoder with linear activations and the Adam optimizer [10].
As expected, given the muchmore complex computational graph, eager execution was significantly slower than the static computa- tion graph, with an average time of 4.9 seconds as opposed to 1.9 seconds.
Analysis of results: Our results are consistent with our prior expectations for eager execution. On computationally easy tasks without complex computation graphs like matrix multiplication, eager execution performs comparably to standard TensorFlow. On more computationally intensive tasks, like large matrix inversions, or tasks with complex graph structure, like performing both a for- ward and backward pass through an autoencoder, eager execution is slower.
Performance and Control Flow in TensorFlow 2017, CS242, Final Project
Figure 10: Batching and multiple passes with a naive Numpy implementation.
Figure 11: Batching and multiple passes with the Dataset API.
6.2 Ease of development Eager execution offers two advantages over standard TensorFlow from a development standpoint:
• More intuitive programming patterns: Most program- mers are used to being able to having operations immediately execute, not having to define everything they want to run and then executing specific parts of the computation graph. Consequently, eager execution is much closer to standard programming patterns than standard TensorFlow. This, in turn, can translate into more compact and intuitive code and easier debugging.
• Dynamic graphs: some models are difficult to implement in standard TensorFlow. Specifically, graphs where the struc- ture depends on the input, like tree-structured LSTMs in NLP, are difficult to implement with a static graph. Our inter- views with deep learning practitioners (Section 7) confirm this. Eager execution, which allows for the computations performed to depend on the input, more naturally facilitates dynamic graph structures.
To summarize our findings, the more intuitive and flexible inter- face of eager execution make it somewhat useful for prototyping and dynamic graph structures. However, the slower runtimes for computationally intensive tasks or complex computation graphs
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
2000 4000 6000 8000 10000 12000 14000 Matrix Dimension
10 2
10 1
2000 4000 6000 8000 10000 Matrix Dimension
10 1
Figure 13: Performance on thematrix inversion benchmark.
mean that standard TensorFlow is preferable for cases where high performance is desired and flexibility is unnecessary. We note that, for our purposes at least, eager execution does not at present save developer time; the time to set up the nightly build and get used to the different way of setting up computations more than offset any gains. Further, we were somewhat disappointed to discover that eager execution was slower than standard TensorFlow even on tasks like matrix inversion, with less consistent runtimes. We refactored our eager execution code numerous times in an effort to get its performance to match standard TensorFlow’s, and even opened an issue on the TensorFlow GitHub repository to make developers aware of our results; hopefully our project will be able to contribute to the TensorFlow ecosystem. However, as eager exe- cution matures, and its implementation is no longer bleeding-edge, it may come closer to approaching the performance of standard TensorFlow in real-world applications – an exciting possibility for developers.
7 USER CASE STUDIES We were fortunate enough to be able to secure lengthy interviews with three expert deep learning practitioners in industry, some of whom would speak to us only on the condition that we did not identify them by name.
We interviewed one of the original developers of TensorFlow (an author on the canonical TensorFlow paper) about the thought that went into its original structure. He explained that the development
team anticipated the limitations of the original graph structure when they were developing TensorFlow. Specifically, they knew it would be hard to develop models that required dynamic graphs whose structure changed in response to input. Such models were already being developed at the time in natural language processing [16]. So this was a known limitation of TensorFlow when it was developed. He viewed the development of the TensorFlow eager execution as an important step forward for TensorFlow, but not necessarily for the field of machine learning as a whole, since Torch has similar capabilities and is much faster.
A second deep learning practitioner, a research scientist at Sales- force Einstein (formerly MetaMind), largely corroborated this ac- count and gave us welcome context on the history that preceded the development of TensorFlow eager execution. He explained that TensorFlow had initially attempted to compromise between two competing conceptions of how to build models:
• Model as data structure: Data structures are desirable be- cause they are easy to optimize (e.g., compilers convert code to data structures like trees and graphs to perform optimiza- tions). They are also highly portable: data structures are totally self-contained and easier to use across platforms – e.g., Caffe does this [6].
• Model as code: Harder to optimize and less portable. How- ever, much easier for programmers to write – complicated models are a pain to implement as data structures. Also, data structures don’t have that much flexibility (e.g., if you want graph structure to change depending on the input).
To get the best of both worlds, the researcher explained, the original TensorFlow framework took code and converted it to data structures. TensorFlow’s framework is quite comprehensive, with 500 ops, so it does offer a lot of versatility, but it’s pretty hard and opaque to accomplish certain tasks. And writing things in terms of graphs is not the most intuitive way to run code.
As all this was going on, the researcher explained, there was also a lot of work being done on automatic differentiation which was incorporated into the Torch library, and after a huge amount of work theymanaged tomake it just as fast as TensorFlowwhile being much more flexible and allowing dynamic graph structure. This made Torch much more appealing for many NLP tasks. TensorFlow eager execution did more to match Torch’s flexibility but still had a lot of rough edges and was much slower, so the researcher told us he preferred Torch at present. He added, however, that if he were in an ecosystem that already used TensorFlow, eager execution would represent a significant step forward.
A third deep learning researcher corroborated the above accounts and added that the TensorFlow Fold API was another development to facilitate creation of dynamic computation graphs. We asked him about the merits of the Dataset API. He said he did not use it at present but in general had observed that significant speedups (and increases in GPU utilization) could be provided by using Tensor- Flow’s many data loading utilities. He added, though, that he rarely worried too much about optimizing such things because he was not doing work that required highly optimized code. In general, our impression is that the sheer size of TensorFlow’s codebase is a mixed blessing: it has so many features, so many ways to do things, that it isn’t always clear which features are worth picking up as a
Performance and Control Flow in TensorFlow 2017, CS242, Final Project
developer. The fact that parts of the codebase are under flux also makes development somewhat difficult.
8 DISCUSSION TensorFlow’s progress is in some sense a metaphor for the progress of deep learning as a whole. It has enjoyed explosive growth, in terms of codebase, contributors, and users, but that rapid growth has produced some rough edges. Many of its features clearly im- prove both performance and ease-of-use: for example, it offers an interface to GPUs which massively improves performance over CPUs, while being much easier to use than CUDA. Similarly, its Dataset API both reduces development cost and dataset loading time. The many hours we invested in accomplishing these ubiq- uitous deep learning tasks without TensorFlow’s utilities greatly increased our appreciation for how much time it saves.
But, as in deep learning as a whole, TensorFlow’s rapid growth has produced rough edges. For example, when we used bleeding- edge eager mode, we find its performance fails to match standard TensorFlow (as its developers warned), and its development cost remains substantial. The fact that parts of the codebase are still in flux also introduces challenges for practitioners: for example, it was reported that when rounding behavior in part of the TensorFlow codebase was changed from “truncate” to “round to even”, one model went from 25% error to 99% error [12]. Our conclusion, on the basis of our investigations and our interviews with real-world practitioners, is “caveat programmer”: TensorFlow’s mature fea- tures are invaluable, but its newer features continue to be ironed out. That said, the enormous amount of investment and engineering talent from both Google and the open source community, and the excitement around deep learning in general, makes us confident TensorFlow will only continue to improve.
9 ACKNOWLEDGMENTS Both authors contributed equally to this project. We thank the deep learning practitioners who contributed their time and expertise.
REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2] Rachel Allen and Michael Li. 2017. Ranking Popular Deep Learn- ing Libraries for Data Science. https://www.kdnuggets.com/2017/10/ ranking-popular-deep-learning-libraries-data-science.html. (2017).
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[4] Mark Harris. 2017. An Even Easier Introduction to CUDA. https://devblogs. nvidia.com/parallelforall/even-easier-introduction-cuda/. (2017).
[5] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensional- ity of data with neural networks. science 313, 5786 (2006), 504–507.
[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
[7] Norm Jouppi. 2016. Google supercharges machine learning tasks with TPU custom chip. https://cloudplatform.googleblog.com/2016/05/ Google-supercharges-machine-learning-tasks-with-custom-chip.html. (2016).
[8] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1–12.
[9] Krishna M. Kavi, Bill P. Buckles, and U. Narayan Bhat. 1986. A formal definition of data flow graph models. IEEE Transactions on computers 11 (1986), 940–948.
[10] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 (2014).
[11] NVIDIA. 2017. NVIDIA TITAN Xp specs. https://www.nvidia.com/en-us/ design-visualization/products/titan-xp/. (2017).
[12] Ali Rahimi. 2017. NIPS 2017 test-of-time award presentation. https://www. youtube.com/watch?v=Qi1Yry33TQE. (2017).
[13] Peter Seibel. 2009. Coders at work: Reflections on the craft of programming. Apress. [14] Asim Shankar and Wolff Dobson. 2017. Eager Execution: An imperative,
define-by-run interface to TensorFlow. https://research.googleblog.com/2017/10/ eager-execution-imperative-define-by.html. (2017).
[15] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354–359.
[16] Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-termmemory networks. arXiv preprint arXiv:1503.00075 (2015).
[17] TensorFlow Development Team. 2017. Importing Data. https://www.tensorflow. org/programmers_guide/datasets. (2017).
[18] TensorFlow Development Team. 2017. Performance Guide. https://www. tensorflow.org/performance/performance_guide. (2017).