TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops...

24
TensorFlow Marco Serafini COMPSCI 532 Lecture 20

Transcript of TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops...

Page 1: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

TensorFlow

Marco Serafini

COMPSCI 532Lecture 20

Page 2: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining
Page 3: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

3 3

Motivations• DistBelief: Previous iteration

• Parameter server• Limitations:

• Monolithic layers, difficult to define new ones• Difficult to offload computation with complex dependencies to parameter servers

• E.g. Apply updates based on gradients accumulated over multiple iterations• Fixed execution pattern

• Read data, compute loss function (forward pass), compute gradients for parameters (backward pass), write gradients to parameter server

• Not optimized for single workstations and GPUs

Page 4: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

44

TensorFlow• Dataflow graph of operators, but not a DAG

• Loops and conditionals• Deferred (lazy) execution

• Enables optimizations, e.g. pipelining with GPU kernels• Composable basic operators

• Matrix multiplication, convolution, ReLu• Concept of devices

• CPUs, GPUs, mobile devices• Different implementations of the operators

Page 5: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

55

Difference with Parameter Server• Parameter server

• Separate worker nodes and parameter nodes• Different interfaces

• TensorFlow: only tasks• Shared parameters (called operators): variables and queues• Tasks managing them are called PS tasks • PS task are regular tasks: they can run arbitrary operators• Uniform programming interface

Page 6: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

66

Example

b_1stateful operators

stateful operators

Page 7: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

77

Example• Data-parallel training looks like this

Stateful queues

Stateful variables

Concurrent steps for data parallelism

Page 8: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

88

Dataflow Graph• Vertex: unit of local computation

• Called operation in TensorFlow• Edges: inputs and outputs of computation

• Values along edges are called tensors

Page 9: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

99

Tensors• Edges in dataflow graph• Data flowing among operators• Format

• n-dimensional arrays• Elements have primitive types (including byte arrays)

• Tensors are dense• All elements are represented• User must find ways to encode sparse data efficiently

Page 10: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1010

Operations• Vertices in dataflow graph• State is encapsulated in operations

• Variables and queues• Access to state (and tensors)

• Variable op: Returns unique reference handle• Read op: Take reference handle, produce value of variable• Write ops: Take reference and value and update.

• Queues are also stateful operators• Get reference handle, modify through operations• Blocking semantics, backpressure, synchronization

Page 11: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1111

Execution Model• Step: client executes a subgraph by indicating:

• Edges to feed the subgraph with input tensors• Edges to fetch the output tensors• Runtime prunes the subgraph to remove unnecessary steps

• Subgraphs are run asynchronously by default• Can execute multiple partial, concurrent subgraphs

• Example: concurrent batches for data-parallel training

Page 12: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1212

Distributed Execution• Tasks: named processes that send messages

• PS tasks: store variables, but can also run computations• Worker tasks: the rest• Note: “informal” categories, not enforced by TensorFlow

• Devices: CPU, GPU, TPU, mobile, …• CPU is the host device• Device executes kernel for each operation assigned to it

• Same operation (e.g. matrix multiplication) has different kernels for different devices• Requirements for a device

• Must accept kernel for execution• Must allocate memory for inputs and outputs• Must transfer data to and from host memory

Page 13: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1313

Distributed Execution• Each operation

• Resides on a device• Corresponds to one or more kernel• More kernel can be specialized for different devices

• Operations are executed within a task

Page 14: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1414

Distributed Scheduling• TensorFlow runtime places operations on devices

• Implicit constraints: stateful operation on same device as state• Explicit constraints: dictated by the user• Optimal placement still open question

• Obtain per-device subgraphs• All operations assigned to device• Send and Receive operations to replace edges• Specialized per-device implementations

• CPU – GPU: CUDA memory copy• Across tasks: TCP or RDMA

• Placement preserved throughout session

Page 15: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1515

Dynamic Control Flow• How do enable dynamic control flow with static graph?• Example: recurrent neural network

• Train network for sequence of variable length without unrolling• Conditional: Switch and Merge

SwitchData input

Control input

op

op

op

op

Merge

input

dead

Output one non-dead

input

Page 16: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1616

Loops• Uses three additional operators

EnterData input op op Exit

NextIteration

Page 17: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

17

Scaling to Large Models• Model parallelism

• Avoids moving terabytes of parameters every time• Operations (typically implemented by library)

• Gather: reads tensor data from shard and computes• Part: Partitions the input across shards of parameters• Stitch: Aggregates all partitions

parameters

inputs

Page 18: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1818

Fault Tolerance• Long running tasks face failures and pre-emption

• Sometimes run at night on idle machines• Small operations, no need to tolerate individual failures

• Even RDDs are overkill• User use Save operation for checkpointing

• Each variable in a task connected to same save for batching• Asynchronous, not consistent

• Restore operation executed by clients at startup• Other use cases: transfer learning

Page 19: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

1919

Coordination• TensorFlow is asynchronous by default

• Stochastic Gradient Descent tolerates asynchrony• Asynchrony increases throughput

• But synchrony has benefits• Using stale parameters slows down convergence

• System must support user-defined synchrony

Page 20: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

2020

Synchronous Coordination• Use blocking queues for synchrony • Redundant tasks for stragglers

blocking queues on inputs and outputs

different colors = different versions

of parameter proactive (not reactive) backup

workers

Page 21: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

2121

Implementation• Distributed master

• Obtain subgraphs for each participating device

• Dataflow executor• Handles requests from master • Schedules the execution of the kernels of local subgraph• Data transfer to device and over network

Page 22: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

2222

Single-Machine Performance• Similar to COST analysis

• Comparison with single-server (not single-threaded) tools• Four convolutional models using one GPU

Page 23: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

23

Synchronous Microbenchmarks• Null training steps• Sparse performance is close to optimal (scalar)

Page 24: TensorFlow - Marco Serafini · TensorFlow •Dataflow graph of operators, but not a DAG •Loops and conditionals •Deferred (lazy) execution •Enables optimizations, e.g. pipelining

2424

Scalability• Scalability bound by access to PS tasks (7 in the exp)• Synchronous coordination scales well• Backups are beneficial (but expensive way to do FT)