HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture ...
Transcript of HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture ...
HotChips 2019HotChips 2019
HotChips 2019 TutorialCloud TPU: Codesigning Architecture and InfrastructureClifford ChaoBrennan Saeta
1
HotChips 2019
How to Codesign: Leverage expertise
across disciplines and stakeholders for optimal
results.2
HotChips 2019
Gordon Moore: “No exponential is forever”
● Past: take general-purpose compute and shrink technology to improve performance
● Current reality: many devices not scaling well anymore. Running out of tricks!
● Solution: many gains left to be had through specializing compute to specific applications. Codesign!
3
But First...Why Codesign?
HotChips 2019Confidential + Proprietary
Codesign: TPU Architecture
4
HotChips 2019
Brief Historical Perspective: TPU v1 (Google I/O 2016)
Google internally deployed in 2015 (not Cloud).
Inference workloads, including AlphaGo (Go matches again Lee Sedol).
ISCA paper in 2017, describing technical details.
5
HotChips 2019
● For inference and training workloads● 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s memory BW
Cloud TPU v2 Board (Google I/O 2017)
6
HotChips 2019
● 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s memory BW● Architecture co-designed with ML research, systems and data center teams.
Cloud TPU v2 Board
7
TPU Codesign
● ML research: computational requirements for cutting-edge models.
● Systems: power delivery, board space.
● Data Center: cooling, buildability.
HotChips 2019
Host Server
32 lane PCIe gen3
8
Cloud TPU v2 Host Connection
HotChips 2019
Host Server
PCI v3 x 32
9
Cloud TPU v2 Host Connection
TPU Codesign
● ML research: input/output data feed rates.
● Systems: board layout.● Data Center: wiring and
serviceability.
HotChips 2019 10
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)
8GBHBM
8GBHBM
Vector Unit
Scalar Unit
Cloud TPU v2 Chip Architecture
HotChips 2019
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)
8GBHBM
Vector Unit
Scalar Unit
11
Cloud TPU v2 Chip Architecture
8GBHBM
HotChips 2019
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)
8GBHBM
Vector Unit
Scalar Unit
12
Cloud TPU v2 Chip Architecture
8GBHBM
22.5 TFLOPS per core
● 2 cores per chip● 4 chips per board● Scalar Unit● Vector Unit● Matrix Unit● Mostly float32
HotChips 2019
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)
8GBHBM
Vector Unit
Scalar Unit
13
Cloud TPU v2 Chip Architecture
8GBHBM
22.5 TFLOPS per core
● 2 cores per chip● 4 chips per 180
TFLOP Cloud TPU● Scalar Unit● Vector Unit● Matrix Unit● Mostly float32
TPU Codesign
● ML research: types of operations to accelerate.
● Software: proper flexibility and programmability.
● Data center: performance metrics.
HotChips 2019
Vector Unit
Scalar Unit
8GBHBM
8GBHBM
Vector Unit
Scalar Unit
Cloud TPU v2 Chip Architecture
Matrix Unit (MXU) Matrix Unit (MXU)
14
HotChips 2019
Vector Unit
Scalar Unit
8GBHBM
8GBHBM
Vector Unit
Scalar Unit
Cloud TPU v2 Chip Architecture
Matrix Unit (MXU) Matrix Unit (MXU)
Matrix Unit (MXU)
● 128 x 128 systolic array
● float32 accumulate● bfloat16 multiplies
15
HotChips 2019
Mat
rix U
nit (
MXU
)Matrix Unit Systolic Array
W11 W12 W13
W21 W22 W23
W31 W32 W33
X11
X12
Computing Y = WX
W = 3x3 matrixX = 3-elem vectorY = 3-elem vector
X13
X21
X22
X23
X31
X32
X33in
puts
wei
ghts
accumulation 16
HotChips 2019
Mat
rix U
nit (
MXU
) W11X11
W12 W13
W21 W22 W23
W31 W32 W33
X12
X13
X21
X22
X23
X31
X32
X33
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
accumulation 17
HotChips 2019
Mat
rix U
nit (
MXU
) W11X21
W12X12+
W11X11
W13
W21X11
W22 W23
W31 W32 W33
X13X22
X23
X31
X32
X33inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
accumulation 18
HotChips 2019
Mat
rix U
nit (
MXU
) W11X31
W12X22+
W11X21
W13X13+ ...
W21X21
W22X12+
W21X11
W23
W31X11
W32 W33
X23X32
X33
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
accumulation 19
HotChips 2019
Mat
rix U
nit (
MXU
)
W11
W12X32+
W11X31
W13X23+ ...
W21X31
W22X22+
W21X21
W23X13+ ...
W31X21
W32X12+
W31X11
W33
X33
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
Y11 = W11X11 + W12X12 + W13X13
outputs
accumulation 20
HotChips 2019
Mat
rix U
nit (
MXU
)
W11 W12
W13X33+ ...
W21
W22X32+
W21X31
W23X23+ ...
W31X31
W32X22+
W31X21
W33X13+ ...
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13
Y12 = W21X11 + W22X12 + W23X13
outputs
accumulation 21
HotChips 2019
Mat
rix U
nit (
MXU
)
W11 W12 W13
W21 W22
W23X33+ ...
W31
W32X32+
W31X31
W33X23+ ...
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
outputs
Y31 = W11X31 + W12X32 + W13X33
Y22 = W21X21 + W22X22 + W23X23
Y11 = W11X11 + W12X12 + W13X13
Y12 = W21X11 + W22X12 + W23X13
Y21 = W11X21 + W12X22 + W13X23
Y13 = W31X11 + W32X12 + W33X13
accumulation 22
HotChips 2019
Mat
rix U
nit (
MXU
)
W11 W12 W13
W21 W22 W23
W31 W32
W33X33+ ...
inpu
tsw
eigh
ts
Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3
outputs
Y31 = W11X11 + W12X12 + W13X13
Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13
Y21 = W11X21 + W12X22 + W13X23
Y13 = W31X11 + W32X12 + W33X13
Y32 = W21X31 + W22X32 + W23X33
Y23 = W31X21 + W32X22 + W33X23
accumulation 23
HotChips 2019
Mat
rix U
nit (
MXU
)
W11 W12 W13
W21 W22 W23
W31 W32 W33
inpu
tsw
eigh
ts
Matrix Unit Systolic Array
outputs
Y31 = W11X11 + W12X12 + W13X13
Y22 = W21X21 + W22X22 + W23X23
Y13 = W31X11 + W32X12 + W33X13
Y32 = W21X31 + W22X32 + W23X33
Y23 = W31X21 + W32X22 + W33X23Y33 = W31X31 + W32X32 + W33X33
accumulation
Computing Y = WXwith W = 3x3, batch-size(X) = 3
24
HotChips 2019
Mat
rix U
nit (
MXU
)
W11 W12 W13
W21 W22 W23
W31 W32 W33
inpu
tsw
eigh
ts
Matrix Unit Systolic Array
Very high throughput for matrix multiplications
accumulation
Computing Y = WXwith W = 3x3, batch-size(X) = 3
25
HotChips 2019
Systolic Arrays and Energy Efficiency
From Mark Horowitz’s ISSCC 2014 Keynote: “Computing’s Energy Problem (and what we can do about it)”.
Energy for control logic, SRAM, and register accesses needed by matrix multiply dominates in conventional processors.
26
HotChips 2019
Systolic Arrays and Energy Efficiency
This doesn’t even count DRAM energy cost, which is orders of magnitude larger (1300-2600pJ).
Now consider doing 16,384 operations per clock in a systolic array, which has high reuse so only the red portion scales up.
27
HotChips 2019
Systolic Arrays and Energy Efficiency
For this figure, the DRAM energy cost is even bigger (1300-2600pJ).
Now consider doing 16,384 or 65,536 operations per second in a systolic array.
Each operand gets reused 128 or 256 times within the systolic array.
28
TPU Codesign
● ML research: latency and BW requirements.
● Software: controllability.● Systems: thermal limits.
HotChips 2019
Numerics
Neural-network researchers showed that inference did not need float32 precision.Vanhoucke, V., Senior, A. and Mao, M.Z., 2011, December. Improving the speed of neural networks on CPUs.In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop (Vol. 1, p. 4).
TPUv1 deployed with int8 hardware and support for int16 through software.int16 intended as “insurance”; used mostly in LSTM-based models.
Training has traditionally been done in floating point.
Q: Does training require full float32 resolution? A: Not everywhere. But where matters.
29
HotChips 2019
M M M M M M M M M M M M M M M M M M M M M M MS E E E E E E E E
Exponent: 8 bits Mantissa (Significand): 23 bits
float32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38
S E E E E E M M M M M M M M M M
Exponent: 5 bits Mantissa (Significand): 10 bits
float16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504
bfloat16: Brain Floating Point Format
S E E E E E E E E
Exponent: 8 bits Mantissa (Significand): 7 bits
M M M M M M M
Range: ~1e−38 to ~3e38
30
Numerics
HotChips 2019
M M M M M M M M M M M M M M M M M M M M M M MS E E E E E E E E
Exponent: 8 bits Mantissa (Significand): 23 bits
float32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38
S E E E E E M M M M M M M M M M
Exponent: 5 bits Mantissa (Significand): 10 bits
float16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504
bfloat16: Brain Floating Point Format
S E E E E E E E E
Exponent: 8 bits Mantissa (Significand): 7 bits
M M M M M M M
Range: ~1e−38 to ~3e38
31
NumericsTPU Codesign
● ML research: trains without loss scaling, unlike float16 [Micikevicius 2017].
● ASIC: small mantissa reduces multiplier power, area.
● Software: same dynamic range as float32, same Inf/NaN behavior.
HotChips 2019
Cloud TPU v3(Google I/O 2018)
32
HotChips 201933
● 420 teraflops of computation, 128 GB of HBM memory● Liquid cooling
Cloud TPU v3 Board
HotChips 2019
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)16GBHBM
16GBHBM
Vector Unit
Scalar Unit
Cloud TPU v3 Chip Architecture
34
Matrix Unit (MXU) Matrix Unit (MXU)
HotChips 2019
Vector Unit
Scalar Unit
Matrix Unit (MXU) Matrix Unit (MXU)16GBHBM
16GBHBM
Vector Unit
Scalar Unit
Cloud TPU v3 Chip Architecture
35
Matrix Unit (MXU) Matrix Unit (MXU)
TPU Codesign
● ML Research: latest computational requirements.
● Systems: liquid cooling.● Data Center: space provisioning,
network requirements.
HotChips 2019
Datacenter-Scaled Tiled Architecture
Three ways to spend on computer architectural resources:● Compute● Memory● Custom interconnect
Allows us to scale the system by connecting many boards together into a super-computing pod to work on a single workload.
36
HotChips 2019
Datacenter-Scaled Tiled Architecture
First, about the custom interconnect itself:● 2-D torus network topology● Custom protocol allows lower overhead than generic protocols (eg ROCE,
Infiniband)● Enables algorithms like parallel in-network reduction to a single node● Integrated router into the TPU chip
Now, let’s use the interconnect to scale the system...
37
HotChips 2019
Single board
Cloud TPU v2 Board38
HotChips 2019 39
● 11.5 pflops● 4TB HBM● 2D torus topology
Cloud TPU v2 Pod
HotChips 2019 40
HotChips 2019 41
HotChips 2019 42
● >100 pflops● 32TB HBM
Cloud TPU v3 Pod
HotChips 2019 43Cloud TPU v3 Pod
TPU Codesign
● ML research: size and scope of models.
● Data Center: cooling, space and network provisioning, power delivery.
● Software: programmability for parallelism infrastructure.
HotChips 201944
Example Slicing of TPU Pod Hardware
HotChips 2019
Cloud TPU Sizes
45
https://cloud.google.com/tpu/docs/types-zones
HotChips 2019 46
HotChips 2019
Actual Models Map to TPU Architecture
Matrix Unit→ High-FLOPS Ops
● Matrix multiplication (dense layers)
● Convolutions
Vector Units→ Element-wise Ops
● Non-linearities● Activations● Dropout● Pooling● Parameter gradient
updates
Scalar Unit● TPU control● Data flow● Memory addressing
(concat’s)
47
HotChips 2019
Linear Performance Scaling
48
import tf.keras.layers as l
import tf.losses.sparse_softmax_cross_entropy as softmax
def model_fn(images, labels, mode):
conv = l.Conv2D(32, 5)(images)
flat = l.Flatten()(conv)
logits = l.Dense(10)(flat)
loss = softmax(labels, logits)
optimizer =
tf.train.GradientDescentOptimizer(
learning_rate=0.3)
optimizer =
tf.contrib.tpu.CrossShardOptimizer(
optimizer)
train_op = optimizer.minimize(loss)
return
tf.contrib.tpu.TPUEstimatorSpec(
loss=loss,
train_op=train_op)
import tensorflow as tf
import tf.contrib.cluster_resolver.TPUClusterResolver
def input_fn(params):
ds = tf.data.TFRecordDataset(path)
ds = ds.map(parse_function)
ds = ds.cache().repeat()
ds = ds.shuffle(buffer_size=50000)
ds = ds.batch(params['batch_size'])
return ds
cr = TPUClusterResolver()
rc = tf.contrib.tpu.RunConfig(
cluster=cr,
model_dir=FLAGS.model_dir)
est = tf.contrib.tpu.TPUEstimator(
model_fn=model_fn, use_tpu=True,
train_batch_size=1024)
est.train(input_fn=input_fn)
Runs on TPU
Runs on CPU
import tf.keras.layers as l
import tf.losses.sparse_softmax_cross_entropy as softmax
def model_fn(images, labels, mode):
conv = l.Conv2D(32, 5)(images)
flat = l.Flatten()(conv)
logits = l.Dense(10)(flat)
loss = softmax(labels, logits)
optimizer =
tf.train.GradientDescentOptimizer(
learning_rate=0.3)
optimizer =
tf.contrib.tpu.CrossShardOptimizer(
optimizer)
train_op = optimizer.minimize(loss)
return
tf.contrib.tpu.TPUEstimatorSpec(
loss=loss,
train_op=train_op)
import tensorflow as tf
import tf.contrib.cluster_resolver.TPUClusterResolver
def input_fn(params):
ds = tf.data.TFRecordDataset(path)
ds = ds.map(parse_function)
ds = ds.cache().repeat()
ds = ds.shuffle(buffer_size=50000)
ds = ds.batch(params['batch_size'])
return ds
cr = TPUClusterResolver()
rc = tf.contrib.tpu.RunConfig(
cluster=cr,
model_dir=FLAGS.model_dir)
est = tf.contrib.tpu.TPUEstimator(
model_fn=model_fn, use_tpu=True,
train_batch_size=1024)
est.train(input_fn=input_fn)
Runs on TPU
Runs on CPUZero code changes when scaling from a single device to a whole pod!
● ML research: influence abstractions.● Compilers: JIT code-gen & parallel
IRs.● APIs: Carefully constructed for
distribution.
HotChips 2019 51Cloud TPU System Architecture
Host TPU
Host TPU
HostTPU
HostTPU
Cloud Storage
VM
Py
HotChips 2019 52Cloud TPU System Architecture
Host TPU Device
Host TPU Device
HostTPU Device
HostTPU Device
Cloud Storage
VM
Py
System Codesign
● Storage systems: high bandwidth disks, network, etc.
● Accelerators: infeed & asynchronous abstractions.
● Software: high throughput, software pipelining, horizontal scaling.
HotChips 2019
Streaming Data to Cloud TPUs at High Speed
Cloud Storage
A distributed file system optimized for extreme scale and low cost.
Compute Engine
Stream data from Compute Engine VM(s) local file system or memory for maximum flexibility.
Cloud Bigtable
Scalable, low-latency wide-row structured storage system.
53
HotChips 2019Confidential + Proprietary
Results
54
HotChips 2019
MLPerf 0.6: TopLine* Comparison (mlperf.org)
* TopLine: fastest convergence time (regardless of resource consumption)
TPUv3 Tesla V100 Speedup
Resnet-50 (1024) 77 (1536) 80 4%SSD (1024) 72 (480) 134 86%Mask-RCNN (128) 2136 (192) 1107 -48%GNMT (512) 127 (384) 108 -15%Transformer (1024) 51 (480) 95 86%
E2E time (units in secs; # of chips in parentheses)
55
HotChips 2019Confidential + Proprietary
Performance & Optimization Tips
56
HotChips 2019
Performance & Optimization Tips
On-device execution
● Leverage MXU size● Avoid reshapes (XLA helps a lot!)● Scale out for trivial data-parallel
improvements (interconnect awesomeness!)
● More advanced parallelism available
57
HotChips 2019
Performance & Optimization Tips
Input pipelines
● Many parallel, large, streaming reads● Software pipelining critical● Tradeoffs between storage, bandwidth, and compute to tune input pipeline
○ Leverage CPU parallelism
58
Use the profiler!
HotChips 201959
From: https://cloud.google.com/tpu/docs/cloud-tpu-tools#profile_tab
Performance & Optimization Tips
HotChips 2019Confidential + Proprietary
Codesign Summary
1. Necessary due to EoML2. Architect infrastructure for usability
and scale!3. Leverage all areas of expertise
60
HotChips 2019
We already co-design:
61
1. Data centers2. Power & cooling systems3. Board topology4. On-chip compute units5. Accelerator-accelerator networks6. Numerics7. Memory & cache hierarchies8. Compilers9. ML frameworks & APIs
10. Datacenter networks11. Data storage hardware & software12. File formats13. Accelerator-host interfaces &
bandwidth14. Cluster scheduling systems15. ML Algorithms16. Model partitioning & parallelism
research17. Security
HotChips 2019Confidential + Proprietary
What else can we co-design?
62
63
Co-design theprogramming languagetoo
Check out Swift for TensorFlow at:
https://github.com/tensorflow/swift
Please join our open design reviews on Friday’s! Sign up at: [email protected] (Google Group)
HotChips 2019Confidential + Proprietary
Thank you!
64