ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE...
Transcript of ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE...
![Page 1: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/1.jpg)
Dr. Stephen W. Keckler Senior Director of Architecture Research, NVIDIA
ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMS
![Page 2: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/2.jpg)
2
The Goal:
Sustained ExaFLOPS on Problems of Interest
...
at reasonable cost
![Page 3: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/3.jpg)
3 Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Historic
Scaling
![Page 4: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/4.jpg)
4
2017
CORAL 150-300PF (5-10x)
11MW (1.1x) 14-27 GFLOPs/W (7-14x)
Lots of Threads
20PF 18,000 GPUs
10MW 2 GFLOPs/W ~107 Threads
You Are Here
1,000PF (50x) 72,000HCNs (4x)
20MW (2x) 50 GFLOPs/W (25x)
~1010 Threads (1000x) 2013
2023
![Page 5: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/5.jpg)
5
HETEROGENEOUS NODE System Interconnect
NoC
MC NIC
LOC
0
LOC
7
DRAM Stacks
TOC
1
TOC
2
TOC
3
TOC0
Lane
15
Lane
0
TPC0
L20 1MB
L21 1MB
L22 1MB
L23 1MB
TPC
127
NVL
ink
NVLink NoC
MC DRAM DIMMs LLC
NV RAM
![Page 6: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/6.jpg)
6
How do we get to 50GFlops/Watt?
![Page 7: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/7.jpg)
7
Start with an energy-efficient architecture
![Page 8: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/8.jpg)
8
CPU 130 pJ/flop (Vector SP)
Optimized for Latency Deep Cache Hierarchy
Haswell 22 nm
GPU 30 pJ/flop (SP) Optimized for Throughput
Explicit Management of On-chip Memory
Maxwell 28 nm
![Page 9: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/9.jpg)
9
CPU 2 nJ/flop (Scalar SP)
Optimized for Latency Deep Cache Hierarchy
Haswell 22 nm
GPU 30 pJ/flop (SP) Optimized for Throughput
Explicit Management of On-chip Memory
Maxwell 28 nm
![Page 10: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/10.jpg)
10
HOW IS POWER SPENT IN A CPU?
In-order Embedded OOO Hi-perf
Clock + Control Logic 24%
Data Supply 17%
Instruction Supply 42%
Register File 11%
ALU 6% Clock + Pins
45%
ALU 4%
Fetch 11%
Rename 10%
Issue 11%
RF 14%
Data Supply 5%
Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)
20pJ FP op è1nJ instruction
![Page 11: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/11.jpg)
11
Latency-Optimized Core (LOC)
PC
PC
Branch Predict
I$
Register Rename
ALU 1 ALU 2 ALU 4 ALU 3
Reorder Buffer
Instruction Window
Register File
ALU 1 ALU 2 ALU 4 ALU 3
I$
Select
Register File
PCs
Throughput-Optimized Core (TOC)
![Page 12: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/12.jpg)
12
How do we continue to scale energy efficiency
...in a world where technology scaling is diminished?
![Page 13: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/13.jpg)
13
Do Less Work
Eliminate waste and redundancy
Move fewer bits
Move data more efficiently
![Page 14: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/14.jpg)
14
DO LESS WORK Mixed Precision Arithmetic
1 11 52 double-precision
1 8 23 single-precision
1 5 10 half-precision 4x throughput
4x bandwidth 4x capacity
< ¼ energy/op
5x precision bits 60x range
Only use as much precision as you need Exploit mix of representations Scaled arithmetic
![Page 15: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/15.jpg)
15
ELIMINATE WASTE Temporal SIMT
32-wide datapath
time
1 cy
c
1 warp instruction = 32 threads
thread 0
thread 31
ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
Spatial SIMT (current GPUs) 1-wide
time
1 cy
c
ld ld ld ld ld ld ld
0 (threads)
1 2 3 4 5 6 7 ld
ld ld
8 9
ld 10
Pure Temporal SIMT
![Page 16: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/16.jpg)
16
ELIMINATE WASTE Temporal SIMT
32-wide (41%)
4-wide (65%)
1-wide (100%)
Increase efficiency on divergent code
![Page 17: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/17.jpg)
17
ELIMINATE WASTE Variable Warp Sizing
Small warps
+ Improved perf for divergent code
+ Better SIMD utilization
Emulate wide warp HW
+ Wider converged execution
+ Memory locality/convergence
+ Reduced power (frontend)
scheduler
Time Gang schedule warps with same PC
Schedule several warps, different PCs scheduler
Rogers [ISCA 2015]
![Page 18: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/18.jpg)
18
ELIMINATE REDUNDANCY Scalarization
Lee [CGO 2013]
LD R2ç<A> LD R3çR2, 1 ADD R4çR3, 2
LD R2ç<A> LD R3çR2, 2 ADD R4çR3, 2
LD R2ç<A> LD R3çR2,3 ADD R4çR3, 2
LD R2ç<A> LD R3çR2, 4 ADD R4çR3, 2
scalar op vector load vector op
SIMT Execution
ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2
LD SR2ç<A> VLD SR3çSR2, 1
Scalarized SIMT Execution
![Page 19: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/19.jpg)
19
MOVE FEWER BITS
Small multi-ported register file
Capture locality of commonly used operands
Can reduce RF energy by 50%
Register File Cache (RFC)
S F U
M E M
T E X
Operand Routing
Operand Buffering
Main Register File 4x128-bit Banks (1R1W)
RFC 4x32-bit (3R1W) Banks
ALU
Gebhart [ISCA 2011]
![Page 20: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/20.jpg)
20
MOVE DATA MORE EFFICIENTLY Toggle-aware Compression
0x00003A00 0x8001D000 0x00003A01 0x8001D008 ...
4�bytes128Ͳbyte�Uncompressed�Cache�Line
4�bytes
8Ͳbyte�flit
0x00003A00 0x8001D000
0x00003A01 0x8001D008
XOR
Flit�0
Flit�1
=0000...00100...00100... #�Toggles�=�2
0x5�0x3A00�0x7�0x8001D000
128Ͳbyte�FPCͲcompressed�Cache�Line
8Ͳbyte�flit
5�3A00�7�80001D000�5�1D
XOR
Flit�0
Flit�1
=
001001111...110100011000 #�Toggles�=�31
0x5�0x3A01�0x7�0x8001D008 0x5�...
1�01�7�80001D008�5�3A02�1
Metadata
Compression can increase power consumption
Goal: reduce bus toggling
Pekhimenko [HPCA 2016]
![Page 21: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/21.jpg)
21
MINIMIZE DATA MOVEMENT
Reduces distance
Increases bandwidth
Offers opportunity to optimize signaling circuits
Packaging
High-bandwidth on-package memory
![Page 22: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/22.jpg)
22
MINIMIZE DATA MOVEMENT Heterogeneous DRAM Architectures
~200 GB/s
High BW Memory TOCs LOCs
~2TB/s ~200 GB/s
High Capacity Memory
~200 GB/s
High BW Memory TOCs
~2TB/s
High Capacity Memory
Challenges Exploiting all available bandwidth Maximizing locality for frequently accessed data
![Page 23: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/23.jpg)
23
MINIMIZE DATA MOVEMENT Software-managed Caching with On-Package Memory
0"
1"
2"
3"
4"
5"
6"
backp
rop(
bfs(
cns(
comd(
kmeans(
minife(
mumm
er(
needle(
pathfi
nder(
srad_v1(
xsbench(
geo:m
ean(
Throughp
ut(Rela=
ve(
to(No(Migra=o
n(
Legacy"CUDA" First8Touch"+"Range"Exp"+"BW"Balancing" ORACLE"
Strategies Aggressively migrate pages upon First-Touch to GDDR memory Pre-fetch neighbors of touched pages to reduce TLB shootdowns Throttle page migrations when nearing peak BW
Competitive with manual memory copy Close to “perfect” prefetch
Agarwal [HPCA 2015]
![Page 24: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/24.jpg)
24
MINIMIZE DATA MOVEMENT
Tag overhead: hundreds of MB
Alloy tag and data in same DRAM row (Micro12)
Cache organization: optimize for bandwidth
Direct mapped, consecutive sets in same row
Results
Fine-grained transfers good for lower locality apps
Can eliminate some page migration overheads
Hardware Managed DRAM Cache RO
W BU
FFER
Tag Set Index Offset
16-byte DRAM bus
29-bits 5-bits RO
W BU
FFER
1GB D
RAM
(512M lines)
tag*
(2-byte) data
(64-byte)
Only medium of accessing DRAM array
![Page 25: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/25.jpg)
25
LOOMING MEMORY POWER CRISIS
Column Power
120W
160W
![Page 26: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/26.jpg)
26
SUMMARY
Do Less Work
Eliminate waste and redundancy
Move fewer bits
Move data more efficiently
![Page 27: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop](https://reader033.fdocuments.net/reader033/viewer/2022050410/5f879755adc2cc12c2142232/html5/thumbnails/27.jpg)