FPR/VR Register Overlay & SIMD Overviewarith23.gforge.inria.fr/slides/Schwarz.pdf · BR 2 LS, 2 LU,...
Transcript of FPR/VR Register Overlay & SIMD Overviewarith23.gforge.inria.fr/slides/Schwarz.pdf · BR 2 LS, 2 LU,...
© 2016 IBM Corporation
IBM Accelerators July 11, 2016
Eric Schwarz
© 2016 IBM Corporation
Outline
Roadmaps of Z and Power
Arithmetic Feature Comparison
How to Get Performance without Frequency
2
© 2016 IBM Corporation
z Systems - Processor Roadmap
Core 0
L3_0
L3_1
L2
CoPMCU
L2
Core 1
L3_0
L3_1
Core 2
L2
CoP GX
L2
Core 3
L3_0 Controller
L3_1 Controller
MC
IOs
MC
IOs
GX
IOs
GX
IOs
L3B
L3B
Core 0
L3_0
L3_1
L2
CoPMCU
L2
Core 1
L3_0
L3_1
Core 2
L2
CoP GX
L2
Core 3
L3_0 Controller
L3_1 Controller
MC
IOs
MC
IOs
GX
IOs
GX
IOs
L3B
L3B
z1969/2010
zEC128/2012
z102/2008
z131/2015
Leadership Single Thread,
Enhanced Throughput
Improved out-of-order
Transactional Memory
Dynamic Optimization
2 GB page support
Step Function in System
Capacity
Top Tier Single Thread
Performance,System Capacity
Accelerator Integration
Out of Order Execution
Water Cooling
PCIe I/O Fabric
RAIM
Enhanced Energy
Management
Leadership System Capacity
and Performance
Modularity & Scalability
Dynamic SMT
Supports two instruction threads
SIMD
PCIe attached accelerators (XML)
Business Analytics Optimized
Workload Consolidation
and Integration Engine for
CPU Intensive Workloads
Decimal FP
Infiniband
64-CP Image
Large Pages
Shared Memory
© 2016 IBM Corporation
0
1000
2000
3000
4000
5000
6000
z900 z990 z9ec z10ec z196 zEC12 zNext
EC
770 MHz
1.2 GHz
1.7 GHz
4.4 GHz
5.2 GHz5.0 GHz
5.5 GHz
2000z900
189 nm SOI16 Cores**Full 64-bit
z/Architecture
2003z990
130 nm SOI32 Cores**Superscalar
Modular SMP
2005z9 EC
90 nm SOI54 Cores**
System level scaling
2012zEC12
32 nm SOI101 Cores**
OOO and eDRAMcache
improvementsPCIe Flash
Arch extensionsfor scaling
2010z196
45 nm SOI80 Cores**OOO core
eDRAM cacheRAIM memoryzBX integration
2008z10 EC
65 nm SOI64 Cores**
High-freq core3-level cache
2015z13
22 nm SOI141 Cores**
SMT &SIMD
Up to 10TB of Memory
MH
z/G
Hz
1000
0
2000
3000
4000
5000
6000
1695*+12%
GHz
-9%1202
*+33%
GHz
+18%
1514
*+26%
GHz
+6%902*+50%
GHz
+159
%
z13 Continues the CMOS Mainframe Heritage Begun in 1994
* MIPS Tables are NOT adequate for making comparisons of z Systems processors. Additional capacity planning required** Number of PU cores for customer use
© 2013 International Business Machines Corporation 5
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
• Up to 128 MB eDRAM L4
(off-chip)
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory
attach interface
• Integrated PCIe Gen3
• SMP Interconnect
• CAPI (Coherent
Accelerator Processor
Interface)
Cores
•12 cores (SMT8)
•8 dispatch, 10 issue, 16 exec
pipe
•2X internal data flows/queues
•Enhanced prefetching
•64K data cache,
32K instruction cache
Accelerators
•Crypto & memory expansion
•Transactional Memory
•VMM assist
•Data Move / VM MobilityEnergy Management
• On-chip Power Management Micro-controller
• Integrated Per-core VRM
• Critical Path Monitors
Technology
• 22nm SOI, eDRAM, 15 ML 650mm2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
L3 Cache & Chip Interconnect
8M L3
Region
Mem. Ctrl.Mem. Ctrl.
Lo
cal SM
P L
inks
Accelerato
rsL
ocal S
MP
Lin
ksA
ccelerators
From HotChips 2013 Presentation
© 2013 International Business Machines Corporation 6
VSUFXU
IFU
DFU
ISU
LSU
Larger Caching
Structures vs. POWER7
• 2x L1 data cache (64 KB)
• 2x outstanding data cache misses
• 4x translation Cache
Wider Load/Store
• 32B 64B L2 to L1 data bus
• 2x data cache to execution
dataflow
Enhanced Prefetch
• Instruction speculation awareness
• Data prefetch depth awareness
• Adaptive bandwidth awareness
• Topology awareness
Execution Improvement
vs. POWER7
•SMT4 SMT8
•8 dispatch
•10 issue
•16 execution pipes:
• 2 FXU, 2 LSU, 2 LU, 4 FPU,
2 VMX, 1 Crypto, 1 DFU,
1 CR, 1 BR
•Larger Issue queues (4 x 16-entry)
•Larger global completion,
Load/Store reorder
•Improved branch prediction
•Improved unaligned storage access
Core Performance vs . POWER7
~1.6x Single Thread
~2x Max SMT
From HotChips 2013 Presentation
I$
32k D$
IB
LS FX FX FP
PM
DF
BR
2 LS, 2 FX, 1 BR, 1 CR, 1 (FP, ALU, CX), 1 (FP, PM, DF)
256k L2$
32MB L3$
POWER7 Core Base
FP
ALU
CX
CR
ISU
LS
I$
64k D$
IB IB
L,LS L,LS FX FX FP
ALU,PM,
DF
BR
2 LS, 2 LU, 2 FX, 2 (FP , ALU, PM, DF), 1 CR, 1 BR
512k L2$
96MB L3$
Enhanced POWER8 Core
FP
ALU,PM,
CX
CR
ISU ISU
L4$ Mem Buf
Improved
Branch
Prediction
Crypto (AES, SHA)
support
Deeper Out-
of-Order
Processing
More
Execution
Bandwidth
Bigger
Caches
Wider
dispatch &
issue
© 2016 IBM Corporation9
GR 0
z13 Instr / Execution Dataflow
LSU
pipe
0
FXU*
0a
BFU0
IssQ side0 IssQ side1
BFU1
DFU0
Vector0 / FPR0
register
128b string/int
SIMD0
FXU
0b
GR 1
LSU
pipe
1
FXU
1a
FXU
1b
additional execution units for
higher core throughput
new arch registers / execution units
to accelerate business analytics workloads
Instr decode/ crack / dispatch / map
I$
SBBB
33
33
additional instruction flow for
higher core throughput
Branch Q
VBU1VBU0
D$
*FXa pipes execute reg
writers and support b2b
execution to itself
FXb pipes execute non-reg
writers and non-relative branches
(needs 3w AGEN)
DFU1
Vector1 / FPR1
register
128b string/int
SIMD1
VFU0 VFU1
Features of Recent IBM FPUsGHz/FO4 BFU
pipeCore
per
chip
DFP
add
BFU pipes
DP-SP- VDP
Features added problems
P5 2004
2.2 / 23 / 130nm 6 2 OOO N/A 2 – 2 – 0
P6 2007
5.0 / 13 / 65nm 6 2 InO 2 – 4 – 2 13 FO4 inorder
P7 2010
4.1 / 20 / 45nm 6 8 OOO +10 2 – 4 – 4 OOO, more VRs DFU too far
P7+2012
4.2 / 20 / 32nm 6 8 OOO +10 2 – 8 – 4 2 SP per DP
P8
2014
4.1/ 20 / 22nm 6 12 OOO +5 2 – 8 – 4 enh DFU + CAPI +
SMT8
Z9 2005
1.7 / 27 / 90nm 5 1 InO firmware 1 – 1 – 0 software DFU
Z10 2007
4.4 / 15/ 65nm 7 2 InO 16b-
31w
1 – 1 – 0 15 FO4 inorder
Z1962010
5.2 / 16 / 45nm 8 4 OOO 12l-7t 1 – 1 – 0 OOO screams More BFUs
ZEC122012
5.5 / 16 / 32nm 8 6 OOO 12l-7t 1 – 1 – 0 Zoned to DFU Need more Int
registers
Z13
2015
5.0 / 18 / 22nm 8 8 OOO 8l – 1t 2 – 2 – 2 enh DFU, SIMD,
VRs, SMT2
GHz/FO4 BFU
pipeCore
per
chip
SIMD Other
P5 2004
2.2 / 23 / 130nm 6 2 32 x 64b FPRs
scalar
2w SMT
P6 2007
5.0 / 13 / 65nm 6 2 VMX 32 x 128b
VRs + FPRs
2w SMT
P7 2010
4.1 / 20 / 45nm 6 8 VSU 64 x 128b
VRs,
16bit Vec Int MPY
4w SMT
P7+2012
4.2 / 20 / 32nm 6 8 VSU + RNG 4w SMT
P8
2014
4.1 / 20 / 22nm 6 12 VSU +
Crypto/AES +
32b VINT MPY
8w SMT
Z9 2005
1.7 / 27 / 90nm 5 1 16 x 64b FPRs
scalar
COP- CMPR
+ CRYPTO
Z10 2007
4.4 / 15/ 65nm 7 2 FPRs
Z1962010
5.2 / 16 / 45nm 8 4 OOE
zEC12
20125.5 / 16 / 32nm 8 6
Z132015
5.0 / 18 / 22nm 8 8 32 x 128b VRs
+VSU
(Int/FP/String)
COP, 2w SMT
32b V INT MPY
Decimal Floating Point Unit Evolution in HDW since 2006
12
iterativePartially
pipelined
Fully pipelined
addition
Power6
2006Inorder
13 FO4
Power7
2010Out of Order
20 FO4
z196
2010OOE 2X FP perf
Power7+
2012Power8
2014SRT R16
zEC12
2012
z13
2015SMT 2
SIMD
z10
200715 FO4
Z990
GA2
2006
software
QP Binary in DFU
firmware
QP signed BCD addition in DFU
Fixpt divide in DFU
FPU Architecture Advances Quad Precision Hex in hardware for more years than I’ve been at IBM
Quad Precision Binary since 1998
Integer FMA (56 x56 + 112) in hardware since 2003
Z196 - 2010
- BFP new rounding mode (FPC bit 29)
Truncate and OR Inexactness
supports SP A <= SP B + DP C with 1 rounding error
- IEEE 754-2008 heterogeneous support
- DFP quantum exception
Tells Software which is emulating a greater precision and range whether hardware precision (16 or 34 digits and exponent > 398 or 6176) is exceeded
0.5 mask, 1.5 flag, new DXC code
clamped or rounded
replaces use of Test Data Group for every operation
- Converts to/from Integer to BFP/DFP
zEC12 - 2012
- Converts to/from zoned Decimal to DFP
More Changes
Z13 - 2015
- SMT2 and double execution units
- Non-blocking and separate divide pipeline
- 4 X size of register file
- SIMD integer and string
- SIMD floating-point BFP DP
Move away from CC and branches
- 128 bit integer adds and beyond (support cin and cout)
Trends
Memory is getting relatively slower
Frequency constant
Power important – clock/power gating
More parallel hardware/software
- SMT and SIMD/Vector
- Specialized Accelerators
- FPGA
Lines will blur with GPUs
Need processing power where data stored
Future Systems
Core
Multiple
threads
GPU FPGA
Probably
On chip
OpenPower
Possibly both
On chip and off chip
POWER8 CAPI
FPGA accelerator attachment
MEM A
Mem. Bus
SMP Bus
CAPP MC
core core core
core core core
PowerBus
Accelerated
application
Data Control
Host Service Layer
XlateCaches interruptsFault Handling
POWER8FPGA
CAPP: Coherently attached processor proxy
• Provides PowerBus Surrogate
• Directory of cache lines used by accelerator
•Communicates to accelerator over industry-standard PCIe electrical.
•Coherency with full Memory Range
•POWER Effective Address (EA) TranslationS
MP
Bu
s
Host Service Layer
•Provides interface to user application layer
•Responds to snoop commands of interest from PowerBus
•Performs Address translation and table walks
Standard System Topology Preserved
•Accelerators do not consume a Processor Socket
•Accelerators are a System Option – Not a configuration
•Accelerators do not reduce System Memory
Performance
Not getting any faster (GHz)
Parallelism
Reduce Scalar Bottlenecks
- Scatter / Gather of memory elements
Need parallel cache fetch/store engines
Mostly Parallel Execution - Predicated Execution
Avoiding Branches
Easy Programming model
Final Note: Reliability
With all the parallelism something will go wrong
Many parallel executions need to be checked (1.00001)^1000
- Duplication physical vs time redundancy
- Residue and Parity checking
Dynamic Sparing
Or need physical replacing with common FRU