Radiation tolerance of commercial 130nm CMOS technologies for High Energy Physics Experiments
Power5: IBM's Next Generation Power Microprocessor...POWER4 --- Shipped in Systems December 2001...
Transcript of Power5: IBM's Next Generation Power Microprocessor...POWER4 --- Shipped in Systems December 2001...
© 2003 IBM Corporation
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor
Ron Kalla, Balaram Sinharoy, Joel TendlerIBM Systems Group
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Outline
MotivationBackgroundThreading FundamentalsEnhanced SMT Implementation in POWER5 Additional SMT ConsiderationsSummary
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Microprocessor Design Optimization Focus Areas
Memory latencyIncreased processor speeds make memory appear further away
Longer stalls possibleBranch processing
Mispredict more costly as pipeline depth increases resulting in stalls and wasted power
Predication drives increased power and larger chip areaExecution Unit Utilization
Currently 20-25% execution unit utilization commonSimultaneous multi-threading (SMT) and POWER architecture address these areas
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
POWER4 --- Shipped in Systems December 2001
Technology: 180nm lithography, Cu, SOI
POWER4+ shipping in 130nm todayDual processor core8-way superscalar
Out of Order execution
2 Load / Store units
2 Fixed Point units
2 Floating Point units
Logical operations on Condition Register
Branch Execution unit> 200 instructions in flightHardware instruction and data prefetch L3
Dire
ctor
y/C
ontr
ol
L2 L2 L2
LSU LSUIFUBXU
IDU IDU
IFUBXU
FPU FPU
FXU
FXUISU ISU
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
POWER5 --- The Next Step
Technology: 130nm lithography, Cu, SOIDual processor core8-way superscalarSimultaneous multithreaded (SMT) core
Up to 2 virtual processors per real processor
24% area growth per core for SMT
Natural extension to POWER4 design
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Multi-threading Evolution
FX0FX1FP0FP1LS0LS1BRXCRL
Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL
Coarse Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Fine Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Simultaneous Multi-Threading
Thread 1 ExecutingThread 0 Executing No Thread Executing
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Changes Going From ST to SMT CoreSMT easily added to Superscalar Micro-architecture
Second Program Counter (PC) added to share I-fetch bandwidthGPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread)Completion logic replicated to track two threadsThread bit added to most address/tag buses
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
CR, LR,CTR
FPRs
GPRsIntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
RegisterRename
FetchUnit
I-Cache
Decode
PC
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
CR, LR,CTR
FPRs
GPRs
PC
IntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
RegisterRename
FetchUnit
I-Cache
Decode
DataCache
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Resource Sizes
~~
Analysis done to optimize every micro-architectural resource size
GPR/FPR rename pool size
I-fetch buffers
Reservation Station
SLB/TLB/ERAT
I-cache/D-cacheMany Workloads examinedAssociativity also examined
50 60 70 80 90 100 110 120 13080 120
Number of GPR Renames
IPC
ST SMT
Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Resource Sharing
Results based on simulation of an online transaction processing application
Threads share many resources Global Completion Table, BHT, TLB, . . .
Higher performance realized when resources balanced across threads
Tendency to drift toward extremes accompanied by reduced performance
Solution: Dynamically adjust resource utilization
0
2
4
6
8
10
Rel
ativ
e O
ccur
renc
e
0 5 10 15 20
05
1015
20
Thread 0
Thread 1
With dynamic resource utilization
adjustment
0
2
4
6
8
10
Rela
tive
Occ
urre
nce
0 5 10 15 20
05
1015
20
Thread 0
Thread 1
Without dynamic resource utilization
adjustment
Global Completion Table Occupancy
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Resource Sharing
0
2
4
6
8
10
Re
lati
ve
Oc
cu
rre
nc
e
0 5 10 15 20
05
1015
20
Thread 0
Threa
d 1
0
2
4
6
8
10
Re
lati
ve
Oc
cu
rre
nc
e
0 5 10 15 20
05
1015
20
Thread 0
Thr
ead
1
Threads share many resources Global Completion Table, BHT, TLB, . . .
Higher performance realized when resources balanced across threads
Tendency to drift toward extremes accompanied by reduced performance
Solution: Dynamically adjust resource utilization
Global Completion Table Occupancy
Results based on simulation of an online transaction processing application
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Thread PriorityInstances when unbalanced execution desirable
No work for opposite thread
Thread waiting on lock
Software determined non uniform balance
Power management
…Solution: Control instruction decode rate
Software/hardware controls 8 priority levels for each thread
0001111122
IPC
-5 -3 -1 0 1 3 50,7 7,0 1,1
Thread 1 Priority - Thread 0 Priority
Thread 0 IPC Thread 1 IPCPower Save Mode
Single Thread Mode
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Dynamic Thread Switching
Thread StatesUsed if no task ready for second thread to runAllocates all machine resources to one threadInitiated by softwareDormant thread wakes up on:
External interruptDecrementer interruptSpecial instruction from active thread
Active
Dormant
Null
software
hardware or software
software
software
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Single Thread Operation
Advantageous for execution unit limited applications
Floating or fixed point intensive workloads
Execution unit limited applications provide minimal performance leverage for SMT
Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread
Determined dynamically on a per processor basis
POW
ER4+
POW
ER5
ST
POW
ER5
SMT
IPC
Matrix Multiply
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Other SMT Considerations
Power ManagementSMT Increases execution unit utilization
Dynamic power management does not impact performanceDebug tools / Lab bring-up
Instruction tracing
Hang detection
Forward progress monitorPerformance MonitoringServiceability
© 2003 IBM CorporationHotchips 15, August 2003
SMT Implementation in POWER5
Summary
POWER5 SMT implementation is more than SMTGood ROI for silicon area: Performance gain > Area increaseResource sizes optimized Dynamic feedback enhances instruction throughputSoftware controlled priority exploits machine architectureDynamic ST to/from SMT mode capability optimizes system resources
SMT impacts pervasive throughout chipOperating in laboratory
AIX, Linux and OS/400 booted and running