Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on...
-
Upload
nickolas-small -
Category
Documents
-
view
215 -
download
2
Transcript of Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on...
![Page 1: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/1.jpg)
Stamatis Vassiliadis Symposium
Sept. 28, 2007
J. E. Smith
Future Superscalar Processors Future Superscalar Processors Based onBased on
Instruction Compounding Instruction Compounding
![Page 2: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/2.jpg)
Future Microprocessors 2
Instruction Compounding (Fusing)Instruction Compounding (Fusing)
Instruction compounding, or “fusing” has become a key idea in high performance microprocessors
“A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions”
“Instructions composing a compound instruction need not be consecutive.”
-- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994
![Page 3: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/3.jpg)
Future Microprocessors 3
The Future Processor: Three Key The Future Processor: Three Key AspectsAspects
Instruction compounding or fusing• Based on S. Vassiliadis work• Employs compounding and 3-input ALU
Co-designed VM for dynamic translation/fusing
• Concealed from all software• Optimized (fused) instructions held in code-cache
Dual decoder front-end for fast startup• Hardware front-end decoder for fast startup• Software translator for sustained high performance
![Page 4: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/4.jpg)
Future Microprocessors 4
Processor Micro-architectureProcessor Micro-architecture
Data
x86 Code
Code Cache(V-code)
I-Cache
ConventionalMemory
ConcealedMemory
Verticalx86
Decoder
TranslationSoftware
HorizontalV-code
Decoder
PipelinedRename/Dispatch
IssueBuffer
PipelinedExecutionBackend
V-code
x86code
H-code
![Page 5: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/5.jpg)
Future Microprocessors 5
Fusible Instruction SetFusible Instruction Set
RISC-ops with unique features:
• A fusible bit per instruction fuses two dependent instructions
• Dense instruction encoding, 16/32-bit ISA design
Special Features to Support the x86 ISA
• Condition codes
• Addressing modes
• Aware of long immediate & displacement values
21-bit Immediate / Displacement10b opcode
11b Immediate / Disp10b opcode 5b Rds5b Rsr
16-bit opcode 5b Rds5b Rsr5b Rsr
5b op
10b Immd / Disp
F
16-bit immediate / Displacement10b opcode 5b Rds
F
F
F
F
F
F
5b Rds5b Rsr
5b op
5b op
5b Rds5b Rsr
Core 32-bit instruction formats
Add-on 16-bit instruction formats for code density
Fusible ISA Instruction Formats
![Page 6: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/6.jpg)
Future Microprocessors 6
Microarchitecture: Macro-op ExecutionMicroarchitecture: Macro-op Execution
• Enhanced OOO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions
throughout the entire pipeline
DecodeRenameDispatch
Wake-up
RFSelect EXEFetch MEM
cacheports
AlignFuse
Fusebit
3- 1 ALUs
WBRetire
Increasedeffective
bandwidth
Pipelined scheduling;Wider effective window;
Higher effective bandwidth
Highereffective
bandwidth
Higher effective bandwidth;Simpler forward logic
Simpler ROBtracking
![Page 7: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/7.jpg)
Future Microprocessors 7
Macro-op Fusing AlgorithmMacro-op Fusing Algorithm
Objectives: • Maximize fused dependent pairs • Simple & Fast
Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head.
Minimize non-fused single-cycle ALU ops• Criticality: Fuse instructions that are “close” in the original
sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or fewer distinct register operands per fused pair
Solution: Two-pass Fusing Algorithm:• The 1st pass, forward scan, prioritizes ALU ops, i.e. for each
ALU-op tail candidate, look backward in the scan for its head• The 2nd pass considers all kinds of RISC-ops as tail candidates
![Page 8: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/8.jpg)
Future Microprocessors 8
Fusing Algorithm: ExampleFusing Algorithm: Example
x86 asm:
-----------------------------------------------------------
1. lea eax, DS:[edi + 01]
2. mov [DS:080b8658], eax
3. movzx ebx, SS:[ebp + ecx << 1]
4. and eax, 0000007f
5. mov edx, DS:[eax + esi << 0 + 0x7c]
RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]
After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]
![Page 9: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/9.jpg)
Future Microprocessors 9
Instruction Fusing Profile Instruction Fusing Profile
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pe
rce
nta
ge
of D
yn
am
ic In
str
uctio
ns
ALU
FP or NOPs
BR
ST
LD
Fused
55+% fused RISC-ops increases effective ILP by 1.4 Only 6% single-cycle ALU ops left un-fused.
![Page 10: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/10.jpg)
Future Microprocessors 10
Other DBT Other DBT Software ProfileSoftware Profile
Of all fused macro-ops: • 50% ALU-ALU pairs. • 30% fused condition test & conditional branch pairs. • Others mostly ALU-MEM ops pairs.
Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct
destination registers.
Translation Overhead Profile• About 1000 instructions per translated hotspot instruction.
![Page 11: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/11.jpg)
Future Microprocessors 11
Co-designed x86 Processor Co-designed x86 Processor PerformancePerformance
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
16 32 48 64issue window size
Rela
tive IP
C p
erf
orm
ance
4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base
![Page 12: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/12.jpg)
Future Microprocessors 12
Dual Decoder Front-EndDual Decoder Front-End
Data
x86 Code
Code Cache(V-code)
I-Cache
ConventionalMemory
ConcealedMemory
Verticalx86
Decoder
TranslationSoftware
HorizontalV-code
Decoder
PipelinedRename/Dispatch
IssueBuffer
PipelinedExecutionBackend
V-code
x86code
H-code
![Page 13: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/13.jpg)
Future Microprocessors 13
Evaluation: Startup Performance Evaluation: Startup Performance
![Page 14: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/14.jpg)
Future Microprocessors 14
Activity of HW Assists Activity of HW Assists
0
10
20
30
40
50
60
70
80
90
100
1 10 100
1,00
0
10,0
00
100,
000
1,00
0,00
0
10,0
00,0
00
100,
000,
000
Finish
Time: Cycles
HW
Ass
ist A
ctiv
ity (
%) Superscalar
VM.soft
VM.dual
![Page 15: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.](https://reader035.fdocuments.net/reader035/viewer/2022072016/56649ee45503460f94bf3b03/html5/thumbnails/15.jpg)
Future Microprocessors 15
Important Research IssuesImportant Research Issues Profiling
• Probe insertion via software translator not feasible
Multi-core• Shared code cache
• SMT designs
Memory consistency• Stores can be done in-order
• Re-scheduled loads may be important for performance
Precise traps• Potential HW assist?