Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
-
Upload
virginia-greene -
Category
Documents
-
view
222 -
download
1
Transcript of Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
Using Dynamic Binary Translation to
Fuse Dependent Instructions
Shiliang Hu & James E. Smith
2
OutlineOutline
• Introduction
• Fused Instruction Set
• Fusing Algorithm
• Evaluation
• Conclusion
3
MicroArchitecture Model
• Dependence-based Architectures: ILDP [ISCA’02] etc.
• Fuse dependent instruction pairs to be processed as if single Introductions in the processor pipeline
• Both higher IPC and deeper pipelining can be achieved simultaneously.
• Original proposal by I.Kim and M. Lipasti Macro-op Scheduling [MICRO’03] (Hardware Intensive, RISC)
• Other related work on pipelined scheduling logic.
4
Pipelined Scheduling Window
• Critical path in select-wakeup for single cycle instructions: If producer has latency > 1, then wakeup can be done a cycle late => wakeup and select in different pipe stages
Reax mem[Resi + 4] Reax mem[Resi + 4]Select Select Wakeup Wakeup
Reax Reax & 7 Reax Reax & 7 :: RebxReax+RebxSelect & Wakeup Select
Rebx Reax + Rebx WakeupSelcct & Wakeup Recx Rebx + 4
Recx Rebx + 4
5
Performance Implications
• Pros: – Effectively larger scheduling window by holding two instructions
in the same window slot. – Effectively wider issue: by issuing one slot with two fused
instructions, two dependent instructions are kicked off for execution with a single issue decision.
– Can pipeline the scheduling logic without a heavy penalty if there is high fusing rate.
• Cons: – Non-fused single cycle instructions have two cycle latency. – If the head (the 1st instr. in the pair) provides value to another
critical consumer – most values are consumed only once. – If the tail (the 2nd instr. in the pair) has a critical dependence,
slows down the wakeup of the pair.
6
Co-Designed Virtual Machine
• Concurrently design ISA, microarchitecture, and dynamic binary translation (DBT) system
• Examples -- Transmeta Crusoe & Efficeon processors; IBM DAISY, BOA.
• Our Design for x86: – RISC-style implementation ISA with fuse bit– Fetch straightened code generated by fast DBT– Run on an enhanced dynamic superscalar
7
Implementation Instruction Set
• Allocate 1-bit of each instruction, the fuse bit, to fuse two instructions in the pipeline
• Dense Instruction Encoding: 16/32 bit instruction set design
• Features specialized for efficient emulation of the x86 ISA: long immediates, condition code, addressing modes etc
8
Fused Instruction Set
call 0x080af30e (21bit disp)jcc 0x080115a0jmp 0x080C0988
LIMM.lo Redx, LO(0x0810a7de)LIMM.hi Redx, HI(0x0810a7de)CMP.cc Reax, 0x4000
LD Reax, mem[Resp + F8]ST Reax, mem[Rebp + 4C]ADD Reax, Rebx, 4c
ADD Reax, Redx, RebxFmac Facc, Fmp1, Fmp2LD Reax, mem[Rebx + Rebp]
mov esp, ebp MOV Resp, Rebpmov eax,[esp] LD Reax, mem[Resp]add eax, edx ADD Reax, Redx
sub ecx, 4 SUB Recx, 4shr esi, 2 SHR Resi, 2inc ecx INC Recx, 1
jcc 3e e.g. jnz 3e
21-bit Immediate/Displacement10b opcode
11b Immd/Disp10b opcode 5b Rds5b Rsr
16-bit opcode 5b Rds5b Rsr5b Rsr
4b Rd4b Rs 7b op
4b Rd4b I 7b op
8b Immd/Disp 7b op
F
16-bit immeidate / Disp10b opcode 5b Rds
F
F
F
F
F
F
9
An Illustrative Example
X86 instructions Fused ISA Execution Latency
1 mov ebx,ds:[esi + 1c] LD Rebx, [Resi + 1c] 3 2 test ebx, ebx TEST Rebx, Rebx :: Jz 126 2 3 jz 08115bf2 4 LD Rtmp, [Rebx + 02] 3 5 cmp ds:[ebx + 02], 0d CMP Rtmp, 0d :: Jz 2f 2 6 jnz 08115ae1 7 jmp 08115bf2 (direct jmp removed) 8 add esp, 0c ADD.cc Resp, 0c :: LD Rebx,[Resp] 4 9 pop ebx ADD Resp, 4 :: LD Resi,[Resp] 4 10 pop ebp ADD Resp, 4 :: LD Rtmp,[Resp] 4 11 ret_near ADD Resp, 4 1 12 BR.ret Rtmp 1 28 Bytes
10 x86 instructions 32 Bytes, 14 RISC-like instructions. Consume 9 scheduling window slots
10
Dynamic Binary Translation
• Goals: Simple, Fast & Effective
• Hot Superblock detection and formation
• Translation from x86 binary to fused instruction set
• Code cache placement & linking among superblocks in the code cache
11
Hot Superblock Detection & Formation
• Modified MRET (Most Recently Executed Tail) -- Stop at indirect jumps. Threshold: 32. Max Len: 256.
a
c
bd
d
Early exit
Entry
Superblock generated
later
Translated Superblocks:
Basic
block A
CB
D
Taken at superblock construction time
Hot Threshold
12
Translation Procedure
Single Pass Algorithm:
1. Form superblocks using Modified MRET method
2. Crack x86 instructions into RISC-like abstract micro-ops
3. Perform Cluster Analysis of long immediates and assign to regs.
4. Generate micro-ops in the implementation ISA
5. Fusing Algorithm Scan looking for dependent pairs to be fused. Forward scan, backward pairing.
6. Assign registers; extend live ranges for precise traps, use consistent state mapping at superblock exits
7. Code generation
13
Cluster Analysis
• Objectives: – Remove embedded long immediates in x86 binary. – Reduce static and dynamic instructions.
• Long Immediate Conversion. – Scan superblock looking for all long immediate values. – Perform value clustering analysis and allocate registers to
frequent long immediate values. – Convert some x86 embedded long immeidates into register
access or register plus a short immediate that can be handled in implementation ISA.
14
Fusing Algorithm
• Objectives: – Maximize fused dependent pairs – Minimize non-fused single cycle ALU ops.
• Heuristics: – Only single cycle ALU ops can be a head. – Fuse instructions that are close in the original sequence
cracked from x86 binary.
• Fusing Algorithm:– Single pass forward scan. – For each tail candidate, look backward in the scan for its
head.
15
N
Head
Tail
YX
Head
Tail
A
C
B
D
a b c
A
D
B
C
d
?
Dependence Cycle Detection
• All cases are generalized in (d) due to Anti-Scan Fusing Heuristic
16
Dynamic x86 Superblock Size
• Average superblock size is about 15 x86 instructions, 20+ RISC ops.
• String instructions are common in some x86 applications.
0
5
10
15
20
25
30
Dyn
am
ic S
up
erb
lock S
ize
in
x8
6 In
str
uctio
ns
x86i/sb x86i/sb-string
17
Static Translation Size
• Variable length ISA is only about 33% bigger than x86 binary
• Fixed length ISA is 60% to 120% bigger than original x86 binary.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Re
lative
Co
de
Siz
e
x86 VarLen:16/32 VL-align Fixed-Len:32
18
Long Immediate Values Converted
• Intra superblock conversion for now.
• Address Displacement is easier to convert, but not the general long immediate values.
0
10
20
30
40
50
60
70
80
90
100
Co
nve
rsio
n P
erc
en
tag
e
Displacement Long Immd ALL
19
Registers For Long Immediate
• Two or three registers are enough for 95+% dynamic superblocks.
• Most SPEC2000INT benchmarks need no more than 5 registers
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
Number of Registers
Pe
rce
nta
ge
of D
yn
am
ic S
up
erb
locks
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
20
Scheduling Density
• Consistently high fusing rate across SPEC2000INT benchmarks.
• 1.5 Scheduling Density means more than 60% instructions are fused
00.10.20.30.40.50.60.70.80.9
11.11.21.31.41.51.61.7
Dyn
am
ic S
ch
ed
ulin
g D
en
sity Fused Implementation ISA
21
Non-Fused Instruction Profile
• Consistently low single cycle ALU leftovers across SPEC2000INT
• (~23%) X (~35%) means single cycle ALU ops are about 8% of all.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
Average
Percentage of None Fused Instructions
LD ST BR ALU
22
Distance Distribution of Fused Pairs
• Most pairs are consecutive or very close in the original cracked RISC ops cracked from x86 superblock.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
Percentage of Fused Pairs
1 2 3 4 5 6 7
23
Code Re-organization
• More than 50% pairs are across x86 instruction boundaries.
• Single cycle ALU ops pairs is about 60%
0
10
20
30
40
50
60
70
80
90
100
Pe
rce
nta
ge
of
Fu
sed
Pa
irs X-fuse S-fuse
24
Source Register Operands
84
86
88
90
92
94
96
98
100
Perc
enta
ge o
f Fused P
airs 2 Source Register Operands 3 Source Register Operands
• 99+% fusable pairs have no more than 3 source register operands.
• 95+% fusable pairs have no more than 2 source register operands.
25
Conclusion
• High degree of fusing in typical x86 binary: 60% of all dynamic instructions
• Two source register operands are enough: 95% of fusable dependent pairs.
• Non-fused instructions are mostly LD, ST, BR, FP and NOPs
– Little impact from pipelined issue
• Variable length ISA improves code density: by 30% in our case
• Co-Designed VM featuring fused instruction execution is promising Future work: Complete the co-designed microarchitecture
26
Backup: Dynamic Binary Translation
• Start program execution by interpretation; identify “hot” (frequently executed) program paths
• Translate hot paths into translation cache• If program control flow reaches already translated code, execute
natively
Interpret
Translate
Native execution
Threshold
End of superblock Translation found
DBT (VMM)
Target translation found
Not found (call-DBT instruction)
End of superblock Translation not found