Duplicating and Deconstructing Virtual Load/Store Queues
description
Transcript of Duplicating and Deconstructing Virtual Load/Store Queues
![Page 1: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/1.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
1
Duplicating and Deconstructing Virtual Load/Store Queues
Vikas GargSonal Agarwal
![Page 2: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/2.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
2
Motivation Large instruction window and load/store queue
to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory
accesses. Pipeline flushes to handle replay traps
• Wasted pipeline operations (Power)• Excessive L1 accesses (Power and Locality)
![Page 3: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/3.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
3
Motivation Virtual Load/Store Queue (VLSQ) proposal
[Jaleel, HPCA’05] • Use large load store queue for the front end• Throttle memory instructions at issue stage• Reduces the re-ordering of memory instructions• Help in avoiding replay traps• Saves power• No big performance drop
What if we simply reduce the LSQ size?
Does a VLSQ really work?
![Page 4: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/4.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
4
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusions
![Page 5: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/5.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
5
VLSQ Introduction
LD/ST 0LD/ST 1LD/ST 2LD/ST 3LD/ST 4LD/ST 5LD/ST 6LD/ST 7LD/ST 8LD/ST 9
LD/ST 10LD/ST 11LD/ST 12LD/ST 13LD/ST 14LD/ST 15
LSQ Head
LSQ Tail
Virtual Head
Virtual TailFRONT END
ISSUE
ISSUED NOT READY BLOCKED EMPTY
![Page 6: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/6.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
6
VLSQ Pipeline Operation
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Fetc
h/
Deco
de
Load/Store Queue
Stall
Stall
![Page 7: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/7.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
7
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
![Page 8: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/8.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
8
Simulation Setup Alpha 21264 simulator (sim-alpha)
• I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle)• L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim)• 1024 entry store-wait table• 2048 line 2-level bimodal branch predictor• Pipeline width: Fetch(8); Issue(8/4); Commit(11)• Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1)
Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf• Warm-up: 2 Billion Inst; Data: 500 Million Inst• Reference input
![Page 9: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/9.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
9
Simulation Setup (Continued…)
ROB Size Registers Issue Queue LSQ Size VLSQ Size
80 80/72 20/15 32/32 Infinite
128 160/144 40/30 64/64 Infinite
256 320/288 80/60 128/128 Infinite
512 640/576 160/120 256/256 Infinite
Baseline Out-of-Order Configurations
For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2
For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2
![Page 10: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/10.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
10
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
![Page 11: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/11.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
11
VLSQ - Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB Size
CP
I
Inf
64
32
16
8
4
2
![Page 12: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/12.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
12
VLSQ - Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB Size
Ex
ec
uti
on
Cy
cle
s
Inf
64
32
16
8
4
2
![Page 13: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/13.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
13
VLSQ – Map/Rename Stalls
0
500
1,000
1,500
2,000
2,50080
-Inf
80-6
480
-32
80-1
680
-880
-480
-2
128-
Inf
128-
6412
8-32
128-
1612
8-8
128-
412
8-2
256-
Inf
256-
6425
6-32
256-
1625
6-8
256-
425
6-2
512-
Inf
512-
6451
2-32
512-
1651
2-8
512-
451
2-2
ROB-VLSQ Sizes
Sta
ll C
ycle
s p
er T
ho
usa
nd
Inst
.
ROB MEM ISSUE
![Page 14: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/14.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
14
VLSQ Pipeline Operation (Continued…)
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Stall
Fetc
h/
Deco
de
Load/Store Queue
Stall
Stall
Stall
![Page 15: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/15.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
15
VLSQ Summary Reduces speculation and replay traps Not a big performance drop Saves power Stall propagates backwards
• Need a lot of memory independent instructions
On the critical path?
What if we simply reduce the LSQ size?
VLSQ works!
![Page 16: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/16.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
16
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
![Page 17: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/17.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
17
Small Load/Store Queue
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Fetc
h/
Deco
de
Load/StoreQueue
Stall
Stall
![Page 18: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/18.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
18
VLSQ vs. LSQ (Map/Rename Stalls)
VLSQ Stalls
0
500
1,000
1,500
2,000
2,500
80-I
nf80
-64
80-3
280
-16
80-8
80-4
80-2
128-
Inf
128-
6412
8-32
128-
1612
8-8
128-
412
8-2
256-
Inf
256-
6425
6-32
256-
1625
6-8
256-
425
6-2
512-
Inf
512-
6451
2-32
512-
1651
2-8
512-
451
2-2
ROB-VLSQ Sizes
Sta
ll C
ycle
s
ROB MEM ISSUE
LSQ Stalls
0
500
1,000
1,500
2,000
2,500
80-B
ase
80-6
480
-32
80-1
680
-880
-480
-2
128-
Bas
e12
8-64
128-
3212
8-16
128-
812
8-4
128-
2
256-
Bas
e25
6-64
256-
3225
6-16
256-
825
6-4
256-
2
512-
Bas
e51
2-64
512-
3251
2-16
512-
851
2-4
512-
2
ROB-LSQ Sizes
Sta
ll C
ycle
sROB MEM ISSUE
![Page 19: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/19.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
19
VLSQ vs. LSQ (Performance)
VLSQ Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB Size
CPI
Inf 64 32 16 8 4 2
LSQ Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB SizeC
PI
Base 64 32 16 8 4 2
![Page 20: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/20.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
20
VLSQ vs. LSQ (Trap Overhead)
VLSQ Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB Size
Exec
utio
n C
ycle
s
Inf 64 32 16 8 4 2
LSQ Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB SizeEx
ecut
ion
Cyc
les
Base 64 32 16 8 4 2
![Page 21: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/21.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
21
VLSQ vs. LSQ (Summary)
Baseline VLSQ LSQ
CPI 1.35 1.34 1.35
ROB Stall Cycles 1 0 0
MEM Stall Cycles 3 0 534
ISSUE Stall Cycles 233 363 1
Total Stall Cycles 236 364 536
Trap Overhead 45% 36% 26%
L1 Accesses 648 499 451
L1 Misses 96 94 91
Fetch Ops 0% 12% 31%
Map Ops. 0% 12% 34%
Exec Ops. 0% 12% 18%
ROB Size: 512; VLSQ Size 16; LSQ Size 16
![Page 22: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/22.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
22
LSQ Summary Reduces speculation and replay traps Performance vs. power tradeoff better than
that for VLSQ Simpler than VLSQ
• Not on the critical path• Additional power saving from a smaller LSQ
Reducing LSQ size is better than using VLSQ!
VLSQ works BUT…
![Page 23: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/23.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
23
Dynamic Throttling Easy to do dynamic throttling using VLSQ
• Just need to tweak the VLSQ window size
Might be better to just vary the LSQ size• Maybe we can just shut down parts of the LSQ
Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02]
![Page 24: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/24.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
24
Conclusions Speculative execution of memory instructions
leads to wasted power due to replay traps VLSQ helps to reduce memory re-ordering and
replay traps LSQ is more effective For power saving it is better to throttle earlier
in the pipeline
![Page 25: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/25.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
25
Duplicating and Deconstructing Virtual Load/Store Queues
Questions?
![Page 26: Duplicating and Deconstructing Virtual Load/Store Queues](https://reader035.fdocuments.net/reader035/viewer/2022062217/56813cb0550346895da65ce4/html5/thumbnails/26.jpg)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
26
Duplicating and Deconstructing Virtual Load/Store Queues
Questions?