Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead...
Transcript of Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead...
Runahead Execution
An Alternative to Very Large Instruction Windows for Out-of-order Processors
Onur MutluYale N. Patt
Jared StarkChris Wilkerson
2
Outline
• Motivation
• Overview
• Mechanism
• Experimental Evaluation
• Conclusions
3
Motivation
• Out-of-order processors require very large instruction windows to tolerate today’s main memory latencies.– Even in the presence of caches and prefetchers
• As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency.
• Building a very large instruction window is not an easy task.
4
Small Windows: Full-window Stalls
• Instructions are retired in-order from the instruction window to support precise exceptions.
• When a very long-latency instruction is not complete, it blocks retirement and incoming instructions fill the instruction window if the window is not large enough.
• Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall.
• L2 misses are responsible for most full-window stalls.
5
Impact of Full-window Stalls
IPC: 0.77
IPC: 1.69 IPC: 1.15
0%
10%
20%
30%
40%
50%
60%
70%
80%
Machine Model (L2 size, instruction window size)
% c
ycle
s w
ith
fu
ll w
ind
ow s
tall
s 512KB L2, 128-entry window
perfect L2, 128-entry window
512KB L2, 2048-entry window
6
Overview of Runahead Execution
• During a significant percentage of full-window stall cycles, no work is performed in the processor.
• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction.
• Enter runahead mode when the oldest instruction is an L2-miss load and remove that load from the processor.
• While in runahead mode, keep processing instructions without updating architectural state and without blocking the instruction window due to L2 misses.
• When the original load miss returns back, resume normal-mode execution starting with the runahead-causing load.
7
Benefits of Runahead Execution
• Loads and stores independent of L2-miss instructions generate useful prefetch requests:– From main memory to L2
– From L2 to L1
• Instructions on the predicted program path are prefetchedinto the trace cache and L2.
• Hardware prefetcher tables are trained using future memory access information. The prefetcher also runs ahead along with the processor.
8
Mechanism
9
Entry into Runahead Mode
When an L2-miss load instruction reaches the head of the instruction window:
• Processor checkpoints architectural register state, branch history register, return address stack.
• Processor records the address of the L2-miss load.
• Processor enters runahead mode.
• L2-miss load marks its destination register as invalid and is removed from the instruction window.
10
Processing in Runahead Mode
• Two types of results are produced: INV (invalid), VALID
• First INV result is produced by the L2-miss load that caused entry into runahead mode.
• An instruction produces an INV result– If it sources an INV result– If it misses in the L2 cache (A prefetch request is generated)
• INV results are marked using INV bits in the register file, store buffer, and runahead cache.– INV bits prevent introduction of bogus data into the pipeline.– Bogus values are not used for prefetching/branch resolution.
11
Pseudo-retirement in Runahead Mode
• An instruction is examined for pseudo-retirement when it reaches the head of the instruction window.
• An INV instruction is removed from window immediately.
• A VALID instruction is removed when it completes execution and updates only the microarchitectural state.
• Pseudo-retired instructions free their allocated resources.
• Pseudo-retired runahead stores communicate their data and INV status to dependent runahead loads.
12
Runahead Cache
• An auxiliary structure that holds the data values and INV bits for memory locations modified by pseudo-retired runaheadstores.
• Its purpose is memory communication during runahead mode.
• Runahead loads access store buffer, runahead cache, and L1 data cache in parallel.
• Size of runahead cache is very small (512 bytes).
13
Runahead Branches
• Runahead branches use the same predictor as normal branches.
• VALID branches are resolved and trigger recovery if mispredicted.
• INV branches cannot be resolved.
14
Exit from Runahead Mode
When the data for the instruction that caused entry into runahead returns from main memory:
• All instructions in the machine are flushed.
• INV bits are reset. Runahead cache is flushed.
• Processor restores the state as it was before the runahead-inducing instruction was fetched.
• Processor starts fetch beginning with the runahead-inducing L2-miss instruction.
15
Experimental Evaluation
16
Baseline Processor
• 3-wide fetch, 29-stage pipeline
• 128-entry instruction window
• 32 KB, 8-way, 3-cycle L1 data cache, write-back
• 512 KB, 8-way, 16-cycle L2 unified cache, write-back
• Approximately 500-cycle penalty for L2 misses
• Streaming prefetcher (16 streams)
17
Benchmarks
• Selected traces out of a pool of 280 traces
• Evaluated performance on those that gain at least 10% IPC improvement with perfect L2 cache
• 147 traces simulated for 30 million x86 instructions
• Trace Suites– SPEC CPU95 (S95): 10 benchmarks, mostly FP
– SPEC FP2k (FP00): 11 benchmarks
– SPECint2k (INT00): 6 benchmarks
– Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001
– Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake
– Productivity (PROD): 17 benchmarks: Sysmark2k, winstone
– Server (SERV): 2 benchmarks: tpcc, timesten
– Workstation (WS): 7 benchmarks: CAD, nastran, verilog
18
Performance of Runahead Execution
22%
52%16%
12%22%
15%
13%
35%
12%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ion
s P
er C
ycle
No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher
19
Effect of Frontend on Runahead
• Average number of instructions during runahead: 711– Before mispredicted INV branch: 431
• Average number of L2 misses during runahead: 2.6– Before mispredicted INV branch: 2.38
• Runahead becomes more effective with a better frontend:– Real trace cache: 22% IPC improvement– Perfect trace cache: 27% IPC improvement– Perfect branch predictor and trace cache: 31% IPC improvement
20
Runahead vs. Large Windows
-6%
4%
6%
0%
3%-1%
2%12%
3%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ions
Per
Cyc
le
128-entry window with Runahead256-entry window384-entry window512-entry window
21
Importance of Store-Load Data Communication
10%
23%12%
7%13%
9%
5%
3%
8%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ions
Per
Cyc
le
Baseline
Runahead, no runahead cache
Runahead with runahead cache
22
Conclusions
• Runahead execution results in 22% IPC increase over the baseline processor with a 128-entry window and a streaming prefetcher.
• This is within 1% of the IPC of a 384-entry window machine.
• Runahead and the streaming prefetcher interact positively.
• Store-load data communication through memory in runaheadmode is vital for high performance.
23
Backup Slides
24
Added Bits & Structures
25
Runahead-Prefetcher Interaction
55%48%
36%24%
3%
82%
53%
17%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
tpcc verilog nastran gcc vortex mgrid apsi ammp
Inst
ruct
ions
Per
Cyc
le
No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher
26Effect of a Better Frontend
27%
75%
27%
15%
26%21%
13%
35%
13%
31%
89%
37%
20%
34%29%
8%
36%
13%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ions
Per
Cyc
le
Baseline w/ perf TC
Runahead w/ perf TC
Baseline w/ perf TC and BP
Runahead w/ perf TC and BP
27
When do we see the L2 misses in Runahead?L2 Data Miss Distances
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
miss 1 miss 2 miss 3 miss 4 miss 5 miss 6 miss 7 miss 8 miss 9 miss 10 miss 11 miss 12 miss 13 miss 14 miss 15 miss 16
nth out-of-window miss
Cyc
les
or in
stru
ctio
ns a
fter t
he ru
nahe
ad-c
ausi
ng L
2 m
iss
cycles
instrs
28
Distance of the first L2 miss in runahead modeWhere does the first L2 miss occur?
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
128-256 256-384 384-512 512-640 640-768 768-896 896-1024 1024-1536 1536+
Distance in instructions from the runahead-causing L2 miss
Per
cent
age
of fi
rst o
ut-o
f-w
indo
w L
2 m
isse
s
29
Instruction vs. Data Prefetching Benefit
88%
96%55%
87%91%
74%
94%
97%
87%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ions
Per
Cyc
le
Baseline
Runahead with no inst benefit
Runahead with all benefits
30
Runahead with a Larger L2 Cache
22%
52%16%
12%22%
15%
13%
35%
12%
17%
40%13%
10%19%
12%
14%
30%
7%
16%
32%
8%
8%13%
11%
30%
27%
6%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ion
s P
er C
ycle
No Runahead - 0.5 MB L2Runahead - 0.5 MB L2No Runahead - 1 MB L2Runahead - 1 MB L2No Runahead - 4 MB L2Runahead - 4 MB L2
31
Runahead on in-order?
40%
55%
28%
14%
21%17%
74%
73%
15%
22%
52%16%
12%22%
15%
13%
35%
12%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Inst
ruct
ion
s P
er C
ycle
in-order baselinein-order baseline with runaheadout-of-order baselineout-of-order baseline with runahead
32
Future Model Results
33
Future Model with Better Frontend