shieh06a

Load Speculation

Jong-Jiann Shieh Department of Computer Science and Engineering

Tatung University [email protected]

and Cheng-Chun Lin

[email protected] and

Shin-Rung Chen [email protected]

Abstract

The superscalar processor must issues instructions as

early as possible to enhance the performance. But load instructions would be issued with register dependencies are solved and memory dependencies are known. Register dependence makes load instruction must wait until prior instruction with same destination register is completed. Memory dependence results in load instruction cannot be issued before the ambiguities are resolved. Therefore load instructions only could be issued when no register dependencies exist and all prior stores’ effective addresses calculated. This paper combines two mechanisms: value prediction (VP) and load forwarding history table (LFHT) to speculatively execute load instructions. Our study shows that by doing so there is about 15% average speedup up over baseline architecture. Keyword: load speculation, register dependence, memory dependence, value prediction, load forwarding

1. Introduction

Modern superscalar processors allow instructions to execute out of program order to find more instruction level parallelism (ILP). These processors must monitor data dependencies to maintain correct program behavior. There are two types of data dependencies, register dependence and memory dependence.

Register dependence is detected in the instruction decode stage by examining instructions’ register operand fields. If there is an instruction which load instruction depends on, the load instruction must wait until prior instruction completed, then the value of operand can be used.

The lack of information about memory dependence at instruction decode time is a problem for an out-of-order instruction scheduler. If the scheduler executes a load before a prior store that writes to the same memory location, the load will read the wrong value. In this event the load and all subsequent dependent instructions must be re-executed, resulting in a huge performance penalty.

To avoid these memory order violations, the instruction scheduler should be conservative to prevent loads from executing until all prior stores have executed. This approach decreases performance because loads in majority cases will be made falsely dependent on no alias stores as data on section 3 shown.

In this paper, we use a simple value predictor to predict the operand value to avoid register dependence and propose a structure called Load Forwarding History Table (LFHT) to exploit memory dependence speculation at run time. As we combine these two mechanisms, the predictor can help LFHT making more load instructions execute without waiting for the prior stores’ effective addresses calculated, this result in more load instructions will be issued earlier. When a load instruction is speculatively executed, instructions that are dependent upon the load instruction will also be speculatively executed.

The organization of the rest of this paper is as follows. Section 2 surveys previously proposed related works. Section 3 illustrates whole structure in superscalar processor. Section 4 describes our CPU model and simulation environment. The performance is evaluated in section 5. Finally, the conclusion of this paper is presented in section 6.

2. Related Works

The traditional works on memory disambiguation were

done in the context of compiler and hardware mechanisms for non-speculative disambiguation to ensure program correctness. Franklin and Sohi [2] proposed the address resolution buffer (ARB). The ARB indicates memory references into bins according to their address. The bins are used to cause a temporal order between references to the same address. The ARB is a structure based on bank. Multiple disambiguation requests can be dispatched in one cycle, provided that they are all to different banks.

Chrysos and Emer used predictor to solve memory disambiguation problem in [5]. The goal of the designers is to be able to schedule load instructions as soon as possible without causing any memory order violations. The predictor proposed is

based on store-sets. A store set for a specific load is the set of all stores upon which the load has ever depended. The processor adds a store to the store set of the load if a memory order violation is caused when the load executes before that store. In the next instance of the load instruction, the store set is accessed to determine which stores the load will need to wait for before executing.

A. Yoaz., M. Erez., R. Ronen. and S. Jourdan designed a CHT predictor [7]. The CHT predictor provides a prediction about whether a load instruction will conflict with any store in the instruction window. Allocating a new entry only when a load collides for the first time and invalidating its entry when its state changes to non-colliding. It does not predict which store instruction the load will conflict with. Therefore, it is easier to design but it does not provide the best possible information for disambiguation purposes.

Color set [10] presents a simple mechanism which incorporates multiple speculation levels within the processor and classifies the load and the store instructions at run time to the appropriate speculation level. Each speculation level is termed as a color and the sets of load and store instructions are called color sets. These colors divide the load instructions into distinct sets, starting with the base color which corresponds to the no violation case. In other words, this set is the set of load instructions which have never collided with unready store instructions in the past. Each color in the spectrum represents increasing levels of aggressiveness in load speculation; a load instruction is allowed to issue only if its color is less than or equal to the current speculation level. If the processor later discovers that the load has collided with a store, the color assigned to the load instruction in the predictor is increased. 3. VALUE PREDICTION AND LFHT

3.1 Issuing a Load

When executing a load or store instruction, the instruction is split into two micro instructions inside the processor [1]. One instruction calculates the effective address, and the other instruction performs the memory access once the effective address calculated and any potential store alias dependencies resolved. In the baseline architecture, each store and load instruction must wait until its effective address calculation completes. In addition, all stores are issued in-order with respect to prior stores, and each load must wait on the most recent store before it can be speculatively issued.

There are three cases that a load instruction always spends cycles on, (1) waiting on its effective address calculation (ea), (2) waiting for prior store addresses to be calculated (dep), and (3) the latency for fetching the data (mem). This paper focus on (1) and (2). We use data prediction to solve problem (1), and use LFHT to solve problem (2).

Figure 3.1 shows how many cycles per load instruction waiting on its effective address [13]. As the figure shows that each load instruction must wait 7 cycles in average so that it can get its effective address, this make a lot of wasting.

0

5

10

15

20

25

30

bzip2

crafty

gap

gcc

gzip m

cf

parser

twolf

vortex vp

r

Figure 3.1 Cycles per load instruction spend on waiting its effective address in baseline architecture.

In the conventional disambiguation memory dependence mechanism [14], load-forwarding behavior can detect store alias and forward store data. Figure 3.2 shows percentages of load that can take advantage of load-forwarding behavior [12]. Most load instructions will not forward store data and conflict with prior store on the baseline simulation architecture (describe in section 4), the average amount of forwarding load is 12.7% and the lowest amount of forwarding load is only 2.7%. It means that most load instructions are unnecessarily pending for disambiguating memory dependence.

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

bzip2

crafty ga

pgz

ip mcf

parser tw

olf

Figure 3.2 Percent of forwarding load instructions

3.2 Value Prediction

All loads have to wait until their effective address is calculated before they can be issued. If the load is on the critical path, and the address can be accurately predicted, then it can be beneficial to speculate the value of the address and load the data as soon as possible.

A load instruction is effectively split into two instructions inside the processor, one instruction calculates the effective address. In order to predict this instruction, we predict instruction’s operand so that we don’t have to wait prior instruction which this instruction depend on.

Since we predict instruction’s operand, there are some instructions that don’t have their register dependence with prior instruction, these instructions don’t need to predict operand because they already have exactly register value. In our

simulation, we only predict instructions with register dependence. But all load instructions must update the predictor, so that we can maintain predictor’s accurate rate.

Data prediction helps speedup the effective address calculation for a load. The load then has only to wait on potential store aliases before issuing. But if the operand was incorrectly predicted, a recovery mechanism will take place when the actual operand is available.

Value prediction has been studied for a long time and many schemes have been proposed [4, 6, 11]. In this paper we use the simplest scheme, stride predictor, to predict the operand of a load instruction. 3.3 Load Forwarding History Table

When a load is issued, it performs a lookup in the store buffer for a non-committed aliased store and it performs its data cache access in parallel. If a store alias is found, store data forward to load and the load has a shorter latency. If there is no store alias, and there is a data cache hit, the load has a longer latency because of the pipelined data cache. If there is a miss in the data cache, the miss will only be processed if no alias is found in the store buffer, load-forwarding behavior can detect store alias and forward store data. This way, load instructions can be issued out of order without waiting for prior stores executed [8, 9, 14].

Conventional disambiguation memory dependence mechanism unable to provide information for load instruction in the decode stage. For that reason, in order to exploit load-forwarding behavior and bring about all of these benefits, a mechanism is proposed: the load forwarding history table (LFHT). The LFHT records the result produced when the load instruction was executed for load forwarding behavior of the last time, and determines whether or not to out of order issue the load when the load instruction is encountered in the future.

Each LFHT entry contains two fields: the tag field and alias bit field. The LFHT is considered as a direct mapped cache, indexed by the PC. The alias bit field is a sticky bit, the load instruction is always treated as alias and waits until all prior store addresses have been calculated before issuing, after the load instruction encounter the first load forwarding behavior indicate conflict with store at the execution time.

LFHT will be established or updated according to load forwarding behavior of a load instruction at run time. If LFHT miss, part of load instruction’s PC is written to the corresponding entry as tag and the alias bit will be set or clear depend on the load forwarding behavior. If LFHT hit, alias bit will also be set or clear depend on the load forwarding behavior. But if alias bit is in set state and load forwarding behavior indicate no conflict with prior store, alias bit is still kept in set state. If LFHT hit, and alias bit is in clear state, the load instruction will be speculatively executed.

The validation/invalidation of speculative load instruction is performed when each prior store address has been calculated. Each time a store address is calculates, all the executed speculative loads that occur after store in the instruction

window have their addresses checked for an alias. If an alias is found, recovery action is taken for the load, and the load must be re-issued; corresponding alias bit changed into set state to avoid incorrect speculative load execution in the future.

3.4 Combine VP and LFHT We used LFHT introduced in section 3.3 to combine with value predictor discussed in section 3.2. If only VP is used, although some load instructions can get their operand faster, but these load instructions still must wait prior store instruction’s effective address calculated to ensure that there are no memory dependence. But if only LFHT is used, although we have overcome memory disambiguation problem, but there still exist register dependence, it means, some load instructions’ operand isn’t available to use. So that these load instructions must wait operand ready to issue to the function unit.

So we combine these two mechanisms as shown in figure 3.3 to solve both memory dependence and register dependence.

Instr. Fetch Unit

Decode UnitRegister Update

Unit

Load/StoreFunction Unit

Function Unit

Function Unit

LFHT

Register fileThe additional data

path

VP

Figure 3.3 Architecture data path with VP and LFHT.

4. Evaluation Methodology 4.1 Machine Model

The simulator used in this work is derived from the SimpleScalar 2.0 and 3.0c tool set [3], a suite of functional and timing simulation tools. The instruction set architecture employed is the Alpha instruction set, which is based on the Alpha AXP ISA.

Table 1 summarizes some of the parameters used in our baseline architecture. Table 2 shows the architectures we studied in this evaluation. Table 1 Baseline Architecture Configuration

Instruction fetch 8 instructions per cycle.

Out-of-Order

execution

mechanism

Issue of 8 instructions /cycle, 256 entry RUU(which is the ROB and the IW combined), 128 entry load/store queue. Loads executed only after all preceding store addresses are known. Value bypassed to loads from matching stores ahead in the load/store queue. 2 cycle load forwarding latency.

Architected registers 32 interger, hi, lo, 32 floating point. Functional units (FU)

8-integer ALUs, 8 load/store units, 4-FP adders, 1-Integer MULT/DIV, 1-FP MULT/DIV

FU latency int alu--1, load/store--1, int mult--3, int div--12, fp adder--2, fp mult--4, fp div--12, fp sqrt--24

L1 Instruction cache 64K bytes, 2-way set assoc., 32 byte block, 4 cycles hit latency.

L1 Data cache 64K bytes, 2-way set assoc., 32 byte block, 4 cycles hit latency. Dual ported.

L2 unified cache 1024K bytes, 4-way set assoc., 64 byte block, 12 cycles hit latency

Memory Memory access latency (first-36, rest-4) cycle. Width of memory bus is 32 bytes.

TLB miss 30 cycles

Table 2 Architectures we studied

Baseline Baseline architecture VP Baseline + VP

VP + LFHT Baseline + VP + LFHT VP + LFHT with Cycle

Clear Baseline + VP + LFHT with

Cycle Clear Perfect VP + Perfect LFHT Baseline + Perfect VP + Perfect

LFHT

Note: Cycle Clear is a keyword detailed in section 5.3

4.2 Benchmarks

To perform our experimental study, we have collected results of the SPEC2000 benchmarks. The programs were compiled with the gcc compiler included in the tool set. Table 4 shows the input data set for each integer benchmark. Table 5 shows the floating-point benchmark. In simulating the benchmarks, we skipped the first billion instructions, and collected statistics on the next five hundred million instructions. Table 4 Input data set for benchmarks

SPECint 2000 Input SPECfp 2000 Inputbzip2 input.source ammp ammp.in

crafty crafty.in applu applu.in

gap ref.in art a10.img &

gcc 166.i equake inp.in

gzip input.graphic galgel galgel.in

mcf inp.in licas lucas2.in

parser ref.in mesa mesa.in

twolf ./twolf/ref mgrid mgrid.in

vortex lendian.raw swim swim.in

vpr net.in & arch.in

5. Performance Analysis

In this section, we will examine the performance

improvement gained by using the proposed mechanism. We also explore detail configuration of VP and LFHT. First of all, we show how many load instructions each benchmarks have (as shown in figure 5.1 and 5.2). This can help analyzing data. Note: load instructions with same program counter means these instructions are the same instruction. 5.1 VP: Cycles Waste for Waiting EA Figure 5.1 shows how many cycles per load waste for waiting its effective address when using VP. Compare to figure 3.1 in section 3.1, 4.5 cycles per load instruction are saved after using VP.

0

1

2

3

4

5

6

bzip2

crafty

gap

gcc

gzip m

cf

parser

twolf

vortex vp

r

Figure 5.1 Cycles per load instructions spend on waiting its effective address after using VP. 5.2 LFHT

The average LFHT hit rate of integer benchmarks are 68% for 128, 85% for 512, 94% for 2048, 99% for 8192, of floating-point benchmarks are 80% for 128, 99% for 512, 97% for 2048, 99% for 8192.

Figure 5.2 shows how many cycles per load waste for waiting its effective address when using LFHT, each load instruction spends average 3.43 cycles.

Figure 5.3 shows how many cycles per load instructions spend on waiting its effective address after we combined both VP and LFHT. Each load instruction now spends average 1.91 cycles..

Because of (1) effective address calculation need at least one cycle, (2) there are 40% of load instructions with unavailable operand can’t predict (because these load instructions’ state field in VP are init or transient), so we still have to wait 1.91 cycles.

5.3 IPC & Speedup Figure 5.4 shows IPC in integer benchmarks. Figure 5.5 shows IPC in floating-point benchmarks.

Alias bit is a sticky bit in LFHT, alias bit keeps in set state after first conflict store detected. When a load instruction

conflicts with a prior store instruction and the alias bit is set, this load instruction always waits until all prior store addresses have been calculated before issuing, no matter whether store alias has occurred or not. That may reduce speculative load in effect, as false data dependency after the long period of run time may occurs.

To prevent the LFHT from being too conservative, causing false data dependency, all alias bit fields in LFHT are cleared at a regular length of cycles. In this paper, we modeled 50,000 cycles clear (CC) for LFHT.

Figure 5.6 shows integer benchmarks’ speedup over the baseline. The average speedups are 14.5% without cycle clear, 16.1% with cycle clear. Figure 5.7 shows floating-point benchmarks’ speedup over the baseline. The average speedups are both 5% with or without cycle clear. Speedup is calculated by:

(New schemes’ IPC – baseline’s IPC) / baseline’s IPC In our simulation, we also model a perfect scheme on

both VP and LFHT, using this scheme to check how many performance we can get most from this scheme.

With VP, we use perfect prediction on all load instructions. This means when a load instruction is dispatched to RUU, whether its operand is available or not, we can use it with actual value, because we can predict it with 100% accurate rate.

With LFHT, perfect method means load instruction only wait for store instruction with same effective address. This means there are no unnecessary time spent on waiting store instructions’ effective addresses.

The average speedup with perfect prediction in integer benchmarks is 20.5%, in floating-point benchmarks is 7.5%. Our simulation results in speedup of 16.1% for integer benchmarks, of 5% for floating-point. This is close to perfect case.

The main reason of vortex’s performance worsen than baseline is due to its low accurate rate in value prediction, so it needs more cycles to squash instructions and re-fetch instructions to instruction windows.

0123456

bzip2

crafty ga

p gcc

gzip mcf

parser tw

olf

vorte

xvp

r

Figure 5.2 Cycles per load instructions spend on waiting its effective address after using LFHT

00.5

11.5

22.5

33.5

bzip2

c ra fty

gap

gcc

gzip

mc f

parse r

two lf

vor tex vp

r

Figure 5.3 Cycles per load spend on waiting its ea after combine both VP and LFHT

IPC (integer)

0

1

2

3

4

5

6

7

bzip2

crafty ga

p gcc

gzip mcf

parser tw

olf

vorte

xvp

r

baseline

VP(2K)

VP+LFHT

with CC

perfect

Figure 5.4 IPC (integer benchmarks)

IPC (floating-point)

0123456

ammp

applu art

equak

ega

lgel

lucas

mesamgr

idsw

im

baselineVPVP+LFHTwith CCperfect

Figure 5.5 IPC (floating-point benchmarks)

speedup (integer)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

bzip2

gap

gzip

parser

vorte

x

VPVP+LFHTwith CCperfect

Figure 5.6 Speedup over baseline (integer benchmarks).

speedup (floating-point)

-0.020

0.020.040.060.08

0.10.120.140.160.18

0.2

ammp

applu ar

t

equak

ega

lgel

lucas

mesamgr

idsw

im

VPVP+LFHTwith CCperfect

Figure 5.7 Speedup over baseline (floating-point benchmarks).

6. Conclusions

In this paper we present a combine mechanism for improving load instruction issue rule in the modern superscalar processor. Conventional load instructions only be issued when ensure there are no dependences, and therefore reduces the instruction level parallelism. We proposed a scheme combine two mechanisms, value prediction (VP) + load forwarding history table (LFHT) to speculatively execute load instructions. All the information of VP and LFHT are established or updated at run-time using load instruction history and load forwarding behavior. They will provide the information of memory disambiguation for the load instruction speculative issue at the issue time. Throughout this study, we have not only examined the load instruction issue rule, but also re-establish memory dependence and disambiguation mechanism from two aspects: first, we have studied the characteristic of load instructions and using the information of load for memory dependence and disambiguation, and second, we have proposed a combine scheme that can take advantages of improving the instruction level parallelism.

We evaluated the performance of our proposed

architecture with SimpleScalar. VP provides average speedup of 8.5% over baseline simulation architecture. With VP & LFHT, the speedup is 14.5% over baseline architecture. With Cycle Clear of LFHT, 16.1% speedup over baseline architecture is achieved.

References

[1] M. Johnson, “Superscalar Microprocessor Design,”

Englewood Cliffs, Prentice Hall, 1991. [2] M. Franklin and G. S. Sohi ,“ARB: A Hardware

Mechanism for Dynamic Reordering of Memory References”, IEEE Transactions on Computer, May 1996

[3] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set, version 2.0” ,Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997.

[4] K. Wang and M. Franklin, “Highly accurate data value prediction using hybrid predictors”, In proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pages 281-290, Dec. 1997.

[5] G.Z Chrysos and J.S Emer ,“Memory dependence prediction using store sets”, In proceedings of the 25th Annual International Symposium on Computer Architecture, pages 142-153, June 1998.

[6] G. Reinman and B. Calder. “Predictive techniques for aggressive load speculation”, In proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 127-137, Nov 1998.

[7] A. Yoaz.; M. Erez.; R. Ronen.; S. Jourdan. “Speculation Techniques for Improving Load Related Instruction Scheduling.” Proceedings of the 26th International Symposium on Computer Architecture, May 1999

[8] G. Reinman. B. Calder. “A Comparative Survey of Load Speculation Architectures ,” Journal of Instruction Level Parallelism, May 2000.

[9] A. Moshovos and G.S. Sohi . “Reducing memory latency via read-after-read memory dependence prediction”, IEEE Transactions on Computers, pages 313-326, March 2002.

[10] S. Onder, “Cost effective memory dependence prediction using speculation levels and color sets”, In proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 232-241, Sept. 2002.

[11] Huiyang Zhou, J. Flanagan. and T.M. Conte , “Detecting global stride locality in value streams”, In proceedings of the 30th Annual International Symposium on Computer Architecture, pages 324-335, June 2003.

[12] Shin-Rung , Chen , “Memory Disambiguation using Load Forwarding”, Master Thesis, Department of Computer Science and Engineering, Tatung University, July 2004

[13] Cheng-Chun Lin , “Load Speculation”, Master Thesis, Department of Computer Science and Engineering, Tatung University, July 2005

[14] Jon Paul Shen , Mikko H. Lipasti ,Modezn Processor Design : Fundamentals of Superscalar Processors , McGraw-Hill,2005.

shieh06a

Documents

Transcript of shieh06a