Advanced Pipe Lining Techniques

Ryan DiBiasio 1 Chris Harris

Advanced Pipelining Techniques

Pipelining is a very basic concept in computing which originated from the necessity to process

data at faster and faster speeds than was available from the bus-style architecture. The simplest

pipeline is that of the 5-stage RISC pipeline, and this concept can portray the basis behind every

modern microprocessor, and that is the idea of “stages” in hardware, as opposed to the idea of

having a single stage that the instruction has to propagate through with multiple clock cycles.

The idea of completing one instruction with every clock cycle lead to pipelines becoming the

only modern approach to computer architecture, and thus the variations on pipelining are

extreme.

Many different types of advanced pipelining techniques have been developed in the time that it

has been around, and many more will continue to develop in the future. The following will give

a bit of insight as to what the different techniques are, as well as how to best use the knowledge

of past failures and successes to predict what could be in the future for computing.

EPIC:

One of the earliest computing technologies for microprocessing systems was created by a joint

HP and Intel partnership, and labeled EPIC. Standing for “Explicitly Parallel Instruction

Computing”, the EPIC philosophy of programming a microprocessor had its roots in the concept

of Very Long Instruction Word (VLIW) approaches to computing. The issue with VLIW was

initially that it couldn’t outpace superscalar processors at the time because the VLIW approach

wasted too much memory and had many issues with holding up the pipeline and filling it with

excessive “bubbles” in order to keep it full. The engineers at HP developed a method in between

the brute forced hardware parallelism of superscalar computing, and the brute forced program

level parallelism of VLIW. The idea was that superscalar technology, while in the past

considered a far better alternative to VLIW, would become far too complex for practical use in

the near future, and as a result have a negative impact on their clock speeds. (1) This idea, along

with the knowledge that a single-die microprocessor would most likely be available in the near

future (1), led HP engineers to develop the EPIC technology as a “design philosophy” more than

as a specific architecture.

EPIC has its roots in VLIW, in that the processor takes advantage of ILP by running multiple

instructions through the pipeline at the same time. However in order to surpass superscalar,

which had been hailed as far superior to VLIW, the EPIC design had to encompass the best

aspects of superscalar and also fix the pitfalls of VLIW. The initial concept was to simplify the

hardware of a microprocessor as much as possible, and also avoid out-of-order operation, which

was a very complex design which was very common in superscalar architectures. The other key

difference which EPIC hoped to address was this idea of enabling the same program to be run on

different architectures by altering it to most efficiently operate on a certain set of hardware. The

VLIW approach did this by setting the instructions up at compile time to be ordered pairs, and

filling in voids with blank data, while superscalar approaches did every bit of software alteration

in hardware, with complex circuits. The EPIC design was to use a better spinoff of the VLIW

approach which simplified the running of the code in a superscalar manner, before the

instructions were issued to the processor. This was done by allowing the compiler to use

knowledge of the processor design in order to best arrange the code to be completed as

2

efficiently as possible. There was one key, in that the processor worked best when it was not an

out-of-order processor, since the knowledge that it could give results out-of-order would

complicate the job of the compiler. However, with such lofty goals, the idea of starting from a

VLIW approach led to many problems that needed to be overcome.

The main issues with VLIW approaches stemmed from how the compiler had to setup the code

in order for it to be usable. VLIW takes code and segments it into blocks, which leaves holes

and spaces, but not only in the pipeline, but also in memory where the code itself is stored. This

way of encoding the operations lead to very large memory usage to store code for a program,

much of which was nonsense. In addition, VLIW approaches worked best when the code was

setup for a VLIW processor, which is not an option for most users, who didn't want to have to

buy new software for every architecture (and even worse for the programmers, who don’t want

to write software again for every architecture). EPIC addresses each of these individually,

reducing the bubbles caused by interrupts with additional instructions for load prediction and

extra methods of branch prediction, since stalls in the pipeline of a VLIW based processor are

significantly worse than stalls in another architecture. The other issue for EPIC to address was

the idea of the large code size, which the engineers at HP gave to the compiler. In order to

generate the code for the processor, the compiler would first go through and, using its knowledge

of the processor, optimize the code for it. By assuming the amount of time it would take for an

instruction to complete, the compiler optimizes code with minimum holes so that it can get

instructions completed at roughly the same time, and avoid a large number of data hazards.

The EPIC architecture deals with many of the issues which are present for all architectures in

order to completely deal with hazards, but the more interesting solutions presented are for branch

hazards and register renaming. Branch hazards, viewing branch prediction, are always an issue,

but even more so when using a processing architecture which uses the VLIW approach. In order

to solve this issue, EPIC uses the compiler to assemble many different branch options into larger

pieces and even to avoid branches entirely where necessary (1). The issue for register renaming

stems from the reason why it is done on a superscalar architecture to begin with. In order to

handle various data dependencies, the superscalar architecture often renames registers in order to

prevent data hazards from presenting themselves. The additional freedom given to the compiler

to doctor the code allowed the EPIC design to fix this problem of using more registers than the

instruction can see, by allowing it to rename the registers in the program itself, allowed for the

EPIC microprocessor design to use the same number of registers as a superscalar processor,

while also having all of them available at all times.

History of pipelining:

Pipelining was developed in the early years of microcomputer architecture, and is what initially

led the user base away from supercomputers and instead into microprocessors. The basic

concept that is used is the idea of a 5-stage RISC pipeline, which allows for one instruction to be

completed per clock cycle, and in general improves the overall efficiency of any processor by a

significant amount. The chief distinctions in any pipelining design are both the number of stages,

as well as how the stages are set up, or rather how the new stages divide up the tasks of the older

stages. In general, as we know, the more and more uniform pipeline stages involved, the faster

an instruction will be completed, due to the reduction in the critical path, which lead to a

generally faster clock cycle. The goal of processor engineers was initially to make as many

3

pipeline stages as possible, while also keeping each stage with roughly the same critical path, so

as to use the minimum clock cycle. However, as Intel realized a few years ago, there are some

issues with this philosophy.

In the first pipelined Intel architecture, the Pentium chip used the basic 5-stage

architecture. They then progressed quickly, and for the Pentium Pro, Intel was running with a

12-stage architecture. This type of approach to pipelining allowed them to create a faster

computer without significantly altering the technology on their hardware. The next jump was

backwards a bit to the P3, with a 10-stage pipeline, but featured faster hardware in order to allow

significantly better clock speeds. However, the P3 design was capped at right around a 1GHz

speed, something that the engineers at Intel looked to fix with the Pentium 4. In order to solve

this issue, they created the P4 with a 20-stage pipeline, allowing for a drastic increase in possible

clock speeds, with of course a downside of not actually accomplishing twice as much with twice

the clock speed as a P3 processor. With the higher and higher clock speeds, in order to keep the

processor gaining speed in execution, they needed to develop a faster pipeline, which led them

into a 31-stage architecture for the Prescott cores in the P4 family. These cores topped out close

to 4GHz, but in the end drew too much power. (2)

The next step for Intel was to tone down the absurdity of having so many clock cycles. In order

to keep the power usage lower, they developed newer and smaller architecture, and also ran the

CPU with an efficient 14-stage pipeline. This pipeline also was wider than previous pipelines,

supporting 4 total instructions at a time as opposed to the 3 supported by previous Intel

processors. The shorter pipeline allowed Intel to have faster processors which did not have the

issues associated with a very large pipeline. The typical issues with longer pipelines are of

course, larger branch error penalties, as well as an increased penalty from data hazards involving

loads. Overall, the design of Intel processors was initially growing in pipeline stages with time,

but after they encountered issues with the P4 processor, Intel dropped down to a more

manageable size, and has not increased since the drop to the 14 stages used in the NetBurst

architecture (3).

ILP:

Pipelining is our practical solution in order to meet the needs of instruction level parallelism. At

the heart of this concept, we know that the demands of ILP cannot be perfectly met by the

hardware and coding schemes. Imperfect branch and jump prediction, long memory accesses,

and limited virtual registers will always continue to be a problem. However, since the 1960's we

have been on the journey to make pipelined processors fit an ideal image of what Instruction

Level Parallelism would look like. Although we have not made any processor fit that desired

image, these techniques that are essential to processors used today are definite stepping stones

toward this concept of efficiency and output.

Scoreboarding:

There are a few options to implement dynamic scheduling in a machine. One of the more simple,

yet effective, methods to accomplish multiple tasks simultaneously came with the scoreboard

technology. The scoreboard makes use of instruction level parallelism by executing instructions

in a block simultaneously (a block refers to code sequence that does not hold any branches

4

except the entry and exit). Scoreboarding technology then will allow instructions to

simultaneously execute as long as two conditions are met. There must not be any data-

dependencies in the instructions that are being processed or that were recently issued. Also, the

hardware or resources must be available at the time the scoreboard logs the instruction.

Therefore, instructions are processed in-order; however, if there are no dependencies, an

instruction execution may finish earlier than the previous one issued (out-of-order completion).

In this way, scoreboarding can be classified as a very simple type of dynamic scheduling. (4).

The CDC 6600 was one of the first models to implement scoreboarding, and at the time it was

classified as “unique” since there wasn’t a large concept of superscalar architecture. Even though

a 10MHz clock was used for the CDC 6600, due to parallelism, instructions could be executed

roughly around the speed that a 40MHz CPU would (7). Scoreboarding takes advantage of the

MIPS approach to the pipeline where structural and data hazards were checked in the early

stages.

There are four stages that handle the instructions: issue, read operands, execution, and write

result. The issue stage checks to make sure the hardware is available and analyzes code for

WAW hazards. If these are detected, then the instruction is delayed. In the read operand stage,

source operands are checked to see if any previously issued instructions are going to write to it as

a destination (RAW hazards). The instruction may safely execute in the execution stage, and

finally can notify the scoreboard that execution has completed after obtaining a result. This way,

the scoreboard knows when it's safe to issue another instruction. After execution, the scoreboard

checks for WAR hazards until writing the result to the destination. The scoreboard manages this

by observing the status of the instructions, functional units, and registers (4).

Overall, one of the main purposes of this implementation is to reduce stalls; however, this

doesn’t occur easily as there exist factors prohibit efficiency. As stated previously, this form of

pipelining can be useful only in “code block segments” since branching conditions must be pre-

evaluated before following instructions can be issued. Moreover, for the scoreboard and other

processors that use simultaneous execution, the window size of how many instructions, not

including cached, can look ahead denotes a limit for the number of instructions to be issued in a

cycle. (4). Also, the various kinds and the number of hardware elements can be a limitation as

structural hazards arise during dynamic scheduling. Finally, the existence of ant dependencies

and output dependencies will create stalling.

For scoreboarding, forwarding results in order to satisfy dependencies does not occur as both

registers have to be available for operands to be read. However, the results are written into the

register as soon as they complete the execution stage assuming no WAR hazards. One structural

hazard that presents itself is since the scoreboard needs to communicate with the functional units,

the number of allowed units to be used cannot exceed the number of buses available.

Unfortunately, scoreboarding doesn’t take advantage of optimizing the code sequence, thus

extensive pipeline would seem to be a good substitute; however, this can cause drawbacks to

other parts of the system and functional units. Due to these restraints, better implementations of

dynamic scheduling came about that dealt specifically to the hazards and limitations

scoreboarding couldn’t handle. Even though the scoreboard technology developed a method to

deal with structural and data hazards, it was soon surpassed by the implementation of

Tomasulo’s algorithm which avoided dependencies such as WAW and WAR altogether by the

register renaming scheme.

5

Tomasulo:

After the scoreboard technology gained way, more thought was placed into how to minimize the

hazards that were still present. Fortunately, the scheme provided by Robert Tomasulo aims for

taking care of these hazards. Main differences that present themselves compared to

scoreboarding are the concepts of register renaming, reservation stations, a common data bus, as

well as the fact that the algorithm can perform across branches. There are quite a few highlights

of this technique. For instance, the compiler can allow code to go unoptimized and we can still

achieve high performance. Since caches weren't introduced before this scheme, the addition of

them only compounds the efficiency of the algorithm because during a “cache miss”, out-of-

order instructions can still be executing without penalty. For these reasons and more, this

implementation of dynamic scheduling was adopted heavily during the growth of the 1990's.

Tomasulo's algorithm employs three stages for each instruction: issue, execute, and write data.

The first area where we see some improvement compared to other schemes is directly at the start

where instructions are issued. Reservation stations, which are used to buffer the operands of

issued instructions (in order to solve dependencies) can keep registers available and keep the

operands ready to be operated on. From the coder's point of view, a specific register that's stated

may be represented by two or more reservation stations simultaneously. (6). This step eliminates

WAR and WAW hazards because if an instruction updates and fetches the operand from the

reservation station and not the register, then it leaves the register open to allow previous issued

instructions to be written to. This step also attempts to rename registers specified in the code to

ones that are available on the clock cycle. In the next stage, operands that become available are

immediately placed into reservation stations, thus when all operands are supplied, and then an

operation can be executed. Instructions still needs to be delayed however to avoid reading after

write. For the case where instructions have their operands available on the same clock cycle, then

the functional unit will take them in arbitrarily except for loads and stores which are maintained

in the order they were written in the code. In the final step, the results are written on a “common

data bus” (7). Instead of waiting for instruction results to be placed in registers, the results can be

placed on a bus so that dependencies can be filled before needing to wait until writing.

The IBM 360/91 was designed with Tomasulo's algorithm in order to achieve high floating-point

performance. Even though only four registers were used with long delays, as well as having a

two-address instruction format, the algorithm minimized the need for using the registers as much

as possible. Tracking instruction dependencies allowed this to take place (6).

With the implementation of the common data bus and the reservation registers, many hazards

can be avoided at the cost of an extra clock cycle. Instead of results coming straight from a

functional unit they are buffered on a bus before the write stage takes place. Data may actually

not be stored unless absolutely necessary. Compared to Scoreboarding technology, data doesn't

need to be transferred with a register in between the functional units, and for this reason, makes

an increase in the efficiency of how the system is used. As far as branch prediction is concerned,

if branches are correctly predicted, the reservation stations work well. For example, multiple

iterations of loops can unfold. Another addition to the Tomasulo's Algorithm was the

implementation of reorder buffers. This process essentially allowed instructions to be issued in

order, execute out of order, and then be completed in order. In a sense, the machine may not

6

appear to be taking use of dynamic scheduling, but all functional units are kept busy in the

execution stage.

Furthermore, Tomasulo's approach was highly built upon by the addition of caches, hardware

speculation, and branch prediction. With the idea set behind the approach, especially considering

how to handle hazards and unnecessary use of functional units, this made Tomasulo's design

have more room for growth and advancement compared to Scoreboarding.

Branch Prediction:

On the issue of pipeline performance, we assumed that instruction level parallelism can occur in

segments or blocks of code being limited by branches. Because branches hold conditions as to

which instructions should be issued next, they drastically slow down the pipeline. In order to

take a chance at decreasing the delay, prediction of the branches is implemented. If the

prediction is right, then there shouldn't be any penalties to holding control dependencies through

a branch instruction. Prediction is split into two categories: static and dynamic. Static branch

prediction can be done when code is compiled; however, some branching schemes can only be

predicted during run time done by the hardware.

On the static side, there are several different methods to predict branches. Since there are only

two options to branches, they can be easily analyzed. The easiest method would be to predict the

branch as just taken. However the accuracy of taking correct branches is very low and probably

wouldn't be as beneficial as other schemes. The concept that we can implement is to allow

branch predictions be based off previous runs of the executed code. This way, a more accurate

prediction can be made especially if a branch is 90% taken for example.

However, as more conditional branches present themselves, the efficiency of the static branch

prediction starts to lose effectiveness. Dynamic schemes are useful in that they evaluate the delay

for branching in comparison to the fetching of multiple PC addresses. The logical way to

implement this is by having “memory” of the branches. Therefore, dynamic schemes that deal

with prediction of branches can be implemented using buffers as a part of the hardware. One-bit

buffers for prediction have been used before, but we would prefer two predictions that weren't

correct before changing prediction bit, thus a two-bit buffer is more common. This is done in

order to accommodate branches that strongly favor being taken or not taken (4). If we consider

the two-bit buffer to be similar to a cache, a branch taken will set the first bit, and if it is taken

again it will set the next bit. Two misses and two hits are required to change the prediction state.

Speculation could be done on making an “n-bit” buffer; however, this became what is known as

“profiling” which leads to just comparing whether “1” or “0” appears more often (8). Instead of

making a larger bit counter, it's practical to have a counter table of branch addresses that

reference specific 2-bit predictors (9).

Furthermore, a somewhat stronger branching scheme would be to make predictions based on

what are called “correlating predictors” or “two-level predictors”. These look at the history of

previous branch behaviors in order to evaluate whether or not to take the current branch. If this is

compounded onto the two-bit buffer scheme, the chance of correct prediction can be exponential.

For example, based on a previous branch there can be two separate two-bit buffers to address the

current one with each correlating to a taken or non-taken branch.

7

Overall, branch prediction is extremely helpful when considering pipelining strategies because

with this addition, there is reduced delay in executing parallel instructions. Instead of waiting to

evaluate conditions and then taking use of instruction level parallelism, resources are still being

continually used with the mindset that it's better to keep functional units busy and be wrong a

small percentage of the time rather than waste time waiting for condition affirmations.

In all, we have seen pipelining grow and continue to prove its usefulness throughout the years.

The hazards that presented themselves at the beginning are decreasing in occurrence with each

new design and algorithm. Even physical limitations are starting to pose less of an obstacle with

increased accuracy in branch prediction. What seems to remain now is the refinement of these

technologies and how we can fit the processing needs of our present world today.

8

Works Cited

(1) Michael S. Schlansker, B. Ramakrishna Rau. EPIC: An Architecture for Instruction-Level

Parallel Processors. Compiler and Architecture Research, HP Laboratories,

February, 2000, http://www.hpl.hp.com/techreports/1999/HPL-1999-111.pdf

(2) Brandon Bell. Intel’s Core 2 Extreme/Duo Processors. July 13, 2006,

http://www.firingsquad.com/hardware/intel_core_2_performance/

(3) Scott M. Fulton. First Intel next-gen news: Lower wattage, fewer pipeline stages. Tom’s

Hardware US, August 23, 2005,

http://www.tomshardware.com/news/intel,1317.html

(4) Hennessy, John L., David A. Patterson, and Andrea C. Arpaci-Dusseau. Computer

Architecture a Quantitative Approach. Vol. 4. Amsterdam: Morgan Kaufmann,

2007. Print.

(5) Grishman, Ralph (1974). Assembly Language Programming for the Control Data 6000

Series and the Cyber 70 Series. New York, NY: Algorithmics Press.

(6) Stone, Harold S.. High-performance Computer Architecture (Addison-Wesley series in

electrical and computer engineering). Repr. with corrections ed. Glenview:

Addison-Wesley Educational Publishers Inc, 1988. Print.

(7) R.M. Tomasulo, ``An efficient algorithm for exploiting multiple arithmetic units,'' IBM

Journal of Research and Development, vol. 11, pp. 25-33, 1967.

(8) P.G. Emma, ``Branch prediction,'' Chapter 3 of D. Kaeli and P.-C. Yew (Eds), Speculative

Execution in High Performance Computer Architectures, CRC Press, Boca Raton,

FL, 2005.

(9) S. McFarling, ``Combining branch predictors,'' Technical Report WRL TN-36, Western

Research Laboratory, Digital Equipment Corporation, Palo Alto, California, June,

1993.

http://www.hpl.hp.com/techreports/1999/HPL-1999-111.pdf

http://www.firingsquad.com/hardware/intel_core_2_performance/



http://www.ecst.csuchico.edu/~juliano/csci380/Papers/rTomasulo1967algorithm.pdf

http://www.research.ibm.com/journal/rd/

http://www.research.ibm.com/journal/rd/

http://www.ecst.csuchico.edu/~juliano/csci380/Papers/KaeliYew2005/03/EmmaPhil.pdf

http://crcpress.com/shopping_cart/products/product_detail.asp?sku=C4479&parent_id=&pc=

http://crcpress.com/shopping_cart/products/product_detail.asp?sku=C4479&parent_id=&pc=

http://www.ecst.csuchico.edu/~juliano/csci380/Papers/sMcfarling1993combiningBranchPredictors.pdf

Advanced Pipe Lining Techniques

Documents

Transcript of Advanced Pipe Lining Techniques