Advanced Pipe Lining Techniques
-
Upload
chris-harris -
Category
Documents
-
view
79 -
download
2
Transcript of Advanced Pipe Lining Techniques
Ryan DiBiasio 1 Chris Harris
Advanced Pipelining Techniques
Pipelining is a very basic concept in computing which originated from the necessity to process
data at faster and faster speeds than was available from the bus-style architecture. The simplest
pipeline is that of the 5-stage RISC pipeline, and this concept can portray the basis behind every
modern microprocessor, and that is the idea of “stages” in hardware, as opposed to the idea of
having a single stage that the instruction has to propagate through with multiple clock cycles.
The idea of completing one instruction with every clock cycle lead to pipelines becoming the
only modern approach to computer architecture, and thus the variations on pipelining are
extreme.
Many different types of advanced pipelining techniques have been developed in the time that it
has been around, and many more will continue to develop in the future. The following will give
a bit of insight as to what the different techniques are, as well as how to best use the knowledge
of past failures and successes to predict what could be in the future for computing.
EPIC:
One of the earliest computing technologies for microprocessing systems was created by a joint
HP and Intel partnership, and labeled EPIC. Standing for “Explicitly Parallel Instruction
Computing”, the EPIC philosophy of programming a microprocessor had its roots in the concept
of Very Long Instruction Word (VLIW) approaches to computing. The issue with VLIW was
initially that it couldn’t outpace superscalar processors at the time because the VLIW approach
wasted too much memory and had many issues with holding up the pipeline and filling it with
excessive “bubbles” in order to keep it full. The engineers at HP developed a method in between
the brute forced hardware parallelism of superscalar computing, and the brute forced program
level parallelism of VLIW. The idea was that superscalar technology, while in the past
considered a far better alternative to VLIW, would become far too complex for practical use in
the near future, and as a result have a negative impact on their clock speeds. (1) This idea, along
with the knowledge that a single-die microprocessor would most likely be available in the near
future (1), led HP engineers to develop the EPIC technology as a “design philosophy” more than
as a specific architecture.
EPIC has its roots in VLIW, in that the processor takes advantage of ILP by running multiple
instructions through the pipeline at the same time. However in order to surpass superscalar,
which had been hailed as far superior to VLIW, the EPIC design had to encompass the best
aspects of superscalar and also fix the pitfalls of VLIW. The initial concept was to simplify the
hardware of a microprocessor as much as possible, and also avoid out-of-order operation, which
was a very complex design which was very common in superscalar architectures. The other key
difference which EPIC hoped to address was this idea of enabling the same program to be run on
different architectures by altering it to most efficiently operate on a certain set of hardware. The
VLIW approach did this by setting the instructions up at compile time to be ordered pairs, and
filling in voids with blank data, while superscalar approaches did every bit of software alteration
in hardware, with complex circuits. The EPIC design was to use a better spinoff of the VLIW
approach which simplified the running of the code in a superscalar manner, before the
instructions were issued to the processor. This was done by allowing the compiler to use
knowledge of the processor design in order to best arrange the code to be completed as
2
efficiently as possible. There was one key, in that the processor worked best when it was not an
out-of-order processor, since the knowledge that it could give results out-of-order would
complicate the job of the compiler. However, with such lofty goals, the idea of starting from a
VLIW approach led to many problems that needed to be overcome.
The main issues with VLIW approaches stemmed from how the compiler had to setup the code
in order for it to be usable. VLIW takes code and segments it into blocks, which leaves holes
and spaces, but not only in the pipeline, but also in memory where the code itself is stored. This
way of encoding the operations lead to very large memory usage to store code for a program,
much of which was nonsense. In addition, VLIW approaches worked best when the code was
setup for a VLIW processor, which is not an option for most users, who didn't want to have to
buy new software for every architecture (and even worse for the programmers, who don’t want
to write software again for every architecture). EPIC addresses each of these individually,
reducing the bubbles caused by interrupts with additional instructions for load prediction and
extra methods of branch prediction, since stalls in the pipeline of a VLIW based processor are
significantly worse than stalls in another architecture. The other issue for EPIC to address was
the idea of the large code size, which the engineers at HP gave to the compiler. In order to
generate the code for the processor, the compiler would first go through and, using its knowledge
of the processor, optimize the code for it. By assuming the amount of time it would take for an
instruction to complete, the compiler optimizes code with minimum holes so that it can get
instructions completed at roughly the same time, and avoid a large number of data hazards.
The EPIC architecture deals with many of the issues which are present for all architectures in
order to completely deal with hazards, but the more interesting solutions presented are for branch
hazards and register renaming. Branch hazards, viewing branch prediction, are always an issue,
but even more so when using a processing architecture which uses the VLIW approach. In order
to solve this issue, EPIC uses the compiler to assemble many different branch options into larger
pieces and even to avoid branches entirely where necessary (1). The issue for register renaming
stems from the reason why it is done on a superscalar architecture to begin with. In order to
handle various data dependencies, the superscalar architecture often renames registers in order to
prevent data hazards from presenting themselves. The additional freedom given to the compiler
to doctor the code allowed the EPIC design to fix this problem of using more registers than the
instruction can see, by allowing it to rename the registers in the program itself, allowed for the
EPIC microprocessor design to use the same number of registers as a superscalar processor,
while also having all of them available at all times.
History of pipelining:
Pipelining was developed in the early years of microcomputer architecture, and is what initially
led the user base away from supercomputers and instead into microprocessors. The basic
concept that is used is the idea of a 5-stage RISC pipeline, which allows for one instruction to be
completed per clock cycle, and in general improves the overall efficiency of any processor by a
significant amount. The chief distinctions in any pipelining design are both the number of stages,
as well as how the stages are set up, or rather how the new stages divide up the tasks of the older
stages. In general, as we know, the more and more uniform pipeline stages involved, the faster
an instruction will be completed, due to the reduction in the critical path, which lead to a
generally faster clock cycle. The goal of processor engineers was initially to make as many
3
pipeline stages as possible, while also keeping each stage with roughly the same critical path, so
as to use the minimum clock cycle. However, as Intel realized a few years ago, there are some
issues with this philosophy.
In the first pipelined Intel architecture, the Pentium chip used the basic 5-stage
architecture. They then progressed quickly, and for the Pentium Pro, Intel was running with a
12-stage architecture. This type of approach to pipelining allowed them to create a faster
computer without significantly altering the technology on their hardware. The next jump was
backwards a bit to the P3, with a 10-stage pipeline, but featured faster hardware in order to allow
significantly better clock speeds. However, the P3 design was capped at right around a 1GHz
speed, something that the engineers at Intel looked to fix with the Pentium 4. In order to solve
this issue, they created the P4 with a 20-stage pipeline, allowing for a drastic increase in possible
clock speeds, with of course a downside of not actually accomplishing twice as much with twice
the clock speed as a P3 processor. With the higher and higher clock speeds, in order to keep the
processor gaining speed in execution, they needed to develop a faster pipeline, which led them
into a 31-stage architecture for the Prescott cores in the P4 family. These cores topped out close
to 4GHz, but in the end drew too much power. (2)
The next step for Intel was to tone down the absurdity of having so many clock cycles. In order
to keep the power usage lower, they developed newer and smaller architecture, and also ran the
CPU with an efficient 14-stage pipeline. This pipeline also was wider than previous pipelines,
supporting 4 total instructions at a time as opposed to the 3 supported by previous Intel
processors. The shorter pipeline allowed Intel to have faster processors which did not have the
issues associated with a very large pipeline. The typical issues with longer pipelines are of
course, larger branch error penalties, as well as an increased penalty from data hazards involving
loads. Overall, the design of Intel processors was initially growing in pipeline stages with time,
but after they encountered issues with the P4 processor, Intel dropped down to a more
manageable size, and has not increased since the drop to the 14 stages used in the NetBurst
architecture (3).
ILP:
Pipelining is our practical solution in order to meet the needs of instruction level parallelism. At
the heart of this concept, we know that the demands of ILP cannot be perfectly met by the
hardware and coding schemes. Imperfect branch and jump prediction, long memory accesses,
and limited virtual registers will always continue to be a problem. However, since the 1960's we
have been on the journey to make pipelined processors fit an ideal image of what Instruction
Level Parallelism would look like. Although we have not made any processor fit that desired
image, these techniques that are essential to processors used today are definite stepping stones
toward this concept of efficiency and output.
Scoreboarding:
There are a few options to implement dynamic scheduling in a machine. One of the more simple,
yet effective, methods to accomplish multiple tasks simultaneously came with the scoreboard
technology. The scoreboard makes use of instruction level parallelism by executing instructions
in a block simultaneously (a block refers to code sequence that does not hold any branches
4
except the entry and exit). Scoreboarding technology then will allow instructions to
simultaneously execute as long as two conditions are met. There must not be any data-
dependencies in the instructions that are being processed or that were recently issued. Also, the
hardware or resources must be available at the time the scoreboard logs the instruction.
Therefore, instructions are processed in-order; however, if there are no dependencies, an
instruction execution may finish earlier than the previous one issued (out-of-order completion).
In this way, scoreboarding can be classified as a very simple type of dynamic scheduling. (4).
The CDC 6600 was one of the first models to implement scoreboarding, and at the time it was
classified as “unique” since there wasn’t a large concept of superscalar architecture. Even though
a 10MHz clock was used for the CDC 6600, due to parallelism, instructions could be executed
roughly around the speed that a 40MHz CPU would (7). Scoreboarding takes advantage of the
MIPS approach to the pipeline where structural and data hazards were checked in the early
stages.
There are four stages that handle the instructions: issue, read operands, execution, and write
result. The issue stage checks to make sure the hardware is available and analyzes code for
WAW hazards. If these are detected, then the instruction is delayed. In the read operand stage,
source operands are checked to see if any previously issued instructions are going to write to it as
a destination (RAW hazards). The instruction may safely execute in the execution stage, and
finally can notify the scoreboard that execution has completed after obtaining a result. This way,
the scoreboard knows when it's safe to issue another instruction. After execution, the scoreboard
checks for WAR hazards until writing the result to the destination. The scoreboard manages this
by observing the status of the instructions, functional units, and registers (4).
Overall, one of the main purposes of this implementation is to reduce stalls; however, this
doesn’t occur easily as there exist factors prohibit efficiency. As stated previously, this form of
pipelining can be useful only in “code block segments” since branching conditions must be pre-
evaluated before following instructions can be issued. Moreover, for the scoreboard and other
processors that use simultaneous execution, the window size of how many instructions, not
including cached, can look ahead denotes a limit for the number of instructions to be issued in a
cycle. (4). Also, the various kinds and the number of hardware elements can be a limitation as
structural hazards arise during dynamic scheduling. Finally, the existence of ant dependencies
and output dependencies will create stalling.
For scoreboarding, forwarding results in order to satisfy dependencies does not occur as both
registers have to be available for operands to be read. However, the results are written into the
register as soon as they complete the execution stage assuming no WAR hazards. One structural
hazard that presents itself is since the scoreboard needs to communicate with the functional units,
the number of allowed units to be used cannot exceed the number of buses available.
Unfortunately, scoreboarding doesn’t take advantage of optimizing the code sequence, thus
extensive pipeline would seem to be a good substitute; however, this can cause drawbacks to
other parts of the system and functional units. Due to these restraints, better implementations of
dynamic scheduling came about that dealt specifically to the hazards and limitations
scoreboarding couldn’t handle. Even though the scoreboard technology developed a method to
deal with structural and data hazards, it was soon surpassed by the implementation of
Tomasulo’s algorithm which avoided dependencies such as WAW and WAR altogether by the
register renaming scheme.
5
Tomasulo:
After the scoreboard technology gained way, more thought was placed into how to minimize the
hazards that were still present. Fortunately, the scheme provided by Robert Tomasulo aims for
taking care of these hazards. Main differences that present themselves compared to
scoreboarding are the concepts of register renaming, reservation stations, a common data bus, as
well as the fact that the algorithm can perform across branches. There are quite a few highlights
of this technique. For instance, the compiler can allow code to go unoptimized and we can still
achieve high performance. Since caches weren't introduced before this scheme, the addition of
them only compounds the efficiency of the algorithm because during a “cache miss”, out-of-
order instructions can still be executing without penalty. For these reasons and more, this
implementation of dynamic scheduling was adopted heavily during the growth of the 1990's.
Tomasulo's algorithm employs three stages for each instruction: issue, execute, and write data.
The first area where we see some improvement compared to other schemes is directly at the start
where instructions are issued. Reservation stations, which are used to buffer the operands of
issued instructions (in order to solve dependencies) can keep registers available and keep the
operands ready to be operated on. From the coder's point of view, a specific register that's stated
may be represented by two or more reservation stations simultaneously. (6). This step eliminates
WAR and WAW hazards because if an instruction updates and fetches the operand from the
reservation station and not the register, then it leaves the register open to allow previous issued
instructions to be written to. This step also attempts to rename registers specified in the code to
ones that are available on the clock cycle. In the next stage, operands that become available are
immediately placed into reservation stations, thus when all operands are supplied, and then an
operation can be executed. Instructions still needs to be delayed however to avoid reading after
write. For the case where instructions have their operands available on the same clock cycle, then
the functional unit will take them in arbitrarily except for loads and stores which are maintained
in the order they were written in the code. In the final step, the results are written on a “common
data bus” (7). Instead of waiting for instruction results to be placed in registers, the results can be
placed on a bus so that dependencies can be filled before needing to wait until writing.
The IBM 360/91 was designed with Tomasulo's algorithm in order to achieve high floating-point
performance. Even though only four registers were used with long delays, as well as having a
two-address instruction format, the algorithm minimized the need for using the registers as much
as possible. Tracking instruction dependencies allowed this to take place (6).
With the implementation of the common data bus and the reservation registers, many hazards
can be avoided at the cost of an extra clock cycle. Instead of results coming straight from a
functional unit they are buffered on a bus before the write stage takes place. Data may actually
not be stored unless absolutely necessary. Compared to Scoreboarding technology, data doesn't
need to be transferred with a register in between the functional units, and for this reason, makes
an increase in the efficiency of how the system is used. As far as branch prediction is concerned,
if branches are correctly predicted, the reservation stations work well. For example, multiple
iterations of loops can unfold. Another addition to the Tomasulo's Algorithm was the
implementation of reorder buffers. This process essentially allowed instructions to be issued in
order, execute out of order, and then be completed in order. In a sense, the machine may not
6
appear to be taking use of dynamic scheduling, but all functional units are kept busy in the
execution stage.
Furthermore, Tomasulo's approach was highly built upon by the addition of caches, hardware
speculation, and branch prediction. With the idea set behind the approach, especially considering
how to handle hazards and unnecessary use of functional units, this made Tomasulo's design
have more room for growth and advancement compared to Scoreboarding.
Branch Prediction:
On the issue of pipeline performance, we assumed that instruction level parallelism can occur in
segments or blocks of code being limited by branches. Because branches hold conditions as to
which instructions should be issued next, they drastically slow down the pipeline. In order to
take a chance at decreasing the delay, prediction of the branches is implemented. If the
prediction is right, then there shouldn't be any penalties to holding control dependencies through
a branch instruction. Prediction is split into two categories: static and dynamic. Static branch
prediction can be done when code is compiled; however, some branching schemes can only be
predicted during run time done by the hardware.
On the static side, there are several different methods to predict branches. Since there are only
two options to branches, they can be easily analyzed. The easiest method would be to predict the
branch as just taken. However the accuracy of taking correct branches is very low and probably
wouldn't be as beneficial as other schemes. The concept that we can implement is to allow
branch predictions be based off previous runs of the executed code. This way, a more accurate
prediction can be made especially if a branch is 90% taken for example.
However, as more conditional branches present themselves, the efficiency of the static branch
prediction starts to lose effectiveness. Dynamic schemes are useful in that they evaluate the delay
for branching in comparison to the fetching of multiple PC addresses. The logical way to
implement this is by having “memory” of the branches. Therefore, dynamic schemes that deal
with prediction of branches can be implemented using buffers as a part of the hardware. One-bit
buffers for prediction have been used before, but we would prefer two predictions that weren't
correct before changing prediction bit, thus a two-bit buffer is more common. This is done in
order to accommodate branches that strongly favor being taken or not taken (4). If we consider
the two-bit buffer to be similar to a cache, a branch taken will set the first bit, and if it is taken
again it will set the next bit. Two misses and two hits are required to change the prediction state.
Speculation could be done on making an “n-bit” buffer; however, this became what is known as
“profiling” which leads to just comparing whether “1” or “0” appears more often (8). Instead of
making a larger bit counter, it's practical to have a counter table of branch addresses that
reference specific 2-bit predictors (9).
Furthermore, a somewhat stronger branching scheme would be to make predictions based on
what are called “correlating predictors” or “two-level predictors”. These look at the history of
previous branch behaviors in order to evaluate whether or not to take the current branch. If this is
compounded onto the two-bit buffer scheme, the chance of correct prediction can be exponential.
For example, based on a previous branch there can be two separate two-bit buffers to address the
current one with each correlating to a taken or non-taken branch.
7
Overall, branch prediction is extremely helpful when considering pipelining strategies because
with this addition, there is reduced delay in executing parallel instructions. Instead of waiting to
evaluate conditions and then taking use of instruction level parallelism, resources are still being
continually used with the mindset that it's better to keep functional units busy and be wrong a
small percentage of the time rather than waste time waiting for condition affirmations.
In all, we have seen pipelining grow and continue to prove its usefulness throughout the years.
The hazards that presented themselves at the beginning are decreasing in occurrence with each
new design and algorithm. Even physical limitations are starting to pose less of an obstacle with
increased accuracy in branch prediction. What seems to remain now is the refinement of these
technologies and how we can fit the processing needs of our present world today.
8
Works Cited
(1) Michael S. Schlansker, B. Ramakrishna Rau. EPIC: An Architecture for Instruction-Level
Parallel Processors. Compiler and Architecture Research, HP Laboratories,
February, 2000, http://www.hpl.hp.com/techreports/1999/HPL-1999-111.pdf
(2) Brandon Bell. Intel’s Core 2 Extreme/Duo Processors. July 13, 2006,
http://www.firingsquad.com/hardware/intel_core_2_performance/
(3) Scott M. Fulton. First Intel next-gen news: Lower wattage, fewer pipeline stages. Tom’s
Hardware US, August 23, 2005,
http://www.tomshardware.com/news/intel,1317.html
(4) Hennessy, John L., David A. Patterson, and Andrea C. Arpaci-Dusseau. Computer
Architecture a Quantitative Approach. Vol. 4. Amsterdam: Morgan Kaufmann,
2007. Print.
(5) Grishman, Ralph (1974). Assembly Language Programming for the Control Data 6000
Series and the Cyber 70 Series. New York, NY: Algorithmics Press.
(6) Stone, Harold S.. High-performance Computer Architecture (Addison-Wesley series in
electrical and computer engineering). Repr. with corrections ed. Glenview:
Addison-Wesley Educational Publishers Inc, 1988. Print.
(7) R.M. Tomasulo, ``An efficient algorithm for exploiting multiple arithmetic units,'' IBM
Journal of Research and Development, vol. 11, pp. 25-33, 1967.
(8) P.G. Emma, ``Branch prediction,'' Chapter 3 of D. Kaeli and P.-C. Yew (Eds), Speculative
Execution in High Performance Computer Architectures, CRC Press, Boca Raton,
FL, 2005.
(9) S. McFarling, ``Combining branch predictors,'' Technical Report WRL TN-36, Western
Research Laboratory, Digital Equipment Corporation, Palo Alto, California, June,
1993.