L27,28-Superscaler

8/6/2019 L27,28-Superscaler

1/28

What is Superscalar?

Common instructions (arithmetic, load/store,

conditional branch) can be initiated

simultaneously and executed independently A superscalar processor is one in which

multiple independent instruction pipelines are

used. Each pipeline consists of multiple

stages, so that each pipeline can handle

multiple instructions at a time.

Equally applicable to RISC & CISC

In practice usually RISC


2/28

Why Superscalar?

Most operations are on scalar quantities

Performance of execution of scalar instructions

can be improved by using superscalar processing.

Superscalar approach is the next step in the

evolution of high-performance general-purposeprocessors.

Superscalar allows parallel fetch execute


3/28

Superpipelined approach

Many pipeline stages need less than half a clock

cycle

Thus, a doubled internal clock speed allows the

performance of two tasks in one external clock

cycle


4/28


5/28

Superscalar implementations achievesInstruction level parallelism (ILP)

ILP can be maximized by optimizing the

compiler & Hardware techniques

Limitations to parallelism are

True data dependency

Procedural dependency

Resource conflicts Output dependency

Antidependency


6/28

Also called as Flow dependency or Write-read

dependency

In case of dependency b/w 1st and 2nd instructions , 2nd

instructions is delayed as many clock cycles as required

to remove data dependency

With no dependency instructions can be fetched,decoded and executed in parallel


7/28

Procedural Dependency

Can not execute instructions following a branch

instruction until the branch is executed

Also, if instruction length is not fixed, instructionshave to be partially decoded before the following

instruction can be fetched

This prevents simultaneous fetches which is

required by superscalar pipeline.


8/28

Resource Conflict

Two or more instructions requiring access to thesame resource at the same time

e.g. of resources include memories, caches,

buses, register-file ports, functional units (likeALU adder)

Can duplicate resources

e.g. have two arithmetic units so two

arithmetic instructions can be executed inparallel


9/28


10/28

Design Issues

Instruction level parallelism & MachineParallelism

Instruction-issue policy

Register Renaming

Branch Prediction


11/28

Design Issues

Instruction level parallelism exists when Instructions in a sequence are independent Executed in parallel by overlapping

Limitations: true data and procedural dependency

Machine Parallelism Measure of ability of processor to take advantage

of instruction level parallelism

Determined by number of instructions that can be

fetched and executed at the same time (no ofpipelines) & by the speed & mechanism that the

processor uses to find independent instructions


12/28

Instruction Issue Policy

Instruction issue refers to the process of initiatinginstruction execution in the processors functional

unit

Instruction-issue policy refers to the protocol

used to issue instructions Following orderings are important:

Order in which instructions are fetched

Order in which instructions are executed Order in which instructions change

registers and memory


13/28

In general, superscalar instruction issue policies are

grouped into the following categories :1. In-order issue with in-order completion

2. In-order issue with out-of-order completion

3. out-of-order issue with out-of-order completion


14/28

1. In-Order Issue In-Order Completion

Issue instructions in the order they occur i.e. issue

instructions in the order that would achieve

sequential execution

Result is written in the same order

Not very efficient

May fetch >1 instruction

Instructions are fetched 2 at a time and passed to the

decode unit.

Fetched Instructions must wait until the pair of

decode pipeline stages has cleared.


15/28

Example:

I1 requires two clock cycles to execute

I3 and I4 conflict for the same functional unit

I5 depends on the value produced by I4

I5 and I6 conflict for a functional unit


16/28


17/28

2. In-Order Issue Out-of-Order Completion

Used in superscalar RISC processors

Requires more complex logic than in-order completion

Output dependency should be avoided. Consider following

instructions:

R3:= R3 + R5; (I1) R4:= R3 + 1; (I2)

R3:= R5 + 1; (I3)

R7:= R3 + R4; (I4)

I2 depends on result of I1 - data dependency

If I3 completes before I1, the result will be wrong this isoutput (Write-write) dependency

To have correct o/p execution of I3 must be stalled if its

result might be overwritten by an older instruction that

takes longer to complete


18/28


19/28

3. Out-of-Order Issue Out-of-Order Completion

Decouple decode & execution stages of the pipeline This is done with a buffer referred to as an

instruction window

When the processor finishes decoding an instruction,

it is placed in the instruction window

Can continue to fetch and decode until this buffer is

full

When a functional unit becomes available an

instruction can be executed

So we can say that processor has look ahead capacity

Identifies independent instructions and bring it to

execution stage irrespective of their original program

order


20/28


21/28

This dependency is similar to that of true data

dependency, but reversed

Here 2nd instruction destroys a value that the 1st

instruction uses


22/28

Register Renaming

Out-of-order issuing and out-of-order completion

of instruction, gives rise to the possibility of

output dependency and antidependencies

Output and antidependencies occur becauseregister contents may not reflect the correct

ordering of values dedicated by the program flow

Both dependencies are examples of storage

conflicts Multiple instructions compete for the use of same

register locations. A solution to this is duplication

of resources. The technique is called as Register

Renaming


23/28

Register Renaming (Contd.)

In this method registers are allocated

dynamically by the processor h/w

When an instruction executes that has aregister as a destination operand, a new

register is allocated for that value

Subsequent instructions that access that value

as a source operand in that register must gothrough a renaming process


24/28

For each instruction that writes to a register X, a newregister X is instantiated

Multiple register Xs can co-exist

Consider

S1: R3 = R3 + R5

S2: R4 = R3 + 1 S3: R3 = R5 + 1

S4: R7 = R3 + R4

becomes

S1: R3b = R3a + R5a S2: R4b = R3b + 1

S3: R3c = R5a + 1

S4: R7b = R3c + R4b


25/28


26/28

Branch Prediction

Any high performance pipelined m/c must use a

method of dealing with branches 80486 fetches both next sequential instruction after branch

and branch target instruction Other methods for prediction are:

Predict never taken

Predict always taken

Predict by opcode

Taken/Not taken switch

Branch History Table


27/28

Conceptual view of Superscalar execution


28/28

Superscalar implementationFor implementing superscalar processors some processor

hardware is required. Following are the key elements:

1) Instruction fetch strategies

2) Logic for determining true dependencies &

mechanism for communicating these values towhere they are needed

3) Mechanism for initiating multiple instructions in

parallel

4) Resources for parallel execution of multipleinstructions

5) Mechanism for committing the process state in

correct order

L27,28-Superscaler

Documents

Transcript of L27,28-Superscaler