2.Pipeline Processors
-
Upload
sriram-seshagiri -
Category
Documents
-
view
22 -
download
1
description
Transcript of 2.Pipeline Processors
Chris Jesshope!
1!
2. Pipelined and superscalar processors
Pipelined processors
Chris Jesshope!
2!
(c) Chris Jesshope 3 November 16, 10
Two types of concurrency ♣ Pure parallel
In a pure-parallel system multiple operations are performed simultaneously
♣ Pipelined parallel In a pipelined-parallel system an operation is broken down into a
sequence of sub-tasks The sub-tasks are interleaved in time, i.e. multiple sub-tasks from
different operations are performed simultaneously This has already been illustrated in DRAM access
♣ Parallelism is the only weapon we have to increase throughput and in designing systems to overcome long latencies
(c) Chris Jesshope 4 November 16, 10
Pipelines
Operations decomposed to sequential tasks: Op: A ; B ; C ; D
A B C D
Pipelined concurrency requires sequences of operations. Given Opi, 0≤i≤n, we execute sub-tasks concurrently as follows:
Opi: Ai+3||Bi+2||Ci+1||Di
Ai+3 Bi+2 Ci+1 Di
; denotes sequential composition || denotes parallel composition
Need to capture intermediate results and we have a start-up phase and flush periods where we do not have full concurrency.!
Chris Jesshope!
3!
(c) Chris Jesshope 5 November 16, 10
Pipeline timing diagram
1!1!
1!1!
time!
concurrency!
2!2!
2!
A! B! C! D!
2!
3!3!
3!3!
4!4!
4!4!
5!5!
5!5!
Only when pipeline is full is there maximal concurrency!
(c) Chris Jesshope 6 November 16, 10
Pipeline performance ♣ This pipelines has a length of 4 subtasks, assume each
sub-task takes t seconds for a single operation we get no speedup; it takes 4t
seconds to complete each of the subtasks this is the same as performing each sub task in sequence
on the same hardware
♣ In the general case - for n operations - it takes 4t seconds to produce the first result and t seconds for each subsequent result
Chris Jesshope!
4!
(c) Chris Jesshope 7 November 16, 10
Generic Pipeline Performance
♣ For a pipeline of length L then the time it takes to process n operations is as follows:
T(n) = Lt + (n-1)*t = (L-1)t + n*t ♣ We can characterise all pipelines by the two parameters, the startup
time and the maximum rate i.e. the rate at which operations are completed once the pipeline is full.
♣ Another useful parameter measures the number of operations required to produce half the maximum rate - the half-performance vector length
L
(c) Chris Jesshope 8 November 16, 10
Pipeline Parameters ♣ maximum rate, = t-1 (called r infinity)
♣ Pipeline start-up time, S = (L-1)t
♣ Half performance vector length, n1/2 :
(L-1)t + n1/2 t = 2 n1/2* -1 = 2 n1/2 * t
n1/2 = L-1 = S/t ♣ It can be seen that n1/2 is equal to the start up time in units
of the clock period
€
r∞
€
r∞
€
r∞
Chris Jesshope!
5!
(c) Chris Jesshope 9 November 16, 10
Asymptotic performance
Perfo
rman
ce!
Number of operations!
=1/t!
/2!
n1/2!
€
r∞
€
r∞
(c) Chris Jesshope 10 November 16, 10
Pipelined computers
♣ Pipelining can be applied to processing sequences of instructions
♣ Or in processing sequences of complex operations The former is used in scaler and super-scaler
computers Vector computers execute multiple operations
using one instruction
Chris Jesshope!
6!
(c) Chris Jesshope 11 November 16, 10
Instruction execution
♣ To exploit pipelining in instruction processing the instructions set must be regular and simple
♣ Each operation must be similar E.g. RISC (Reduced Instruction set computer) see:
http://www.science.uva.nl/~jesshope/web-site-08236/home.html Complex instruction set computers can exploit
pipelining but the complex instructions must be broken down into RISC-like micro-instructions first ♥ E.g. Intel X86
(c) Chris Jesshope 12 November 16, 10
A simple pipeline
♣ We can identify a number of sequential tasks (stages) in the execution of an instruction leading to pipelined implementations - here is one example: Instruction fetch (IF) Instruction decode & register read (RR) Calculate: result or memory address or branch taken (EX) Access memory (MEM) Write back result (WB)
♣ These becomes the concurrent stages in the pipeline ♣ Note that it requires two memories to read instruction and
read or write data concurrently (IF & MEM stages) Hence the requirement for separate caches at level 1
Chris Jesshope!
7!
shift left 2
Registers
read register 1
write data
write register
read data 1
read data 2
zero
ALU result
Sign extend
16 32
write data
read address
write address
Read data Instruction memory
instruction
4
Add
Data memory
sum PC
read address
Add
IF/ID ID/EX EX/MEM EX/WB
Simplified Pipelined Datapath - MIPS ISA
read register 2
Opcode etc. to control!
(c) Chris Jesshope 14 November 16, 10
Control signals also pipelined
IF/ID ID/EX EX/MEM MEM/WB
Control
EX
M
WB
M
WB
WB
Chris Jesshope!
8!
(c) Chris Jesshope 15 November 16, 10
Pipeline performance
♣ Long instruction sequences mean it should be possible to execute instructions at approaching one instruction per cycle in a pipelined datapath
♣ There are problems in achieving this - they are: Hazards e.g. data dependencies between instructions or
branches in the instruction stream Long-latency operations e.g. memory references or
floating point operations
♣ Approaches used to improve instruction execution rate using multiple functional units Issuing multiple instructions concurrently executing instructions out of programmed order
(c) Chris Jesshope 16 November 16, 10
Measuring performance
♣ Performance depends on the pipeline’s clock rate ♣ The number cycles required per instructions CPI
or its inverse the number of instructions executed per cycle IPC
♣ The product of IPC and clock rate gives a measure of a processor’s performance
Chris Jesshope!
9!
Pipeline performance
♣ Here we have some interesting observations the processors clock is slower the more complex
the task that must be completed in a single period …but operations have a fixed complexity so to increase the clock rate instructions are
broken down into smaller sub-tasks smaller tasks mean deeper pipelines so more
instructions need to be executed to fill the pipeline
(c) Chris Jesshope 17 11/16/10
Multiple instruction issue
Chris Jesshope!
10!
(c) Chris Jesshope 19 November 16, 10
Multiple execution units
♣ In the simple pipeline register-to-register operations have a wasted cycle a memory access is not required but this stage still reqires a
cycle to complete the operations ♣ Decoupling memory access and operation execution avoids
this e.g. use an ALU plus a memory unit … but either we need two write ports to the register file or
arbitration on a single port
IF! ID!Ex! WB!
Ad! M! WB!
Performs R-to-R ops!
Performs loads and stores!
(c) Chris Jesshope 20 November 16, 10
Pipeline hazards
♣ There are three types of pipeline hazard that must be identified and resolved structural hazard - two instructions require the
same resource (e.g. register file port) control hazard - e.g. branches of control data hazard - read after write dependency
♣ Each hazard, when encountered will cause the pipeline to stall until the hazard is resolved stall means one or more stages are blocked and can
not accept more instructions
Chris Jesshope!
11!
(c) Chris Jesshope 21 November 16, 10
Structural hazard - register
♣ A structural hazard occurs when a resource in the pipeline is required by more than one instruction e.g. one write port a resources may be an execution unit or a register port a multi-function pipeline has a hazard writing registers
lw a addr add b c d
IF!lw! ID!
Ex! WB!
Ad! M! WB!
IF!add! ID!
Ex! WB!
Ad! M! WB!
IF!lw! ID!
Ex! WB!
Ad! M! WB!
IF!add! ID!
Ex! WB!
Ad! M! WB!
Resolved by stalling the pipeline!
stall or bubble!
hazard!
IF!add! ID!
Ex! WB!
Ad! M! WB!
(c) Chris Jesshope 22 November 16, 10
Structural hazard - execution unit
♣ Some operations require more than one pipeline cycle mult is more complex than add (often requires 2 cycles) floating point still more complex still (~ 5 cycles)
♥ mult c d e ♥ add f g h
IF!mult! ID! Ex! WB!Ex!
IF!add! ID! Ex!
hazard!
IF!mult! ID! Ex! WB!Ex!
IF!add! ID! WB!Ex!
bubble!
Resolved again by stalling the pipeline !
Chris Jesshope!
12!
(c) Chris Jesshope 23 November 16, 10
Eliminating structural hazards
♣ Because structural hazards result from contention for hardware resources they can only be removed by duplicating resources the register file hazard can be resolved by adding an
additional write port to the register file so that two writes can occur simultaneously
in the case of the execution unit this hazard can be solved by using dedicated functional units or by pipelining the functional unit
Can have: memory address, integer logical or add, integer multiplication, floating point as separate funtion units
(c) Chris Jesshope 24 November 16, 10
Multiple functional units ♣ This is not a new principle
the CDC 6600 (1963) & CDC 7600 (1970) had 10 & 9 functional units respectively
the 7600’s operations were pipelined (except division) i.e. they could accept new operands every cycle
♣ These computers were scalar RISC supercomputers designed before the term RISC had been invented
IF! ID!
Add! WB!
Mem! WB!Adr!
Mul! WB!Mul! Mul!
Float! WB!Float!Float!Float!Float!
The goal was to achieve as close as possible to one instruction executed per cycle!IPC = 1!
Chris Jesshope!
13!
(c) Chris Jesshope 25 November 16, 10
Control hazard ♣ Branches - in particular conditional branches - can also
cause pipeline hazards the outcome of a conditional branch is not known until the Ex
stage but is required at IF to load another instruction and keep the pipeline full
A simple solution is to assume by default that the branch falls through - i.e. is not taken
IF!beq! ID! Ex! WB!Branch not taken!
IF! ID! Ex! WB!
Continue to fetch but stall ID until branch target is known – one cycle lost!
IF!beq! ID! Ex! WB!Branch is taken!
IF! ID! Ex! WB!
Need to refetch at new target – two cycles lost!IF!
Wrong target!
(c) Chris Jesshope 26 November 16, 10
Eliminating control hazards ♣ Control hazards are more difficult to eliminate as they are
procedural rather than structural ♣ Options include…
Embedding a delay in the action of a branch - branch delay slot Predicting the branch target i.e. whether the branch is taken
♥ Statically - compiler decides ♥ Dynamically - a huge research topic discussed later
Fetching from both targets - requires a branch target address prediction
Eliminating branches through predication
Chris Jesshope!
14!
(c) Chris Jesshope 27 November 16, 10
Branch delay slot
♣ Assume the branch takes effect on the second cycle after executing the branch rather than the next This eliminates the stall if the branch is taken by filling
the stalled cycle with an independent instruction
L1: lw a x[i]! add a a a! sw a x[i]! sub i i 4! beq i 0 L2! j L1!L2: ---!1 cycle stall on all but last instruction!
L1: lw a x[i]! add a a a! sw a x[i]! beq i 1 L2! sub i i 4! j L1!L2: ---!
no stalls but last iteration executes an unecessary sub!
Branch delay slot always executed!
n.b. add b c d ==> (b=c+d)!
(c) Chris Jesshope 28 November 16, 10
Fetching both targets
♣ Using an additional I-cache port both fall through and branch taken addresses can be fetched Then at EX a choice is made as to which is decoded When coupled with a branch delay slot this eliminates all
wasted cycles but…
♣ multiple conditional branches in the pipeline will break this solution longer pipelines - say 20 stages - might contain several
branches in the pipe prior to EX every new branch doubles the number of paths fetched
Chris Jesshope!
15!
(c) Chris Jesshope 29 November 16, 10
Predicated instruction execution
♣ Control flow can (in some cases) be replaced by guarded or predicated instruction execution… a condition sets a predicate register (boolean) instructions are predicated on that register any state change (WB or Mem write) only occurs if the
predicate is true this moves the conditional operation from the IF stage to
the WB stage useful in long pipelines where branch hazards can be costly it removes a control hazard at the expense of redundant
instruction execution
(c) Chris Jesshope 30 November 16, 10
Predication - an example
Instr: i!
i+1 cond branch!
i+2! I+5!
i+3! i+6!
branch!
i+7!
false! true!
Branching code! Predicated code!Instr: i!
Set C!
i+2!
i+5!
i+3!
i+6!
i+7!
!C!
!C!
C!
C!
No hazards but redundant execution!
predicates!
Chris Jesshope!
16!
(c) Chris Jesshope 31 November 16, 10
Data hazard
♣ Data hazards are caused when the output from one instruction is required as an operand by a subsequent one this is called a data dependency
♣ The hazard occurs because of the latency in the pipeline the result (output from one instruction) is not written
back to the register file until the last stage of the pipe the operand (input of a subsequent instruction) is required
at register read - some cycles prior to writeback the longer the RR to WB delay the more cycles the pipe
must be stalled for to execute the subsequent instruction
(c) Chris Jesshope 32 November 16, 10
Eliminating data hazards
IF!mult! ID! Ex! WB!
IF!add! ID! Ex! WB! n.b. register read
is in the ID stage!
dependency!mult a b c!add d a f!
Two bubbles!
In a data hazard (RAW) the data from EX is required immediately but it takes two cycles to write it to the register and read it again from the same register!The solution is to take the data directly from the pipeline register using a so-called bypass bus!
IF!mult! ID! Ex! WB!
IF!add! ID! Ex! WB!
dependency!Data bypassed from pipeline register to execution unit!
Chris Jesshope!
17!
(c) Chris Jesshope 33 November 16, 10
Bypass busses
registers!
Bypass buses!
The operand is taken from the pipeline register and input directly to the ALU on the subsequent cycle!
Because there are two stall cycles, a dependency to the next but one instruction also requires bypassing!
ALU!
one cycle delay!two cycles delay!three + cycles delay!
(c) Chris Jesshope 34 November 16, 10
Data dependencies
♣ Data dependencies can also be eliminated by appropriate scheduling if independent operations can be rearranged scheduling in this context is the process of sequencing
the operations according to the data-dependency graph instructions are free to be scheduled in any order
provided that all dependencies are met – concurrency of operation
♣ Scheduling can be dynamic or static static scheduling is performed by the compiler dynamic schedling is performed by hardware
♣ Scheduling is generaly an NP-hard problem
Chris Jesshope!
18!
Example of a dependency graph
a := (sqrt(b*b-4ac))/2a
(c) Chris Jesshope 35 11/16/10
*2!
*4!
a!
b! *!
c!
*! -! sqrt! /!
These can execute in any order!
These can not!
Superscalar processors
Dynamic dependency analysis
Chris Jesshope!
19!
(c) Chris Jesshope 37 November 16, 10
Pipeline design space issues
♣ Depth of pipeline - Superpipelining further dividing pipeline stages increases frequency but introduces more scope for hazards and higher frequency means more power dissipated
♣ Number of functional units - Scalar ILP instructions fetched and decoded in sequence multiple operations executed in parallel avoids waiting for long operations to complete
♣ Concurrent issue of instructions - Superscalar ILP multiple instructions fetched and decoded concurrently issued to multiple functional units
(c) Chris Jesshope 38 November 16, 10
Scalar vs Superscalar ♣ Scalar processors issue instructions in order but instructions may complete out of
order ♣ Superscalar processors issue instructions concurrently and possibly out of
programmed order
Chris Jesshope!
20!
(c) Chris Jesshope 39 November 16, 10
Superscalar processors
♣ Most modern microprocessors are superscalar ♣ Issues covered in this section are:
parallel fetch decoding and issue ♥ 100s of instructions in-flight simultaneously
out-of-order execution and sequential consistency ♥ Exceptions and false dependencies
finding parallelism and scheduling its execution application specific engines
♥ e.g. SIMD & prefetching
(c) Chris Jesshope 40 November 16, 10
Principles ♣ Superscalar example based on a 3-stage pipeline: i.e. fetch (f) decode (d)
and execute (e) multiple instructions simultaneously Here we see a 3 stage pipeline with an issue width of 1 (scalar) and 2
(superscalar)
f! d! e!
f! d! e!
f! d! e!
f! d! e!
f! d! e!
f! d! e!
Scalar processor max IPC = 1!
f! d! e!
f! d! e!
f! d! e!
f! d! e!
f! d! e!
f! d! e!
Superscalar processor max IPC = 2!
Time!
Chris Jesshope!
21!
(c) Chris Jesshope 41 November 16, 10
Instruction-level parallelism - ILP ♣ ILP is the number of instructions issued per cycle (issue concurrency) ♣ IPC the number of instructions executed per cycle is limited by:
the ILP the number of true dependencies the number of branches in relation to other instructions the latency of operations in conjunction with dependencies
♣ Current microprocessors issue from 4-8 instructions to up to 12 functional units concurrently but achieve an IPC of typically 2-3
Long execution time with resource dependency (pipelined fn. unit)!
Long execution time with a!true data dependency!
f! d! e1! e2! e3! e4!
f! d! e1! e2! e3! e4!
f! d! e1! e2! e3! e4!
f! d! e1! e2! e3! e4!
(c) Chris Jesshope 42 November 16, 10
Instruction issue policy ♣ Just widening of the processor’s pipeline does not necessarily improve
its performance ♣ The processor’s policy in fetching, decoding and executing instructions
also has a significant effect on its performance ♣ The instruction issue policy is determined by its look-ahead capability
in the instruction stream For example with no look-ahead, if a resource conflict halts
instruction fetching the processor is not able to find any further instructions until the conflict is resolved
If the processor is able to continue fetching instructions it may find an independent instruction that can be executed on a free resource out of programmed order
♣ This is a decoupling of instruction fetching from instruction execution and this is the basis for most modern microprocessors
Chris Jesshope!
22!
(c) Chris Jesshope 43 November 16, 10
1. In-order issue with in-order completion
♣ Simplest instruction issue policy Instructions issued in exact program order with
results written in the same order This is shown for comparison purposes only as very
few pipelines use in-order completion
(c) Chris Jesshope 44 November 16, 10
1. In-order issue with in-order completion
6 instructions!require 8 cycles!
IPC = 6/8 = 0.75!
♣ Assume a 3 stage execution in a pipeline that can issue two instructions, execute three instructions and write back two results every cycle… assume: I1 requires 2 cycles to execute I3 and I4 are in conflict for a functional unit I5 depends on the value produced by I4 I5 and I6 are in conflict for a functional unit
Decode! Execute! Writeback!Time!I1! I2!I3! I4! I1! I2!I3! I4! I1!
I4! I1!I3! I2!I5! I6! I4!
I6! I5! I3! I4!I6!
I5! I6!
Chris Jesshope!
23!
(c) Chris Jesshope 45 November 16, 10
2. In-order issue with out-of-order completion ♣ Out-of-order completion, improves performance of instructions with long
latency operations such as loads and floating point ops ♣ The modifications made to execution are:
any number of instructions allowed in the execution stage up to the total number of pipeline stages for all functional units
instruction issue is not stalled when an instructions takes more than one cycle to complete
(c) Chris Jesshope 46 November 16, 10
2. In-order issue with out-of-order completion
6 instructions!require 7 cycles!
IPC = 6/7 = 0.86!
♣ Again assume a processor issues two instructions, executes three instructions and writes back two results every cycle I1 requires 2 cycles to execute I3 and I4 are in conflict for a functional unit I5 depends on the value produced by I4 I5 and I6 are in conflict for a functional unit
Decode! Execute! Writeback!Time!I1! I2!I3! I4! I1! I2!I3! I4! I1! I2!I3!
I1!I4! I3!I5! I6!I5! I4!I6!I6! I5!
I6!
Chris Jesshope!
24!
(c) Chris Jesshope 47 November 16, 10
2. In-order issue in out-of-order completion ♣ In a processor with out-of-order completion instruction issue is
stalled when: There is a conflict for a functional unit An instruction depends on a result that is not yet
computed - a dependency
♥ can use register specifiers to detect dependencies between instructions and logic to ensure synchronisation between producer and consumer instructions – e.g. scoreboard logic
♣ There is one other situation when issue must be stalled… it is a new type of dependency caused by out-of-order completion It is called an output dependency
(c) Chris Jesshope 48 November 16, 10
Output Dependency ♣ Consider the code to the right:
the 1st instruction must be completed before the 3rd otherwise the 4th instruction may receive the wrong result!
this is a new type of dependency caused by allowing out-of-order completion
the result of the 3rd instruction has an output dependency on the 1st instruction
the 3rd instruction must be stalled if its result may be overwritten by a previous instruction which takes longer to complete
R3 := R3 op R5!
R4 := R3 + 1!R3 := R5 + 1!R7 := R3 op R4!
Chris Jesshope!
25!
(c) Chris Jesshope 49 November 16, 10
3. Out-of-order issue with out-of-order completion ♣ In-order issue stalls when the decoded instruction has
a resource conflict, a true data dependency or an output dependency on an uncompleted instruction
this is true even if instructions after the stalled one can execute to avoid stalling, decode must be decoupled from execution
♣ Conceptually out-of-order issue decouples the decode/issue stage from instruction execution it requires an instruction window between the decode and
execute stages to buffer decoded or part pre-decoded instructions this buffer serves as a pool of instructions giving the processor a
look-ahead facility the size of the buffer (n.b. in this simple pipeline it is not a pipeline stage)
instructions are issued from the buffer in any order provided there are no resource conflicts or dependencies with executing instructions
(c) Chris Jesshope 50 November 16, 10
3. Out-of-order issue with out-of-order completion
6 instructions!require 6 cycles!
IPC = 1!
♣ Again assume a processor issues two instructions, executes three instructions and writes back two results every cycle but now has a issue window of at least three instructions I1 requires 2 cycles to execute I3 and I4 are in conflict for a functional unit I5 depends on the value produced by I4 I5 and I6 are in conflict for a functional unit
I1! I2!Decode! Execute! Writeback!Time! I. Window!
I3! I4! I1! I2!I1, I2!I5! I6! I1! I2!I3!I3, I4!
I1!I4! I3!I6!I4, I5, I6!
I5!I5! I4!I5! I6!
Chris Jesshope!
26!
(c) Chris Jesshope 51 November 16, 10
Anti-dependency ♣ Out-of-order issue introduces yet another
dependency - called an anti-dependency the 3rd instruction can not be completed
until the second instruction has read its operands
otherwise the 3rd instruction may overwrite the operand of the 2nd instruction
we say that the result of the 3rd instruction has an anti-dependency on the 1st operand of the 2nd instruction
this is like a true dependency but reversed
R3 := R3 op R5!
R4 := R3 + 1!R3 := R5 + 1!R7 := R3 op R4!
(c) Chris Jesshope 52 November 16, 10
Dependencies or storage conflicts ♣ We have now have seen three kinds of dependencies
True (data) dependencies … read after write (RAW) Output dependencies … write after write (WAW) - out of order completion Anti dependencies … write after read (WAR) - out of order issue
♣ Only true dependencies reflect the flow of data in a program and only true dependencies should require require the pipeline to stall when instructions are issued and completed out of order the one-to-one
relationship between registers and values at any given time is lost new dependencies arise because registers hold different values from
independent computations at different times - a resource dependency
♣ Anti dependencies and output dependencies then are really just storage conflicts and can be eliminated by introducing new registers to re-establish the one-to-one relationship between registers and values at a given time
Chris Jesshope!
27!
(c) Chris Jesshope 53 November 16, 10
Renaming - example code
♣ Renaming dynamically rewrites the assembly code using a larger register set A renamed register is allocated
somehow and remains in force until commit
Subsequent use of a register name as an operand uses the latest rename of it
may need to keep several renamed registers if using branch prediction
R3b := R3 op R5!
R4 := R3b + 1!R3c := R5 + 1!R7 := R3c op R4!
R3 -> R3b -> R3c!scope R3b!
(c) Chris Jesshope 54 November 16, 10
Register renaming ♣ Storage conflicts can be removed in out-of-order issue
microprocessors by renaming registers this requires additional registers e.g. a rename buffer or an
extended register file not identified by the program the relationship between program name and actual location has
to be maintained while the instructions are executing ♣ Instructions are executed out of sequence from the instruction
window using the renamed registers allocate new register on multiple use of same target register store mapping from instruction register to architectural register instructions executing after a rename use the renamed register
rather than instruction-specified register as an operand ♣ A commit stage is used to preserve sequential machine state
by storing values to architectural registers in program order
Chris Jesshope!
28!
Register renaming
♣ Can either rename at instruction issue explicit renaming maps architectural register to physical
register used in conjunction with a scoreboard to track dependencies
♣ Can remap implicitly using reservation stations at the execute stage and a reorder buffer on instruction completion reservation stations use dataflow to manage true
dependencies and implicitly rename registers the reorder buffer holds the data until all previous
instructions have completed then write it to the architectural register specified
(c) Chris Jesshope 55 11/16/10
(c) Chris Jesshope 56 November 16, 10
Explicit renaming ♣ Register renaming logic maps an ISA register address to a physical
register address ♣ At any point in the fetched code, there is just one valid mapping
for a given logical register specifier subsequent instructions will use that mapping to read their
operands until a further remapping is made
♣ Two implementations of the mapping function are possible 1. RAM, where the physical address is stored in a table the size of
the ISA register file and read using the ISA register address 2. CAM, where the ISA address is stored in a content addressable
table of the size of the physical register file - in this case the ISA address is used as a tag and multiple matches to this tag are resolved using a currently-valid bit.
CAM = content addressable memory!
Chris Jesshope!
29!
(c) Chris Jesshope 57 November 16, 10
Physical id!Physical id!
Implementation of renaming
ISA register file size!
Physical id!
ISA target id!
RAM!
Physical register file
size!
Physical id!ISA id!1!
Physical id!ISA id!0!
Valid bit!
Previous mappings held from branch prediction to confirmation!
Freelist of physical register ids!
CAM!
at decode!
Physical id!
at commit!
Dynamic scheduling
Out-of-order issue OoO
Chris Jesshope!
30!
(c) Chris Jesshope 59 November 16, 10
Abstract OoO instruction issue
I-cache
Lw/Sw Integer Branch Float
Instruction window synchronisation and scheduling
Execution units
Instruction fetch & decode
Waiting for data
Ready to execute
Writes resolve dependencies and allow new instructions to be scheduled
Parallel Instruction issue
Empty slot
This is a form of dataflow instruction execution… maintain state of register synchronisation and schedule instructions (issue to execution units)
(c) Chris Jesshope 60 November 16, 10
Dynamic techniques for OoO
♣ Out-of-order issue provides exploitation of difficult ILP works when dependencies are not known at compile time gives binary compatibility code across generations of an ISA
♣ Scoreboard originally implemented in CDC 6600 in 1963 centralized control structure CDC 6600 had no register renaming and no forwarding pipeline stalls for WAR and WAW hazards
♣ Reservation stations first used in IBM 360/91 in 1966 distributed control structures implicit renaming of registers i.e. dispatch pointers into RF
that are held in reservation stations associated with each of the functional units ♥ these tags are matched with result tags using a common data bus ♥ results broadcast to all reservation stations for RAW
WAR and WAW hazards eliminated by register renaming
Chris Jesshope!
31!
Reservation stations
(c) Chris Jesshope 61 November 16, 10
Fn unit!
data/tag! data/tag!v vdata/tag! data/tag!v v
common data bus!tag identifies the source of the data – i.e. which fn unit!
v is a valid bit!
Instructions can be issued before data is available but valid bit is set to 0!
Only when an instruction has two valid operands can it be sent to the functional unit!
Every result matches all l-r tags in every functional unit and data is grabbed if a match occurs!
(c) Chris Jesshope 62 November 16, 10
Scoreboard approach (CDC 6600)
Func
tiona
l Uni
ts!
FP Mult!FP Mult!
FP Divide!
FP Add!
Integer!
Scoreboard logic!tracks fn unit use & dependencies!
Register!file!
Chris Jesshope!
32!
(c) Chris Jesshope 63 November 16, 10
Four-stage Scoreboard Control ♣ Issue - decode instructions & check for structural hazards (ID1)
instructions issued in program order (for hazard checking) don’t issue if structural hazard don’t issue if instruction is output dependent on any previously
issued but uncompleted instruction (i.e. WAW hazards) ♣ Read operands - wait until no data hazards, then read ops (ID2)
all dependencies (RAW) resolved in this stage - the scoreboard identifies when a register is valid from prior instructions’ WB
no forwarding of data (in the CDC 6600) ♣ Execution - operate on operands (EX)
the functional unit begins execution upon receiving operands and signals the scoreboard on completion
♣ Write result - finish execution (WB) stall until no WAR hazards on previous instructions
Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads operands
More information
♣ http://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/example1.html
(c) Chris Jesshope 64 11/16/10
Chris Jesshope!
33!
(c) Chris Jesshope 65 November 16, 10
Tomasuloʼs algorithm organization!
FP adders!
Add1!Add2!Add3!
FP multipliers!
Mult1!Mult2!
From Mem! FP Registers!
Reservation !Stations!
Common Data Bus (CDB)!
To Mem!
FP Op!Queue!
Load Buffers!
Store !Buffers!
(c) Chris Jesshope 66 November 16, 10
Three-stage Tomasulo Algorithm
1. Issue - get instruction from FP Op Queue if reservation station free (no structural hazard)
control issues instr and operands (renames registers)
2. Execution - operate on operands (EX) when both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result - finish execution (WB) write on Common Data Bus to all awaiting units
mark reservation station available
♣ Normal data bus is data + destination ♣ Common data bus is data + source
64 bits of data + 4 bits of functional unit address write if reservation buffer expects result from that
functional unit
Chris Jesshope!
34!
More information
♣ Demo using web applet University of Edinburgh http://www.dcs.ed.ac.uk/home/hase/webhase/demo/
tomasulo.html
♣ Another demo from University of Massachusetts http://www.ecs.umass.edu/ece/koren/architecture/
Tomasulo1/tomasulo.htm
(c) Chris Jesshope 67 11/16/10
(c) Chris Jesshope 68 November 16, 10
Recent example of Tomasulo ♣ The PPC 620 is an implementation of the 64-bit Power PC architecture
It can fetch, issue and retire up to 4 instructions per cycle
♣ It has 6 functional units which can issue instructions independently from reservation stations 2 simple integer units (single cycle) 1 integer unit for multiply and divide (3 - 20 cycles latency) 1 load/store unit (1 cycle integer 2 cycle floating point) 1 floating point unit (2 - 31 cycles for divide) 1 branch unit
♣ N.b. The G5 processor is a successor to the PPC 620, see blackboard for more details
Chris Jesshope!
35!
Fetch Unit!
Instruction!Cache!
Dispatch unit!With 8-entry!
Instruction queue! Completion unit!with reorder buffer!
Integer registers! Fl point registers!
Int 0! Int 1! Mpy! LSU! FPU! BPU!
Data!Cache!
Result status bus!
Reorder buffer information!
Instruction!Dispatch !Buses!
Reservation stations!
Completion unit!with reorder buffer!
(c) Chris Jesshope 70 November 16, 10
Machine state ♣ Issue instructions out of order and sequential-order machine
state is lost this can cause problems when exceptions occur how can we define the machine state if many instructions
are in flight - not necessarily in program order? ♣ To checkpoint the sequential state we must retire or commit
instructions only when all previous instructions have finished only at this stage can the result of that instruction be
written into the architectural register ♣ This is managed in a reorder buffer that provides sequential
state out of the end of the pipeline
Chris Jesshope!
36!
(c) Chris Jesshope 71 November 16, 10
Reorder buffer
♣ The reorder buffer stores information about all instructions executing between issue and retire stages it can also store the results of those instructions pending
a write to the architectural register file, which is another implicit form of register renaming
♣ The reorder buffer is a queue or FIFO instructions are written to it at the tail in program order
at the issue stage instructions removed from it at the head but only when
they have completed execution at this stage the results are written into the user-
specified register it is then said to be retired or committed
(c) Chris Jesshope 72 November 16, 10
Reorder buffer
Ins. i!v = 1!
Ins. i+1!v = 1!
Ins. i+2!v = 0!
Ins. i+1!v = 1!
Ins. i+2!v = 1!
Ins. i+1!v = 1!
Ins. i+2!v = 0!
head!tail!
Flag v determines whether result has been written or not!
Results available for bypassing or forwarding!
Unable to commit these! Can commit these!
…! …!
Chris Jesshope!
37!
(c) Chris Jesshope 73 November 16, 10
Variations
♣ There are many variations on the implementation of dynamic OoO scheduling algorithm can combine explicit register
renaming with a scoreboard and a standard pipeline (e.g. Alpha 21264)
Fetch! Decode/!Rename! Execute!
scoreboard!+!
rename!table!
♣ Issue/completion state can occur in many different places one combined (architecture + rename) register file architectural register file + reservation stations + reorder
buffer etc.
(c) Chris Jesshope 74 November 16, 10
Explicit renaming
♣ Requires rapid access to a table of translations ♣ And a physical register file that has more registers than
specified by the ISA ♣ Ability to figure out which physical registers are free
if there are no free registers then stall on issue
♣ Register renaming does not require reservation stations but… many modern architectures use explicit register
renaming + Tomasulo-like reservation stations to control execution
Chris Jesshope!
38!
(c) Chris Jesshope 75 November 16, 10
Memory state
♣ Note that although a consistent register state may be identified using a reorder buffer the memory state is a different matter this is because of memory delays, cache writeback
strategies etc
♣ Modern microprocessors hold reads and writes in buffers similar to reservation stations these will match reads with writes and also bypass data
so that a read to a location in the buffer that has not yet been written can provide its value to the read
♣ This allows load/store re-ordering and can improve locality of memory accesses
(c) Chris Jesshope 76 November 16, 10
Load store reordering
♣ Can safely allow a load to bypass a store as long as the addresses of load and store are different
♣ If addresses are not know then either do not allow bypassing or speculatively bypass the store but squash the
load if the address turns out to be the same
♣ Can also allow loads to bypass loads on cache misses This is called a lock-up free cache but it can
complicate the cache coherence protocol
Chris Jesshope!
39!
(c) Chris Jesshope 77 November 16, 10
Control hazards revisited
♣ Superscalar pipelines are typically super-pipelined and have many stages ~ 10-20
♣ They also have wide issue widths ~ 4-8 we may therefore have 40-160 instructions in flight
♣ The latency to resolve a branch condition is large 17 cycles for a conditional branch in Pentium 4! hence many pipeline slots will be filled with
instructions from the wrong target this has to be cleaned up when the wrong target is
chosen this includes register renames
(c) Chris Jesshope 78 November 16, 10
Grohoski’s estimate
♣ 1 in 5 instructions are branches 1 in 3 branches are unconditional - fully predictable 1in 3 branches are loop closing conditional - predictable
♥ n-1 of these are taken only the last is not 1 in 3 branches are random branches - not so predictable
♥ 50% of these are taken - on average if truly random therefore 1 in 6 branches (1 in 30 instrs.) cause problems
and it is unlikely that the pipe will ever remain full ♥ even with 8-way issue IPC will not get much above 2-3
♣ This requires sophisticated branch prediction algorithms
Chris Jesshope!
40!
(c) Chris Jesshope 79 November 16, 10
Dynamic branch prediction
♣ Branch prediction can be used to improve the all but random branches
♣ The prediction is stored in the I cache, a branch history table (BHT) or branch target buffer (BTB) A 1 bit predictor associated with an instruction address or
partial address tells us what the last branch did ♥ i.e. taken (T) or not taken (N) which is used to predict the next branch
A 1-bit predicted loop will fail on the first and last iterations ♥ initially the entry is N, after the first iteration the entry is T and will
remain at T till the last iteration, when it will fail again as the branch is not taken
♥ NTTTTT…TTTN
(c) Chris Jesshope 80 November 16, 10
2-bit predictors – bimodal counters ♣ Use two bits to represent the last two attempts (~ 90%
accurate) … there are various schemes TT TN NT NN are the predictor states E.g. change prediction only if miss-predict twice but
return in one step - one of several strategies
TT! TN!
NT! NN!
Predict taken! Predict taken!
Predict not taken! Predict not taken!
T!
T!T!
T!
N!N!
N!
N!
Chris Jesshope!
41!
(c) Chris Jesshope 81 November 16, 10
Branch history table/target buffer
♣ Stores the prediction state in a table, either associatively addressed or indexed on small number of address bits Can also store branch target if it is associative Get prediction at IF stage and update prediction when
condition is resolved
address! TT!
PC!
Prediction/target!
target!
N.b. in a loop, on every conditional branch, the same address is calculated!
(c) Chris Jesshope 82 November 16, 10
Correlated or global predictors ♣ There may be correlation between different branches, e.g. nested loops ♣ Normally predictors are indexed on address bits of the branch instruction ♣ Correlation can be tracked by so-called global predictors that maintain a
register of the history of recent branches taken and use this to address the prediction
♣ This scheme, by itself, is only better than the bimodal scheme for large table sizes, and is never as good as local prediction.
♣ Alternatively, can index the table of bimodal counters with the recent history concatenated with a few bits of the branch instruction's address - Gselect predictor. Or XOR the recent history with the instruction address
Chris Jesshope!
42!
Example microprocessors
(c) Chris Jesshope 84 November 16, 10
Compaq (DEC) Alpha 21264 - EV6
♣ The EV6 was the last Alpha microprocessor to be manufactured alpha has a very clean RISC ISA that uses separate
integer and floating-point register files alpha was unique in supporting a high clock rate and
short pipeline through good ISA and silicon design
♣ EVA6 uses out-of-order issue in a 7 stage pipeline 4 instructions per cycle can be fetched (speculatively)
and up to 6 instructions issued out of order sophisticated branch predictor uses scoreboarding and explicit renaming techniques to
track dependencies and avoid false dependencies
Chris Jesshope!
43!
(c) Chris Jesshope 85 November 16, 10
Alpha 21264 pipeline
(c) Chris Jesshope 86 November 16, 10
Saving ports of the RF
♣ Later in this section we will see that the register file area grows quadratically with number of ports number of ports proportional to the number of functional
units allowed to write in one cycle multiple issue processors requires two read and one write
port per concurrent functional unit
♣ Ports can be minimised by separating floating point and integer operations as long
as data can be moved between the two duplicating registers
♣ The Alpha uses both techniques 72 Fl.Pt. registers and 80 integer registers which are duplicated to minimise ports
Chris Jesshope!
44!
(c) Chris Jesshope 87 November 16, 10
80!Integer!
registers!
functional!unit!
functional!unit!
80!Integer!
registers!
functional!unit!
functional!unit!
Alpha groups two functional units with each of two register files!
4 read and 4 write ports!Area = 2*c*82 =128c!
Instead of having one file (8 read and 4 write ports)!
Area = c*122 = 144c!
Only a small saving in area but this also aids locality of signals on the critical read-to-functional unit path!
c is a constant based on line width/spacing!
(c) Chris Jesshope 88 November 16, 10
Stage 0 - branch prediction ♣ Hybrid tournament branch
predictor ♣ 90-100% accuracy
Local!History!table!
1024x10!
Global !prediction!1024x2!
PC!
Local!prediction!
1024x3!
Choice!Prediction!
4096x2!
prediction!
Path history!
23 Kbits in total!
Chris Jesshope!
45!
(c) Chris Jesshope 89 November 16, 10
Instruction fetching
♣ I-cache is a 2-way set associative cache a cache block fetch is four instructions it uses line and set prediction
♥ it predicts where to fetch a block ♥ in which set to place the block
accuracy of 85% line miss-prediction cost typically 1 cycle bubble
(c) Chris Jesshope 90 November 16, 10
AMD athlon ♣ An implementation of the X86 instruction set ♣ 3 instructions fetched speculatively
uses branch history table and branch target buffer
♣ Uses reorder buffer for renaming integer operations merged with architectural register file
♣ Long floating point operations are renamed and scheduled independently floating point pipes also implement MMX extensions
♣ Uses 8-way banked L1 D-cache both I- and D-cache are 2 way set associative
Chris Jesshope!
46!
(c) Chris Jesshope 91 November 16, 10
AMD athlon
AMD 64-bit architecture
♣ Opteron introduced in 2003 moved AMD to a 64 bit architecture
♣ In 2005 AMD introduced the first dual core processor
♣ In 2007 AMD introduced the quad core and now offer a 6-core version
(c) Chris Jesshope 92 11/16/10
Chris Jesshope!
47!
6-core opteron
(c) Chris Jesshope 93 11/16/10
904M transistors 346mm2!
45nm technology!
6-core opteron
(c) Chris Jesshope 94 11/16/10
Chris Jesshope!
48!
(c) Chris Jesshope 95 November 16, 10
Power PC G5 (IBM/Apple)
♣ The PowerPC G5 was fabricated using 0.130 µm (130 nm) circuitry … about 350 meters of wiring!
♣ In-order dispatch of up to five operations into distributed issue queue structure - reservation stations
♣ Simultaneous issue of up to 10 out-of-order operations to 12 functional units N.b in-order dispatch / out-of-order issue!
(c) Chris Jesshope 96 November 16, 10
G5 Specifications ♣ Velocity engine - 128 bits wide ♣ Other datapaths - 64 bit ♣ 64K L1 I-cache ♣ 32K L1 D-cache ♣ 512 Kbyte L2 cache ♣ Uses hybrid branch predictor like Alpha ♣ Suport for up to 8 outstanding L1 cache misses ♣ Hardware-initiated instruction prefetching from
L2 cache ♣ Hardware- or software-initiated data stream
prefetching - look at this later support for up to eight active streams
Chris Jesshope!
49!
(c) Chris Jesshope 97 November 16, 10
Intel Pentium 4
♣ This has a very deep pipeline and hence high clock rate - 4GHz compared to the G5’s 2GHz
♣ Problems exacerbated due to the X86 CISC instruction set, which is not suitable for pipelining
♣ X86 instructions are translated into µops these are regular and uniform - like a traditional
RISC ISA trace cache caches translated µop sequences along
program execution paths
(c) Chris Jesshope 98 November 16, 10
♣ Decode stage translates X86 instructions into µops
♣ Trace cache stores µop traces i.e. a cache of instructions
as executed rather than as stored in memory
♣ Branch history table branch target buffer + static prediction
♣ Now adopts simultaneous multi-threading - hyperthreading Fetches instructions
simultaneously from two threads
Pentium P4 pipeline overview!
Chris Jesshope!
50!
(c) Chris Jesshope 99 November 16, 10
Trace cache
♣ Cache branch sequences rather than memory sequences - i.e. the sequence of instructions executed rather than
their order in memory
(c) Chris Jesshope 100 November 16, 10
Pentium 4 pipeline ♣ 6-way out-of-order execution 20 stage pipeline
2 generations compared below (note the superpipelining) ♣ 126 entry reorder buffer (registers and result status)
Chris Jesshope!
51!
Multi-core ♣ Pentium 4D increased the pipeline length to 31
stages it an attempt to push clock speeds to 4GHz ♣ Since then Intel has moved to multi-core - shorter
pipelines - slower clocks 2006 Core 2 duo 2.93 GHz, 291M transistors, 65nm 2007 Quad core Xeon 2.66GHz, 582M transistors,
65nm 2007 Quad core Xeon Penryn >3GHz, 820M
transistors, 45nm 2008 Quad Core i7 Nehalem, 731M transistors,
45nm
(c) Chris Jesshope 101 11/16/10
Penryn
♣ 2 or 4 cores ♣ 14 stage pipeline ♣ 4-way instruction issue ♣ No hyperthreading ♣ Large 12 Mbyte of L2 cache
(c) Chris Jesshope 102 11/16/10
Chris Jesshope!
52!
Intel Core i7 - Nehalem
♣ 2 or 4 core up to 3GHz ♣ 14 stage pipeline with stream prefetching ♣ Return of hyper-threading ♣ 32 KB L1 I-cache & 32 KB L1 D-cache (8-way
set associative) ♣ 256 KB L2 cache per core (8-way set associative) ♣ Shared 8 MB L3 cache (16-way associative)
(c) Chris Jesshope 103 11/16/10
(c) Chris Jesshope 104 11/16/10
Chris Jesshope!
53!
Nehalem
(c) Chris Jesshope 105 11/16/10
Evolution of Intel processors
(c) Chris Jesshope 106 11/16/10
Download from:!http://download.intel.com/pressroom/kits/IntelProcessorHistory.pdf!
Chris Jesshope!
54!
Architectural Optimisations
(c) Chris Jesshope 108 November 16, 10
Application-specific optimisations ♣ Multimedia applications require intensive compute power
They are also very regular in their structure, I.e. they perform the same operations on large regular data sets
♣ A number of optimisations can be made to accelerate these applications - we consider two SIMD extensions Streaming data
Chris Jesshope!
55!
(c) Chris Jesshope 109 November 16, 10
SIMD extensions ♣ Many modern microprocessor provide SIMD extensions for
multimedia processing E.g. MMX, Altivec, Velocity engine
♣ Multimedia applications have specific characteristics low bit precision (often 8 or 16 bits) regular concurrent operations (vector operations) MACs - no not burgers but multiply-accumulates
♥ This results from inner product operations in linear algebra
(c) Chris Jesshope 110 November 16, 10
Example of SIMD extensions
♣ Example - PPC G5 Processor Altivec - Velocity Engine processes up to 128 bits of
data as: ♥ 4 by 32-bit integers, ♥ 8 by 16-bit integers, ♥ 16 by 8-bit integers, or ♥ 4 by 32-bit single-precision floating-point values
Pipelined single-cycle operations
Chris Jesshope!
56!
(c) Chris Jesshope 111 November 16, 10
SIMD instructions
Also need to be able to shift data to align it for processing where indices are offset!
(c) Chris Jesshope 112 November 16, 10
Register set ♣ Like the separation of Integer and floating point in a general purpose
processor, SIMD extensions may use existing or dedicated registers Altivec 128 bit SIMD extensions use separate registers
♣ Usually provide both modulo and saturated arithmetic modulo wraps around in the field specified saturation as the name suggests is limited out at either
end of the range ♣ The extensions of course need compiler support
support varies from automatic vectorisation… … to support for vector types and operations
♣ the many instruction sets does not help portability in the latter case
Chris Jesshope!
57!
(c) Chris Jesshope 113 November 16, 10
SIMD computers ♣ Historically there have been many pure SIMD computers built:
Illiac 4 was a 64 array of 64bit processors circa 1975 ICL DAP was a 4096 array of 1 bit processor circa 1980
♣ The technique of reconfiguration is also not new: ICL DAP could be configured between 4096 1 bit processors and
32 32 bit processors The RPA (1984-89) was more flexible:
♥ Jesshope C R et. al. (1989) Design of SIMD microprocessor array, in Computers and Digital Techniques, IEE Proceedings E, 136(3), pp197-204, ISSN: 0143-7062
♥ Rushton R (1989) Reconfigurable Processor-Array: A Bit-Sliced Parallel Computer, MIT press (Ed Jesshope, C R) ISBN:0262680572
(c) Chris Jesshope 114 November 16, 10
Streaming applications
♣ Multimedia (and other streaming applications) suffer badly from compulsary cache misses computation uses a stream of data that is used once and
transformed and never used again very little opportunity for reuse
♣ The data stream is very regular however typically data addresses are of the form:
Si = A0 + i*stride where stride defines the offset between subsequent locations
♣ A solution is to bring data into cache before it is required this is called stream prefetching
Chris Jesshope!
58!
(c) Chris Jesshope 115 November 16, 10
Stream prefetching
♣ The process requires two actions: 1. detecting the stream - i.e. detecting regular strides
between memory accesses 2. issuing prefetch instructions to bring data into cache
♣ The options are: static detection/static prefetch - SW (user or compiler
generated) static detection dynamic prefetch - HW/SW dynamic detection dynamic prefetch - HW solution
(c) Chris Jesshope 116 November 16, 10
Static detection static prefetch
♣ Cheap and cheerful Requires ISA that supports prefetch
♥ prefetch instruction touches the cache with required address before data is required but does not load a register
Disadvantages ♥ requires additional instructions to be executed - critical in real-time
signal processing ♥ frequency of access is based on cache/memory parameters and
requires loop unrolling - can interfere with other optimisations ♥ prefetch not binding, i.e. can be swapped out again by another miss
into cache - depends on replacement policy
Chris Jesshope!
59!
(c) Chris Jesshope 117 November 16, 10
Dynamic detection dynamic issue ♣ Requires hardware to detect strides between
consecutive loads for a given instruction Many streams may be interleaved
♥ Stride-prediction table (SPT) keeps history of instruction address and memory address and detects stream candidates
♣ Once candidates are identified can touch the cache without an instruction slot in the pipeline
(c) Chris Jesshope 118 November 16, 10
Dynamic solution
♣ Advantages no programmer intervention - supports legacy code
♣ Disadvantages large SPT required even larger SPT if we perform loop unrolling still have a potential for cache thrashing
♣ To avoid thrashing, separate stream caches can be used replacement policy for stream cache is different for
normal locality ♥ e.g use evict most recently used rather than least recently used
can also look at longer history sequences
Chris Jesshope!
60!
(c) Chris Jesshope 119 November 16, 10
Static detection dynamic issue ♣ Static detection of streams is trivial as it is implicit in loop
parameters ♣ Again either compiler or programmer signals which streams to
prefetch instead of using prefetch instructions the instructions set up a
prefetch engine to prefetch the stream requires only a small prefetch table no additional instructions required in the inner loop
Technology issues
Chris Jesshope!
61!
(c) Chris Jesshope 121 November 16, 10
Cancelled projects
♣ Both Intel and Compaq have cancelled wide superscaler projects recently pentium Netburst and Alpha 21464
♣ The trend is now to develop independent microprocessors on chip - multi-cores this is covered in the next section on explicit
concurrency
♣ But why is this?
(c) Chris Jesshope 122 November 16, 10
Technology issues
♣ Moores law - applied to CMOS silicon gate concurrency - scales as ekt (0.35 ≤ k ≤ 0.46)
♥ t in years i.e. doubling period 1.5 to 2 years gate speed - scales at ekt/2
most performance has come from gate speed with density providing support in the form of large caches
♣ Industry predicts 10 to 15 years of scaling left from the SIA roadmap pessimistically 32 fold concurrency and 6 fold speed optimistically 1024 fold concurrency and 32 fold speed
Chris Jesshope!
62!
(c) Chris Jesshope 123 November 16, 10
Power dissipated
♣ Power dissipated is a silicon CMOS circuit comprises several components and the major component has been dynamic power
♣ Dynamic power = kV2<f>
Ronan et. al. (2001) Coming Challenges in Microarchitecture and Architecture, Proc IEEE, 89 (3) pp325-340
(c) Chris Jesshope 124 November 16, 10
Architecture ♣ Past performance gains came mostly from clock rate
clock improved faster than circuit speed - super-pipelining
♣ Gate concurrency has not translated to instruction issue width - the example below is quite revealing 1987 - T800 transputer 0.5M transistors
♥ 20MHz and single issue 64-bit floating point 2004 - P4 Extreme - 178 M transistors
♥ 3GHz and 6-way issue
Scaling Concurrency Speed
Predicted from Moore’s law 387 20
Transistors scaling 356 ?
Instruction issue scaling 6 150
From known scaling, instead of one P4 we could be have 387 400Mhz T800 transputers Speed 7-8 times expected - superpipelining!
Issue width 64 times less than edxpected!
Chris Jesshope!
63!
(c) Chris Jesshope 125 November 16, 10
Voltage - frequency scaling
♣ Within some voltage range: frequency increases with supply voltage - k1Vµ-1 µ ≥ 1 but power increases more quickly - k2V µ+2
♣ This can be used to trade performance and power by reducing the voltage If frequency is reduced voltage can be reduced and hence
power dissipated is reduced
♣ With scalable concurrency - concurrent execution can reduce the total energy of a computation by lowering the required frequency
(c) Chris Jesshope 126 November 16, 10
Signal propagation
130 nm
100 nm
70 nm
35 nm
20 mm chip edge
Analytically … Qualitatively …
Bits reachable in one clock cycle clock cycle defined by 8, 16 or optimal (fSIA) number of fan-out-of-4 gates
Chris Jesshope!
64!
(c) Chris Jesshope 127 November 16, 10
Superscalar support structures
♣ Superscaler register access is not scalable for n instructions issued per cycle the number of registers
grow as O(n) - otherwise run out of registers for n read/write ports in a register file a one-bit register
cell area grows as O(n2) register file area and latency therefore grow as O(n3)
♣ Superscaler issue is not scalable it requires logic which grows at least O(n2) to issue n
instructions per cycle multiported rename logic must be able to rename the
same register multiple times in one cycle rename logic is one of the key complexities in multi-issue!
♣ Power dissipated in these structures does not scale well
(c) Chris Jesshope 128 November 16, 10
More architecture ♣ The high profile projects cancellations were due to power density
and circuit complexity issues arrising from the poor scalability of dynamic concurrency - out-of-order instruction issue NetBurst, Pentium 4 - Xeon Evolution “The chips are apparently too hot to go the market” The Alpha 21464 8-way issue SMT superscaler
♣ This should not have been a surprise! it was predicted in 1999 that the Alpha 21464’s instruction issue
would consume 25% of the chips power budget
Chris Jesshope!
65!
(c) Chris Jesshope 129 November 16, 10
Alpha 21464 project
♣ A quick look at the cancelled Alpha 21464 R P Peterson et. al. (2002) Design of an 8-wide superscalar RISC microprocessor with
simultaneous multithreading, ISSC Digest and Visuals supplement.
♣ This was a 64 bit microprocessor designed to support 4 threads and to issue/retire 8 instructions out-of-order to 12 functional units in every cycle
♣ The floor plan on the next slide shows the large area dedicated to instruction issue and register file
Alpha 21464!
Chris Jesshope!
66!
(c) Chris Jesshope 131 November 16, 10
Summary ♣ Out-of-order issue exploits instruction-level concurrency from a sequential
instruction stream (implicit concurrency) it supports a large number of instructions in execution simultaneously in
multiple functional units instructions are dynamically scheduled at the issue by executing out of
order while honouring dependencies dependencies introduced by completing and issuing instructions out of
order are removed by register renaming
♣ Out-of-order issue is costly and only appropriate for low levels of concurrency and is inefficient on irregularly branching code difficult to get an IPC of much more than 2 even for 8-way issue
♣ The clear advantage is concurrency with binary-code compatibility
These issues should be considered in programming for efficiency!
(c) Chris Jesshope 132 November 16, 10
CPU Frequencies
Supercomputer performance
0.0010 0.0100 0.1000 1.0000 10.0000 100.0000 1000.0000 10000.0000 100000.0000 1000000.0000
1980 1985 1990 1995 2000 2005 2010 Year
FLOPS
100M 1G 10G 100G
10T 100T 1P
10M
1T
100MHz 1GHz 10GHz
10MHz
X-MP VP-200 S-810/20 S-820/80 VP-400 SX-2 CRAY-2 Y-MP8 VP2600/1 0
SX-3 C90
SX-3R S-3800
SR2201
SR2201/2K CM-5 Paragon
T3D T90
T3E NWT/166 VPP500 SX-4 ASCI Red VPP700
SX-5 SR8000
SR8000G1 SX-6 ASCI Blue ASCI Blue Mountain VPP800 VPP5000 ASCI White ASCI Q Earth Simulator
1M
History of High Performance Computers
Data courtesy Jack Dongarra
Single CPU Performance
System Parallelism
Pipelining & Instruction parallelism
Moore’s law