Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs....

Computer ArchitectureR. Poss

1 ca02 - 10 september 2015

Performance


Processor performance equations

Performance = Results / second = Results / Instructions x Instructions / Cycles x Cycles / Second

Execution time = Seconds / Result = Instructions / Result x Cycles / Instruction x Second / Cycle

Performance and execution time are related - how?

Hint: think throughput vs latency, 1 result vs many

frequency (Hz)IPC

CPI


Processor performance

Latency: expressed as CPI = cycles per instructiondivide by frequency to obtain absolute latency

Throughput: expressed as IPC = instructions per cyclemultiply by frequency to obtain absolute throughput

Pipelining objective: increase IPC, also decrease CPI As we will see ↘CPI and ↗IPC are conflicting requirements


Who controls what?

Results vs instructions: software task of programmer, compiler, instruction set

Instructions vs cycles: micro-architecturetask of processor designer, instruction set, partly compiler

Cycles vs seconds: technologytask of circuit designer, manufacturer


Performance comparison“X has a speedup of n relative to reference Y”

“X is n times faster than Y” n = perf(X) / perf(Y) = exectime(Y) / exectime(X)

NB: performance encompasses software + hardware

Can’t compare performance of sw alone without specifying which hw is used to measure

in general casefor 1 program


Components and vocabulary


Complexity vs Critical pathComplexity: number of operations/circuits needed to produce an output from an input

Critical path: longest chain of operations to the output from input

Complexity and critical path are different

think parallelism again

But: critical path establishes a lower bound on (circuit) complexity

In component design, criticial path sets lower bound on cycle time


Von Neumann vs HarvardData memory vs code memory: where are the instructions coming from?

Harvard architecture: separate memory for code and data

Von Neumann architecture: same memory for both

Most architectures nowadays use a hybrid system:

Separate Instruction and Data cache (I-Cache / D-Cache)

Data in I-Cache not automatically updated when writing to D-cache, illustion of partially separate memories


Buffer vs memory

Buffer: one or more memory cells arranged in a FIFO

Producer-consumer interface between two or more hardware components

Memory: one or more memory cells arranged in an array

With address selection circuit to determine which cell is read from/written to in one access operation


Register vs main memoryRegister file:

on-chip, close to the processor; SRAM: faster, larger/bit, small capacity

indexed using fixed offsets in instruction codes

Main memory or scratchpads in MPSoCs

off-chip or farther from the processor; DRAM: slower, smaller/bit, large capacity

indexed using variable offsets in registers or other memory cells

Both are forms of “memory” from the architect’s perspectivethe word “memory” designates any component with an address / value interface.

NB: from the software/compiler/OS perspective “memory” is not registers


Processors: RISC pipelines


Instruction set architecture

ISA = instruction set + operational semantics

Instruction set = all possible instruction encodings

Described with instruction formats and decode logic

Defines how operands and operations are derived from the instruction codes

Operational semantics = “what instruction do”

Described with pseudo-code, also called Register Transfer Language (RTL)

Defines how results are produced from operands


Example instruction set: MIPS

First bits determine opand encoding of rest

More regularity = simpler decode logic

Op Format Opx Insn0 R-R 0x20 add0 R-R 0x21 addu0 R-R 0x22 sub0 R-R 0x23 subu 8 R-I - addi9 R-I - addiu

0x23 R-I - lw0x2B R-I - sw

4 B (R-I) - beq3 J - jal

Example encodings


© Chris Jesshope 2008-2011, Raphael Poss 2011

PipelinesEach instruction / operation can be decomposed in sub-tasks: Op = A ; B ; C ; D

Considering an instruction stream [Op1; Op2; ...] at each cycle n we can run in parallel: An+3 ∣∣ Bn+2 ∣∣ Cn+1 ∣∣ Dn

A B C D

A B C D

input An+3 input Bn+2 input Cn+1at start of cycle n:


Origins of pipelining

Idea: “assembly line”

Different phases of two instructions can occur at the same time

eg. read operand of one instruction while the next is being fetched

More steps in RTL = potentially more stages in pipeline


MIPS Pipeline


© Chris Jesshope 2008-2011, Raphael Poss 2011

Pipeline trade-offsObservations:

more complexity per sub-task requires more time per cycle

conversely, as the sub-tasks become simpler the cycle time can be reduced

so to increase the clock rate instructions must be broken down into smaller sub-tasks

…but operations have a fixed complexity

smaller sub-tasks mean deeper pipelines = more stages ⇒ more instructions need to be executed to fill the pipeline


Control hazards


Control hazardsBranches – in particular conditional branches – cause pipeline hazards

the outcome of a conditional branch is not known until the end of the EX stage, but is required at IF to load another instruction and keep the pipeline full

A simple solution: assume by default that the branch falls through – i.e. is not taken – then continue speculatively until the target of the branch is known

IFbeq ID Ex WBBranch not taken

IF ID Ex WB

Continue to fetch but stall ID until branch target is known – one cycle lost

IFbeq ID Ex WBBranch is taken

IF ID Ex WB

Need to refetch at new target – two cycles lostIF

Wrong target


How to overcomeDifferent solutions:

Stall (wait) from decode stage when a branch is encountered

Flush: erase instructions in pipeline after the branch when/if the branch is taken

Expose the branch delay to the programmer / compiler: branch delay slots (MIPS, SPARC, PA-RISC)

Predict whether the branch is taken or not and its target: branch prediction

Eliminate branches altogether via predication (most GPUs)

Execute instructions from other threads: hardware multithreading (eg Niagara, cf next lecture)


Branch delay slotsSpecify in the ISA that a branch takes effect N instructions later, then let the compiler / programmer fill the empty slot

L1: lw a x[i] add a a a sw a x[i] sub i i 4 bne i 0 L1 nopL2: 1 cycle wasted at

each iteration

L1: lw a x[i] add a a a sw a x[i] bne i 4 L1 sub i i 4L2: no bubble, but one extra sub

at last iteration


Grohoski’s estimate


SummaryControl hazards are situations where the pipeline is not fully utilized due to branch instructions

Branch prediction, delay slots and predication are architectural solutions to overcome control hazards

Only branch prediction is fully invisible to software

Without these solutions, software must avoid branches by inlining and loop unrolling; with a trade-off: more code means more pressure on I-Cache


Data hazards


Data hazardOccurs when the output of one operation is the input of a subsequent operation

The hazard occurs because of the latency in the pipeline

the result (output from one instruction) is not written back to the register file until the last stage of the pipe

the operand (input of a subsequent instruction) is required at register read – some cycles prior to writeback

the longer the RR to WB delay, the more cycles there must be between the writeback of the producer instruction and the read from the consumer

Example:


How to overcomeDo nothing i.e. expose to the programmer, e.g. MIPS 1, DSPs, VLIW

Stall the read stage

Bypass buses:

The operand is taken from the pipeline register and input directly to the ALU on the subsequent cycle


Structural hazards and scalar ILP


Scalar pipelinesIn the simple pipeline, register-to-register operations have a wasted cycle

a memory access is not required, but this stage still requires a cycle to complete the operations

Decoupling memory access and operation execution avoids this e.g. use an ALU plus a memory unit - this is scalar ILP

… note: either we need two write ports to the register file or arbitration on a single port


Structural hazard - registersA structural hazard occurs when a resource in the pipeline is required by more than one instruction

a resource may be an execution unit or a register port

Example: only one write port lw a addr add b c d


How to overcomeThey result from contention ⇒ they can be removed by adding more hardware resources

register write hazard: add more write ports

execution unit: add more execution units

Example: CDC 6600 (1963) 10 units, 4 write ports, only FP div not pipelined

Note:

more resources = more cost (area, power)


Superscalar processorsIntroduction / overview


Pipelining - summaryDepth of pipeline - Superpipelining

further dividing pipeline stages increases frequency

but introduces more scope for hazards

and higher frequency means more power dissipated

Branches to different functional units - Scalar pipelining - avoids waiting for long operations to complete

instructions fetched and decoded in sequence

multiple operations executed in parallel

Concurrent issue of instructions - Superscalar ILP

multiple instructions fetched and decoded concurrently

new ordering issues and new data hazards


Scalar vs. superscalar

in-order issue

concurrent issue, possibly out of order

Most “complex” general-purpose processors are superscalar


Instruction-level parallelismThe number of instructions issued per cycle is called issue parallelism / issue width

IPC the number of instructions executed per cycle is limited by:

the issue width

the number of true dependencies

the number of branches in relation to other instructions

the latency of operations in conjunction with dependencies

Current microprocessors: 4-8 max ILP, 12 functional units, however IPC of typically 2-3


Challenges of superscalar execution

parallel fetch decoding and issue

100s of instructions in-flight simultaneously

out-of-order execution and sequential consistency

Exceptions and false dependencies

finding parallelism and scheduling its execution

application specific engines, e.g. SIMD & prefetching


Fundamental limitations of superscalar designs

Power dissipation is one of the major issues in obtaining performance

particularly true with constraints on embedded systems

Concurrency can be used to reduce power and get performance

reduce frequency which in turn allows a voltage reduction

does not apply to pipelined concurrency where more concurrency increases power due to shared structures


OoO not scalable – what else?It is clear that out-of-order issue is not scalable

register file and issue logic scale as ILP3 and ILP2 respectively

A number of approaches have been followed to increase the utilisation of on-chip parallelism; these include:

VLIW processors

Speculative VLIW processors - Intel’s IA64 - EPIC

multi-core and multi-threaded processors

these are collectively called explicit concurrency


Explicit concurrency as the (a?) way forward

VLIW is the simplest form of explicit concurrency it reduces hardware complexity, hence power requirements

drawbacks are code compatibility and intolerance to cache misses

Multiple instruction streams / multi-programming:

Hardware multithreading (HMT) provides more scalability on single cores

Multi-cores are even more scalable, best results combined with HMT

However the software stack is not ready yet, still too much ad-hoc APIs/frameworks. multi-programming still needs new programming models

the big question is - can we design general purpose multi-cores?


Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs....

Documents

Transcript of Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs....