IA64 rx2600/rx5670 Operation and Installation. System Configuration-rx2600 - Overview.
Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs....
Transcript of Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs....
Computer ArchitectureR. Poss
1 ca02 - 10 september 2015
Performance
2 ca02 - 10 september 2015
Processor performance equations
Performance = Results / second = Results / Instructions x Instructions / Cycles x Cycles / Second
Execution time = Seconds / Result = Instructions / Result x Cycles / Instruction x Second / Cycle
Performance and execution time are related - how?
Hint: think throughput vs latency, 1 result vs many
frequency (Hz)IPC
CPI
3 ca02 - 10 september 2015
Processor performance
Latency: expressed as CPI = cycles per instructiondivide by frequency to obtain absolute latency
Throughput: expressed as IPC = instructions per cyclemultiply by frequency to obtain absolute throughput
Pipelining objective: increase IPC, also decrease CPI As we will see ↘CPI and ↗IPC are conflicting requirements
4 ca02 - 10 september 2015
Who controls what?
Results vs instructions: software task of programmer, compiler, instruction set
Instructions vs cycles: micro-architecturetask of processor designer, instruction set, partly compiler
Cycles vs seconds: technologytask of circuit designer, manufacturer
5 ca02 - 10 september 2015
Performance comparison“X has a speedup of n relative to reference Y”
“X is n times faster than Y” n = perf(X) / perf(Y) = exectime(Y) / exectime(X)
NB: performance encompasses software + hardware
Can’t compare performance of sw alone without specifying which hw is used to measure
in general casefor 1 program
6 ca02 - 10 september 2015
Components and vocabulary
7 ca02 - 10 september 2015
Complexity vs Critical pathComplexity: number of operations/circuits needed to produce an output from an input
Critical path: longest chain of operations to the output from input
Complexity and critical path are different
think parallelism again
But: critical path establishes a lower bound on (circuit) complexity
In component design, criticial path sets lower bound on cycle time
8 ca02 - 10 september 2015
Von Neumann vs HarvardData memory vs code memory: where are the instructions coming from?
Harvard architecture: separate memory for code and data
Von Neumann architecture: same memory for both
Most architectures nowadays use a hybrid system:
Separate Instruction and Data cache (I-Cache / D-Cache)
Data in I-Cache not automatically updated when writing to D-cache, illustion of partially separate memories
9 ca02 - 10 september 2015
Buffer vs memory
Buffer: one or more memory cells arranged in a FIFO
Producer-consumer interface between two or more hardware components
Memory: one or more memory cells arranged in an array
With address selection circuit to determine which cell is read from/written to in one access operation
10 ca02 - 10 september 2015
Register vs main memoryRegister file:
on-chip, close to the processor; SRAM: faster, larger/bit, small capacity
indexed using fixed offsets in instruction codes
Main memory or scratchpads in MPSoCs
off-chip or farther from the processor; DRAM: slower, smaller/bit, large capacity
indexed using variable offsets in registers or other memory cells
Both are forms of “memory” from the architect’s perspectivethe word “memory” designates any component with an address / value interface.
NB: from the software/compiler/OS perspective “memory” is not registers
11 ca02 - 10 september 2015
Processors: RISC pipelines
12 ca02 - 10 september 2015
Instruction set architecture
ISA = instruction set + operational semantics
Instruction set = all possible instruction encodings
Described with instruction formats and decode logic
Defines how operands and operations are derived from the instruction codes
Operational semantics = “what instruction do”
Described with pseudo-code, also called Register Transfer Language (RTL)
Defines how results are produced from operands
13 ca02 - 10 september 2015
Example instruction set: MIPS
First bits determine opand encoding of rest
More regularity = simpler decode logic
Op Format Opx Insn0 R-R 0x20 add0 R-R 0x21 addu0 R-R 0x22 sub0 R-R 0x23 subu 8 R-I - addi9 R-I - addiu
0x23 R-I - lw0x2B R-I - sw
4 B (R-I) - beq3 J - jal
Example encodings
14 ca02 - 10 september 2015
© Chris Jesshope 2008-2011, Raphael Poss 2011
PipelinesEach instruction / operation can be decomposed in sub-tasks: Op = A ; B ; C ; D
Considering an instruction stream [Op1; Op2; ...] at each cycle n we can run in parallel: An+3 ∣∣ Bn+2 ∣∣ Cn+1 ∣∣ Dn
A B C D
A B C D
input An+3 input Bn+2 input Cn+1at start of cycle n:
15 ca02 - 10 september 2015
Origins of pipelining
Idea: “assembly line”
Different phases of two instructions can occur at the same time
eg. read operand of one instruction while the next is being fetched
More steps in RTL = potentially more stages in pipeline
16 ca02 - 10 september 2015
MIPS Pipeline
17 ca02 - 10 september 2015
© Chris Jesshope 2008-2011, Raphael Poss 2011
Pipeline trade-offsObservations:
more complexity per sub-task requires more time per cycle
conversely, as the sub-tasks become simpler the cycle time can be reduced
so to increase the clock rate instructions must be broken down into smaller sub-tasks
…but operations have a fixed complexity
smaller sub-tasks mean deeper pipelines = more stages ⇒ more instructions need to be executed to fill the pipeline
18 ca02 - 10 september 2015
Control hazards
19 ca02 - 10 september 2015
Control hazardsBranches – in particular conditional branches – cause pipeline hazards
the outcome of a conditional branch is not known until the end of the EX stage, but is required at IF to load another instruction and keep the pipeline full
A simple solution: assume by default that the branch falls through – i.e. is not taken – then continue speculatively until the target of the branch is known
IFbeq ID Ex WBBranch not taken
IF ID Ex WB
Continue to fetch but stall ID until branch target is known – one cycle lost
IFbeq ID Ex WBBranch is taken
IF ID Ex WB
Need to refetch at new target – two cycles lostIF
Wrong target
20 ca02 - 10 september 2015
How to overcomeDifferent solutions:
Stall (wait) from decode stage when a branch is encountered
Flush: erase instructions in pipeline after the branch when/if the branch is taken
Expose the branch delay to the programmer / compiler: branch delay slots (MIPS, SPARC, PA-RISC)
Predict whether the branch is taken or not and its target: branch prediction
Eliminate branches altogether via predication (most GPUs)
Execute instructions from other threads: hardware multithreading (eg Niagara, cf next lecture)
21 ca02 - 10 september 2015
Branch delay slotsSpecify in the ISA that a branch takes effect N instructions later, then let the compiler / programmer fill the empty slot
L1: lw a x[i] add a a a sw a x[i] sub i i 4 bne i 0 L1 nopL2: 1 cycle wasted at
each iteration
L1: lw a x[i] add a a a sw a x[i] bne i 4 L1 sub i i 4L2: no bubble, but one extra sub
at last iteration
22 ca02 - 10 september 2015
Grohoski’s estimate
23 ca02 - 10 september 2015
SummaryControl hazards are situations where the pipeline is not fully utilized due to branch instructions
Branch prediction, delay slots and predication are architectural solutions to overcome control hazards
Only branch prediction is fully invisible to software
Without these solutions, software must avoid branches by inlining and loop unrolling; with a trade-off: more code means more pressure on I-Cache
24 ca02 - 10 september 2015
Data hazards
25 ca02 - 10 september 2015
Data hazardOccurs when the output of one operation is the input of a subsequent operation
The hazard occurs because of the latency in the pipeline
the result (output from one instruction) is not written back to the register file until the last stage of the pipe
the operand (input of a subsequent instruction) is required at register read – some cycles prior to writeback
the longer the RR to WB delay, the more cycles there must be between the writeback of the producer instruction and the read from the consumer
Example:
26 ca02 - 10 september 2015
How to overcomeDo nothing i.e. expose to the programmer, e.g. MIPS 1, DSPs, VLIW
Stall the read stage
Bypass buses:
The operand is taken from the pipeline register and input directly to the ALU on the subsequent cycle
27 ca02 - 10 september 2015
Structural hazards and scalar ILP
28 ca02 - 10 september 2015
Scalar pipelinesIn the simple pipeline, register-to-register operations have a wasted cycle
a memory access is not required, but this stage still requires a cycle to complete the operations
Decoupling memory access and operation execution avoids this e.g. use an ALU plus a memory unit - this is scalar ILP
… note: either we need two write ports to the register file or arbitration on a single port
29 ca02 - 10 september 2015
Structural hazard - registersA structural hazard occurs when a resource in the pipeline is required by more than one instruction
a resource may be an execution unit or a register port
Example: only one write port lw a addr add b c d
30 ca02 - 10 september 2015
How to overcomeThey result from contention ⇒ they can be removed by adding more hardware resources
register write hazard: add more write ports
execution unit: add more execution units
Example: CDC 6600 (1963) 10 units, 4 write ports, only FP div not pipelined
Note:
more resources = more cost (area, power)
31 ca02 - 10 september 2015
Superscalar processorsIntroduction / overview
32 ca02 - 10 september 2015
Pipelining - summaryDepth of pipeline - Superpipelining
further dividing pipeline stages increases frequency
but introduces more scope for hazards
and higher frequency means more power dissipated
Branches to different functional units - Scalar pipelining - avoids waiting for long operations to complete
instructions fetched and decoded in sequence
multiple operations executed in parallel
Concurrent issue of instructions - Superscalar ILP
multiple instructions fetched and decoded concurrently
new ordering issues and new data hazards
33 ca02 - 10 september 2015
Scalar vs. superscalar
in-order issue
concurrent issue, possibly out of order
Most “complex” general-purpose processors are superscalar
34 ca02 - 10 september 2015
Instruction-level parallelismThe number of instructions issued per cycle is called issue parallelism / issue width
IPC the number of instructions executed per cycle is limited by:
the issue width
the number of true dependencies
the number of branches in relation to other instructions
the latency of operations in conjunction with dependencies
Current microprocessors: 4-8 max ILP, 12 functional units, however IPC of typically 2-3
35 ca02 - 10 september 2015
Challenges of superscalar execution
parallel fetch decoding and issue
100s of instructions in-flight simultaneously
out-of-order execution and sequential consistency
Exceptions and false dependencies
finding parallelism and scheduling its execution
application specific engines, e.g. SIMD & prefetching
36 ca02 - 10 september 2015
Fundamental limitations of superscalar designs
Power dissipation is one of the major issues in obtaining performance
particularly true with constraints on embedded systems
Concurrency can be used to reduce power and get performance
reduce frequency which in turn allows a voltage reduction
does not apply to pipelined concurrency where more concurrency increases power due to shared structures
37 ca02 - 10 september 2015
OoO not scalable – what else?It is clear that out-of-order issue is not scalable
register file and issue logic scale as ILP3 and ILP2 respectively
A number of approaches have been followed to increase the utilisation of on-chip parallelism; these include:
VLIW processors
Speculative VLIW processors - Intel’s IA64 - EPIC
multi-core and multi-threaded processors
these are collectively called explicit concurrency
38 ca02 - 10 september 2015
Explicit concurrency as the (a?) way forward
VLIW is the simplest form of explicit concurrency it reduces hardware complexity, hence power requirements
drawbacks are code compatibility and intolerance to cache misses
Multiple instruction streams / multi-programming:
Hardware multithreading (HMT) provides more scalability on single cores
Multi-cores are even more scalable, best results combined with HMT
However the software stack is not ready yet, still too much ad-hoc APIs/frameworks. multi-programming still needs new programming models
the big question is - can we design general purpose multi-cores?
39 ca02 - 10 september 2015