ELEC 669 Low Power Design Techniques Lecture 1
description
Transcript of ELEC 669 Low Power Design Techniques Lecture 1
2
ELEC 669: Low Power Design Techniques
Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: [email protected] Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html
Will use paper reprints
Lecture notes will be posted on the course web page.
3
Course
Structure
Lectures: 1-2 weeks on processor review 5 weeks on low power techniques 6 weeks: discussion, presentation, meetings
Reading paper posted on the web for each week. Need to bring a 1 page review of the papers.
Presentations: Each student should give to presentations in class.
4
Course Philosophy
Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details)
One Project (50%) Presentation (30%)- Will be announced in advance. Final Exam: take home (20%)
IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.
5
Project
More on project later
6
Topics
High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more…
7
A Modern Processor
Fetch CommitCompleteIssueDecode
Front-endBack-end
1-What do each do?2-Possible Power Optimizations?
8
Power Breakdown
Back-end35%
REST37%
Front-end28%
PentiumPro
Rest26%
Back-end68%
Front-end6%
Alpha 21464
9
Instruction Set Architecture (ISA)
Fetch Instruction From Memory
Decode Instruction determine its size & action
Fetch Operand data
Execute instruction & compute results or status
Store Result in memory
Determine Next Instruction’s address
•Instruction Execution Cycle
10
What Should we Know?
A specific ISA (MIPS)
Performance issues - vocabulary and motivation
Instruction-Level Parallelism
How to Use Pipelining to improve performance
Exploiting Instruction-Level Parallelism w/ Dynamic Approach
Memory: caches and virtual memory
11
What is Expected From You?
• Read papers!• Be up-to-date! • Come back with your input & questions for discussion!
12
Power?
Everything is done by tiny switches
Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown
Need to keep power within acceptable limitsNeed to keep power within acceptable limits
13
POWER in the real world
1
10
100
1000
W/c
m2
14
Power as a Performance Limiter
Conventional Performance Scaling:
Goal: Max. performance w/ min cost/complexity
How: -More and faster xtors.
-More complex structures.
Power: Don’t fix if it ain’t broken
Not True Anymore: Power has increased rapidly
Power-Aware Architecture a Necessity
15
Power-Aware Architecture
Conventional Architecture:Conventional Architecture:
Goal: Max. performance
How: Do as much as you can.
This WorkThis Work Power-Aware ArchitecturePower-Aware Architecture
Goal: Min. Power and Maintain Performance
How: Do as little as you can, while maintaining performance
Challenging and new area
16
Why is this challenging
Identify actions that can be delayed/eliminated
Don’t touch those that boost performance
Cost/Power of doing so must not out-weight benefits
17
Definitions
Performance is in units of things-per-second bigger is better
If we are primarily concerned with response time performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
04/20/23
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E)
Speedup(with E) =1/ ((1-F) + F/S)
04/20/23
Amdahl's Law-example
A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement?
Fraction enhanced= 0.4
Speedup enhanced = 10
Speedup overall = 1 = 1.56
0.6 +0.4/10
04/20/23
Why Do Benchmarks? How we evaluate differences
Different systems Changes to a single system
Provide a target Benchmarks should represent large class of important
programs Improving benchmark performance should help many
programs For better or worse, benchmarks shape a field Good ones accelerate progress
good target for development Bad benchmarks hurt progress
help real programs v. sell machines/papers? Inventions that help real programs don’t help
benchmark
04/20/23
SPEC first round
First round 1989; 10 programs, single number to summarize performance
One program: 99% of time in single line of code New front-end compiler could improve dramatically
Benchmark
SPE
C P
erf
0
100
200
300
400
500
600
700
800
gcc
epre
sso
spic
e
doduc
nasa7
li
eqnto
tt
matr
ix300
fpppp
tom
catv
23
SPEC95
Eighteen application benchmarks (with inputs) reflecting a technical computing workload
Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu,
turb3d, apsi, fppp, wave5 Must run with standard compiler flags
eliminate special undocumented incantations that may not even generate working code for real programs
04/20/23
Summary
Time is the measure of computer performance! Remember Amdahl’s Law: Improvement is limited by unimproved
part of program
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
25
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
26
What Must be Specified?Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
° Instruction Format or Encoding
– how is it decoded?
° Location of operands and result
– where other than memory?
– how many explicit operands?
– how are memory operands located?
– which can or cannot be in memory?
° Data type and Size
° Operations
– what are supported
° Successor instruction
– jumps, conditions, branches
27
What Is an ILP?
Principle: Many instructions in the code do not depend on each other
Result: Possible to execute them in parallel ILP: Potential overlap among instructions (so they can be
evaluated in parallel)
Issues: Building compilers to analyze the code Building special/smarter hardware to handle the code
ILP: Increase the amount of parallelism exploited among instructions
Seeks Good Results out of Pipelining
28
What Is ILP?
CODE A: CODE B:
LD R1, (R2)100 LD R1,(R2)100 ADD R4, R1 ADD R4,R1 SUB R5,R1 SUB R5,R4 CMP R1,R2 SW R5,(R2)100 ADD R3,R1 LD R1,(R2)100
Code A: Possible to execute 4 instructions in parallel. Code B: Can’t execute more than one instruction per cycle.
Code A has Higher ILP
29
Out of Order Execution
Programmer: Instructions execute in-order
Processor: Instructions may execute in any orderifif results remain the same at the endat the end
A B
D
CA: LD R1, (R2) B: ADD R3, R4C: ADD R3, R5D: CMP R3, R1
In-Order
B: ADD R3, R4C: ADD R3, R5A: LD R1, (R2)D: CMP R3, R1
Out-of-Order
30
Assumptions
Five-stage integer pipeline Branches have delay of one clock cycle
ID stage: Comparisons done, decisions made and PC loaded No structural hazards
Functional units are fully pipelined or replicated (as many times as the pipeline depth)
FP Latencies
0Store doubleLoad double
1FP ALU opLoad double
2Store doubleFP ALU op
3Another FP ALU opFP ALU op
Latency (clock cycles)Dependant instructionSource instruction
Integer load latency: 1; Integer ALU operation latency: 0
31
Simple Loop & Assembler Equivalent
for (i=1000; i>0; i--) x[i] = x[i] + s;
Loop: LD F0, 0(R1) ;F0=array element ADDD F4, F0, F2 ;add scalar in F2 SD F4 , 0(R1) ;store result SUBI R1, R1, #8 ;decrement pointer
8bytes (DW) BNE R1, R2, Loop ;branch R1!=R2
• x[i] & s are double/floating point type• R1 initially address of array element with the highest
address• F2 contains the scalar value s• Register R2 is pre-computed so that 8(R2) is the last
element to operate on
32
Where are the stalls?UnscheduledLoop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall
10 clock cyclesCan we minimize?
Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1)
6 clock cycles 3 cycles: actual work; 3 cycles:
overhead Can we minimize further?
0Store doubleLoad double
1FP ALU opLoad double
2Store doubleFP ALU op
3Another FP ALU opFP ALU op
Latency (clock cycles)Dependant instructionSource instruction
Schedule
33
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
Loop Unrolling
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop
Four copies of loop
LD F0, 0(R1)ADDD F4, F0, F2SD F4 , 0(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -8(R1)ADDD F4, F0, F2SD F4 , -8(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -16(R1)ADDD F4, F0, F2SD F4 , -16(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -24(R1)ADDD F4, F0, F2SD F4 , -24(R1)SUBI R1, R1, #32BNE R1, R2, Loop
Eliminate Incr, Branch Four iteration code
Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4
34
Loop Unroll & Schedule
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall
28 clock cycles or 7 per iterationCan we minimize further?
Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)
SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)
No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?
Schedule
35
Summary
Iteration10 cycles
6 cycles
7 cycles
3.5 cycles(No stalls)
Scheduling
Unrolling
Scheduling
36
Multiple Issue
• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.
• Superscalar processors
• Very Long Instruction Word (VLIW) processors
37
A Modern Processor
Fetch CommitCompleteIssueDecode
Front-endBack-end
Multiple Issue
38
1990’s: Superscalar Processors
Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)
Hazards Superpipelining? Diminishing returns (hazards + overhead)
How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)
1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB
39
Elements of Advanced Superscalars
High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?
Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution
40
SS + DS + Speculation
Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?
Overcome with superscalar Superscalar increases hazards
Overcome with dynamic scheduling RAW dependences still a problem?
Overcome with a large window Branches a problem for filling large window? Overcome with speculation
41
The Big Picture
&Static program Fetch & branch
predict execution
issue
Reorder & commit
42
Superscalar Microarchitecture
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
43
Register renaming methods
First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of
log. Reg use a free list of physical registers Physical register file bigger than log register file
Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.
44
Register Renaming Example
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall
28 clock cycles or 7 per iterationCan we minimize further?
Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)
SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)
No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?
Schedule
45
Register renaming: first method
R2 R6 R13
R8
R7
R5
R9
R1
r0
r1
r2
r3
r4
R6 R13
R8
R7
R5
R9
R2
r0
r1
r2
r3
r4
Add r3,r3,4
Mapping table
Free List
Mapping table
Free List
46
Superscalar Processors
• Issues varying number of instructions per clock
• Scheduling: Static (by the compiler) or dynamic(by the hardware)
• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).
• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
47
Program
Instr
ucti
on
issu
es p
er
cy
cle
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
10
1312
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None
More Realistic HW: Register Impact
Effect of limiting the number of renaming registers
Integer: 5 - 15
FP: 11 - 45
IPC
48
Reorder Buffer
Place data in entry when execution finished
Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
49
…..…..
register renaming:reorder buffer
r3
R8
R7
R5
R9
rob6
r0
r1
r2
r3
r4
R3 0 R3 ….
R8
R7
R5
R9
rob8
r0
r1
r2
r3
r4
Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4
Reorder buffer
Reorder buffer
7 6 0 8 7 6 0
50
Instruction Buffers
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
51
Issue Buffer Organization
a) Single, shared queue b)Multiple queue; one per inst. type
No out-of-orderNo Renaming
No out-of-order inside queuesQueues issue out of order
52
Issue Buffer Organization
c) Multiple reservation stations; (one per instruction type or big pool)
NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo
From Instruction Dispatch
53
Typical reservation station
Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination
54
Memory Hazard Detection Logic
Address add & translation
Address compare
Load address buffer
Store address buffer
loads
stores
Hazard Control
To memoryInstruction issue