ELEC 669 Low Power Design Techniques Lecture 1

53
ELEC 669 Low Power Design Techniques Lecture 1 Amirali Baniasadi [email protected]

description

ELEC 669 Low Power Design Techniques Lecture 1. Amirali Baniasadi [email protected]. ELEC 669: Low Power Design Techniques. Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. - PowerPoint PPT Presentation

Transcript of ELEC 669 Low Power Design Techniques Lecture 1

Page 1: ELEC 669 Low Power Design Techniques Lecture 1

ELEC 669Low Power Design

Techniques

Lecture 1Amirali Baniasadi

[email protected]

Page 2: ELEC 669 Low Power Design Techniques Lecture 1

2

ELEC 669: Low Power Design Techniques

Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: [email protected] Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html

Will use paper reprints

Lecture notes will be posted on the course web page.

Page 3: ELEC 669 Low Power Design Techniques Lecture 1

3

Course

Structure

Lectures: 1-2 weeks on processor review 5 weeks on low power techniques 6 weeks: discussion, presentation, meetings

Reading paper posted on the web for each week. Need to bring a 1 page review of the papers.

Presentations: Each student should give to presentations in class.

Page 4: ELEC 669 Low Power Design Techniques Lecture 1

4

Course Philosophy

Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details)

One Project (50%) Presentation (30%)- Will be announced in advance. Final Exam: take home (20%)

IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.

Page 5: ELEC 669 Low Power Design Techniques Lecture 1

5

Project

More on project later

Page 6: ELEC 669 Low Power Design Techniques Lecture 1

6

Topics

High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more…

Page 7: ELEC 669 Low Power Design Techniques Lecture 1

7

A Modern Processor

Fetch CommitCompleteIssueDecode

Front-endBack-end

1-What do each do?2-Possible Power Optimizations?

Page 8: ELEC 669 Low Power Design Techniques Lecture 1

8

Power Breakdown

Back-end35%

REST37%

Front-end28%

PentiumPro

Rest26%

Back-end68%

Front-end6%

Alpha 21464

Page 9: ELEC 669 Low Power Design Techniques Lecture 1

9

Instruction Set Architecture (ISA)

Fetch Instruction From Memory

Decode Instruction determine its size & action

Fetch Operand data

Execute instruction & compute results or status

Store Result in memory

Determine Next Instruction’s address

•Instruction Execution Cycle

Page 10: ELEC 669 Low Power Design Techniques Lecture 1

10

What Should we Know?

A specific ISA (MIPS)

Performance issues - vocabulary and motivation

Instruction-Level Parallelism

How to Use Pipelining to improve performance

Exploiting Instruction-Level Parallelism w/ Dynamic Approach

Memory: caches and virtual memory

Page 11: ELEC 669 Low Power Design Techniques Lecture 1

11

What is Expected From You?

• Read papers!• Be up-to-date! • Come back with your input & questions for discussion!

Page 12: ELEC 669 Low Power Design Techniques Lecture 1

12

Power?

Everything is done by tiny switches

Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown

Need to keep power within acceptable limitsNeed to keep power within acceptable limits

Page 13: ELEC 669 Low Power Design Techniques Lecture 1

13

POWER in the real world

1

10

100

1000

W/c

m2

Page 14: ELEC 669 Low Power Design Techniques Lecture 1

14

Power as a Performance Limiter

Conventional Performance Scaling:

Goal: Max. performance w/ min cost/complexity

How: -More and faster xtors.

-More complex structures.

Power: Don’t fix if it ain’t broken

Not True Anymore: Power has increased rapidly

Power-Aware Architecture a Necessity

Name
Say that Dealing with power was viewed as an additional complexity.Also make sure at the end to make the point that power-aware architecture is one approach. Others, especially at the circuit level are also necessary and probably more important.
Page 15: ELEC 669 Low Power Design Techniques Lecture 1

15

Power-Aware Architecture

Conventional Architecture:Conventional Architecture:

Goal: Max. performance

How: Do as much as you can.

This WorkThis Work Power-Aware ArchitecturePower-Aware Architecture

Goal: Min. Power and Maintain Performance

How: Do as little as you can, while maintaining performance

Challenging and new area

Name
Say that Dealing with power was viewed as an additional complexity.Also make sure at the end to make the point that power-aware architecture is one approach. Others, especially at the circuit level are also necessary and probably more important.
Page 16: ELEC 669 Low Power Design Techniques Lecture 1

16

Why is this challenging

Identify actions that can be delayed/eliminated

Don’t touch those that boost performance

Cost/Power of doing so must not out-weight benefits

Page 17: ELEC 669 Low Power Design Techniques Lecture 1

17

Definitions

Performance is in units of things-per-second bigger is better

If we are primarily concerned with response time performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Performance(X)

n = ----------------------

Performance(Y)

Page 18: ELEC 669 Low Power Design Techniques Lecture 1

04/20/23

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------- = ---------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task

by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E)

Speedup(with E) =1/ ((1-F) + F/S)

Page 19: ELEC 669 Low Power Design Techniques Lecture 1

04/20/23

Amdahl's Law-example

A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement?

Fraction enhanced= 0.4

Speedup enhanced = 10

Speedup overall = 1 = 1.56

0.6 +0.4/10

Page 20: ELEC 669 Low Power Design Techniques Lecture 1

04/20/23

Why Do Benchmarks? How we evaluate differences

Different systems Changes to a single system

Provide a target Benchmarks should represent large class of important

programs Improving benchmark performance should help many

programs For better or worse, benchmarks shape a field Good ones accelerate progress

good target for development Bad benchmarks hurt progress

help real programs v. sell machines/papers? Inventions that help real programs don’t help

benchmark

Page 21: ELEC 669 Low Power Design Techniques Lecture 1

04/20/23

SPEC first round

First round 1989; 10 programs, single number to summarize performance

One program: 99% of time in single line of code New front-end compiler could improve dramatically

Benchmark

SPE

C P

erf

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spic

e

doduc

nasa7

li

eqnto

tt

matr

ix300

fpppp

tom

catv

Page 22: ELEC 669 Low Power Design Techniques Lecture 1

23

SPEC95

Eighteen application benchmarks (with inputs) reflecting a technical computing workload

Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu,

turb3d, apsi, fppp, wave5 Must run with standard compiler flags

eliminate special undocumented incantations that may not even generate working code for real programs

Page 23: ELEC 669 Low Power Design Techniques Lecture 1

04/20/23

Summary

Time is the measure of computer performance! Remember Amdahl’s Law: Improvement is limited by unimproved

part of program

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Page 24: ELEC 669 Low Power Design Techniques Lecture 1

25

Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

Page 25: ELEC 669 Low Power Design Techniques Lecture 1

26

What Must be Specified?Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

° Instruction Format or Encoding

– how is it decoded?

° Location of operands and result

– where other than memory?

– how many explicit operands?

– how are memory operands located?

– which can or cannot be in memory?

° Data type and Size

° Operations

– what are supported

° Successor instruction

– jumps, conditions, branches

Page 26: ELEC 669 Low Power Design Techniques Lecture 1

27

What Is an ILP?

Principle: Many instructions in the code do not depend on each other

Result: Possible to execute them in parallel ILP: Potential overlap among instructions (so they can be

evaluated in parallel)

Issues: Building compilers to analyze the code Building special/smarter hardware to handle the code

ILP: Increase the amount of parallelism exploited among instructions

Seeks Good Results out of Pipelining

Page 27: ELEC 669 Low Power Design Techniques Lecture 1

28

What Is ILP?

CODE A: CODE B:

LD R1, (R2)100 LD R1,(R2)100 ADD R4, R1 ADD R4,R1 SUB R5,R1 SUB R5,R4 CMP R1,R2 SW R5,(R2)100 ADD R3,R1 LD R1,(R2)100

Code A: Possible to execute 4 instructions in parallel. Code B: Can’t execute more than one instruction per cycle.

Code A has Higher ILP

Page 28: ELEC 669 Low Power Design Techniques Lecture 1

29

Out of Order Execution

Programmer: Instructions execute in-order

Processor: Instructions may execute in any orderifif results remain the same at the endat the end

A B

D

CA: LD R1, (R2) B: ADD R3, R4C: ADD R3, R5D: CMP R3, R1

In-Order

B: ADD R3, R4C: ADD R3, R5A: LD R1, (R2)D: CMP R3, R1

Out-of-Order

Name
here you talk about ordering onlyGOAL: I can execute instructions any order I like so long as I produce the right result at the end
Page 29: ELEC 669 Low Power Design Techniques Lecture 1

30

Assumptions

Five-stage integer pipeline Branches have delay of one clock cycle

ID stage: Comparisons done, decisions made and PC loaded No structural hazards

Functional units are fully pipelined or replicated (as many times as the pipeline depth)

FP Latencies

0Store doubleLoad double

1FP ALU opLoad double

2Store doubleFP ALU op

3Another FP ALU opFP ALU op

Latency (clock cycles)Dependant instructionSource instruction

Integer load latency: 1; Integer ALU operation latency: 0

Page 30: ELEC 669 Low Power Design Techniques Lecture 1

31

Simple Loop & Assembler Equivalent

for (i=1000; i>0; i--) x[i] = x[i] + s;

Loop: LD F0, 0(R1) ;F0=array element ADDD F4, F0, F2 ;add scalar in F2 SD F4 , 0(R1) ;store result SUBI R1, R1, #8 ;decrement pointer

8bytes (DW) BNE R1, R2, Loop ;branch R1!=R2

• x[i] & s are double/floating point type• R1 initially address of array element with the highest

address• F2 contains the scalar value s• Register R2 is pre-computed so that 8(R2) is the last

element to operate on

Page 31: ELEC 669 Low Power Design Techniques Lecture 1

32

Where are the stalls?UnscheduledLoop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall

10 clock cyclesCan we minimize?

Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1)

6 clock cycles 3 cycles: actual work; 3 cycles:

overhead Can we minimize further?

0Store doubleLoad double

1FP ALU opLoad double

2Store doubleFP ALU op

3Another FP ALU opFP ALU op

Latency (clock cycles)Dependant instructionSource instruction

Schedule

Page 32: ELEC 669 Low Power Design Techniques Lecture 1

33

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8

BNE R1, R2, Loop

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8

BNE R1, R2, Loop

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8

BNE R1, R2, Loop

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8

BNE R1, R2, Loop

Loop Unrolling

Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop

Four copies of loop

LD F0, 0(R1)ADDD F4, F0, F2SD F4 , 0(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -8(R1)ADDD F4, F0, F2SD F4 , -8(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -16(R1)ADDD F4, F0, F2SD F4 , -16(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -24(R1)ADDD F4, F0, F2SD F4 , -24(R1)SUBI R1, R1, #32BNE R1, R2, Loop

Eliminate Incr, Branch Four iteration code

Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4

Page 33: ELEC 669 Low Power Design Techniques Lecture 1

34

Loop Unroll & Schedule

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall

28 clock cycles or 7 per iterationCan we minimize further?

Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)

SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)

No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?

Schedule

Page 34: ELEC 669 Low Power Design Techniques Lecture 1

35

Summary

Iteration10 cycles

6 cycles

7 cycles

3.5 cycles(No stalls)

Scheduling

Unrolling

Scheduling

Page 35: ELEC 669 Low Power Design Techniques Lecture 1

36

Multiple Issue

• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.

• Superscalar processors

• Very Long Instruction Word (VLIW) processors

Page 36: ELEC 669 Low Power Design Techniques Lecture 1

37

A Modern Processor

Fetch CommitCompleteIssueDecode

Front-endBack-end

Multiple Issue

Page 37: ELEC 669 Low Power Design Techniques Lecture 1

38

1990’s: Superscalar Processors

Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)

Hazards Superpipelining? Diminishing returns (hazards + overhead)

How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)

1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB

Page 38: ELEC 669 Low Power Design Techniques Lecture 1

39

Elements of Advanced Superscalars

High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?

Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW

Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution

Page 39: ELEC 669 Low Power Design Techniques Lecture 1

40

SS + DS + Speculation

Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?

Overcome with superscalar Superscalar increases hazards

Overcome with dynamic scheduling RAW dependences still a problem?

Overcome with a large window Branches a problem for filling large window? Overcome with speculation

Page 40: ELEC 669 Low Power Design Techniques Lecture 1

41

The Big Picture

&Static program Fetch & branch

predict execution

issue

Reorder & commit

Page 41: ELEC 669 Low Power Design Techniques Lecture 1

42

Superscalar Microarchitecture

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

Page 42: ELEC 669 Low Power Design Techniques Lecture 1

43

Register renaming methods

First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of

log. Reg use a free list of physical registers Physical register file bigger than log register file

Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.

Page 43: ELEC 669 Low Power Design Techniques Lecture 1

44

Register Renaming Example

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall

28 clock cycles or 7 per iterationCan we minimize further?

Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)

SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)

No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?

Schedule

Page 44: ELEC 669 Low Power Design Techniques Lecture 1

45

Register renaming: first method

R2 R6 R13

R8

R7

R5

R9

R1

r0

r1

r2

r3

r4

R6 R13

R8

R7

R5

R9

R2

r0

r1

r2

r3

r4

Add r3,r3,4

Mapping table

Free List

Mapping table

Free List

Page 45: ELEC 669 Low Power Design Techniques Lecture 1

46

Superscalar Processors

• Issues varying number of instructions per clock

• Scheduling: Static (by the compiler) or dynamic(by the hardware)

• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).

• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

Page 46: ELEC 669 Low Power Design Techniques Lecture 1

47

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Effect of limiting the number of renaming registers

Integer: 5 - 15

FP: 11 - 45

IPC

Page 47: ELEC 669 Low Power Design Techniques Lecture 1

48

Reorder Buffer

Place data in entry when execution finished

Reserve entry at tail when dispatched

Remove from head when complete

Bypass to other instructions when needed

Page 48: ELEC 669 Low Power Design Techniques Lecture 1

49

…..…..

register renaming:reorder buffer

r3

R8

R7

R5

R9

rob6

r0

r1

r2

r3

r4

R3 0 R3 ….

R8

R7

R5

R9

rob8

r0

r1

r2

r3

r4

Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4

Reorder buffer

Reorder buffer

7 6 0 8 7 6 0

Page 49: ELEC 669 Low Power Design Techniques Lecture 1

50

Instruction Buffers

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

Page 50: ELEC 669 Low Power Design Techniques Lecture 1

51

Issue Buffer Organization

a) Single, shared queue b)Multiple queue; one per inst. type

No out-of-orderNo Renaming

No out-of-order inside queuesQueues issue out of order

Page 51: ELEC 669 Low Power Design Techniques Lecture 1

52

Issue Buffer Organization

c) Multiple reservation stations; (one per instruction type or big pool)

NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo

From Instruction Dispatch

Page 52: ELEC 669 Low Power Design Techniques Lecture 1

53

Typical reservation station

Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination

Page 53: ELEC 669 Low Power Design Techniques Lecture 1

54

Memory Hazard Detection Logic

Address add & translation

Address compare

Load address buffer

Store address buffer

loads

stores

Hazard Control

To memoryInstruction issue