Low Power Design Techniques for Microprocessors

Post on 27-Dec-2021

2 views 0 download

Transcript of Low Power Design Techniques for Microprocessors

ISSCC2001

1

Low Power Design Techniquesfor Microprocessors

ISSCC, Feb 4th 2001

Simon Segars

VP Engineering, ARM Inc.

ISSCC2001

2

Agenda• Introduction

• why is power consumption an importantdesign criteria of a microprocessor?

• System design

• CPU microarchitecture

• Where does the power go?• A look at ARM920T and ARM9TDMI

• Design tools

• Conclusions• References

ISSCC2001

3

Why should a microprocessordesigner care about power

consumption ?

Isn’t it all just about performance?

ISSCC2001

4

The digital age

• The internet and wireless services are gettingmarried

• Their offspring are handheld digital communicationdevices that will rapidly evolve in complexity• microprocessors becoming pervasive

• Darwinian theory will hold true• species which are too big, too heavy, have poor

performance or need regular recharging will die..

ISSCC2001

5

Microprocessors Everywhere..

In-Car Digital PlayerHarzoom

MP3 Player

NintendoGameboy Advance

Litronic Forté CardLinux watch

EricssonScreenphone

ISSCC2001

6

Battery technology• Battery technology moves very slowly

• Moore’s law does not seem to apply

• Li-Ion and NiMh still the dominant technologies

• Batteries still contribute significantly to the weight ofa mobile device• Nokia 61xx - 33%

• Toshiba Portage 3110 - 20%

• Handspring - 10%

• This is getting better over time, but due mainly toadvances in other technology

ISSCC2001

7

Its not just about mobile devices

• Power consumption important in ‘tetheredapplications’ too

• Dissipating heat has a large impact on packagingtechnology and cost

• Excessive switching currents effect reliability

• Cooling fans create noise and add to cost

• Energy Star partnership created to promote lowerpower devices for environmental reasons

ISSCC2001

8

Impacts for Microprocessors

• Processing requirements forportable devices are rapidlyincreasing• functional integration going up

• But power requirements areeven more stringent• who wants to spend time

plugged into the wall?

• The life of the microprocessordesigner has got harder

ISSCC2001

9

How do we measure Power?

• As with processing performance, there are no goodpower consumption metrics

• MIPS per Watt is the most common• this tells you nothing about performance

• MIPS2 per Watt tells you more• but how useful are MIPS?

• Motorola invented ‘Powerstone’(1) for Mcore• collection of 15 embedded applications

• EEMBC(2) scores per Watt may be better

ISSCC2001

10

Other power Metrics

• At the circuit or device level, power-delay-productoften used• measure the power consumption, measure the

propagation delay and multiply

• Useful comparative metric between devices ofidentical functionality

• Often used in D-Type, adder or multiplier analysis• not much use for an entire CPU

ISSCC2001

11

Use power metrics wisely...• MIPS per Watt is likely to have no meaning for the

application you are running

• The best way to determine the power consumptionof a processor in your application is to run yourapplication

• Cache modes, bus activity, data usage all havelarge bearing on power consumption

• System designers require high-level power modelsof processors to make system tradeoffs• which no one has...

ISSCC2001

12

Dynamic and Static Power• Switching the capacitance is a dominant

source of power consumptionP=CV2F

• F is the effective frequency

• So, to reduce your powerconsumption:• Lower your supply voltage

• Minimise your capacitance

• Switch your circuit only whenyou need to

• Static leakage current is now becoming asignificant effect in low voltage processes

In Out

GND

Vdd

DynamicCharging Current

DynamicShort-Circuit

CurrentCload

StaticleakageCurrent

ISSCC2001

13

Process Technology• As process shrinks, dynamic power is reduced

• Lower Vt transistors mean that Vdd can be less andswitching speed is maintained• Power consumption reduced by Vdd2

• Shrinking die size also reduces capacitance aswires get shorter• Thinner in one dimension reduces capacitance

• Thicker in the other dimension and closer togetherincreases capacitance

• Capacitance reduction not proportional to area

• But, with low Vt, leakage becomes a problem

ISSCC2001

14

So why should you care?• High functionality devices means your processors

have to deliver more performance

• Size, weight, packaging, battery life mean that yourdevices have to become more power efficient

• Battery technology is not helping much

• Process technology helps but introduces newissues• static leakage power is becoming as big a problem as

dynamic power

• So where do you start?

ISSCC2001

15

System design

The higher the level ofabstraction, the more scope there

is for saving power

ISSCC2001

16

System Issues• Choice of algorithm, ISA, system partitioning, data

representation etc. have largest impact on overallsystem power

• The processor designer needs to optimize thosefactors under his control

• Operating systems are becoming more poweraware and this impacts the CPU

• A low power processor is no use if a low powersystem cannot be built around it

ISSCC2001

17

System Partitioning

• By isolating high and lowbandwidth devices, systempower can be reduced

• The CPU’s bus interfaceneeds facilitate differentspeed responses• split transactions

• fire and forget

• wait for critical response

CPU

High-speed bus

ExtBusI/F

DMA

ROMRAM

Peripheral

Peripheral

Bridge

Low-speed,narrow bus

ISSCC2001

18

Instruction Sets

• Instruction set efficiency has large impact onsystem power• the fewer memory accesses taken to complete a given

function the better

• The higher your code density the better• more code in your cache prevents costly external

accesses

• Therefore, RISC is potentially bad for powerconsumption, and CISC is good!

ISSCC2001

19

0

0.2

0.4

0.6

0.8

1

1.2

1.4

ARM

Thumb

Instructions Memory Bit Accesses

System Power vs CPU Power

• The 16 bit Thumb(3) instruction set is an encoding of asubset of the 32 bit ARM instruction set

• Dhrystone 2.1:• 21% more Thumb

instruction fetches

• 39% less memory bitreferences

• Saving in system poweroutweighs increasedCPU power

ISSCC2001

20

Dedicated hardware vsprogrammable microprocessors

• Dedicated hardware functions can be more efficientthan a very flexible microprocessor

• Hardware multipliers are more efficient thansoftware routines due to reduced instruction fetchesand register reads• Dedicated DSP functions such as Viterbi very efficient

• Reconfigurable processors can offer power savings• instructions and functionality can be added on a case by

case basis allowing the system designer to trade offhardware and software complexity

ISSCC2001

21

System-level Power control• Complex systems require power-aware operating

systems to minimize power consumption

• The processor needs to provide functionality tointeract with the system and the OS

• Functions can range from static ‘sleep’ modes todynamic voltage scaling

• Sleep modes commonplace in embedded systems• On/off power control

• clocks stopped

• state saved then power removed

ISSCC2001

22

Dynamic Voltage Scaling• Voltage scaling appearing in mobile x86 market

• Intel - SpeedStep

• AMD - PowerNow

• Transmeta - Longrun

• By optimizing voltage and frequency for a givenwork load, power may be reduced

• Introduces a new challenge - designers need toensure that their circuits will operate linearly acrossthe voltage range(4)

ISSCC2001

23

LongRun(5)

• Combined software/hardware approach

• Software monitors the processing requirement andadjusts voltage and frequency to match• frequency range 200-700MHz

• voltage range 1.1 - 1.6V

• On chip power management hardware uses anexternal interface to communicate with a systemlevel voltage regulator

• The result is significant temperature and averagepower reductions

ISSCC2001

24

System design - summary

• The CPU designer will always consider systemperformance when designing the externalinterfaces• But power is equally important

• A top down approach must be taken to achieve lowpower end systems• interaction between software and hardware becoming

critical

• Allowing for voltage scaling changes how you mustdesign your circuits

ISSCC2001

25

CPU microarchitecture

Pipelines, parallelism and prediction

ISSCC2001

26

Microarchitecure effects on power• Complexity = Power

• Pipeline depth• more stages increase D-type count and clock load

• Parallelism/Scalarity• parallel execution units increase overall capacitance

• Speculation• guessing wrong wastes power

• But, performance requirements are going up• so we need to work out how to increase

microarchitectural complexity without sacrificing power

ISSCC2001

27

Pipeline depth• Deep pipelines involve large numbers of clocked

registers throughout a design

• More D-types = greater clock load = higher power• D-type design activity for low power has increased

significantly over the last few years

• Pentium4 has a 20 stage pipeline, 42M transistorsand consumes 66W(6)

• Driving the clock in a CPU forms a significantproportion of the overall power in a CPU• clock power accounts for 30-40% of the total

consumption in Alpha 21264(7)

ISSCC2001

28

D-type power issues• Power consumption within a D-type arises from

clock load and data activity• designs can be optimised for low clock power, low data

power or speed, but hard to optimise for all three

• Therefore different D-types should be used indifferent areas of the design• only use fast, power hungry designs on critical paths

• Stojanovic et al(8) compare D-type designs• large spread of power, delay, and power-delay product

• D-type selection critical in heavily pipelined designs

ISSCC2001

29

Speculation bad, simplicity good• Branch prediction often added to a processor to

improve CPI

• Complex schemes involving history tables areexpensive on power• effectively there is another cache in the system

• Simple static schemes yield lower performance yethave low power cost

• In both cases guessing wrong cost power• any speculative processing has to be thrown away an

original machine state restored

ISSCC2001

30

Scalarity• Superscalar and VLIW cores add ‘C’ to increase

instruction throughput• parallel execution units

• hardware resolution of resource constraints

• But the added complexity can significantly increasepower consumption• out-of-order issue logic in Alpha 21264 consumes 18% of

the total chip power

• VLIW cores also suffer from poor code density• leads to high power consumption in the memory system

ISSCC2001

31

So simpler is better?• Yes and no

• If you are able to adjust operating voltage then:• build the highest performance processor you can given

the area budget,• but take care to control effective Frequency

• make sure your circuits can handle low voltages

• and run it at the lowest possible voltage

• The savings in V2 will outweigh the increase in C and F

• However, cost and system issues often don’t allowvariable operating voltages• in which case, keep it simple

ISSCC2001

32

Where does the power go?

An analysis of ARM920T

ISSCC2001

33

ARM920T• Harvard cached processor

• ARM9TDMI core

• 2x16K caches

• MMU support for VM

• Support for write-backand write-through caches• write-through for coherency

• write-back for low power

• 2.5 Million transistors

• TSMC 0.18µm:• 1mW/MHz (1.8V)

• 200MHz (1.65V)

• 11.8mm2

BIU

ARM9TDMI

16K I$ 16K D$

IMMU

DMMU

PA-TAG

AHB

ISSCC2001

34

Power Analysis of ARM920T

I Cache25%

D MMU5%

I MMU4%

arm925%

PATag RAM1%

CP152%

BIU8%

SysCtl3%

Clocks4% Other

4% D Cache19%

ISSCC2001

35

Low Power Caches• Almost half the total power is consumed by the

caches

• ARM920T uses a high-associativity cache segment• modular approach allows cache size reconfiguration

• careful layout provides low power implementation

• Many other processors use low (2 or 4 way) setassociative caches• including ARM’s soft cores

• Sequentiality can be used to reduce cache powerfor both types of cache

ISSCC2001

36

High associativity cache• Low order address bits

select the segment on anon-sequential access

• A pair of words are read

• Data is taken from theword-pair if sequential• saves CAM and RAM power

• Data RAM is not activatedon a cache miss

• Self-timed, full custom,high performance yet lowpower

64 WayCAM

Data RAM(2KB)

Sense Amps

Data[31:0]

A[31:7]

A[4:0]Enable

ISSCC2001

37

Low-swing bit lines• The cache bit lines only discharge

to the point where the sense amptriggers and then recharges• self-timed feedback path

• bit-lines only drop ~15% of full railvalue

• Low swing lines applicable toother areas of CPUs• particularly long interconnect busses

eg. core cache

• 10x saving in bus power in Alpha 21264(r)

• Also helps improve performanceB

it-lin

e V

olta

ge

Time

Sense AmpTrigger Voltage

Discharge

Pre-charge

ISSCC2001

38

Low associativity caches

• More suitable for ASIC flows

• All TAG arrays looked upevery non-sequential access

• RAMs often looked upspeculatively in parallel forperformance

• Only addressed setaccessed on sequentialcycles

• Again, whole line can belatched

RAM_0

TAG_3

RAM_3RAM_2RAM_1

TAG_0 TAG_1 TAG_2

RAM_0

TAG_3

RAM_3RAM_2RAM_1

TAG_0 TAG_1 TAG_2

Non-sequential Access

Sequential Access

ISSCC2001

39

Associativity effect on power• Higher number of cache ways increase power

• more capacitance switched per cache access

• Most CPUs fix the associativity• but best cache architecture is algorithm dependant

• MCore M340 provides cache reconfiguration toalter the number of active ways(9) in its cache

• Results show that by enabling the optimum numberof ways power can be reduced by up to 50%• requires an RTOS to reprogram the cache control

register on a task switch

ISSCC2001

40

Virtually addressed caches

• ARM920T uses a virtually addressed cache• address translation only occurs on a cache miss

• In write-back mode, the physical address needs tobe recalculated when writing dirty data to memory• need to re-access the TLB which may miss

• A ‘physical address TAG’ is used to cachetranslated data addresses to accelerate theprocess and save power• no need for TLB look up or external page table walk

ISSCC2001

41

The CPU Core - ARM9TDMI• 5 stage pipeline

• Harvard architecture

• ARM v4T compliant• ARM and Thumb instruction

decoders

• 110,000 transistors

• TSMC 0.18µm:• 0.3mW/MHz (1.8V)

• 220MHz (1.65V)

• 1mm2

ISSCC2001

42

Power Analysis of ARM9TDMI

RegBank13% ByteRot

1%

Amux3%

Bmux5%

Shifter7%

AU17%

LU1%

ICEdp17%

Mul3%

psrdp3%

DAOut/Scan5%

IAOut/Scan7%

DDOut/Scan6%

IDScan5%

CoreDPMisc4%

ALUPipe3%

• Datapath: 43%• RegBank+AU+

Debug = 47%

• Control logic: 53%

• Clock driver:4%

Datapath analysis

ISSCC2001

43

Controlling Sub-block power

• The golden rule of low power design:• don’t clock or toggle the inputs of an inactive block

• Partition your functional blocks so that they can beisolated

• Derive control signals early in your instructiondecode to allow blocks to be isolated

• Don’t take this too far• make sure the power saved is not spent generating

control signals

ISSCC2001

44

ARM7 Core ALU

• Analysis shows• AU used 61% of cycles

• LU used 39% of cycles

• When LU in use theAU is unnecessarilydriven• when calculating a logical result

40% of the total power is wastedby the adder

• Therefore isolating theAU can save significant power

RegisterFile

Shi

fter

LogicUnit

Arith.Unit

Res

ult M

ux

Ctl

Latches

ISSCC2001

45

Modified Design

RegisterFile

Shi

fter

LogicUnit

Arith.Unit

Latches

Res

ult M

ux

Result

Ctl

• Potential for 20% saving in power in the ALU• no cost in performance

• extra 64 latches

• Similar scheme implementedin ARM9TDMI

ISSCC2001

46

Glitch control• Within combinatorial logic blocks 15-20% of the

power can be caused by glitches(10)

• Logic optimisation can reduce this• wide gates have low switching probabilities

• Path matching very difficult as signal arrival timeshighly layout dependant

• Latches can be used to isolate blocks of logic(11)

• but expensive in terms of area

• Benini et al(12) suggest modifying gates to isolatestages of logic until inputs are stable

ISSCC2001

47

Freeze Gates

Clock

∆∆

Logic1 Logic2

F-Gate

Clock

A

∆∆ >= propagation delay of Logic1

A

• Logic2’s inputs frozen while Logic1 evaluates

• Cells grouped together under common delayed clock signal

• Minimal impact on speed and area

∆∆

ISSCC2001

48

ARM9TDMI Clock Domains

Datapath

Control Logic

Debugdebug clock

control clock

dp clock

ClkDebug En

nWait

Clock Block

• ARM9TDMI has simple clock driverto generate clocks for control logic,datapath and debug logic

• Buffered clock tree distributes low-skew clock across core

• Global wait signal factored into allclocks

ISSCC2001

49

Clock Power

0

10

20

30

40

50

60

%ag

e C

ore

Po

wer

C

on

sum

pti

on

Control Datapath Clockblock

Clock DriversLogic

• Clock driver power only 4% oftotal core consumption

• Power for entire clock treetotals 32% of core• excludes clk logic within latches

• Early gating of debug clockallows all debug logic to befrozen when not in use• saves ~10% of total core power

• Global wait signal factoredinto all clocks

ISSCC2001

50

So why have a clock?• Processors built using asynchronous design styles

offer the potential for low power implementations• No clock, so no clock-tree power

• side benefit is low electro-magnetic radiation

• Functional units automatically power up dependingon which instruction is in execution• much finer resolution and no explicit idle-state decode

logic

• Manchester University leading with AMULET3i(13)

• Latest in family with performance similar to ARM9TDMI

• First commercial use in DRACO wireless DECT chip

ISSCC2001

51

Future of Asynchronous• Mainly still a research activity

• Few implementations to show that the technologycan be employed in a commercially interesting way

• Designers brought up on synchronous techniques

• No commercial design tools to help

• But, great potential• low power, low emissions, no clock to distribute

• Using asynchronous interfaces betweensynchronous processing elements on a large SoCcould provide a home for this technology

ISSCC2001

52

Leakage• On low voltage (low Vt) processes leakage currents

can become significant to the overall powerconsumption• typically 10-20pA/transistor when Vt~0.7V

• increases to 10-20nA/transistor when Vt~0.2-0.3V

• Combination of microarchitecture and circuittechniques required to control leakage currents

• Very real problem - leakage becoming commonproblem even on commodity processes• many designs in progress today targeting leaky

processes

ISSCC2001

53

Leakage control strategies• Processor sleep modes remove power to leaky

sections of the design when not in use• eg a multiplier or a cache

• State often needs to be saved to memory• system trade off of power cost to write out to memory and

then reload vs leakage power saved

• need OS to determine how long the sleep period is likelyto be

• At circuit level substrate biasing and MTCMOS aremain approaches in addition to careful transistorsizing

ISSCC2001

54

Local State saving

• Balloon logic MTCMOStechnique proposed by NTT(14)

• Low Vt devices used in D-type• Connected to gated Vdd

• High Vt devices used in‘balloon’• Connected to Vdd

• ‘Balloon’ enabled prior toentering sleep mode

• Allows rapid switch from sleepto regular mode with no loss ofstate

High Vt ‘Balloon’

Low Vt DFF

Vdd

ISSCC2001

55

Summary• Understand where the power goes in your CPU

from historical data and then optimise your design

• Caches, ALUs and register files are big powerburners

• Clocks contribute significantly to the power sochoose your d-types well and clock gate early• or design out the clock all together!

• Ensure that infrequently used logic is clock-frozen,isolated from input activity and even remove Vdd

• Leakage is now a significant issue which affectsyour microarchitecture and your circuits

ISSCC2001

56

Design tools

The EDA industry needs to help

ISSCC2001

57

The design space

Tx

Gate

RTL

µArch

Arch

Level of abstraction

Sco

pe f

or p

ower

Sav

ing

Commercial tool coverage:• Analysis tools• Optimization tools

System

ISSCC2001

58

Transistor Level• Post-layout analysis tools predict what your power

consumption will be, but too late in the project toallow large scale changes• Synopsys-EPIC PowerMill(15)

• Mentor Mach PA(16)

• Claim close correlation with Spice and real silicon• better than 5% accurate

• AMPS(17) offers transistor size optimization forpower and performance

• Avant! Mars-Rail(18) for power driven Place andRoute

ISSCC2001

59

Gate and RTL level Analysis• Main tool offerings from Synopsys and Sequence

• PrimePower(19)

• WattWatcher(20)

• Gate level simulations provide a graphical displayof switching activity across a circuit to allow designoptimizations for power

• Rely on using a power characterized cell library

• WattWatcher can also perform analysis on RTL• logic complexity estimated by analyzing RTL

ISSCC2001

60

Gate and RTL Optimization

• Synopsys’ PowerCompiler(21)

insert gated clocks onregisters• can also optimize logic gate

selection based on librarypower characterization data

• Sequence’s WattSmith(22)

analyses the RTL and itsWattBots then propose RTLlevel power optimizations

Clk

D QIn

Out

Ctl

D Q

Clk

In

Ctl

Out

ISSCC2001

61

Higher levels of abstraction• Current design tools only cover low end of the

design space today

• No tools exist today for microarchitecture andarchitecture analysis and optimization• the EDA industry needs to keep pushing up the

abstraction graph

• The system designer needs power models of CPUsto make high level trade-offs• model the dependency on instruction and data patterns

• allow the effects of code and data representationchanges to be determined

ISSCC2001

62

Conclusions

ISSCC2001

63

Conclusions• Power consumption is as important a design criteria

as performance• even if your application is plugged into the wall

• Low power design starts at the system level• a top down approach will yield greatest results

• Power management is a system issue that requirescircuit, microarchitecture and software interaction

• Performance requirements are increasing and somicroarchitectural complexity must go up withoutsacrificing power

ISSCC2001

64

Conclusions cont.

• Set your power budget at the start of the designand measure it as you go• EDA tools are becoming available to help

• But the EDA industry needs to be encouraged to do more

• Understand where the power goes in your designstoday and use the data to improve future products

• Low power design presents new challenges• still an immature technology

• reducing mW is much more interesting than increasingMHz!

ISSCC2001

65

References• (1) “Designing the Low- Power M·CORE(TM) Architecture” Scott, Hwang

Lee, Arends, Moyer, www.mot.com/SPS/MCORE/MPU/designwp.pdf

• (2) http://eembc.org/

• (3) “Embedded Control Problems, Thumb and the ARM7TDMI”, Segars,Clarke and Goudge, IEEE Micro, October 1995, p22-30

• (4) “Design Issues for Dynamic Voltage Scaling”, Burd and Brodersen,proceedings of ISLPD 2000, p9-14

• (5) www.transmeta.com/crusoe/lowpower/longrun.html

• (6) “Pentium 4 (Partially) Previewed”, Glaskowsky, MicroprocessorReport August 2000 p1,11-13

• (7) - “Power Considerations in the Design of the Alpha 21264” Gowan,Brio, and Jackson, DAC 1998

• (8) “Comparative Analysis of Master-Slave Latches and Flip-flops forHigh Performance and Low-Power Systems”, Stojanovic and Oklobdzija,IEEE JSSC, April 1999, p536-548

ISSCC2001

66

References cont.• (9) “A Low Power Unified cache Architecture Providing Power and

Performance Flexibility”, Malik, Moyer and Cermak, proceedings ofISLPED 2000 p241

• (10) “Analysis of hazard contributions to power dissipation in CMOS Ics”,Benini, Favalli and Ricco, IWLPD 1994 p27-32

• (11) “A Low Power 16 by 16 Multiplier Using Transistion ReductionCircuitry” Lemonds and Mahant Shetti, IWLPD 1994

• (12) “Glitch Power Minimization by Selective Gate Freezing”, Benini, DeMicheli, Macii, Macii, Poncino and Scarsi, IEEE Transactions on VLSISystems, June 2000, p287-298

• (13) http://www.cs.man.ac.uk/amulet/

• (14) “A 1-V high-speed MTCMOS circuit scheme for power-downapplications”, Shigematsu et al, Symposium on VLSI Circuits, 1995

ISSCC2001

67

References cont.• (15) http://www.synopsys.com/products/etg/powermill_ds.html

• (16) http://www.mentor.com/mach/

• (17) http://www.synopsys.com/products/etg/amps_ds.html

• (18) http://www.avanticorp.com/Avant!/SolutionsProducts/Products/Item/1,1172,12,00.html

• (19) http://www.synopsys.com/products/etg/primepower_ds.html

• (20) http://www.sequencedesign.com/2_solutions/2b1_wwatcher.html

• (21) http://www.synopsys.com/products/power/power_bkg.html

• (22) http://www.sequencedesign.com/2_solutions/2b2_wsmith.html

• See also:• “Low-Power CMOS Digital Design”, Chandrakasan, Sheng and Brodersen,

IEEE JSSC, April 1992 p473-484

• “Low-Power CMOS Design”, Chandrakasan and Brodersen, IEEE Press,ISBN 0-7803-3429-9, 1998