Low Power Design Techniques for Microprocessors
Transcript of Low Power Design Techniques for Microprocessors
ISSCC2001
1
Low Power Design Techniquesfor Microprocessors
ISSCC, Feb 4th 2001
Simon Segars
VP Engineering, ARM Inc.
ISSCC2001
2
Agenda• Introduction
• why is power consumption an importantdesign criteria of a microprocessor?
• System design
• CPU microarchitecture
• Where does the power go?• A look at ARM920T and ARM9TDMI
• Design tools
• Conclusions• References
ISSCC2001
3
Why should a microprocessordesigner care about power
consumption ?
Isn’t it all just about performance?
ISSCC2001
4
The digital age
• The internet and wireless services are gettingmarried
• Their offspring are handheld digital communicationdevices that will rapidly evolve in complexity• microprocessors becoming pervasive
• Darwinian theory will hold true• species which are too big, too heavy, have poor
performance or need regular recharging will die..
ISSCC2001
5
Microprocessors Everywhere..
In-Car Digital PlayerHarzoom
MP3 Player
NintendoGameboy Advance
Litronic Forté CardLinux watch
EricssonScreenphone
ISSCC2001
6
Battery technology• Battery technology moves very slowly
• Moore’s law does not seem to apply
• Li-Ion and NiMh still the dominant technologies
• Batteries still contribute significantly to the weight ofa mobile device• Nokia 61xx - 33%
• Toshiba Portage 3110 - 20%
• Handspring - 10%
• This is getting better over time, but due mainly toadvances in other technology
ISSCC2001
7
Its not just about mobile devices
• Power consumption important in ‘tetheredapplications’ too
• Dissipating heat has a large impact on packagingtechnology and cost
• Excessive switching currents effect reliability
• Cooling fans create noise and add to cost
• Energy Star partnership created to promote lowerpower devices for environmental reasons
ISSCC2001
8
Impacts for Microprocessors
• Processing requirements forportable devices are rapidlyincreasing• functional integration going up
• But power requirements areeven more stringent• who wants to spend time
plugged into the wall?
• The life of the microprocessordesigner has got harder
ISSCC2001
9
How do we measure Power?
• As with processing performance, there are no goodpower consumption metrics
• MIPS per Watt is the most common• this tells you nothing about performance
• MIPS2 per Watt tells you more• but how useful are MIPS?
• Motorola invented ‘Powerstone’(1) for Mcore• collection of 15 embedded applications
• EEMBC(2) scores per Watt may be better
ISSCC2001
10
Other power Metrics
• At the circuit or device level, power-delay-productoften used• measure the power consumption, measure the
propagation delay and multiply
• Useful comparative metric between devices ofidentical functionality
• Often used in D-Type, adder or multiplier analysis• not much use for an entire CPU
ISSCC2001
11
Use power metrics wisely...• MIPS per Watt is likely to have no meaning for the
application you are running
• The best way to determine the power consumptionof a processor in your application is to run yourapplication
• Cache modes, bus activity, data usage all havelarge bearing on power consumption
• System designers require high-level power modelsof processors to make system tradeoffs• which no one has...
ISSCC2001
12
Dynamic and Static Power• Switching the capacitance is a dominant
source of power consumptionP=CV2F
• F is the effective frequency
• So, to reduce your powerconsumption:• Lower your supply voltage
• Minimise your capacitance
• Switch your circuit only whenyou need to
• Static leakage current is now becoming asignificant effect in low voltage processes
In Out
GND
Vdd
DynamicCharging Current
DynamicShort-Circuit
CurrentCload
StaticleakageCurrent
ISSCC2001
13
Process Technology• As process shrinks, dynamic power is reduced
• Lower Vt transistors mean that Vdd can be less andswitching speed is maintained• Power consumption reduced by Vdd2
• Shrinking die size also reduces capacitance aswires get shorter• Thinner in one dimension reduces capacitance
• Thicker in the other dimension and closer togetherincreases capacitance
• Capacitance reduction not proportional to area
• But, with low Vt, leakage becomes a problem
ISSCC2001
14
So why should you care?• High functionality devices means your processors
have to deliver more performance
• Size, weight, packaging, battery life mean that yourdevices have to become more power efficient
• Battery technology is not helping much
• Process technology helps but introduces newissues• static leakage power is becoming as big a problem as
dynamic power
• So where do you start?
ISSCC2001
15
System design
The higher the level ofabstraction, the more scope there
is for saving power
ISSCC2001
16
System Issues• Choice of algorithm, ISA, system partitioning, data
representation etc. have largest impact on overallsystem power
• The processor designer needs to optimize thosefactors under his control
• Operating systems are becoming more poweraware and this impacts the CPU
• A low power processor is no use if a low powersystem cannot be built around it
ISSCC2001
17
System Partitioning
• By isolating high and lowbandwidth devices, systempower can be reduced
• The CPU’s bus interfaceneeds facilitate differentspeed responses• split transactions
• fire and forget
• wait for critical response
CPU
High-speed bus
ExtBusI/F
DMA
ROMRAM
Peripheral
Peripheral
Bridge
Low-speed,narrow bus
ISSCC2001
18
Instruction Sets
• Instruction set efficiency has large impact onsystem power• the fewer memory accesses taken to complete a given
function the better
• The higher your code density the better• more code in your cache prevents costly external
accesses
• Therefore, RISC is potentially bad for powerconsumption, and CISC is good!
ISSCC2001
19
0
0.2
0.4
0.6
0.8
1
1.2
1.4
ARM
Thumb
Instructions Memory Bit Accesses
System Power vs CPU Power
• The 16 bit Thumb(3) instruction set is an encoding of asubset of the 32 bit ARM instruction set
• Dhrystone 2.1:• 21% more Thumb
instruction fetches
• 39% less memory bitreferences
• Saving in system poweroutweighs increasedCPU power
ISSCC2001
20
Dedicated hardware vsprogrammable microprocessors
• Dedicated hardware functions can be more efficientthan a very flexible microprocessor
• Hardware multipliers are more efficient thansoftware routines due to reduced instruction fetchesand register reads• Dedicated DSP functions such as Viterbi very efficient
• Reconfigurable processors can offer power savings• instructions and functionality can be added on a case by
case basis allowing the system designer to trade offhardware and software complexity
ISSCC2001
21
System-level Power control• Complex systems require power-aware operating
systems to minimize power consumption
• The processor needs to provide functionality tointeract with the system and the OS
• Functions can range from static ‘sleep’ modes todynamic voltage scaling
• Sleep modes commonplace in embedded systems• On/off power control
• clocks stopped
• state saved then power removed
ISSCC2001
22
Dynamic Voltage Scaling• Voltage scaling appearing in mobile x86 market
• Intel - SpeedStep
• AMD - PowerNow
• Transmeta - Longrun
• By optimizing voltage and frequency for a givenwork load, power may be reduced
• Introduces a new challenge - designers need toensure that their circuits will operate linearly acrossthe voltage range(4)
ISSCC2001
23
LongRun(5)
• Combined software/hardware approach
• Software monitors the processing requirement andadjusts voltage and frequency to match• frequency range 200-700MHz
• voltage range 1.1 - 1.6V
• On chip power management hardware uses anexternal interface to communicate with a systemlevel voltage regulator
• The result is significant temperature and averagepower reductions
ISSCC2001
24
System design - summary
• The CPU designer will always consider systemperformance when designing the externalinterfaces• But power is equally important
• A top down approach must be taken to achieve lowpower end systems• interaction between software and hardware becoming
critical
• Allowing for voltage scaling changes how you mustdesign your circuits
ISSCC2001
25
CPU microarchitecture
Pipelines, parallelism and prediction
ISSCC2001
26
Microarchitecure effects on power• Complexity = Power
• Pipeline depth• more stages increase D-type count and clock load
• Parallelism/Scalarity• parallel execution units increase overall capacitance
• Speculation• guessing wrong wastes power
• But, performance requirements are going up• so we need to work out how to increase
microarchitectural complexity without sacrificing power
ISSCC2001
27
Pipeline depth• Deep pipelines involve large numbers of clocked
registers throughout a design
• More D-types = greater clock load = higher power• D-type design activity for low power has increased
significantly over the last few years
• Pentium4 has a 20 stage pipeline, 42M transistorsand consumes 66W(6)
• Driving the clock in a CPU forms a significantproportion of the overall power in a CPU• clock power accounts for 30-40% of the total
consumption in Alpha 21264(7)
ISSCC2001
28
D-type power issues• Power consumption within a D-type arises from
clock load and data activity• designs can be optimised for low clock power, low data
power or speed, but hard to optimise for all three
• Therefore different D-types should be used indifferent areas of the design• only use fast, power hungry designs on critical paths
• Stojanovic et al(8) compare D-type designs• large spread of power, delay, and power-delay product
• D-type selection critical in heavily pipelined designs
ISSCC2001
29
Speculation bad, simplicity good• Branch prediction often added to a processor to
improve CPI
• Complex schemes involving history tables areexpensive on power• effectively there is another cache in the system
• Simple static schemes yield lower performance yethave low power cost
• In both cases guessing wrong cost power• any speculative processing has to be thrown away an
original machine state restored
ISSCC2001
30
Scalarity• Superscalar and VLIW cores add ‘C’ to increase
instruction throughput• parallel execution units
• hardware resolution of resource constraints
• But the added complexity can significantly increasepower consumption• out-of-order issue logic in Alpha 21264 consumes 18% of
the total chip power
• VLIW cores also suffer from poor code density• leads to high power consumption in the memory system
ISSCC2001
31
So simpler is better?• Yes and no
• If you are able to adjust operating voltage then:• build the highest performance processor you can given
the area budget,• but take care to control effective Frequency
• make sure your circuits can handle low voltages
• and run it at the lowest possible voltage
• The savings in V2 will outweigh the increase in C and F
• However, cost and system issues often don’t allowvariable operating voltages• in which case, keep it simple
ISSCC2001
32
Where does the power go?
An analysis of ARM920T
ISSCC2001
33
ARM920T• Harvard cached processor
• ARM9TDMI core
• 2x16K caches
• MMU support for VM
• Support for write-backand write-through caches• write-through for coherency
• write-back for low power
• 2.5 Million transistors
• TSMC 0.18µm:• 1mW/MHz (1.8V)
• 200MHz (1.65V)
• 11.8mm2
BIU
ARM9TDMI
16K I$ 16K D$
IMMU
DMMU
PA-TAG
AHB
ISSCC2001
34
Power Analysis of ARM920T
I Cache25%
D MMU5%
I MMU4%
arm925%
PATag RAM1%
CP152%
BIU8%
SysCtl3%
Clocks4% Other
4% D Cache19%
ISSCC2001
35
Low Power Caches• Almost half the total power is consumed by the
caches
• ARM920T uses a high-associativity cache segment• modular approach allows cache size reconfiguration
• careful layout provides low power implementation
• Many other processors use low (2 or 4 way) setassociative caches• including ARM’s soft cores
• Sequentiality can be used to reduce cache powerfor both types of cache
ISSCC2001
36
High associativity cache• Low order address bits
select the segment on anon-sequential access
• A pair of words are read
• Data is taken from theword-pair if sequential• saves CAM and RAM power
• Data RAM is not activatedon a cache miss
• Self-timed, full custom,high performance yet lowpower
64 WayCAM
Data RAM(2KB)
Sense Amps
Data[31:0]
A[31:7]
A[4:0]Enable
ISSCC2001
37
Low-swing bit lines• The cache bit lines only discharge
to the point where the sense amptriggers and then recharges• self-timed feedback path
• bit-lines only drop ~15% of full railvalue
• Low swing lines applicable toother areas of CPUs• particularly long interconnect busses
eg. core cache
• 10x saving in bus power in Alpha 21264(r)
• Also helps improve performanceB
it-lin
e V
olta
ge
Time
Sense AmpTrigger Voltage
Discharge
Pre-charge
ISSCC2001
38
Low associativity caches
• More suitable for ASIC flows
• All TAG arrays looked upevery non-sequential access
• RAMs often looked upspeculatively in parallel forperformance
• Only addressed setaccessed on sequentialcycles
• Again, whole line can belatched
RAM_0
TAG_3
RAM_3RAM_2RAM_1
TAG_0 TAG_1 TAG_2
RAM_0
TAG_3
RAM_3RAM_2RAM_1
TAG_0 TAG_1 TAG_2
Non-sequential Access
Sequential Access
ISSCC2001
39
Associativity effect on power• Higher number of cache ways increase power
• more capacitance switched per cache access
• Most CPUs fix the associativity• but best cache architecture is algorithm dependant
• MCore M340 provides cache reconfiguration toalter the number of active ways(9) in its cache
• Results show that by enabling the optimum numberof ways power can be reduced by up to 50%• requires an RTOS to reprogram the cache control
register on a task switch
ISSCC2001
40
Virtually addressed caches
• ARM920T uses a virtually addressed cache• address translation only occurs on a cache miss
• In write-back mode, the physical address needs tobe recalculated when writing dirty data to memory• need to re-access the TLB which may miss
• A ‘physical address TAG’ is used to cachetranslated data addresses to accelerate theprocess and save power• no need for TLB look up or external page table walk
ISSCC2001
41
The CPU Core - ARM9TDMI• 5 stage pipeline
• Harvard architecture
• ARM v4T compliant• ARM and Thumb instruction
decoders
• 110,000 transistors
• TSMC 0.18µm:• 0.3mW/MHz (1.8V)
• 220MHz (1.65V)
• 1mm2
ISSCC2001
42
Power Analysis of ARM9TDMI
RegBank13% ByteRot
1%
Amux3%
Bmux5%
Shifter7%
AU17%
LU1%
ICEdp17%
Mul3%
psrdp3%
DAOut/Scan5%
IAOut/Scan7%
DDOut/Scan6%
IDScan5%
CoreDPMisc4%
ALUPipe3%
• Datapath: 43%• RegBank+AU+
Debug = 47%
• Control logic: 53%
• Clock driver:4%
Datapath analysis
ISSCC2001
43
Controlling Sub-block power
• The golden rule of low power design:• don’t clock or toggle the inputs of an inactive block
• Partition your functional blocks so that they can beisolated
• Derive control signals early in your instructiondecode to allow blocks to be isolated
• Don’t take this too far• make sure the power saved is not spent generating
control signals
ISSCC2001
44
ARM7 Core ALU
• Analysis shows• AU used 61% of cycles
• LU used 39% of cycles
• When LU in use theAU is unnecessarilydriven• when calculating a logical result
40% of the total power is wastedby the adder
• Therefore isolating theAU can save significant power
RegisterFile
Shi
fter
LogicUnit
Arith.Unit
Res
ult M
ux
Ctl
Latches
ISSCC2001
45
Modified Design
RegisterFile
Shi
fter
LogicUnit
Arith.Unit
Latches
Res
ult M
ux
Result
Ctl
• Potential for 20% saving in power in the ALU• no cost in performance
• extra 64 latches
• Similar scheme implementedin ARM9TDMI
ISSCC2001
46
Glitch control• Within combinatorial logic blocks 15-20% of the
power can be caused by glitches(10)
• Logic optimisation can reduce this• wide gates have low switching probabilities
• Path matching very difficult as signal arrival timeshighly layout dependant
• Latches can be used to isolate blocks of logic(11)
• but expensive in terms of area
• Benini et al(12) suggest modifying gates to isolatestages of logic until inputs are stable
ISSCC2001
47
Freeze Gates
Clock
∆∆
Logic1 Logic2
F-Gate
Clock
A
∆∆ >= propagation delay of Logic1
A
• Logic2’s inputs frozen while Logic1 evaluates
• Cells grouped together under common delayed clock signal
• Minimal impact on speed and area
∆∆
ISSCC2001
48
ARM9TDMI Clock Domains
Datapath
Control Logic
Debugdebug clock
control clock
dp clock
ClkDebug En
nWait
Clock Block
• ARM9TDMI has simple clock driverto generate clocks for control logic,datapath and debug logic
• Buffered clock tree distributes low-skew clock across core
• Global wait signal factored into allclocks
ISSCC2001
49
Clock Power
0
10
20
30
40
50
60
%ag
e C
ore
Po
wer
C
on
sum
pti
on
Control Datapath Clockblock
Clock DriversLogic
• Clock driver power only 4% oftotal core consumption
• Power for entire clock treetotals 32% of core• excludes clk logic within latches
• Early gating of debug clockallows all debug logic to befrozen when not in use• saves ~10% of total core power
• Global wait signal factoredinto all clocks
ISSCC2001
50
So why have a clock?• Processors built using asynchronous design styles
offer the potential for low power implementations• No clock, so no clock-tree power
• side benefit is low electro-magnetic radiation
• Functional units automatically power up dependingon which instruction is in execution• much finer resolution and no explicit idle-state decode
logic
• Manchester University leading with AMULET3i(13)
• Latest in family with performance similar to ARM9TDMI
• First commercial use in DRACO wireless DECT chip
ISSCC2001
51
Future of Asynchronous• Mainly still a research activity
• Few implementations to show that the technologycan be employed in a commercially interesting way
• Designers brought up on synchronous techniques
• No commercial design tools to help
• But, great potential• low power, low emissions, no clock to distribute
• Using asynchronous interfaces betweensynchronous processing elements on a large SoCcould provide a home for this technology
ISSCC2001
52
Leakage• On low voltage (low Vt) processes leakage currents
can become significant to the overall powerconsumption• typically 10-20pA/transistor when Vt~0.7V
• increases to 10-20nA/transistor when Vt~0.2-0.3V
• Combination of microarchitecture and circuittechniques required to control leakage currents
• Very real problem - leakage becoming commonproblem even on commodity processes• many designs in progress today targeting leaky
processes
ISSCC2001
53
Leakage control strategies• Processor sleep modes remove power to leaky
sections of the design when not in use• eg a multiplier or a cache
• State often needs to be saved to memory• system trade off of power cost to write out to memory and
then reload vs leakage power saved
• need OS to determine how long the sleep period is likelyto be
• At circuit level substrate biasing and MTCMOS aremain approaches in addition to careful transistorsizing
ISSCC2001
54
Local State saving
• Balloon logic MTCMOStechnique proposed by NTT(14)
• Low Vt devices used in D-type• Connected to gated Vdd
• High Vt devices used in‘balloon’• Connected to Vdd
• ‘Balloon’ enabled prior toentering sleep mode
• Allows rapid switch from sleepto regular mode with no loss ofstate
High Vt ‘Balloon’
Low Vt DFF
Vdd
ISSCC2001
55
Summary• Understand where the power goes in your CPU
from historical data and then optimise your design
• Caches, ALUs and register files are big powerburners
• Clocks contribute significantly to the power sochoose your d-types well and clock gate early• or design out the clock all together!
• Ensure that infrequently used logic is clock-frozen,isolated from input activity and even remove Vdd
• Leakage is now a significant issue which affectsyour microarchitecture and your circuits
ISSCC2001
56
Design tools
The EDA industry needs to help
ISSCC2001
57
The design space
Tx
Gate
RTL
µArch
Arch
Level of abstraction
Sco
pe f
or p
ower
Sav
ing
Commercial tool coverage:• Analysis tools• Optimization tools
System
ISSCC2001
58
Transistor Level• Post-layout analysis tools predict what your power
consumption will be, but too late in the project toallow large scale changes• Synopsys-EPIC PowerMill(15)
• Mentor Mach PA(16)
• Claim close correlation with Spice and real silicon• better than 5% accurate
• AMPS(17) offers transistor size optimization forpower and performance
• Avant! Mars-Rail(18) for power driven Place andRoute
ISSCC2001
59
Gate and RTL level Analysis• Main tool offerings from Synopsys and Sequence
• PrimePower(19)
• WattWatcher(20)
• Gate level simulations provide a graphical displayof switching activity across a circuit to allow designoptimizations for power
• Rely on using a power characterized cell library
• WattWatcher can also perform analysis on RTL• logic complexity estimated by analyzing RTL
ISSCC2001
60
Gate and RTL Optimization
• Synopsys’ PowerCompiler(21)
insert gated clocks onregisters• can also optimize logic gate
selection based on librarypower characterization data
• Sequence’s WattSmith(22)
analyses the RTL and itsWattBots then propose RTLlevel power optimizations
Clk
D QIn
Out
Ctl
D Q
Clk
In
Ctl
Out
ISSCC2001
61
Higher levels of abstraction• Current design tools only cover low end of the
design space today
• No tools exist today for microarchitecture andarchitecture analysis and optimization• the EDA industry needs to keep pushing up the
abstraction graph
• The system designer needs power models of CPUsto make high level trade-offs• model the dependency on instruction and data patterns
• allow the effects of code and data representationchanges to be determined
ISSCC2001
62
Conclusions
ISSCC2001
63
Conclusions• Power consumption is as important a design criteria
as performance• even if your application is plugged into the wall
• Low power design starts at the system level• a top down approach will yield greatest results
• Power management is a system issue that requirescircuit, microarchitecture and software interaction
• Performance requirements are increasing and somicroarchitectural complexity must go up withoutsacrificing power
ISSCC2001
64
Conclusions cont.
• Set your power budget at the start of the designand measure it as you go• EDA tools are becoming available to help
• But the EDA industry needs to be encouraged to do more
• Understand where the power goes in your designstoday and use the data to improve future products
• Low power design presents new challenges• still an immature technology
• reducing mW is much more interesting than increasingMHz!
ISSCC2001
65
References• (1) “Designing the Low- Power M·CORE(TM) Architecture” Scott, Hwang
Lee, Arends, Moyer, www.mot.com/SPS/MCORE/MPU/designwp.pdf
• (2) http://eembc.org/
• (3) “Embedded Control Problems, Thumb and the ARM7TDMI”, Segars,Clarke and Goudge, IEEE Micro, October 1995, p22-30
• (4) “Design Issues for Dynamic Voltage Scaling”, Burd and Brodersen,proceedings of ISLPD 2000, p9-14
• (5) www.transmeta.com/crusoe/lowpower/longrun.html
• (6) “Pentium 4 (Partially) Previewed”, Glaskowsky, MicroprocessorReport August 2000 p1,11-13
• (7) - “Power Considerations in the Design of the Alpha 21264” Gowan,Brio, and Jackson, DAC 1998
• (8) “Comparative Analysis of Master-Slave Latches and Flip-flops forHigh Performance and Low-Power Systems”, Stojanovic and Oklobdzija,IEEE JSSC, April 1999, p536-548
ISSCC2001
66
References cont.• (9) “A Low Power Unified cache Architecture Providing Power and
Performance Flexibility”, Malik, Moyer and Cermak, proceedings ofISLPED 2000 p241
• (10) “Analysis of hazard contributions to power dissipation in CMOS Ics”,Benini, Favalli and Ricco, IWLPD 1994 p27-32
• (11) “A Low Power 16 by 16 Multiplier Using Transistion ReductionCircuitry” Lemonds and Mahant Shetti, IWLPD 1994
• (12) “Glitch Power Minimization by Selective Gate Freezing”, Benini, DeMicheli, Macii, Macii, Poncino and Scarsi, IEEE Transactions on VLSISystems, June 2000, p287-298
• (13) http://www.cs.man.ac.uk/amulet/
• (14) “A 1-V high-speed MTCMOS circuit scheme for power-downapplications”, Shigematsu et al, Symposium on VLSI Circuits, 1995
ISSCC2001
67
References cont.• (15) http://www.synopsys.com/products/etg/powermill_ds.html
• (16) http://www.mentor.com/mach/
• (17) http://www.synopsys.com/products/etg/amps_ds.html
• (18) http://www.avanticorp.com/Avant!/SolutionsProducts/Products/Item/1,1172,12,00.html
• (19) http://www.synopsys.com/products/etg/primepower_ds.html
• (20) http://www.sequencedesign.com/2_solutions/2b1_wwatcher.html
• (21) http://www.synopsys.com/products/power/power_bkg.html
• (22) http://www.sequencedesign.com/2_solutions/2b2_wsmith.html
• See also:• “Low-Power CMOS Digital Design”, Chandrakasan, Sheng and Brodersen,
IEEE JSSC, April 1992 p473-484
• “Low-Power CMOS Design”, Chandrakasan and Brodersen, IEEE Press,ISBN 0-7803-3429-9, 1998