Optimizing Power @ Design Time Interconnect and Clocks
description
Transcript of Optimizing Power @ Design Time Interconnect and Clocks
Jan M. Rabaey
Low Power Design Essentials ©2008 Chapter 6
Optimizing Power @ Design Time
Interconnect and Clocks
Low Power Design Essentials ©2008 6.2
Chapter Outline
Trends and bounds An OSI approach to interconnect optimization
– Physical layer– Data link and MAC– Network– Application
Clock distribution
Low Power Design Essentials ©2008 6.3
ITRS Projections
Calendar Year 2012 2018 2020
Interconnect One Half Pitch 35 nm
18 nm 14 nm
MOSFET Physical Gate Length 14 nm
7 nm 6 nm
Number of Interconnect Levels 12-1614-18
14-18
On-Chip Local Clock 20 GHz
53 GHz 73 GHz
Chip-to-Board Clock 15 GHz
56 GHz 89 GHz
# of Hi Perf. ASIC Signal I/O Pads 2500 3100 3100
# of Hi Perf. ASIC Power/Ground Pads 2500 3100 3100
Supply Voltage 0.7-0.9 V 0.5-0.7 V 0.5-0.7 V
Supply Current 283-220 A
396-283 A 396-283 A
[Source: ITRS Roadmap, 2004, 2005]
Low Power Design Essentials ©2008 6.4
Increasing Impact of Interconnect
Interconnect is now exceeding transistors in– Latency– Power dissipation– Manufacturing complexity
Direct consequence of scaling
Low Power Design Essentials ©2008 6.5
Communication Dominant Part of Power Budget
65%21%
9%5% Interconnect
Clock
I/OCLB
FPGAmProcessor
Signal processor
Clock
LogicMemory
I/O
ASSP
ClocksCaches
Execution Units
Control I/O Drivers
40%20%
15%
15% 10%
Low Power Design Essentials ©2008 6.6
Idealized Wire Scaling Model
Parameter Relation Local Wire Constant Length Global Wire
W, H, t 1/S 1/S 1/S
L 1/S 1 1/SC
C LW/t 1/S 1 1/SC
R L/WH S S2 S2/SC
tp ~ CR L2/Ht 1 S2 S2/SC2
E CV2 1/SU2 1/U2 1/(SCU2)
Low Power Design Essentials ©2008 6.7
Distribution of Wire Lengths on Chip
[Ref: J. Davis, C&S’98]
© IEEE 1998
Low Power Design Essentials ©2008 6.8
Technology Innovations
Reduce dielectric permittivity (e.g. Aerogels or air)
Reduce resistivity (e.g. Copper)
Reduce wirelengths through 3D-integration
Novel interconnect media (carbon nanotubes, optical)
(Pictures courtesy of IBM and IFC FCRP)
© IEEE 1998
Low Power Design Essentials ©2008 6.9
Logic Scaling
10-12 10-9 10-6 10-3 100
Ptp~1/S3
100
10 -3
10 -6
10 -9
10 -12
10 -15
Pow
er [W
], P
Delay [s], tp
10-6J
10-9J
10-12J
10-15J
10-18J
[Ref: J. Davis, Proc’01]
Low Power Design Essentials ©2008 6.10
Interconnect Scaling
Delay [s], t
(Len
gth)
-2 [c
m-2],
L-2
(Len
gth)
[cm
], L
10-18
10-15
10-12
10-9
10-6
10-3
L-2 t = 10-5 [s/cm-2](F = 0.1µ)
L-2 t ~ S2
1010
10 8
10 6
10 4
102
10 0
10 -2
10 -4
10-5
10-4
10 -3
10 -2
10-1
10 -0
10
102
10-13(1000µ)
10-11(100µ)
10-9(10µ)
10-7(1µ)
[Ref: J. Davis, Proc’01]
Low Power Design Essentials ©2008 6.11
Lower Bounds on Interconnect Energy
Claude Shannon
)1(2logkTBPBC S
C: capacity in bits/secB: bandwidthPs: average signal power
CPE Sbit /
Valid for an “infinitely long” bit transition (C/B→0)Equals 4.10-21J/bit at room temperature
)2ln()0/((min) kTBCEE bitbit
Shannon’s theorem on maximum capacity of communication channel
[Ref: J. Davis, Proc’01]
Low Power Design Essentials ©2008 6.12
Reducing Interconnect Power/Energy
Same philosophy as with logic: reduce capacitance, voltage (or voltage swing) and/or activity
A major difference: sending a bit(s) from one point to another is fundamentally a communications /networking problem, and it helps to consider it as such.
Abstraction layers are different:– For computation: device, gate, logic, micro-architecture– For communication: wire, link, network, transport
Helps to organize along abstraction layers, well understood in the networking world: the OSI protocol stack
Low Power Design Essentials ©2008 6.13
OSI Protocol Stack
Reference model for wired and wireless protocol design — Also useful guide for conception and optimization of on-chip communication
Layered approach allows for orthogonalization of concerns and decomposition of constraints
Network
Transport
Session
Data Link
Physical
Presentation/Application
No requirement to implement all layers of the stack Layered structure must not necessarily be maintained in
final implementation[Ref: M. Sgroi, DAC’01]
Low Power Design Essentials ©2008 6.14
The Physical Layer
Transmit bits over physical interconnect medium (wire) Physical medium
– Material choice, repeater insertion
Signal waveform– Discrete levels, pulses,
modulated sinusoids Voltages
– Reduced swing Timing, synchronization
Network
Transport
Session
Data Link
Physical
Presentation/Application
So far, on-chip communication almost uniquely “level-based”
Low Power Design Essentials ©2008 6.15
Repeater Insertion
Optimal receiver insertion results in wire delay linear with L
))(( wwddp crCRLt with RdCd and rwcw intrinsic delays of inverter and wire, respectively
But: At major energy cost!
Low Power Design Essentials ©2008 6.16
Repeater Insertion ─ Example
1 cm Cu wire in 90 nm technology (on intermediate layers)– rw = 250 W/mm; cw = 200 fF/mm– tp = 0.69rwcwL2 = 3.45 nsec
Optimal driver insertion:– tpopt = 0.5 nsec– Requires insertion of 13 repeaters– Energy per transition 8 times larger than just charging
the wire (6 pJ verus 0.75 pJ)! It pays to back off!
Low Power Design Essentials ©2008 6.17
Wire Energy-Delay Trade-off
1 2 3 4 5 6 7 80.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
dNorm
eNor
m
wire energy only
L = 1cm (Cu)90 nm CMOS
(dMin, eMax)R
epea
ter o
verh
ead
Low Power Design Essentials ©2008 6.18
Multi-dimensional Optimization
Design parameters:Voltage, number of stages, buffer sizes
Voltage scaling has largest impact, followed by selection of number of repeaters
Transistor sizing secondary.
1 2 3 4 5 6 7 80
2
4
6
8
10
12
dNorm
Num
ber o
f sta
ges
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
VD
D (V
)
Low Power Design Essentials ©2008 6.19
Reduced Swing
Ebit = CVDDVswing
Concerns:– Overhead (area, delay)– Robustness (supply noise, crosstalk, process variations)– Repeaters?
Transmitter (TX) Receiver (RX)
Low Power Design Essentials ©2008 6.20
Traditional Level Converter
Requires two discrete voltage levels Asynchronous level conversion adds extra
delay
VDDH VDDL
VDDH VDDH
in
CL
OUT OUTVDDL
[Ref: H. Zhang, TVLSI’00]
Low Power Design Essentials ©2008 6.21
Avoiding Extra References
[Ref: H. Zhang, VLSI’00]
in
VDD
VDD VDD
in2
CL
outN3
P3
N1
P1
N2
P2
VTC
Transient
Low Power Design Essentials ©2008 6.22
Differential (Clocked) Signaling
Allows for very low swings (200 mV) Robust Quadratic energy savings But: doubling the wiring, extra clock signal, complexity
[Ref: T. Burd, UCB’01]
in
REF VDD
REF
CL
CL
clk
clk clk
d_b d
out_b
out
Low Power Design Essentials ©2008 6.23
Lower Bound on Signal Swing? Reduction of signal swing translates into higher power dissipation
in receiver – trade-off between wire and receiver energy dissipation Reduced SNR impacts reliability – current on-chip interconnect
strategies require Bit Error Rate (BER) of zero (in contrast to communication and network links)– Noise source: power supply noise, crosstalk
Swings as low as 200 mV have been reported [Ref: Burd’00], 100 mV definitely possible
Further reduction requires crosstalk suppression
shielding folding
GND
GND
GND
Low Power Design Essentials ©2008 6.24
Quasi-Adiabatic Charging
t
V VDD
VDD/ N
[Ref: L. Svensson, ISLPED’96]
• Uses stepwise approximation of adiabatic (dis)charging
• Capacitors acting as “charge reservoir”
• Energy drawn from supply reduced by factor N
CT1
CT2
CTN-1
Low Power Design Essentials ©2008 6.25
Charge Redistribution Schemes
VDD/2
VDD/4
3VDD/4
Precharge Eval Precharge
B0
B0
B1
B1
B0 = 0
B1 = 1
VDD
E
E
E
P
P
GND
RX1
RX0
1
0
B1
B1
B0
B0
Charge recycled from top to bottom Precharge phase equalizes differential lines Energy/bit = 2C(VDD/N)2
Challenges: Receiver design, noise margins
[Ref: H. Yamauchi, JSSC’95]
Low Power Design Essentials ©2008 6.26
Alternative Communication Schemes
Example: Capacitively-driven wires
Offers some compelling advantages Reduced swing
Swing is VDD/(n+1) without extra supply Reduced loadAllows for smaller driver Reduced delayCapacitor pre-emphasizes edges Pitchfork capacitors exploit
sidewall capacitance
[Ref: D. Hopkins, ISSCC’07]
Low Power Design Essentials ©2008 6.27
Signaling Protocols
Network
ProcessorModule
(mProc, ALU, MPY, SRAM…)
din reqin ackindout reqout ackout
Din
REQin
done
Globally Asynchronousself-timed handshaking protocol
Allows individual modulesto dynamically trade-off performancefor energy-efficiency
Low Power Design Essentials ©2008 6.28
Signaling Protocols
Network
Physical Layer Interface Module
ProcessorModule
(mProc, ALU, MPY, SRAM…)
din reqin ackindout reqout ackout
din dout clk
Din
REQin
Clk
done
Locally synchronous
done
Globally Asynchronous
Low Power Design Essentials ©2008 6.29
The Data Link /Media Access Layer
Reliable transmission over physical link and sharing interconnect medium between multiple sources and destinations (MAC) Bundling, serialization, packetizing Error detection and correction Coding Multiple-access schemes
Network
Transport
Session
Data Link
Physical
Presentation/Application
Low Power Design Essentials ©2008 6.30
Coding
Enc
oder
Dec
oderN N + k N
LinkTX RX
Adding redundancy to communication link (extra bits) to: Reduce transitions (activity encoding) Reduce energy/bit (error-correcting coding)
Low Power Design Essentials ©2008 6.31
Activity Reduction Through Coding
[Ref: M. Stan, TVLSI’95]
Enc
oder
Dec
oderN
N + 1N
Example: Bus-Invert Coding
Invert bit p
Data word D inverted if Hamming distance from previous is larger than N/2.
DDenc
D
D # T Denc p #T0010101000111011110101000000110101110110…
-2756
0010101000111011001010110000110110001001…
00101
-21+13+12+1
Low Power Design Essentials ©2008 6.32
Bus-Invert Coding
Gain: 25 % (at best – for random data)Overhead: Extra wire (and activity) Encoder, decoder Not effective for correlated data
Reg
LP
Encode
Decode
D DDenc
p
[Ref: M. Stan, TVLSI’95]
Bus
Low Power Design Essentials ©2008 6.33
Other Transition Coding Schemes
Advanced bus-invert coding (e.g. partition bus into sub-components) (e.g. [M.Stan, TVLSI’97])
Coding for address busses ( which often display sequentiality) (e.g. [L. Benini, DATE’98])
Full-fledged channel coding, borrowed from communication links (e.g. [S. Ramprasad, TVLSI’99])
Coding to reduce impact of Miller capacitance between neighboring wires [Ref: Sotiriadis, ASPDAC’01]
Maximum capacitance transition – can be avoided by coding
bit k-1 bit k bit k+1 Delay factor g
h h h 1
h h − 1 + r
h h i 1 + 2r
− h − 1 + 2r
− h i 1 + 3r
i h i 1 + 4r
Low Power Design Essentials ©2008 6.34
Error-Correcting Codes
Enc
oder
Dec
oderN
N + kN
DDenc
D
with
e.g.
1
1
0
= 3
Example: (4,3,1) Hamming Code
B3
wrong Adding redundancy allows
for more aggressive scaling of signal swings and/or timing
Simpler codes such as Hamming prove most effective
P1P2B3P4B5B6B7
P1 + B3 + B5 + B7 = 0
P4 + B5 + B5 + B7 = 0
P2 + B3 + B6 + B7 = 0
Low Power Design Essentials ©2008 6.35
Media Access Sharing of physical media over multiple data streams
increases capacitance and activity (see Chapter 5), but reduces area
Many multi-access schemes known from communications – Time domain:Time-Division Multiple Access (TDMA)– Frequency domain: narrow band, code division multiplexing
Buses based on Arbitration-based TDMA most common in today’s ICs
Low Power Design Essentials ©2008 6.36
Bus Protocols and Energy
Some Lessons from the Communications world:– When utilization is low, simple schemes are more effective – When traffic is intense, reservation of resources minimizes
overhead and latency (collisions, resends) Combining the two leads to energy efficiency Example : SiliconBackplane MicroNetwork
CurrentSlot
[Courtesy: Sonics, Inc]
Independent arbitration for every cycle includes two phases:- Distributed TDMA for guaranteed latency/bandwidth- Round robin for random access
Arbitration
Command
Low Power Design Essentials ©2008 6.37
The Network Layer
Topology-independent end-to-end communication over multiple data links (routing, bridging, repeaters) Topology Static versus dynamic
configuration / routing
Physical
Transport
Session
Data Link
Network
Presentation/Application
Becoming more important in today’s complex multi-processor designs“The Network-on-a-Chip (NOC)”
[Ref: G. De Micheli, Morgan-Kaufman’06]
Low Power Design Essentials ©2008 6.38
Network-on-a-Chip (NoC)
Dedicated networks with reserved links preferable for high traffic channels – but: limited connectivity, area overhead
Flexibility an increasing requirement in multi (many) –core chip implementations
or
Low Power Design Essentials ©2008 6.39
The Network Trade-off’sInterconnect-oriented architecture trades off flexibility, latency, energy and area-efficiency through the following concepts Locality - eliminate global structures Hierarchy - expose locality in communication requirements Concurrency/MultiplexingVery Similar to Architectural Space Trade-off’s
Dedicated wiring
Proc
LocalLogic
Router
NetworkWires
Network-on-a-Chip
[Courtesy: B. Dally, Stanford]
Low Power Design Essentials ©2008 6.40
Networking Topology
Homogeneous– Crossbar, Butterfly, Torus, Mesh,Tree, …
Heterogeneous– Hierarchy
Mesh (FPGA)
Tree
Crossbar
Low Power Design Essentials ©2008 6.41
Network Topology Exploration
Manhattan Distance
Ene
rgy
x D
elay
Mesh
Binary Tree
Manhattan Distance
Ene
rgy
x D
elay
Mesh
Binary Tree
Mesh + Inverse
Short connections in tree are redundant
Inverse clustering complements mesh
[Ref: V. George, Springer’01]
Low Power Design Essentials ©2008 6.42
Circuit-Switched versus Packet Based
On-Chip Reality: Wires (bandwidth) are relatively cheap, buffering and routing expensive
Packet-switched approach versatile– Preferred approach in large networks– But … routers come with large overhead– Case study Intel: 18% of power in link, 82%
in router Circuit-switched approach attractive for
high-data rate quasi-static links Hierarchical combination often preferred
choice
Bus
C C
C C
Bus to connect over short distances
Hierarchical circuit and packet switched networks for longer connections
Bus
C C
C CBus
C C
C C
Bus
C C
C CBus
C C
C C
R R
R R
Low Power Design Essentials ©2008 6.43
Example: The Pleiades Network-on-a-Chip
Configuration Bus• Configurable platform for
low-energy communication and signal-processing applications (See Chapter 5)
• Allows for dynamic task-level reconfiguration of process network
Energy-efficient flexible network essential to the concept
Configurable Interconnect
ArithmeticModule
ArithmeticModule
ArithmeticModule
ConfigurableLogic
ConfigurableLogicmP
Configuration
DedicatedArithmetic
Network Interface
[Ref: H. Zhang, JSSC’00]
Low Power Design Essentials ©2008 6.44
Pleiades Network Layer
Universal Switchbox
Cluster
Cluster
Level-1 Mesh Level-2 Mesh
Hierarchical Switchbox
• Network statically configured at start of session and ripped up at end• Structured approach reduces interconnect energy with factor 7 over straightforward cross-bar
Hierarchical reconfigurable mesh network
Low Power Design Essentials ©2008 6.45
Top Layers of the OSI Stack
Abstracts communication architecture to system and performs data formatting and conversion
Establishes and maintains end-to-end communications – flow control, message
reordering, packet segmentation and reassembly Physical
Transport
Session
Data Link
Presentation/Application
Network
Example: Establish, maintain and rip-up connections in dynamically reconfigurable Systems-on-a-Chip – Important in power-management
Low Power Design Essentials ©2008 6.46
What About Clock Distribution?
Clock easily the most energy-consuming signal of a chip– Largest length– Largest fanout– Most activity (a = 1)
Skew control adding major overhead– Intermediate clock repeaters– De-skewing elements
Opportunities– Reduced swing– Alternative clock distribution schemes– Avoiding a global clock altogether
Low Power Design Essentials ©2008 6.47
Reduced-Swing Clock Distribution
Similar to reduced-swing interconnect Relatively easy to implement But: Extra-delay in flip-flop’s adds directly to clock period
Example: half-swing clock distribution scheme
Regular 2-phase clock
Half-swing clock
VDD
GND
VDD
GND
NMOS clock
PMOS clock
NMOS clock
PMOS clock
[Ref: H. Kojima, JSSC’95]
© IEEE 1995
Low Power Design Essentials ©2008 6.48
Alternative Clock Distribution Schemes
Canceling skew in perfect transmission line scenario
Example: Transmission-Line Based Clock Distribution
[Ref: V. Prodanov, CICC’06]
© IEEE 2006
Low Power Design Essentials ©2008 6.49
Summary
Interconnect important component of overall power dissipation
Structured approach with exploration at different abstraction layers most effective
Lot to be learned from communications and networking community – yet, techniques must be applied judiciously – Cost relationship between active and passive
components different Some exciting possibilities for the future: 3D-
integration, novel interconnect materials, optical or wireless I/O
Low Power Design Essentials ©2008 6.50
Books and Book Chapters T. Burd, “Energy-Efficient Processor System Design,”
http://bwrc.eecs.berkeley.edu/Publications/2001/THESES/energ_eff_process-sys_des/index.htm, UCB, 2001.
G. De Micheli and L. Benini, “Networks on Chips: Technology and Tools,” Morgan-Kaufman, 2006. V. George and J. Rabaey, “Low-energy FPGAs: Architecture and Design”, Springer 2001. J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital Integrated Circuits: A Design Perspective,” 2nd ed,
Prentice Hall 2003. C. Svensson, “Low-Power and Low-Voltage Communication for SoC’s,” in C. Piguet, Low-Power
Electronics Design, Ch. 14, CRC Press, 2005. L. Svensson, “Adiabatic and Clock-Powered Circuits,” in C. Piguet, Low-Power Electronics Design,
Ch. 15, CRC Press, 2005. G. Yeap, “Special Techniques”, in Practical Low Power Digital VLSI Design, Ch 6., Kluwer
Academic Publishers, 1998.
Articles L. Benini et al, “Address bus encoding techniques for system-level power optimization,”
Proceedings DATE’98, pp. 861-867, Paris, February 1998 T. Burd et al., “A Dynamic Voltage Scaled Microprocessor System,” IEEE ISSCC Digest of Technical
Papers, pp. 294-295, Feb. 2000. M. Chang et al, “CMP Network-on-Chop Overlaid with Multi-Band RF Interconnect”, International
Symposium on High-Performance Computer Architecture, Febr. 2008. D.M. Chapiro, “Globally Asynchronous Locally Synchronous Systems,” PhD thesis, Stanford
University, 1984.
References
Low Power Design Essentials ©2008 6.51
W. Dally, “Route Packets, Not Wires: On-Chip Interconnect Networks,” Proceedings DAC 2001, pp. 684-689, Las Vegas, June 2001.
J. Davis and J. Meindl, “Is Interconnect the Weak Link?,” IEEE Circuits and Systems Magazine, pp. 30-36, March 1998.
J. Davis et al., “Interconnect Limits on Gigascale Integration (GSI) in the 21st Century,” Proceedings of the IEEE, Vol. 89, No. 3, pp. 305-324, March 2001.
D. Hopkins et al, "Circuit techniques to enable 430Gb/s/mm2 proximity communication," IEEE International Solid-State Circuits Conference, vol. XL, pp. 368 - 369, February 2007.
H. Kojima et al., “Half-Swing Clocking Scheme for 75% Power Saving in Clocking Circuitry,” Journal of Solid Stated Circuits, vol. 30, no 4, pp. 432-435, April 1995.
E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” Proceedings ISLPED’98, pp.155-160, Monterey, Aug. 1998.
V. Prodanov and M. Banu, “GHz Serial Passive Clock Distribution in VLSI using Bidirectional Signaling,” Proceedings CICC 06.
S. Ramprasad et al., “A coding framework for low-power address and data busses,” IEEE Transactions on VLSI Signal Processing, Vol. 7, No 2, pp. 212-221, June 1999.
M. Sgroi et al, “Addressing the System-on-a-Chip Woes Through Communication-Based Design,” Proceedings DAC 2001, pp. 678-683, Las Vegas, June 2001.
P. Sotiriadis and A. Chandrakasan, “Reducing Bus Delay in Submicron Technology Using Coding,” Proceedings ASPDAC Conference, Yokohama, January 2001.
References (cntd)
Low Power Design Essentials ©2008 6.52
References (cntd)
M. Stan and W. Burleson, “Bus-Invert Coding for Low-Power I/O,” IEEE Transactions on VLSI, pp. 48-58, March 1995.
M.. Stan, W. Burleson, "Low-Power Encodings for Global Communication in CMOS VLSI", IEEE Transactions on VLSI Systems, pp. 444-455, Dec. 1997.
V. Sathe, J.-Y. Chueh, and M. C. Papaefthymiou, “Energy-Efficient GHz-Class Charg-Recovery logic”, IEEE JSSC vol. 42 No 1, pp.38-47, January 2007.
L. Svensson et al., “A sub-CV2 pad Driver with 10 ns Transition Time,” Proc. ISLPED 96, Monterey, Aug. 12-14, 1996.
D. Wingard, “Micronetwork-Based Integration for SOCs,” Proceedings DAC 01, pp. pp. 673-677, Las Vegas, June 2001.
H. Yamauchi et al., “An Asymptotically Zero Power Charge Recycling Bus,” IEEE Journal of Solid Stated Circuits, vol. 30, no 4, pp. 423-431, April 1995.
H. Zhang, V. George and J. Rabaey, “Low-Swing on-chip Signaling Techniques: Effectiveness and Robustness,” IEEE Transactions on VLSI Systems, Vol. 8, No 3, pp. 264-272, June 2000.
H. Zhang et al, “A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1697-1704, Nov. 2000.