1 The Optimization of High- Performance Digital Circuits Andrew Conn (with Michael Henderson and...
-
Upload
ashlynn-carter -
Category
Documents
-
view
216 -
download
0
Transcript of 1 The Optimization of High- Performance Digital Circuits Andrew Conn (with Michael Henderson and...
1
The Optimization of High-Performance Digital Circuits
Andrew Conn (with Michael Henderson and Chandu Visweswariah)
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
4
Dynamic vs. static optimization
Transistor and wire sizes
SimulatorNonlinearoptimizer
Function and gradient values
Transistor and wire sizes
Static timinganalyzer
Nonlinearoptimizer
Function and gradient values
5
Dynamic vs. static optimization
Dynamic Static
Example: DELIGHT.SPICE,JiffyTune
Example: TILOS, Contrast,EinsTuner
Optimizes specified paths Optimizes all paths
Needs input vectors None required
Requires carefully thought-outproblem statement
Automatic
Hard to use Easy
Not susceptible to false paths,pessimism
Susceptible
Easier to add noise/powerconsiderations
Harder
7
EinsTuner: formal static optimizer
Embedded time-domain simulator
SPECS
Static transistor-level timerEinsTLT
Nonlinearoptimizer
LANCELOT
Transistor andwire sizes
Function andgradient values
8
Components of EinsTuner
Read netlist; create timing graph (EinsTLT)Formulate pruned optimization problemFeed problem to nonlinear optimizer (LANCELOT)
Snap-to-grid; back-annotate; re-time
Solve optimization problem, call simulatorfor delays/slews and gradients thereof
Obtain converged solution
Fast simulation and incrementalsensitivity computation (SPECS)
1
2
34
9
]464
,363
[max6
s.t.]
252,
151[max
5s.t.
]686
,585
[max8
s.t.]
676,
575[max
7s.t.
)]8
,7
([maxmin
dATdATATdATdATATdATdATATdATdATAT
ATAT
1AT
2AT
15d
25d
3AT
4AT
36d
46d
57d
68d
58d 67d
5AT
6AT
7AT
8AT
Static optimization formulation
10
1f2f
x0x
z
x
*z
Digression: minimax optimization
)( s.t.)( s.t.
min
])}(),(max{[min
2
1
,
21
xfzxfz
z
xfxfx
zx
11
)],min[max( 87 ATAT
],max[ 6865858 dATdATAT
],max[ 6765757 dATdATAT
],max[ 2521515 dATdATAT
],max[ 4643636 dATdATAT
Remapped problem
8
7
s.t.
s.t.
min
ATz
ATz
z
6868
5858
dATAT
dATAT
6767
5757
dATAT
dATAT
2525
1515
dATAT
dATAT
4646
3636
dATAT
dATAT
1AT
2AT
15d
25d
3AT
4AT
36d
46d
57d
68d
58d 67d
5AT
6AT
7AT
8AT
13
Springs and planks analogy
z7AT
8AT
5AT6AT
1AT 2AT 3AT 4AT15d 25d
36d 46d
57d67d 58d 68d z
7AT8AT
5AT6AT
1AT 2AT 3AT 4AT15d 25d
36d 46d
57d67d 58d 68d z
7AT8AT
5AT6AT
1AT 2AT 3AT 4AT15d 25d
36d 46d
57d67d 58d 68d z
7AT8AT
5AT6AT
1AT 2AT 3AT 4AT15d 25d
36d 46d
57d67d 58d 68d
14
Algorithm animation: inv3
Del
ay
Logic stages
PIs by criticality WireG
ate
Red=criticalGreen=non-criticalCurvature=sensitivityThickness=transistor size
• One such frame per iteration
16
),,,(
),,,(
),,,(
),,,(
rinjpn
fij
fj
finjpn
rij
rj
rinjpn
fij
ri
fj
finjpn
rij
fi
rj
slewcoutwwsslew
slewcoutwwsslew
slewcoutwwdATAT
slewcoutwwdATAT
i j
pn ww ,
jcout
Constraint generation
17
Statement of the problem
arc geach timinfor ),( s.t.
arc timingeachfor),( s.t.POsallfor s.t.
min
SWijsjs
SWijdiATjATiRATiATz
z
POs allfor POmax s.t.
nets internal allfor internalmax s.t.
FETs allfor maxmin s.t.gates allfor max)(min s.t.PIs allfor itarget)(ipincap s.t.
targetarea s.t.
SiS
SiS
WiWWWi
WW
18
SPECS: fast simulation
• Two orders of magnitude faster than SPICE
• 5% typical stage delay and slew accuracy; 20% worst-case
• Event-driven algorithm
• Simplified device models
• Specialized integration methods
• Invoked via a programming interface
• Accurate gradients indispensable
20
LANCELOT algorithms
•Uses augmented Lagrangian for nonlinear constraints (x,) = f(x) + [i ci (x) + ci (x)2 /2]
•Simple bounds handled explicitly
•Adds slacks to inequalities
•Trust region method
21
LANCELOT algorithms continued
Simple bounds
a
b
Trust-region
kx
u
f
wv
2x
1x
bx
ax
xxf
2
1
21
0
0
),(min
22
Customization of LANCELOT• Cannot just use as a black box
• Non-standard options may be preferable
eg Solve the BQP subproblem accurately
• Magic Steps
• Noise considerations
• Structured Secant Updates
• Adjoint computations
• Preprocessing (Pruning)
• Failure recovery in conjunction with SPECS
23
LANCELOT
• State-of-the-art large-scale nonlinear optimization package
• Group partial separability is heavily exploited in our formulation
• Two-step updates applied to linear variables
• Specialized criteria for initializations, updates, adjoint computations, stopping and dealing with numerical noise
24
Aids to convergence• Initialization of multipliers and variables
• Scaling, choice of units
• Choice of simple bounds on arrival times, z
• Reduction of numerical noise
• Reduction of dimensionality
• Treating fanout capacitances as“internal variables” of the optimization
• Tuning of LANCELOT to be aggressive
• Accurate solution of BQP
28
Pruning of the timing graph
• The timing graph can be manipulated– to reduce the number of arrival time variables– to reduce the number of timing constraints– most of all, to reduce degeneracy
• No loss in generality or accuracy
• Bottom line: average 18.3xAT variables,33% variables, 43% timing constraints, 22% constraints, 1.7x to 4.1xin run time on large problems
29
Pruning strategy
• During pruning, number of fanins of any un-pruned node monotonically increases
• During pruning, number of fanouts of any un-pruned node monotonically increases
• Therefore, if a node is not pruned in the first pass, it will never be pruned
• Therefore, a one-pass algorithm can be used for a given pruning criterion
30
Pruning strategy
• The order of pruning provably produces different (possibly sub-optimal) results
• Greedy 3-pass pruning produces a “very good” (but perhaps non-optimal) result
• We have not been able to demonstrate a better result than greedy 3-pass pruning
• However, the quest for a provably optimal solution continues...
31
Pruning: an example
frf
rfr
frf
rfr
frf
rfr
f
r
dATATdATATdATATdATATdATATdATAT
ATzATzz
1212
1212
2323
2323
3434
3434
4
4
s.t. s.t. s.t. s.t. s.t. s.t.
s.t. s.t.min
rfrf
frfr
dddATzdddATz
z
3423121
3423121
s.t. s.t.
min
1 2 3 4
32
Block-based vs. path-based timing
463436
462426
461416
453435
452425
451415
ddATATddATATddATATddATATddATATddATAT
Blo
ck-b
ased P
ath-based
12
3
4
5
6
34342424
1414
4646
4545
dATATdATATdATAT
dATAT
dATAT
33
Block-based & path-based timing
• In timing graph, if node has n fanins, m fanouts, eliminating it causes 2mn constraints instead of 2 (m+n)
• Criterion: if 2mn 2(m+n)+2, prune!
1
2
3
4 5
6
14d24d
34d
45d
46d
1
2
3
5
6
4514 dd
4524 dd
4634 dd
4614 dd
4624 dd
4534 dd
35
Detailed pruning example
1
2
3
7 9 11 14
4
5
6
8 10 13 16
Sink12 15Source
Edges = 26Nodes = 16 (+2)
36
Detailed pruning example
7 9 11 141
2
345
6 8 10 13 16
Sink12 15Source
Edges = 26 20Nodes = 16 10
37
Detailed pruning example
1
2
3
7 9 11
45
6 8 10 13
Sink12Source
14
16
15
14
Edges = 20 17Nodes = 10 7
38
Detailed pruning example
9 11 14
45
6 8 10 1316
Sink12
15Source
14
1,72,7
3,7
Edges = 17 16Nodes = 7 6
39
Detailed pruning example
9 11,14
45
6 8 10 1316
Sink12
15Source
14
1,72,7
3,7
Edges = 16 15Nodes = 6 5
40
Detailed pruning example
9 11,14
45
6 8 1013,16
Sink12
15Source
14
1,72,7
3,7
Edges = 15 14Nodes = 5 4
41
Detailed pruning example
9 11,14
45
6 8
10
10,13,16
Sink12
15Source
14
1,72,7
3,7
Edges = 14 13Nodes = 4 3
42
Detailed pruning example
9 11,14
45
6 810,13,16
SinkSource
10,12,14
12,15
10,12,15
12,141,7
2,7
3,7
Edges = 13 13Nodes = 3 2
44
Adjoint Lagrangian modeAdjoint Lagrangian mode
– gradient computation is the bottleneck– if the problem has m measurements and n
tunable transistor/wire sizes:• traditional direct method: n sensitivity simulations• traditional adjoint method: m adjoint simulations
– adjoint Lagrangian method computes all gradients in a single adjoint simulation!
45
Adjoint Lagrangian modeAdjoint Lagrangian mode
– useful for large circuits– implication: additional timing/noise
constraints at no extra cost!– is predicated on close software integration
between the optimizer and the simulator– gradient computation is 8% of total run time
on average
46
Noise considerationsNoise considerations
– noise is important during tuning
– semi-infinite problem
],[in allfor ),( 21 tttNMtxv L
v
t
area = c(x)
NML
t1 t2
47
Noise considerationsNoise considerations
• Trick: remap infinite number of constraints to a single integral constraint c(x) = 0
• In adjoint Lagrangian mode, any number of noise constraints almost for free!
• General (constraints, objectives, minimax)
• Tradeoff analysis for dynamic library cells
v
t
area = c(x)
NML
t1 t2
50
Some numerical results - Dynamic
Name # gatesStart delay
(ps)End delay
(ps)%
improvement#
iterationsCPU
time (s)
s390-1 22 269.4 255.5 5.2 51 100.2
s390-2 24 388.7 350.0 10.0 110 945.4
s390-3 41 1568.4 1548.6 11.4 49 462.7
ppc-1 155 584.2 526.4 9.9 35 2247.9
s390-4 175 306.2 231.8 24.3 55 17664.8
ppc-2 218 627.6 555.5 11.5 74 25342.2
ppc-3 408 589.8 584.4 0.9 30 2950.8
s390-5 430 943.5 623.4 33.9 228 27125.4
s390-6 559 884.1 742.6 16.0 107 52154.0
ppc-4 628 1276.7 1091.3 14.5 159 284445.0
51
Some numerical results - Static
Name # gatesStart slack
(ps)End slack
(ps)%
improvement#
iterationsCPU
time (s)
s390-1 22 645 668 3.5 34 29.3
s390-2 24 583 617 5.8 46 141.3
ppc-1 155 514 555 7.9 56 528.9
s390-4 175 609 684 12.3 59 355.5
ppc-2 218 454 525 15.6 64 1003
x1 138 -129.1 -41.36 68 49 1321
ppc-3 408 447 512 14.5 68 861.4
s390-5 430 5 424 8380 131 3600
x2 507 -12.53 166 1425 59 4096
x3 743 -265.8 -86.31 67.5 73 30850
52
– Motivation
Lagrangian relaxation
•Tuning community loves the approach
–reduces the size of the problem
–reduces redundancy and degeneracy
– BUT …Never get something for nothing
53
),,,(
),,,(
),,,(
),,,(
rinjpn
fij
fj
finjpn
rij
rj
rinjpn
fij
ri
fj
finjpn
rij
fi
rj
slewcoutwwsslew
slewcoutwwsslew
slewcoutwwdATAT
slewcoutwwdATAT
Lagrangian relaxation (continued)
ri
ri
fi
fi
RATATz
RATATz
•Complicating constraints
Relax into objective function
54
• Substituting the first-order optimality conditions on removes the dependence on z and AT’s --- because of the problem’s structure!
Lagrangian relaxation (continued)
55
Lagrangian Relaxation
0
2000
4000
6000
8000
10000
0 2000 4000Transistors
Var
iabl
es
Lagrangian Relaxation
Pruning
Neither
56
Lagrangian relaxation (continued)Name FETs Iterations CPU(s)
w/o w w/ w/o w
inv3 8 61 2 36 5.1 5.3
a3.2 34 41 17 215 21.5 138.2
Agrph 34 30 28 303 38.1 434.7
ldder 46 41 16 182 21.5 121.1
s3901 72 40 34 109 87.1 479.7
s3902 102 67 61 471 990.5 842.6
s3903 154 26 29 131 236.8 1502
ppc-1 824 37 117 1181 1818 57430
s3904 882 20 156 386 532.8 14553
c8 584 53 194 1372 1004 37068
57
Future work and conclusions –Elaine/IPOPT (Andreas Waechter)
•Handle linear constraints as before/directly
•Nonlinear constraints as before/filter method
•Simple bounds via primal-dual interior point method
• Spherical trust region scaled appropriately/line search method
58
References
C. Visweswariah, A. R. Conn and L. Silva
Exploiting Optimality Conditions in Accurate Static Circuit Tuning
in High Performance Algorithms and Software for Nonlinear Optimization,
G. DiPillo and A. Murli, Eds, pages 1-19, Kluwer, 2002 (to appear)
A. R. Conn, P. K. Coulman, R. A. Haring, G. L. Morrill, and C. Visweswariah.
Optimization of custom MOS Circuits by transistor sizing.
To appear (2002) in the book "The Best of ICCAD - 20Years of Excellence in
Computer Aided Design". Originally appeared as IEEE International Conference
on Computer-Aided Design, pages 174--180, Nov 1996
A.R. Conn and C. Visweswariah
Overview of continuous optimization advances and applications to circuit tuning.
Proceedings International Symposium on Physical Design
(2001), pp. 74-81, ACM Press, New York.
•
59
References
C. Visweswariah, R. A. Haring, and A. R. Conn.
Noise considerations in circuit optimization.
IEEE Transactions on Computer-Aided Design of ICs and Systems,
Vol. 19, pages 679-690, June 2000.
C. Visweswariah and A. R. Conn
Formulation of static circuit optimization with reduced size, degeneracy and
redundancy by timing graph manipulation.
IEEE International Conference on Computer-Aided Design, pages 244--251, Nov. 1999.
A. R. Conn, L. N. Vicente, and C. Visweswariah.
Two-step algorithms for nonlinear optimization with structured applications.
SIAM Journal on Optimization, volume 9, number 4, pages 924--947, September 1999.
60
References
A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P. R. O'Brien, P. N. Strenski,
C. Visweswariah, and C. B. Whan
Gradient-based optimization of custom circuits using a static-timing formulation.
Proc. Design Automation Conference, pages 452--459, June 1999.
A. R. Conn, R. A. Haring, and C. Visweswariah
Noise considerations in circuit optimization.
IEEE International Conference on Computer-Aided Design, pages 220--227, Nov 1998.
A. R. Conn, P. K. Coulman, R. A. Haring, G. L. Morrill, C. Visweswariah, and
C. W. Wu.
JiffyTune: circuit optimization using time-domain sensitivities.
IEEE Transactions on Computer-Aided Design of ICs and Systems, number 12,
volume 17, pages 1292--1309, December 1998.
61
References
A. R. Conn, R. A. Haring, C. Visweswariah, and C. W. Wu.
Circuit optimization via adjoint Lagrangians.
IEEE International Conference on Computer-Aided Design, pages 281--288, Nov 1997.
A. R. Conn, R. A. Haring, and C. Visweswariah.
Efficient time-domain simulation and optimization of digital FET circuits.
Mathematical Theory of Networks and Systems, May 1996.
A popular article on our work by Stewart Wolpin
http://www.research.ibm.com/thinkresearch/pages/2002/20020625_einstuner.shtml