Parallelization by SimPL ification : A Case Study in VLSI Placement

47
Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1 PAPA2011, University of Michigan

description

Parallelization by SimPL ification : A Case Study in VLSI Placement. Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan. Complexities of Parallel Algorithms & SW. Objectives of parallelization A. Improve completion time by using multiple cores in || - PowerPoint PPT Presentation

Transcript of Parallelization by SimPL ification : A Case Study in VLSI Placement

Page 1: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelization by SimPLification:A Case Study in VLSI Placement

Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan

1

Page 2: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Complexities of Parallel Algorithms & SW1.Objectives of parallelization

A. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing

(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)

− Come up with a slow algorithm that is easy to parallelize

■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is

(a) better, (b) easy to parallelize

2

Page 3: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

CAD Algorithms■Sequence of optimizations

− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications

− Elaborate data structures may entail overheadfor parallel access

− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)

■Recommendations− A simpler algorithm is often either to parallelize

(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra

helps reuse previous work on parallelization

3

Page 4: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Global Placement: Motivation■Interconnect lagging in

performance while transistors continue scaling

− Circuit delay, power dissipation and areadominated by interconnect

− Routing quality highly controlled by placement

■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations

4

Unloaded

Coupling IR drop

RC delay

Page 5: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Goals in Placement■Find good relative ordering of cells

− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells

− Eliminate wiring congestion problems− Provide space for post placement stages

–clock trees–buffer insertion–timing correction

■Find good global position

5

Page 6: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

Optimize Relative Order

6

Page 7: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

To spread ...

7

Page 8: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

.. or not to spread

8

Page 9: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

Place to the left

9

Page 10: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

… or to the right

10

Page 11: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

A B C

Optimize Relative Order

Without whitespace,placement is dominated by ordering

11

Page 12: Parallelization by  SimPL ification : A Case Study in VLSI Placement

Example of Global Placement (APlace 2.04 from UCSD)

Page 13: Parallelization by  SimPL ification : A Case Study in VLSI Placement

Example of Global Placement (mFar from UCSB)

Page 14: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Placement Formulation

■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)− (max X – min X) + (max Y – min Y)

■Subject to constraints:− Legality: Row-based

placement with no overlaps− Routability: Limiting local

interconnect congestion forsuccessful routing

− Timing: Meeting performancetarget of a design

14

xy

Page 15: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Quadratic Placement■Consider a graph first, not a hypergraph■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)

− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components

■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2

− Our goal: minimize the energy of the system

A system of springs will only settle in a minimum

15

Page 16: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Iterative Optimization

16

Page 17: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Prior Work

■ Ideal Placer− Low runtime without sacrificing solution quality− Simplicity, integration with other optimizations

17

Spee

d

Solution Quality

Non-convex optimization

mFAR, Kraftwerk2, FastPlace3

Ideal placer

mPL6, APlace2, NTUPlace3

Quadratic and force-directed

Page 18: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Key features of SimPL■Flat quadratic placement■Primal dual optimization

− Closing the gap between upper and lower bounds

18

Final Solution

Lower-Bound Solutionby Linear System Solver

Wire

leng

th

Iteration

Final Legal Solution

Upper-Bound Solution by Look-ahead Legalization

Initial WL Opt.

Page 19: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Common Analytical Placement Flow

19

Placement Instance

Converge

yes

no

GlobalPlacement

Initial WLOptimization

Legalizationand Detailed Placement

Page 20: Parallelization by  SimPL ification : A Case Study in VLSI Placement

SimPL Flow

20

We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

Placement Instance

Legalizationand Detailed Placement

B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]

yesno

Pseudonet Insertion

Look-aheadLegalization

(Upper-Bound)

B2B GraphBuilding

Linear System Solver (Lower-Bound)

ConvergeGlobal

Placement

B2B GraphBuilding

Linear System Solver

WLConverge

yes

noInitial WLOptimization

Page 21: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization■Purpose: Produces almost-legal placement (Upper-

Bound)while preserving the relative cell ordering givenby linear system solver (Lower-Bound)

■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b■Perform geometric top-down partitioning

− Find cell area median (Cc) and whitespace median (CB) − Assign cells (Cc) to corresponding partitions (CB) ■Non-linear scaling

− Form stripe regions− Move cells across stripe regions in-order based on whitespace

21

Page 22: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (1)

Performing geometric top-down partitioning

Overfilled binCell-area median (Cc)

B0 B1

whitespacemedian (CB)

Bin cluster (B)

22

Page 23: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

23

Cell-area median (Cc)

whitespacemedian (CB)

B0

Page 24: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

CB

Obstacle

borders

Uniform cutlines

CellOrdering

Per-stripeLinear Scaling

26

4

37

58

1

CB

26

4

37

58

1

CB

24

Page 25: Parallelization by  SimPL ification : A Case Study in VLSI Placement

SimPL: Look-ahead Legalization (3)■Example (adaptec1)

Look-ahead legalization stops when target regions become small enough

Page 26: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL: Using legal locations as anchors■Purpose: Gradually perturb the linear system to

generate lower-bound solutions with less overlap

■Anchors and Pseudonets− Look-ahead locations used

as fixed, zero-area anchors − Anchors and original cells

connected with 2-pin pseudonets− Pseudonet weights grow

linearly with iterations

26

Page 27: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Next illustration: Tug-of-war between low-wirelength and

legalized placements

27

Page 28: Parallelization by  SimPL ification : A Case Study in VLSI Placement

SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)

Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

28

Page 29: Parallelization by  SimPL ification : A Case Study in VLSI Placement

SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=10 (Lower Bound)

29

Page 30: Parallelization by  SimPL ification : A Case Study in VLSI Placement

SimPL Iterations on Adaptec1 (3)

30

Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)

Iteration=40 (Lower Bound) Iteration=41 (Upper Bound)

Page 31: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Convergence of SimPL■ Legal solution is formed between two bounds

31

Page 32: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Empirical Results: ISPD05 Benchmarks■Experimental setup

− Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation

− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver

for sparse linear systems (w Jacobi preconditioner)

32

Page 33: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Speeding Up Placement Using Parallelism■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism

− Thread-level − Instruction-level

34

Page 34: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelism in Conjugate Gradient Solver■Coarse-grain row partitioning

− Implemented using OpenMP3.0 compiler intrinsic

■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV

■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format

Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003

35

Page 35: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelism in CG Solver - Example

36

Page 36: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelism in B2B Mode Update■B2B net model update

– B2B model is separable– Can process the x and y cases in parallel

− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.

37

Page 37: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SSE optimization affects Runtime Profile

38

Initial placement 5%

CG solver 19%

Sparse matrix and B2B net

modeling10%

Look-ahead legalization

18%

Pseudo-net insertion 1%

Post Global Placement

46%

IO 1%

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Page 38: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (1)■Look-ahead legalization (LAL) started consuming

a significant fraction of overall runtime

■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization

− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel

− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells

39

Page 39: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (2)■LAL keeps the global queue of bin clusters Q■Static partitioning

− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start

■Subtask updates− Thread ti processes one of two sub-clusters (for the next

level of T&N), the remainder is added to the global cluster queue Q

■Dynamic task scheduling − When thread ti is idle, it dynamically retrieves clusters

from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)

40

Page 40: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Empirical Results – Overall Speed-ups■Experimental setup

− Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM

− Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache

41

Page 41: Parallelization by  SimPL ification : A Case Study in VLSI Placement

Empirical Results – Component Speed-ups

42PAPA2011, University of Michigan

Page 42: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Empirical Results – Component Speed-ups

43

Page 43: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Extending the Routability-driven Placement■Ongoing work: simultaneous place-and-route

44

Page 44: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Simultaneous Place-and-Route■After Look-Ahead Legalization (LAL)

perform Look-Ahead Routing (LAR)− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density

(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research

− New, large benchmarks− Placements evaluated by a common global router

45

Page 45: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

SimPL SimPLR■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)

46

Page 46: Parallelization by  SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Conclusions■ New flat quadratic placement algorithm: SimPL

− Novel primal-dual based approach − Amenable to integration with physical synthesis

■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route

47

Page 47: Parallelization by  SimPL ification : A Case Study in VLSI Placement

Questions and Answers

Thank you!Time for Questions

48PAPA2011, University of Michigan