Toward More Efficient Annealing-Based Placement for ... · Toward More E cient Annealing-Based...

71
Toward More Efficient Annealing-Based Placement for Heterogeneous FPGAs by Yingxuan Liu A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Yingxuan Liu

Transcript of Toward More Efficient Annealing-Based Placement for ... · Toward More E cient Annealing-Based...

Toward More Efficient Annealing-Based Placement for HeterogeneousFPGAs

by

Yingxuan Liu

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2014 by Yingxuan Liu

Abstract

Toward More Efficient Annealing-Based Placement for Heterogeneous FPGAs

Yingxuan Liu

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

Simulated Annealing (SA) is a popular placement heuristic used in many commercial and academic

FPGA CAD tools. However, SA placement requires high runtime to produce good quality results. As

FPGAs continue to grow in size, runtime has become more crucial for SA-based placers. This thesis aims

to improve SA placement by making it more efficient in two key areas for large heterogeneous FPGAs:

move generation and placement cost evaluation.

This work shows that by using Median Region moves for heterogeneous architectures, the wirelength

is reduced by 5% while maintaining the same critical path delay. We also show that by using a better

search data structure the runtime is improved by 5%.

A new Timing Cost function incorporating Delay Budgets is introduced in this work, and it improves

the circuit speed by 4% with no degradation in wirelength. This cost function also performs better on

large circuits, where it improves circuit speed by 7%.

ii

Acknowledgements

I would like thank my supervisor Dr.Vaughn Betz for his guidance and support over the past two years.

He has been a wonderful mentor to me and shown a great passion for teaching.

My thank also goes to my wife Vivian and my family for their continued patience and encouragement

over the years.

And I feel grateful to spend the past two years with all the wonderful people here at UofT. It has

been a great experience for me.

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 FPGA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 FPGA CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.4 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 VPR and Placement Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Placement Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 VPR Placement Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Prior Work on Annealing Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Homogeneous Directed Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.2 Delay Budgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Move Generation 21

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Median Region Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Smart Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

4 Cost Function 38

4.1 Cost Function Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Delay Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Delay Budget Timing Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Delay Budget Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.4 Iterative-minimax-PERT: Implementation . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.5 Iterative-minimax-PERT: Experimental Results . . . . . . . . . . . . . . . . . . . . 50

4.2.6 Direct Slack: Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.7 Direct Slack: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Conclusion 59

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 61

v

List of Tables

3.1 Median Region move Experiment Architecture stats. . . . . . . . . . . . . . . . . . . . . . 24

3.2 Largest 5 circuits from the VTR benchmark suite stats[24]. . . . . . . . . . . . . . . . . . 25

3.3 MR QoR Comparison between High and Medium chip utilization (VTR Benchmarks). . . 31

3.4 Placement QoR Comparison between not omitting and omitting high fan-out nets for MR

moves (VTR Benchmarks normalized vs original VPR results). . . . . . . . . . . . . . . . 31

3.5 Smart Search Space results ranking by chip size . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Delta Timing Cost using Original VPR Timing Cost Function. . . . . . . . . . . . . . . . 47

4.2 Delta Timing Cost using New VPR Timing Cost Function. . . . . . . . . . . . . . . . . . 49

4.3 PERT Delay Budget Experiment Architecture Stats. . . . . . . . . . . . . . . . . . . . . . 50

4.4 PERT Delay Budget Experiment Architecture Files. . . . . . . . . . . . . . . . . . . . . . 51

4.5 QoR of Delay Budget with PERT method vs Original VPR using “arch1” and VTR

benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 QoR of Delay Budget with PERT method vs Original VPR using “arch2” and VTR

benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7 QoR of Delay Budget with PERT method vs Original VPR using “arch3” and MCNC

benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8 Direct Slack Results using N10 VTR benchmarks. . . . . . . . . . . . . . . . . . . . . . . 54

4.9 Direct Slack Results using N10 VTR benchmarks normalized vs Original VPR. . . . . . . 55

4.10 Direct Slack Results using N10 VTR benchmarks normalized vs Original VPR (Largest 5). 55

4.11 Direct Slack Results using N4 VTR benchmarks. . . . . . . . . . . . . . . . . . . . . . . . 56

4.12 Direct Slack Results using N4 MCNC benchmarks. . . . . . . . . . . . . . . . . . . . . . . 57

vi

List of Figures

1.1 Typical FPGA CAD flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 FPGA capacity and CPU speed growth in the 2000s [2]. . . . . . . . . . . . . . . . . . . . 3

2.1 FPGA Basic Logic Element and Configurable Logic Block [3]. . . . . . . . . . . . . . . . . 5

2.2 Island Style Homogeneous FPGA Architecture. . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 FPGA Connection and Switch Boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Heterogeneous FPGA Architecture [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 VTR CAD Flow [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 VPR Placement Pseudo Code [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Median Region Move [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8 MR Moves: Cell Shuffling [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 CLB “Cell shuffling” path blocked by RAM block. . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Cell Rippling with 3 MFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 QoR vs. fraction of moves are MR type for various placer effort levels (VTR Benchmarks). 26

3.4 QoR vs. fraction of moves are MR type for various placer effort levels (MCNC Benchmarks). 27

3.5 Median Region Cell Rippling Move Fragment sweep (VTR Benchmarks). . . . . . . . . . 29

3.6 Median Region Cell Rippling MF sweep for medium chip utilization (VTR Benchmarks) . 30

3.7 Median Region produces same QoR as original VPR at lower runtime (VTR Benchmarks). 32

3.8 Size-1 range limit for CLBs in a Homogeneous Architecture. . . . . . . . . . . . . . . . . . 33

3.9 Size-1 “‘window” for CLB in Heterogeneous Architecture. . . . . . . . . . . . . . . . . . . 34

3.10 Size-1 “window” for RAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.11 Size-1 “‘window” for CLB with SSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.12 Size-1 “window” for RAM with SSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Cost Function Trade-off Sweep (VTRBM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Cost Function Tradeoff Sweep 0.1 to 0.9 (VTRBM). . . . . . . . . . . . . . . . . . . . . . 40

4.3 Criticality Exponent Sweep WDP Comparison (λ 0.1 to 0.9). . . . . . . . . . . . . . . . . 41

4.4 Criticality Exponent Sweep WDP Comparison (λ 0.1 to 0.5). . . . . . . . . . . . . . . . . 41

4.5 Criticality Exponent Sweep Delay Comparison (λ 0.1 to 0.5). . . . . . . . . . . . . . . . . 42

4.6 Criticality Exponent Sweep Wirelength Comparison (λ 0.1 to 0.5). . . . . . . . . . . . . . 42

4.7 Criticality Exponent Sweep QoR Comparison (λ = 0.2). . . . . . . . . . . . . . . . . . . . 43

4.8 Criticality Exponent Sweep QoR Comparison (λ = 0.5). . . . . . . . . . . . . . . . . . . . 43

4.9 Changing Criticality Values for Various Critical Paths over an Annealing Process (final

CE= 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vii

4.10 Changing Criticality Values for Various Critical Paths over an Annealing Process (final

CE= 30). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.11 QoR comparison between final CE = 8 and 30 for various placer efforts (λ = 0.2). . . . . 45

4.12 QoR comparison between final CE = 8 and 30 for various placer efforts (λ = 0.5). . . . . 46

4.13 Example where the Timing Cost Function fails to identify a “bad move”. . . . . . . . . . 48

viii

Chapter 1

Introduction

Over the past several decades, Integrated Circuits (ICs) have been used in a wide range of applications

from personal mobile devices to large medical equipment. The size and performance of modern ICs has

been growing as per Moore’s law, with the transistor count doubling every two years, and this trend is

expected to continue [22]. Field-Programmable Gate Arrays (FPGAs), as one of the major IC types,

have been following the same trend. Their once simple architecture, which only consisted of hundreds

of Logic and Input/Output (I/O) elements, has grown into more complex architectures that contain up

to millions of logic elements along with various hard blocks such as Random Access Memory (RAM),

Digital Signal Processing (DSP) units, and more. Manual implementation of a design in an FPGA is no

longer manageable.

Computer-Aided Design (CAD) tools were invented to help IC designers deal with large IC designs.

Furthermore, CAD tools increase design efficiency, and shorten turn around time. As the size and

complexity of FPGAs continue to grow, it is more important for FPGA CAD tools to be able to deliver a

high Quality-of-Result (QoR) solution while maintaining reasonable runtime. Many steps of the FPGA

CAD flow are runtime expensive; thus much research has focused on finding high quality heuristics.

However, most of the previous research has used homogeneous architectures rather than the modern

heterogeneous architectures (employed by most commercial FPGAs). The move to heterogeneous FPGAs

has created a new set of FPGA CAD research challenges and opportunities, and the ever increasing size

of FPGA designs motivates new research into fast but high quality CAD.

1.1 Motivation

The FPGA CAD flow is normally broken into 5 steps as shown in Figure 1.1. Placement, the step

which connects the upstream clustering and downstream routing stages, is responsible for determining

the locations of all the blocks of the circuit. Interconnect delay has become the dominant component

of circuit delay in the deep submicron process era, and in an FPGA this delay is largely determined by

placement. Placing connected blocks closer together reduces delays and improves circuit performance.

The quality of placement also directly affects routability, congestion, and power.

Placement is one of the most timing consuming steps in the FPGA CAD flow. Depending on the

circuit size and optimization level, it can take hours or even days to complete [15] [23]. For example, one

circuit from the Titan benchmark suite, “sparcT1 chip2 ”, which contains approximately 815,000 blocks,

1

Chapter 1. Introduction 2

Figure 1.1: Typical FPGA CAD flow.

takes Altera’s Quartus II CAD system more than a day to pack, place and route [23]. With continued

growth in chip size, placement runtime is expected to become worse. Faster computer processors used

to provide enough speed for CAD tools to keep up with FPGA capacity growth until the mid-2000s as

shown in Figure 1.2. Recent trends suggest that processor speed improvement will continue at a growth

rate lower than that of FPGA capacity. Accordingly, we need placement heuristics that produce the

same QoR with a smaller runtime.

In addition to using CAD tools to implement FPGA designs in known FPGA architectures, re-

searchers also use academic CAD tools to explore different or new FPGA architectures. Many features

can be modified, such as cluster size, block types, wiring structures, and device floorplan. Placement

heuristics must be able to adapt to these changes and deliver good QoR regardless of the architectural

differences. Commercial FPGAs have moved to heterogeneous architectures, whereas most academic

placement research has used simpler homogeneous architectures. With multiple block types available

in heterogeneous architectures, many homogeneous architecture assumptions, such as that the core of a

FPGA chip consists only one type of block, are no longer valid. In addition, placement optimization for

homogeneous architectures may not be as effective when applied to heterogeneous architectures.

In this thesis, we focus on one type of FPGA placement algorithm, Simulated Annealing (SA), in a

popular academic place and route tool called Versatile Place and Route (VPR) [3]. The cost function

and move generation are two important components of a SA-type placer, which we try to optimize in this

work. Furthermore, we look at the impact moving to heterogeneous architectures has on some previous

homogeneous placement optimization techniques.

Chapter 1. Introduction 3

Figure 1.2: FPGA capacity and CPU speed growth in the 2000s [2].

1.2 Contributions

With millions of logic elements on a chip, the compile time of clustering-placement-routing has gone from

minutes to hours. This increases the design turn-around time using FPGAs, and it will increase further

with future technology generations, motivating recent research into increasing placement efficiency. One

such method is to propose “directed moves” as opposed to random moves [28] [8]. Both these prior works,

however, used only homogeneous architectures. In our work, we implement Median Region moves from

[28] but adapt them to heterogeneous architectures. We also develop a technique to find valid “move-to”

locations more efficiently than the original random search in VPR.

For Simulated Annealing (SA) placement, the cost function quantifies the quality of a design imple-

mentation proposed during annealing. Each component can contribute a different fraction of the total

cost, and it is controlled by a tradeoff weight. Wirelength and critical path delay are two of the most

commonly used design metrics for FPGA placement. Besides the cost function tradeoff, there are a few

other parameters that can affect the QoR in SA placement. In this work we also examine the QoR effect

of those parameters.

For a practical design, the maximum operating frequency (Fmax) is directly proportional to the

inverse of critical path delay, i.e. the slowest path in the circuit. The default cost function in the

VPR placer uses criticality to identify connections that would more likely affect the critical path delay.

However, the current implementation gives limited freedom of movement to less critical connections

and this could restrict some potentially better moves for highly critical connections. In this work we

implement a new timing cost function in VPR that uses the “Delay Budget” concept to increase freedom

of movement.

1.3 Organization

The rest of this thesis is organized as follows. Chapter 2 describes prior work in FPGA architecture and

CAD, particularly placement, and summarizes VPR. Chapter 3 describes the first group of techniques

which relate to placement move generation. Chapter 4 introduces the second group of optimization

improvements which relate to the placement cost function. Finally Chapter 5 concludes, summarizes the

results, and describes some possible future directions.

Chapter 2

Background

2.1 FPGA Overview

Unlike other major IC types, such as Application Specific Integrated Circuits (ASICs), all blocks and

interconnects on a FPGA are prefabricated. The fundamental element on a modern FPGA is a Basic

Logic Element (BLE), as shown in Figure 2.1a. A typical BLE consists of a k -input Lookup Table

(LUT), a flipflop, and a MUX. The k -input LUT can be programmed to implement any k -input Boolean

function. The output of a BLE can be either synchronous or asynchronous depending on the MUX

selection. In modern FPGA architectures it is common that a number of BLEs are grouped in a single

Configurable Logic Block (CLB), as shown in Figure 2.1b. Each CLB has N BLEs, I input ports and N

output ports. There are also connections between some BLE outputs and BLE inputs. Crossbars and

MUXes are placed at BLE inputs and outputs to allow flexible connections. Modern FPGAs often have

a size-6 fracturable LUT inside each BLE, and 10 or more BLEs inside a single CLB.

For classic homogeneous FPGAs, as shown in Figure 2.2, wires connecting the CLBs run in both

the horizontal and vertical directions in the space between the CLBs. CLBs are arranged to form a

grid with IO blocks generally located on the periphery. This is often referred to as the Island Style

FPGA [6], and many commercial FPGAs are based on a similar style. Each wiring channel consists of a

number of wiring tracks, and this number is the Channel Width (W). The length of each wire segment

is represented as the number of CLBs it spans. Multiple wire lengths are common in FPGAs to allow

faster connections and better routing flexibility.

CLBs are connected to their adjacent wiring channels through Connection Boxes (CBs) as shown in

Figure 2.3. Switch Boxes (SBs) are used to connect different wire segments together along the same or

different directions. Connections are programmable inside CBs and SBs by using SRAMs. Both the CB

pin connection pattern and SB switching pattern can affect routability and the final CAD QoR, which

is a research topic on its own [20].

For modern FPGAs, the main structure is still Island style but many new block types are incorporated

in the FPGA besides logic blocks. As shown in Figure 2.4, an example from Altera’s Stratix family [13],

the majority of the chip is occupied by soft logic blocks (called “LABs” in Altera devices), but hard blocks

such as RAMs and DSPs are also found among them. These hard blocks are normally larger than CLBs

in size and are distributed more sparsely across the chip. Such FPGA architectures, i.e. heterogeneous

architectures, are relatively new to academic research. Most of the past, as well as some of the current

4

Chapter 2. Background 5

(a) Basic Logic Element (BLE).

(b) Logic Cluster.

Figure 2.1: FPGA Basic Logic Element and Configurable Logic Block [3].

Figure 2.2: Island Style Homogeneous FPGA Architecture.

Chapter 2. Background 6

Figure 2.3: FPGA Connection and Switch Boxes.

FPGA CAD research, has been using homogeneous as opposed to heterogeneous architectures.

2.2 FPGA CAD Flow

The basic goal of the Very Large Scale Integration (VLSI) design flow is to transform a description

of the design into an implementation on a target platform. In the past, when there were only a few

hundreds or thousands of elements in a design, designers were able to manually lay out the whole design

with a reasonable amount of effort, and still achieve good performance. However, in modern FPGAs,

hundreds of thousands or even millions of elements may be involved in a design; it is no longer suitable

for humans to accomplish such a task manually. This problem is taken care of by CAD tools which

help designers their increase productivity, reduce human design errors, and most importantly, find high

quality solutions more effectively.

Furthermore, to achieve high performance, design optimization for speed, area, and/or power is

beneficial at each stage of the CAD flow. With increasing chip size and more complex structures, it has

becoming more difficult for CAD tools to continue delivering high quality results within a reasonable

runtime.

Generally the CAD flow is broken into several steps so each step focuses on one task to achieve high

quality results. For a typical FPGA CAD flow, as shown in Figure 1.1, there are normally 5 steps: logic

synthesis, technology mapping, clustering (packing), placement, and routing. Each step takes the result

from the previous step as its input, and passes its solution to the next step, typically with no back

tracking. The following is a brief description of each FPGA CAD flow step.

2.2.1 Logic Synthesis

Circuit designs are often represented in Hardware Description Language (HDL) format. There are

currently two major HDLs, VHSIC Hardware Description Language (VHDL) and Verilog. During the

Synthesis step, the HDL description is translated into gate-level format and optimized to reduce circuit

resource demand. A common academic format is Berkeley Logic Interchange Format (BLIF) [1]. ODIN

Chapter 2. Background 7

Figure 2.4: Heterogeneous FPGA Architecture [13].

Chapter 2. Background 8

II is a popular academic HDL synthesis tool [14], which provides support for FPGA soft logic blocks as

well as several modern hard blocks.

2.2.2 Technology Mapping

With a netlist of the design produced from the Synthesis step, an FPGA CAD tool then maps each

design element onto the primitive blocks (such as LUTs, flip-flops, RAMs and etc) provided by the

target FPGA architecture. Various logic optimizations with goals such as reducing the number of LUT

inputs, LUT depth, and point-to-point delays, are performed during this step to simplify the design

complexity [7]. A commonly used academic technology mapping tool is ABC [5].

2.2.3 Clustering

The output of Technology Mapping is a netlist with all of its elements mapped to the target FPGA

architecture. During Clustering (Packing), the goal is to transform a netlist of primitives into a netlist

of cluster-level blocks. In general a CLB provides faster interconnects (intra-CLB connections) between

its resident BLEs than the general routing connections between CLBs (inter-CLB connections). Thus

the packer tries to pack as many BLEs into the same CLB as possible to improve delays. Also, by doing

so, more nets are absorbed, which reduces the stress on both placement and routing.

An early popular FPGA clustering algorithm is VPack [3], which has an objective of minimizing logic

cluster count. VPack starts by packing an unclustered logic cell with the highest amount of inputs. This

logic cell is called the “seed” of the cluster. The packer then computes the “attraction” value of all the

unclustered logic cells with respect to the current cluster. The “attraction” is defined as the number

of inputs and outputs shared by an unclustered logic cell and the target cluster. Logic cells with the

highest “attraction” values are packed until the current cluster is full. The packer then starts with a

new seed cell and cluster. This process is repeated until all the logic cells are packed.

VPack is effective at reducing the number of clusters, but has the disadvantage of ignoring timing

information during clustering. An enhanced version of VPack, “T-VPack”, was developed to take timing

into account [19]. T-VPack added timing criticality information to the “attraction” formulation. The

criteria for packing a cluster is based on both connectivity and timing criticality. Results from T-VPack

showed good improvement in both wirelength and circuit delay compared to VPack.

Neither VPack or T-VPack, however, was designed to deal with heterogeneous architectures and

blocks other than CLBs. Hard blocks often have different properties than CLBs so homogeneous packers

cannot be directly applied. Furthermore, both VPack and T-VPack assume that a full crossbar is

available in each CLB so full connectivity exists between all resident BLEs. This optimistic assumption

allows packers to avoid a routing legality check inside CLBs thus shortening runtime. In reality, however,

there is only a partial crossbar available inside each CLB to keep wiring and area costs in check. A new

timing-driven packer, AAPack [16], was developed to deal with these issues. AAPack is similar to VPack

and T-VPack in its basic CLB packing algorithm with a few additional greedy techniques to improve

QoR and runtime. It supports more general packing, including packing hard blocks like RAMs and

DSPs. Furthermore, it performs a routing legality check for connections inside blocks to ensure packing

correctness. Overall AAPack can handle heterogeneous architectures and produces better QoR than its

predecessors.

Chapter 2. Background 9

Prior research has also proposed the idea of combining clustering with placement [8] to allow logic

cells moving between clusters during placement to “repair” earlier low quality clustering decisions.

2.2.4 Placement

During placement, the placer tries to find a legal solution such that all blocks are assigned to a valid

location on the target FPGA chip. It also optimizes for several design goals, such as wirelength, circuit

delay, routing congestion, and power consumption. Because the locations of the blocks are normally

fixed after the placement stage, and placement has a high impact on routing quality, it is very important

that the placement heuristic can produce a high quality solution. Most FPGA placement algorithms

can be categorized into three types: Simulated Annealing, Analytical, and Partitioning-Based.

Simulated Annealing

Simulated Annealing (SA), as its name implies, mimics the metal annealing process. During annealing,

metal is heated to a high temperature to allow atoms to move more freely to find better locations and

relative arrangements. The metal is then cooled down to let atoms settle and form strong bonds between

them. At the beginning of the SA process, a random valid placement is used as the starting point along

with an initial temperature. The block movements are controlled by a cost (objective) function, which

can be defined to focus on one or a few design goals.

The placer proposes moves randomly during the SA process. Each potential move, which is either

an exchange of locations between two blocks of the same type, or a movement of a block to a valid

empty space, would induce a change in the placement cost. This change, or “∆C”, is used by the

annealer to make move decisions. Similar to the metal annealing process, moves have a higher chance

to be accepted at higher temperatures even if the delta cost is positive (increasing the total cost, i.e. a

“bad move”). It is necessary to do this (“hill climbing”) during SA to reduce the chance of becoming

entrapped at local minima. “Good moves”, moves that reduce the overall cost, are always accepted. As

the temperature drops, the annealer will accept fewer “bad moves”. The SA process will eventually stop

when a predefined exit criteria is met.

One of the advantages of SA placement is the versatility of its cost function. This allows it to

combine different optimization goals and arbitrary constraints for different architectures. However,

producing good QoR using SA placement generally requires a long annealing time. Therefore much

research effort has been on reducing runtime for SA placement. In our work, the main focus is also

on reducing runtime, while minimizing the impact on quality. A popular academic FPGA CAD tool,

Versatile Place and Route (VPR), which uses SA as its placement heuristic, is used throughout this

work. VPR and its SA placement algorithm will be described in detail in Section 2.3.

Analytical Placement

Analytical placement (AP) is another popular placement technique originally developed for ASICs. It

models all multi-pin nets using one of the net models, such as clique and star models. It uses an objective

function, commonly the sum of the squared wirelength between every pair of nodes in each net. The

optimal locations for all nodes are obtained by solving the quadratic function for the X and Y dimensions

separately. The solution, however, normally contains block overlaps, and additional legalization steps are

required to remove the overlaps. One of the AP legalization techniques involves adding fixed “pseudo”

Chapter 2. Background 10

pins along the direction of desired movement of each overlapped block. By including those “pseudo”

pins as part of the netlist and solving for a new placement, the overlapped blocks are “spread out” by

the force from the “pseudo” pins. Several iterations of this legalization step may be necessary to obtain

a high quality result. A leading FPGA AP placement tool is FastPlace [26].

The main advantage of the AP is its faster runtime compared to SA placers. However, its cost

function cannot easily model discrete or non-quadratic design goals, such as connection delays. These

goals are normally added to the objective function as weights. Furthermore, heterogeneous architectures

have hard blocks which are different from traditional soft logic blocks in both size and distribution.

Normally there are only specific locations on a FPGA chip where hard blocks can be placed. Solving

all block types together results in an invalid solution which requires more legalization steps. Recently,

a new AP placer, HeAP [12], was developed to handle heterogeneity. It combines traditional global

AP solving for all block types with rotating through solutions of AP for each block type individually.

Experiments showed that HeAP produces similar QoR to a leading commercial SA placer, with less

runtime. However, with high placer efforts, HeAP still lags an SA placer in QoR by 9%.

Partitioning Placement

Partitioning-based placement (PP) divides a circuit into subcircuits and assigns them to subregions of

the FPGA chip. A typical PP placer, such as Capo [25], starts partitioning by dividing the circuit into

two subcircuits (“bisection”) and the chip region into two subregions. The placer then assigns each

subcircuit to a single subregion. This process continues until the subregions and subcircuits are small

so all modules can be placed easily. During each bisection, the placer tries to minimize the number

of connections between the subcircuits to reduce wire usage. At the end of the partitioning process,

additional steps are required to complete the detailed placement inside each subregion to further improve

placement.

Like the AP placement algorithms, most PP placers were created for ASICs and designed for ho-

mogeneous architectures. The partitioning process works well on chips that have balanced resource

distribution, which is true for homogeneous architectures. For modern heterogeneous architectures,

hard blocks are not always distributed equally across the chip. It is difficult for PP placers to efficiently

“cut” the chip region so resources are balanced in each subregion.

Other Placement Techniques

In the past, the speed increase of computer processors was able to provide enough speed up for the

FPGA CAD tools to keep up with the FPGA chip capacity growth. However, in the last decade, the

rate of computer processor speed increase has slowed down. Instead of making faster cores, most CPU

manufactures now provide multiple cores to increase system parallel performance. Some research has

focused on implementing parallel FPGA placement algorithms. However, determinism and data conflict

between threads have been the two main challenges when performing placement in parallel. Determinism

requires a placement tool to always produce an identical sequence of moves and hence an identical result

with the same set of inputs. This is important for measuring placement tool quality and program

debugging. Thread data conflicts can happen when different threads propose moves that affect the same

block or net.

In [15], the paralleled SA placer processes multiple move proposals at the same time so many moves

can be evaluated and possibly accepted. This implementation achieves up to 2.1x speedup over the

Chapter 2. Background 11

sequential version while maintaining determinism. However, the performance does not scale well beyond

4 cores (threads).

Another promising parallel implementation is [29]. This method uses a partitioning-like approach to

divide the FPGA chip into regions. The number of regions equals the number of processor cores available.

Each region is further divided into subregions and only one subregion from each region is optimized at

a given time. This guarantees there will be no overlap between two subregions so no move conflicts can

occur. Each subregion also has a small “extended region” overlapping with the neighboring subregions

to allow some inter-region movement. The placer periodically pauses and allow subregions to exchange

placement updates with the other subregions to limit how stale placement information from other regions

can become. This algorithm can achieve more than 35x speedup on a 25-core system with 8% increase

in delay and 11% increase in wirelength. Further experiments showed good scalability beyond 25 cores.

The followup work [29] improves the region decomposition so inter-region and cross-chip movements are

easier and more frequent. The improved algorithm achieves 51x speedup on a 16-core system. Despite a

great speedup in runtime, this implementation has a significant QoR degradation and was only applied

to a homogeneous architecture.

2.2.5 Routing

Routing is the final step of the FPGA CAD flow, and its main goal is to establish all pin-to-pin con-

nections between all connected blocks. FPGA routing is different from ASIC routing because it can

only use the prefabricated resources, such as wire segments, Connection Boxes (CBs), and Switch Boxes

(SBs). Routibility is a measure of how easily a router can successfully route all connections. Other

important optimization parameters include wirelength, delay, congestion, and power. Modern FPGA

routers perform global and detail routing in one step but normally require several iterations to remove

congestion and achieve good QoR [3] [21].

2.3 VPR and Placement Quality

Versatile Place and Route (VPR) [3] [17] is, perhaps, the most popular and successful academic FPGA

place and route tool available. Introduced more than ten years ago, VPR is an open source software, and

over the years developers have been adding features to allow it to keep up with new FPGA technology

trends. Although there is a significant quality gap between VPR and the leading commercial FPGA

CAD tools [23], VPR provides accurate feedback and a useful framework for much research. It offers the

flexibility to explore different architectural designs, which the commercial tools lack. VPR is also part

of the Verilog-to-Routing (VTR) flow [24], which is an open source FPGA CAD flow package that offers

a complete synthesis flow solution, as shown in Figure 2.5. All work and experiments in this thesis have

been conducted with the VTR flow.

2.3.1 Placement Quality Metrics

Design cost and circuit speed are the two most important metrics for most IC designs. The cost of a

design is proportional to the area required. The area of a design is directly related to the number of logic

elements (CLBs, RAMs, DSPs, and etc) used in a design. This is normally dependent on the upstream

logic synthesis tools and clustering flow which try to ensure the minimum resources are required. Cluster

Chapter 2. Background 12

Figure 2.5: VTR CAD Flow [24].

Chapter 2. Background 13

count generally stays unchanged throughout the rest of the flow after clustering (with the exception of

Physical Synthesis, which VPR does not employ). Another important design parameter is power, which

is affected by the amount of wiring used. A high quality placement solution usually requires fewer wires

in a wiring channel.

Circuit speed has been one of the two main objectives in academic FPGA placement tools along with

wirelength. This is commonly measured as “critical path delay”, which is the delay of the slowest path

of the circuit. By using a cost (objective) function, the placer can focus on minimizing this delay so the

final circuit can run at maximum speed without any timing violation.

Another metric for measuring placement quality is the amount of CPU runtime required. High

runtime means a longer design turn-around cycle, and thus can increase the design cost. Besides its own

runtime, placement quality may also impact the runtime required for the routing stage. A high quality

placement solution often provides better routibility which leads to reduced router runtime.

2.3.2 Cost Function

The ability to model different objectives is essential to produce high quality placement solutions. How-

ever, to accurately model every objective requires a lot of resources and high runtime. Also, the model

has to be flexible so that more than one type of metric, or any new type of metric can be easily integrated.

There are a few different ways to model the metrics for different types of placers. In this thesis we are

focusing on Simulated Annealing placement and VPR, which uses a cost function to model wirelength

and circuit delay.

The current timing-driven cost function in VPR is the sum of the normalized Bounding Box (BB) and

timing costs. A “trade-off” parameter λ is used to adjust the fraction of placement effort on each cost

term during annealing. In Equation 2.1, when λ is 0 all placer focus is on BB (wirelength) minimization,

while λ set to 1 makes the placer focus on timing optimization only. By default λ is set to 0.5, which is

believed to have a balanced placement effort on both cost terms [18].

Total Cost = (1− λ) ∗BB Cost+ λ ∗ Timing Cost (2.1)

Bounding Box Cost

The most accurate wire model would be a full routing, but this leads to an extremely high runtime as the

routing must be re-computed for each placement perturbation. Most SA placement algorithms use the

Half Perimeter Wire Length (HPWL) to estimate routed wirelength. As its name implies, the HPWL is

half of the perimeter of the smallest rectangle (bounding box) that encloses all the pins of a given net.

Placers only need to keep track of the minimum and maximum values of the X and Y coordinates for

each net (4 values per net).

In VPR, the total BB cost is calculated once in the beginning of annealing. For any movement

of a net’s terminals, the BB is only affected when either a block is moved outside of the original BB

(increasing the BB size) or a block on the BB boundary is moved inside of the BB (possibly decreasing

the BB size). In either case, if an update is required, the placer only needs to update the coordinate

information of the BB by looking at a limited number of net terminals instead of all the terminals of

a net. This technique is described as the “Incremental Bounding Box” [4], which can be computed on

average O(1) time for any n-terminal net.

Chapter 2. Background 14

Critical Path Delay and Timing Cost

A path delay in a circuit is the amount of time required for a signal to travel from a source (a primary

input or register output pin) to a sink (a primary output or register input pin). The critical path is

the path with the least amount of slack (or in our experiments, only one timing domain is present so

the critical path is also the largest delay path in the circuit). This delay is directly responsible for the

maximum speed (frequency) that at which the circuit can operate at.

To accurately model the delay of each connection, the placer builds a delay lookup table at the

beginning of annealing. The delays in this table are calculated according to every possible ∆X and ∆Y

distance between any two blocks in the circuit. The placer uses a router to route a connection of each

distance pair to evaluate the delays (i.e. if assumes the best routing path can be achieved). The delay

values are then stored in a ∆X-∆Y lookup table. The placer can quickly retrieve the delay values from

this table given the ∆X and ∆Y distance between two locations.

Timing information is modeled using a directed graph G(V,E), which is a “timing graph”. Each

pin is a node of the graph, and nets connecting them become edges between the nodes. Each edge has

a delay value assigned to it reflecting a cell delay or estimated routing delay. The “timing graph” is

also built once at the beginning of annealing, and the edge delays on it are periodically updated during

annealing.

The annealer recomputes the timing information periodically through “Timing Analysis”. The fol-

lowing is a brief description. For a given node j, its arrival time is defined as in Equation 2.2,

Tarr(j) =

{0, for all source nodes j

max{Tarr(j) + delay(i, j)}, {i,j} in E(2.2)

Once the Arrival Time (Tarr) for all nodes is assigned by a forward traversal of the timing graph,

the path with the highest Tarr also has the highest Path Delay (Dmax). This path is the most critical

path of the circuit, and Dmax is the “critical path delay”. The timing analyzer then sets the Required

Time (Treq) for each sink to Dmax, and does a backward traversal to find all the Treq for each upstream

node. At a node i with an outgoing edge from node i to j, Treq is:

Treq(i) = min{Treq(j)− delay(i, j)} (2.3)

Slack is the difference between the Required Time (Treq) and Arrival Time (Tarr). The critical path

is the path with the least slack. Slack also indicates how much more delay a given edge can add before

it becomes critical.

Slack(i, j) = Treq(j)− Tarr(i)− delay(i, j) (2.4)

Criticality is an important parameter for calculating the timing cost. It indicates how close in timing

a given path or connection is to the most critical path. For the same amount of delay change, the timing

cost is larger for more critical connections than less critical ones. In VPR, the criticality values are

calculated as per Equation 2.5. It is inversely proportional to the slack value. The criticality difference

between high and low criticality connections is controlled by the exponent term “crit exp”, the Criticality

Exponent. For example, when this value is set to 0, all connections have the same criticality value. When

it is set to 1, each connection is assigned a criticality value as a linear function of slack. The higher the

Chapter 2. Background 15

criticality exponent is set, the larger the criticality gap between the high and low slack connections.

Criticality(i, j) = (1− slack(i, j)

Dmax)crit exp (2.5)

Finally, the timing cost is the sum of the product of Criticality and Delay for every connection in

the circuit.

Timing Cost(i, j) =∑

criticality(i, j) ∗ delay(i, j) (2.6)

2.3.3 VPR Placement Flow

VPR uses Simulated Annealing as its placement heuristic [3], and its pseudo code is shown in Figure

2.6. It starts with a random legal placement, and the cost of the initial placement is computed. After

a placement change a new cost will be compared with the initial cost to determine the change in cost,

or ∆C. An initial temperature value is determined to indicate the starting temperature of the annealing

process. The number of total moves before reassessing the annealing temperature is determined from

both the total number of blocks and an inner loop factor, “inner num”, as shown in Equation 2.7.

move lim = inner num ∗N43

block (2.7)

“Inner num” is a parameter that can be set by the user to control how much placer effort (number

of moves at each annealing temperature) is spent at each temperature iteration. The move range limit,

“rlim”, which restricts how far a block can move away from its initial location, is set to the size of the

whole chip initially to allow blocks to move across the entire chip. The annealer updates the temperature

and rlim after each iteration.

The new temperature and rlim depend on the acceptance rate of the proposed moves of the previous

temperature. During each loop of the annealing process, the placer calls its move generation function to

propose a potential move. The move generator chooses a random block and then tries to find a random

block (or empty space) for this block to move to within the current rlim limits. If a block (or empty

space) is found, the placer uses its cost functions to calculate the delta cost of the potential move. The

placer then decides whether to accept or reject the proposed move based on Equation 2.8. The acceptance

probability depends on the delta cost and current annealing temperature. The move is always accepted

when its delta cost is negative; that is the potential move can reduce the total placement cost. The move

can be either rejected or accepted when its delta cost is positive depending on the annealing temperature.

A “bad move” has higher chance to be accepted when the annealing temperature is high, which makes

the annealer less likely to become entrapped in local minima.

Move Accept Probability = e−∆C

T (2.8)

The placer needs to update the location information of the affected blocks and update the overall

cost of the current placement after accepting a move. Otherwise no action is required if the move is

rejected. To save runtime, the placer uses the delta cost to update the total cost incrementally during

each iteration. As the temperature drops, the rlim shrinks to restrict block movement so that move

proposals have a higher chance to be accepted. This is because prior research has shown that as the

annealing proceeds, more blocks are already placed at or near their optimal regions so proposals to move

them outside of it would more likely be rejected by the placer.

Chapter 2. Background 16

Figure 2.6: VPR Placement Pseudo Code [3].

Chapter 2. Background 17

2.4 Prior Work on Annealing Improvement

Much research has been focused on producing the same QoR with a reduced amount of runtime, i.e. to

make the placement process more efficient. The majority of SA placement research can be divided into

two main groups, “move generation” and “cost function”, where each focuses on annealing improvement

from a different perspective.

The “move generation” group focuses on how a SA placer proposes block movements. Originally SA

implementations proposed only random moves and relied on high runtime to converge to good QoR.

By “directing” the movements, prior “directed moves” techniques have been shown to be effective in

producing the same level of QoR with less runtime.

SA placers use cost (objective) functions to judge the quality of the current placement and of each

proposed move. Prior work has focused on either improving the accuracy, or adding more flexibility in

the cost functions. Those works have shown improvements in QoR with a small amount of extra runtime

penalty.

2.4.1 Homogeneous Directed Moves

In [28], the authors experiment with several “directed move” techniques for improving both delay and

wirelength cost. For wirelength cost, they used a concept from their prior work [27], called “Median Re-

gion (MR)”, which places a block in one of the optimal wirelength locations given the current placement

of the blocks to which it is connected. This technique first gathers the X and Y direction minima and

maxima of all the connected nets of a given block minus its own coordinates. The MR is bounded by

the median ([n/2], where n is the total number of X or Y direction bounding boxes coordinates) index

and median plus one ([n/2 + 1]) index for both the X and Y directions.

The effectiveness of MR moves is demonstrated in Figure 2.7. The original Bounding Box cost for

the “Selected Block”, is 30 units, and it is connected to 3 nets. To calculate the selected block’s median

region, the first step is to exclude the selected block from all of its nets as shown in Figure 2.7b. The

next step is to collect the new X and Y minima and maxima. The MR is then bounded by the “median”

and “median+1” indices of the new x and y coordinates, as shown in Figure 2.7c. Finally after moving

the selected block to a random location inside its “median region”, the new Bounding Box cost is now

26 units, a 4 unit reduction from the original.

In [28] it is shown that the best QoR occurs when MR moves are mixed with certain amount of

“random moves”; as placement could be entrapped in a local minimum if too many MR moves are

performed during the earlier stages of annealing. Another observation [28] made is that when swapping

two blocks, as shown in Figure 2.8 for block A and B, the forward MR move (move block A to B)

produces a reduction in total cost, but the effect of the return move (B to A) is unknown. The return

move can result in an increase in cost which would reduce the effectiveness of MR moves. In [28], they

introduced the idea of “shuffling” to deal with this issue. During “shuffling”, as shown in Figure 2.8a,

the placer tries to find an empty spot (location E) within a certain range from block B towards block

A. If such a location is available, all the blocks along the path from block B to the empty location,

including block B itself, are shuffled towards the empty spot, and the result is shown in Figure 2.8b.

This technique reduces the possible disturbance to the local placement.

By using MR with this “shuffling” technique mixed with “random moves”, [28] is able to achieve a

5% reduction in wirelength cost with a 2% increase in critical path delay on homogeneous architectures.

Chapter 2. Background 18

(a) Selected Block belongs to 3 nets (BB cost: 36units).

(b) Exclude “selected block” to recompute BB foreach net.

(c) Use new BBs to compute “median region” for“selected block”.

(d) Move “selected block” to a random locationinside its MR (new BB cost: 32).

Figure 2.7: Median Region Move [28].

To overcome the delay cost penalty, they found that by moving blocks into their “feasible regions”, an

idea that was introduced in [8], the critical path delay can be improved by 3%. The “feasible region”

is the overlapping area of all bounding boxes that enclose a block (on the critical path) with all of its

critical inputs and its immediate output. By moving blocks along the critical path to their “feasible

regions”, the connections between all critical path blocks are shortened and thus the critical path delay

is reduced.

[28] also showed that MR moves yield better QoR when applied to medium-utilization (60%) FPGAs

versus high-utilization (more than 95%) designs. This is likely due to the higher probability of locating

empty spots inside median regions.

2.4.2 Delay Budgets

In timing-driven placement, slack is the difference between the critical path delay and the delay of a given

path, and indicates how much extra delay a path can accommodate before it becomes critical. Criticality

is computed using slack to estimate the timing importance of connections in the current placement. A

timing-driven SA placer uses criticality to compute the Timing Cost to guide its placement decisions.

Traditionally, most SA timing cost functions, such as the one in VPR, focus on high criticality connections

but pay little attention to low criticality ones as observed by [11]. This ignorance may cause some low

criticality paths become highly critical or the new critical paths.

In [11], the authors implemented a new timing cost function to solve this issue in routing. The new

timing cost function uses Delay Budgets instead of absolute delay to measure connection criticality. For

each connection, the Delay Budget is the sum of the connection delay and some amount of path slack

assigned to it. It indicates the maximum delay each connection is allowed to have before it becomes

critical. In [11] the modified router has a slack allocation step to assign path slacks to connection delay

budgets. There are many slack allocation methods available, but most of them are runtime intensive.

The Iterative-minimax-PERT method [31] is used in [11], but it requires many iterations and repeating

Chapter 2. Background 19

(a) Before “cell shuffle”.

(b) After “cell shuffle”.

Figure 2.8: MR Moves: Cell Shuffling [28].

Chapter 2. Background 20

timing analysis to complete the slack allocation. As reported, the new router achieves a 3.2% delay

improvement with 0.7% more wirelength and a 5.3% runtime penalty.

A different slack allocation approach was introduced in [10], which involving using a weight function.

In their proposed method, each connection is assigned a weight equal to the amount of potential delay

improvement such that connections with higher potentials will get more slack allocated than those with

lower potentials. This method was implemented in FPGA routing and achieved an average 14% delay

improvement. However, this method is similar to the Iterative-minimax-PERT method as it requires

multiple iterations to converge, and thus is runtime expensive.

Chapter 3

Move Generation

3.1 Overview

Traditional SA placement algorithms rely on a large number of randomly generated moves to produce

good QoR. For modern large FPGA designs, the amount of runtime required to produce good QoR can

reach an unacceptable level [23]. This chapter investigates several move generation methods that aim at

improving Simulated Annealing (SA) placement heuristic efficiency. These methods were implemented

and tested in VPR in a C++ environment.

As described in Section 2.4, adding some Median Region (MR) moves to an annealer improves wire-

length on homogeneous architectures. In our work, we improve on the previous work by adapting MR

to work on heterogeneous architectures. In addition, we have also implemented a new “Cell Rippling”

technique to avoid bad “return moves” as the previous “shuffling” technique has limitations in a hetero-

geneous environment.

“Smart Search Space (SSS)” is the second enhancement introduced in this chapter. One of the new

challenges posed by heterogeneous architectures is that the placer can no longer assume the core of the

chip only contains one type of block (CLB), and blocks can be entrapped in regions bounded by other

types of blocks. Most SA placers, including the one in VPR, use a range limit to define a “window”

within which moves can occur at the current annealing temperature. This is effective at avoiding low

acceptance probability moves in the low temperature annealing stages but it also limits block movement.

SSS separates each block type into an individual data structure to avoid inter-type interference in the

move search space. In this section we examine its effectiveness on heterogeneous architectures.

3.2 Median Region Moves

3.2.1 Motivations

Although random moves are cheap in term of the runtime per move, the placer can still spend a lot of

time searching previously visited or low acceptance probability portions of the search space [28]. Prior

works have proposed “Directed Moves” instead of the traditional random moves [28], and it has answered

a few questions regarding MR moves.

1. It is best to select the source block randomly.

21

Chapter 3. Move Generation 22

2. The return move may hurt the cost gained from the forward move.

3. Too high a fraction of total moves being MR moves can cause placement oscillation and entrapment

in local minima.

Most modern FPGA research has shifted towards heterogeneous architectures, and it is unknown

whether MR moves will still be as effective on such architectures. This section investigates the effec-

tiveness of MR moves for both homogeneous and heterogeneous architectures. In addition, a new cell

rippling implementation suitable for heterogeneous architectures is described.

3.2.2 Implementation

As described before, VPR implements SA as its placement heuristic. A “move”, or a “swap” in VPR,

is either an exchange in locations between two blocks or between a block and an empty spot. Like most

SA placers, a VPR “swap” begins by randomly choosing a block as its source block and then randomly

choosing a same type block (or empty spot) as its target location. The choice of the target location

is bounded by a move range limit, which forms a “window” around the source block, to ensure higher

acceptance probability.

To implement MR in VPR, a few additional steps were added. After choosing the source block,

the placer uses two vectors to hold the X-Y coordinates for all connected nets excluding the source

coordinates. The median values of the new X and Y coordinates are computed, and the MR for the

source block is enclosed by the “median” and “median plus 1” index values. The placer then chooses a

random location within the MR to complete the “swap”, as shown in Figure 2.7. On average the MR

computation takes O(P*K) to complete, where P is the number of pins of the source block and K is the

average fanout of each net to which the block is connected.

Cell Rippling

The movement of a source block to a target location is a forward move, and the move of the target block

(if the target location is not empty) back to the source location is a return move. With MR, the forward

move is likely to improve the total cost but the return move has an unknown effect on cost. In many

cases, the return move can offset some or all of the cost reduction gained from the forward move.

In [28], “cell shuffling” was designed to handle such an issue in homogeneous architectures. The

placer shuffles the target block and blocks along the path towards the closest empty location instead

of moving it to the source location. This reduces the negative effect of a bad return move but risks

creating undesired disruption to the surrounding blocks. As well, it is not always possible to locate an

empty spot within the range limits. This technique works for homogeneous architectures where the core

of the chip is occupied by CLBs, so in most cases there exists a path from the target location to the

chosen empty location. In a heterogeneous FPGA chip, however, this is not always the case. CLBs still

occupy the majority of the chip area but there are also other block types (RAM, DSP, and etc.). A “cell

shuffling” path can be blocked by a “fence” formed by a different block type, such as shown in Figure

3.1.

In this work, an alternative technique called “cell rippling” is implemented. “Cell rippling” divides

each move proposal into multiple small move fragments (MFs), which each MF is a move itself. Instead

of moving the target block back to the source location or “shuffling” it to a neighboring empty location,

“cell rippling” moves the target block to its MR. If this displaces another block, that block is also moved

Chapter 3. Move Generation 23

Figure 3.1: CLB “Cell shuffling” path blocked by RAM block.

to its MR in a new MF, and so on. This ripple movement continues until either an empty spot is

found, or a preset maximum allowed MF limit has reached. This method ensures most blocks along the

“rippling path” are located in their MRs.

An example of “cell rippling” is shown in Figure 3.2. In this example, the maximum allowed MF is

set at 3. In the first step, block A is chosen as the source block, and a random location is chosen from

its MR as the target location. The target location chosen by block A is occupied by block B. For the

next MF, the placer computes the MR for block B, and moves it to a location in its MR that is occupied

by block C. Finally for the last MF, block C will move to a location within its MR if an empty location

is available as in Figure 3.2a, or to original block A’s location (if no empty location exists in C’s MR)

to complete the “move”’, as in Figure 3.2b.

3.2.3 Experimental Results

The main advantage of using MR moves is it can produce good QoR with less runtime. In VPR,

placement effort is controlled by the parameter “inner num”, which sets the number of moves to perform

for each temperature. A higher “inner num” means more moves are performed, and QoR is likely

to improve but a longer runtime is required. In this work we perform “inner num” sweeps for each

experiment to compare the QoR of the original VPR and the MR version at various runtime levels.

In prior work, experiments were performed using homogeneous architectures only. In this work, a

homogeneous architecture is also used to ensure the correctness of our implementation. The 20 largest

benchmark circuits from the MCNC [30] suite were used in the homogeneous architecture experiments. A

standard size-4-CLB homogeneous architecture (“k4 N4 90nm.xml”) from the VTR project [24] is used.

The details of the architectures are listed in Table 3.1. The benchmark circuits have an average size of

750 CLBs when using size-4 CLBs. For heterogeneous architecture tests, the 5 largest VTR benchmark

circuits with high block type diversity were chosen from the VTR benchmark set [24], as listed in Table

3.2. The heterogeneous experiment uses a size-10-CLB architecture (“k6 frac N10 mem32K 40nm.xml”)

from the VTR project. This architecture has support for many hard blocks such as RAM and DSP blocks.

Chapter 3. Move Generation 24

(a) Cell Rippling finds an empty spot to complete.

(b) Cell Rippling moves to original source location if no empty spot found.

Figure 3.2: Cell Rippling with 3 MFs.

Table 3.1: Median Region move Experiment Architecture stats.

Homogeneous Arch Heterogeneous Arch

LUT size (K) 4 5/6

BLE per cluster (N) 4 10

Inputs per cluster (I) 10 40

Fc in 0.15 0.15

Fc out 0.25 0.10

Fs 3 3

Wire Segment (L) 1 4

Chapter 3. Move Generation 25

Table 3.2: Largest 5 circuits from the VTR benchmark suite stats[24].

Circuit Name Chip Size Nets IO CLB RAM DSP Total Clustered Blocks

bgm 63 21094 289 2930 0 11 3230

LU8PEEng 53 16278 216 2104 45 8 2373

LU32PEEng 98 54219 216 7544 168 32 7544

mcml 95 52415 69 6615 159 30 6873

stereovision2 86 34476 331 2395 0 213 2939

To fully understand the effect of the MR moves and “cell rippling” implementation in this work,

each benchmark set went through 3 different groups of tests. Test Condition 1 sweeps the percentage of

MR moves and compares QoR to the original “random moves” in VPR. Test Condition 2 takes the best

2 MR move percentages from the Condition 1 test, and sweeps MF from 1 to 5 to find the best value

for MF; values of MF greater than 1 incorporate “cell rippling”. Finally Test Condition 3 repeats the

Condition 2 test with 70% FPGA utilization to check the effect of lower chip utilization on MR moves.

The placement and post-routing results of each test are used to compare the QoR improvement.

All tests were performed using both homogeneous and heterogeneous architectures, and all results are

geometric averages over 5 different placement seeds to minimize CAD noise.

Condition 1: MR Move Fraction Sweep

MR moves have been proven effective at reducing wirelength cost. Intuition would lead one to believe

that using more or entirely MR moves in annealing would achieve the best QoR. However, prior work

has shown that a high percentage of MR moves in fact degrades QoR. To confirm this effect and find

the best MR move percentage, a sweep test of MR percentage is performed in this work. In Figure

3.3, which shows the comparison using the VTR benchmarks, original VPR is shown as the “Org (0%

MR)” points and “100% MR” represents the case when all moves are MR moves. The experiment has

run through five “inner num” values (0.1, 0.3, 1, 3, 10) to compare QoR at different CPU effort levels.

The QoR values are geometric averages normalized against the original VPR results at “inner num =

3”. The lower graph of Figure 3.4 demonstrates how the results are clustered into different “inner num”

groups, and the other plots follow the same pattern.

From both Figures 3.3 and 3.4, it is clear that using entirely MR moves degrades QoR. “100% MR”

produced nearly 6% worse delay with no improvement in wirelength over the original VPR. The lower

graph of Figure 3.4 also shows that MR moves are more runtime expensive than “random moves”.

At “inner num = 10”, most MR runs took 10% to 40% longer than original VPR. However, at lower

“inner num” values, MR runs have a much smaller impact on runtime. At “30%”, MR runs gained about

4% in wirelength at “inner num = 0.1” without giving up any delay quality. At higher “inner num”

values, the wirelength gain is smaller. Combining the results from both VTR and MCNC runs, it is

clear that between 30% and 50%, MR moves has the best tradeoff between QoR and runtime.

Condition 2: Cell Rippling MF Sweep

Condition 1 tests suggested that MR moves are most effective when only 30% to 50% of moves are of

MR type. Therefore those were chosen to be the MR percentage values for the Condition 2 test. The

“cell rippling” technique implemented in this work involves triggering a sequence of MR moves until

Chapter 3. Move Generation 26

Figure 3.3: QoR vs. fraction of moves are MR type for various placer effort levels (VTR Benchmarks).

Chapter 3. Move Generation 27

Figure 3.4: QoR vs. fraction of moves are MR type for various placer effort levels (MCNC Benchmarks).

Chapter 3. Move Generation 28

either an empty spot is found or the maximum number of MFs is reached. This test sweeps “maximum

allowed MFs” from 1 to 5 in searching for the best QoR and runtime trade-off point.

Similar to the Condition 1 tests, all values are geometric averages over 5 seeds and normalized

against the original VPR values at “inner num = 3”. In Figure 3.5, “1mf” represents experiments with

the maximum allowed MFs set to 1, which corresponds to a “swap” (a forward move followed by a return

move if the target location has a block).

Comparing MF = 3 or 4 to MF = 1 (which has no “cell rippling”) at inner nums 0.1 and 0.3, we can

see that wirelength is reduced by a further 2% as shown in Figure 3.5. Placement runtime is increased

by 9% to 20% at each inner num respectively.

Condition 3: QoR Effect of Chip Utilization

For real world FPGA applications, a typical design only uses about 70% or less of the whole chip.

Therefore it is important to evaluate MR moves at a lower chip utilization level. Figure 3.6 shows the

results of the Condition 2 tests repeated with the chip utilization set to 70%. At lower “inner num”

values, there is up to 5% improvement in wirelength.

Table 3.3 lists the QoR comparisons between the High and Medium chip utilization when using 50%

MR moves for “inner nums” from 0.1 to 1. Each group of results are normalized against the original

placement results (0% of MR moves) at each utilization level. The results show that with Medium chip

utilization, 50% MR moves produces 1% to 2% better wirelength than at High utilization for every placer

effort level. The runtime shows no difference between the two conditions. This experiment shows that

MR moves is more effective at improving QoR with lower chip utilization levels as the chance of finding

an empty spot in each median region is higher.

Reducing Runtime: Omitting High Fan-out Nets

The previous experiments show that MR moves are effective at producing lower wirelength than the

original VPR placement. However, these moves normally require 6% to 33% extra placement runtime

depending on the MF level.

Most of the extra runtime is spent in computing MRs for the source blocks, and higher fan-out nets

require a longer runtime to compute their contribution to the MR. In reality, however, high fan-out nets

do not provide much useful information for MR computation as they normally cover much of the chip

area (i.e. their coordinates are unlikely to affect the median values). Therefore we investigate omitting

high fan-out nets from the MR computation. From a quick sweep on omitting net fan-out size, we found

that by omitting nets with fan-out of 5 or higher the QoR is not affected.

To demonstrate the effect of omitting high fan-out nets, we picked one condition from Table 3.3,

“50% MR with 95% Chip Utilization and 3MF at inner num of 0.3”, which uses 36% more placement

runtime than original VPR. The results are shown in 3.4. The omitting high fan-out nets results are

normalized against both original VPR and the original MR version (column “All Nets”). The results

show there is no QoR change by omitting the high fan-out nets, and the runtime is reduced by 12% on

average. This is encouraging for making MR moves a practical optimization for SA placement.

Chapter 3. Move Generation 29

Figure 3.5: Median Region Cell Rippling Move Fragment sweep (VTR Benchmarks).

Chapter 3. Move Generation 30

Figure 3.6: Median Region Cell Rippling MF sweep for medium chip utilization (VTR Benchmarks)

Chapter 3. Move Generation 31

Table 3.3: MR QoR Comparison between High and Medium chip utilization (VTR Benchmarks).

Table 3.4: Placement QoR Comparison between not omitting and omitting high fan-out nets for MRmoves (VTR Benchmarks normalized vs original VPR results).

Chapter 3. Move Generation 32

Figure 3.7: Median Region produces same QoR as original VPR at lower runtime (VTR Benchmarks).

3.2.4 Conclusion

From all 3 comparisons, MR moves have the ability to reduce wirelength without degrading delay at low

runtimes. By using 50% MR moves, there is an up to 5% improvement in wirelength at low runtimes

(inner num = 0.1, 0.3) and up to 2% improvement in wirelength at medium runtimes (inner num = 1),

while maintaining a similar delay cost to all random moves. However, given a sufficiently long runtime,

MR and random moves converge to a similar QoR level. At inner num = 10, MR runs use 30% to 50%

more runtime to produce the same QoR as random moves, due to the more computationally expensive

move proposals.

At inner num = 3, the runtime gap between using MR moves and the original implementation is

about 20% to 40%. MR moves, however, produce a 2% improvement in QoR at inner num = 3. This

is the same QoR level that “random moves” would produce at inner num = 10, which requires 140%

more runtime. To help clarify this observation, a horizontal line is drawn across the plot in Figure 3.7.

It shows that for the same QoR there is a 50% runtime savings by doing 30% MR moves at inner num

= 3 instead of all random moves at inner num = 10.

Overall 30% to 50% of MR moves are effective at improving wirelength at lower placer effort levels,

and reducing the runtime required to produce high quality results at higher placer effort levels. We also

show that by omitting high fan-out nets in MR computations, the placement runtime can be reduced

by a significant amount without any QoR loss.

Chapter 3. Move Generation 33

Figure 3.8: Size-1 range limit for CLBs in a Homogeneous Architecture.

3.3 Smart Search Space

3.3.1 Motivations

In VPR, the size of the “window”, or move range, is a region defined by rlim grid units in all 4 directions

from the source block coordinates, and clipped to the edges of the chip. The minimum grid unit is

defined by the size of the smallest block type. For example, a move range of 4 means a size-1 block can

move within a 9-by-9 box centred on its current location, and clipped to the chip boundaries. At the

beginning of annealing the size of the “window” covers the entire chip. As temperature drops, the size

of the move range limit shrinks, and eventually drops to the minimum size of 1 grid unit, which means

blocks are only allowed to move within a 3-by-3 box as shown in Figure 3.8.

In classic homogeneous architectures, the majority of the chip is occupied by CLB blocks, and thus

in most cases all valid locations enclosed by a “window” are the same type. The source block in Figure

3.8, for example, has a total of 8 valid locations to which it can move within a size-1 “window”. In a

heterogeneous architecture, however, there are more than one type of blocks available across the chip.

In Figure 3.9, blocks are arranged in columns so that blocks are all the same type in each column. In

this example, column 1, 2, and 4 are CLB type, and column 3 is RAM type. A size-1 “window” still

covers the same amount of area as before, but there are 3 invalid locations as the “window” covers part

of a RAM column. Instead of 8 valid CLB locations, the source CLB block has only 5 valid locations

from which to choose. This reduces the placer search space and prevents the CLB blocks from crossing

the “fence” formed by the RAM column (in Figure 3.9, for example, no CLB from the left of column 3

can reach any location on the right side of column 3, and vice versa).

Hard blocks, such as the RAM block in Figure 3.10, are often spread across the chip and separated by

other block types. The size of these blocks are normally larger than CLB blocks. For exampe, in Figure

3.10 the size of the RAM block is 4 grid units, which means they are 4 times the size of a CLB block. In

Chapter 3. Move Generation 34

Figure 3.9: Size-1 “‘window” for CLB in Heterogeneous Architecture.

this example, a source is chosen from the middle of a RAM block at column 5. When the “window” has

a size of 1, the source RAM block cannot reach any valid RAM location either horizontally or vertically.

This causes the annealer to waste runtime searching for a non-existent valid location.

These examples showed that the classic “window” technique has problems when dealing with blocks

in heterogeneous architectures. A more adaptive method, Smart Search Space (SSS), is developed in

this work to solve these issues.

3.3.2 Implementation

In most VPR architectures, blocks are oriented in columns. In reality, however, blocks can be arranged

in any orientation the FPGA manufacturer may choose. Furthermore, academic researchers often use

tools like VPR to explore new architectures, and this can introduce new non-column based orientations.

Therefore our SSS implementation can handle arbitrary FPGA floorplans and has no column orientation

assumption.

Before annealing starts, SSS scans the entire chip to record every possible X location for each block

type (CLB, IO, RAM, and etc.) in a data structure. For each valid X location, SSS also records all the

possible Y values for each block type in the new data structure. The final SSS data is a compressed

search space of all valid locations for each block type. The move range limit then is transferred from

the chip grid floorplan unit to an X or Y index range in the data structure (SSS index) as shown in

Equation 3.1. This equation ensures that the original annealing move range limit in physical distance is

mostly maintained for SSS.

For a given source block, SSS first calculates the move range limits in units of SSS index, and then

randomly chooses an index from the resulting X-direction range. The SSS data structure then gives a

list of all possible Y locations at the chosen X coordinate. SSS stores all X and Y indexes in a n*n

lookup table where n is the dimension of the grid. The complexity of locating the source block in the

Chapter 3. Move Generation 35

Figure 3.10: Size-1 “window” for RAM.

SSS data structure is O(1). SSS completes the search by randomly choosing a Y-direction location from

the list bounded by Y-direction move range limits.

SSS Move Range = max(Global Move Range

Average X(orY ) Gap, 1) (3.1)

Looking at the previous example again in Figure 3.11, with SSS the source CLB block now has 8

valid locations in its “window” and it can go across the RAM column to the other side. For the RAM

example in Figure 3.12, SSS enables it to move to 8 possible locations as opposed to original number of

0. SSS increases the search space efficiency for annealer and the chance of finding valid moves.

3.3.3 Experimental Results

SSS was mainly targeting at dealing with “window” issues arising in heterogeneous architectures, so we

have tested SSS for such architectures. The architecture used has size-10 CLBs and support for RAM,

DSP and other type of blocks (“k6 frac N10 mem32K 40nm.xml”).

To minimize CAD noise, both the original VPR and the SSS version were run with 3 different

placement seeds and all results are geometric averages over three seeds. Table 3.5 shows the comparison

between the two sets of results with all values normalized against the original VPR results. The circuits

are sorted in ascending order of their chip size in term of X-Y grid size (e.g. size = 5 means a 5*5 grid).

Overall SSS produces similar QoR as the original VPR with 2% less Minimum Channel Width (W). SSS

also requires 7% fewer moves compares to original VPR, which leads to 5% Placement runtime and 4%

Chapter 3. Move Generation 36

Figure 3.11: Size-1 “‘window” for CLB with SSS.

Figure 3.12: Size-1 “window” for RAM with SSS.

Chapter 3. Move Generation 37

Circuit Name Chip Size Min W Delay Wirelength Total Swaps PL Time RT Time Total Runtimestereovision3 5 0.96 0.95 0.99 0.77 0.85 1.00 0.87ch intrinsics 8 0.99 0.99 1.02 0.88 0.92 0.94 0.93diffeq1 14 1.01 0.99 0.97 1.00 1.00 1.09 1.01diffeq2 14 1.00 1.01 0.98 0.93 1.00 1.05 1.02raygentop 17 1.02 0.99 1.02 0.87 0.92 1.03 0.96sha 17 1.01 1.01 0.96 0.85 0.88 1.02 0.90boundtop 18 1.02 1.01 1.00 1.00 1.02 0.99 1.02mkSMAdapter4B 18 0.99 1.01 1.02 0.86 0.91 1.03 0.94or1200 25 0.99 1.00 1.01 1.00 1.00 0.80 0.95mkPktMerge 26 0.82 1.00 1.05 1.02 1.02 0.84 0.94blob merge 28 1.00 0.99 1.01 0.90 0.92 0.98 0.93stereovision0 35 0.96 0.97 1.00 0.90 0.92 1.03 0.93stereovision1 38 0.99 1.00 0.98 0.90 0.94 0.97 0.95mkDelayWorker32B 48 0.97 1.04 1.03 1.00 0.97 1.04 1.00LU8PEEng 53 0.98 0.99 0.98 0.90 0.95 1.08 0.99bgm 63 0.98 0.98 0.98 0.95 0.96 1.01 0.97stereovision2 86 1.01 0.99 0.98 0.94 1.00 1.02 1.01mcml 95 0.96 0.99 0.95 0.98 0.97 1.01 0.98LU32PEEng 98 0.98 1.01 0.99 0.98 0.95 0.95 0.97Geomean — 0.98 1.00 1.00 0.93 0.95 0.99 0.96Geomean (Largest 5) — 0.98 0.99 0.98 0.96 0.97 1.01 0.98

Table 3.5: Smart Search Space results ranking by chip size

total runtime savings.

More importantly, SSS has better performance for larger circuits. For the largest 5 circuits in this

benchmark suite (bgm, LU8PEEng, LU32PEEng, mcml, stereovision2 ), SSS produces 1% better delay

and 2% better wirelength (as shown in the last row of Table 3.5). There are 5 circuits for which SSS pro-

duces worse QoR than the original VPR (boundtop, mkDelayWorker32B, mkPktMerge, mkSMAdapter4B,

or1200 ) with 1% longer delay and 2% more wirelength on average. These circuits have the highest IO

block count among the entire suite. Therefore this QoR degradation is likely caused by the special

arrangement of IO blocks in VPR, which assumes IO blocks are always on the chip periphery. This

arrangement is very different from all the other block types, which are arranged in columns. Upon fur-

ther observation, circuits with lower IO block ratios tend to have better QoR when using SSS. However,

IO blocks are not always arranged on the periphery in commercial FPGAs [9], and we expect SSS to

perform slightly better for such architectures.

3.4 Conclusion

Smart Search Space is designed to better search the placement space in heterogeneous architectures. By

using SSS, each block type move is no longer interfered with by the other block types when searching

for a valid chip location. It also guarantees the move generator will not search within a range which no

available location exists, and hence increases placer efficiency and requires less placer effort to achieve

the same QoR. The results have shown that SSS produces the same or improved QoR with less runtime.

Moreover, SSS produces better QoR on larger circuits, which is promising for the future.

Chapter 4

Cost Function

4.1 Cost Function Tuning

4.1.1 Motivations

In Section 2.3 we described the timing-driven placement Cost Function of VPR. It consists of both BB

cost and timing cost terms, and each contributes half of the total cost by default. The cost trade-off

parameter, λ, controls the weight of each cost. In this work we sweep different trade-off values to check

whether 0.5, the VPR default, is still the best default value.

Another parameter, “Criticality Exponent (CE)”, controls the criticality magnitude difference be-

tween the high and low criticality paths. In VPR, this parameter changes from the start to the end of

the anneal. CE is set to 1 in the beginning (VPR parameter “td place exp first”) to give only slightly

more focus on high criticality paths than the low ones. As annealing progresses, critical path information

becomes more stable so CE becomes higher to put most annealer focus on highly critical paths. The

schedule by which CE increases is related to the move range limit. The final CE (“td place exp last”)

is currently set to 8 by default in VPR. To the best of our knowledge, no prior research has shown

a comparison between different CE values for heterogeneous architectures ([18] compared to using a

homogeneous architecture). In this work, we compare different CE values and show their effects on

QoR.

4.1.2 Experimental Results

Cost Function Trade-off Value Comparison

The first experiment sweeps different λ values for the current VPR placement cost function. All the other

parameters are set to their defaults (“td place exp first” is set to 1, and “td place exp last” is set to 8,

and “inner num” value to 1). For this experiment and the subsequent experiments in this chapter we

use VTR benchmark circuits, and a default VTR architecture file (“k6 frac N10 mem32K 40nm.xml”).

Figure 4.1 shows the result of the experiment. To save runtime, only placement estimated wirelength

and critical path delay values are shown (i.e. routing was not performed). The X-axis indicates the λ

values are swept from 0.0 (all placement effort on minimizing BB cost) to 1.0 (all placement effort on

minimizing Timing cost), and the Y-axis shows the normalized values against the QoR at λ = 0.5. The

38

Chapter 4. Cost Function 39

Figure 4.1: Cost Function Trade-off Sweep (VTRBM).

blue line with the “diamond” shape indicators shows the wirelength, and the red line with “square”

shape indicators shows the critical path delay. We have noticed that the wirelength-delay trade-off is

not obvious, so a third value, the Wirelength-Delay Product (WDP), is used to help visualize the overall

effect. In this figure, a green dotted line indicates the WDP. In the rest of this chapter we will use a

similar style for QoR comparisons.

From Figure 4.1, it is shown that by solely focusing on either wirelength (BB cost) or critical path

delay (timing cost) the QoR degradation is large. At λ = 0.0, the placer achieves 4% better wirelength

but slows down the circuit by 12%. At the other extreme, when λ = 1.0, the result shows no improvement

in critical path delay but there is a 35% increase in wirelength.

Figure 4.2 shows the same result as Figure 4.1 but without the two extreme cases. In this range,

the trade-off relationship between wirelength and critical path delay is near linear. The lowest achieved

wirelength is 4% less than the default (at λ = 0.5) at λ = 0.1 or 0.2, and the best achieved critical path

delay is only 1% lower at λ = 0.7 or 0.8. The best WDP is at λ = 0.2 or 0.3 with 1% improvement over

the default but with 1% to 3% worse delay. Overall at λ = 0.5 the placement QoR is close to both the

best critical path delay and WDP.

Criticality Exponent Value Comparison

This experiment aims to compare various CE values and find the value that produces the best QoR. The

initial CE value (“td place exp first”) is held at the default value of 1. A wide range of final CE values

(“td place exp last”) are swept: 1, 5, 8 (default value), 10, 15, 20, 25, 30, 35. λ is also swept from 0.1

to 0.9.

The VTR benchmark suite is used, and placement “inner num” is set to 1.0 for this experiment.

Figure 4.3 shows the WDP comparison between all different CE values and at various λ values. The

Y-axis indicates the WDP normalized against VPR placement at default settings.

From this experiment, we see that when λ is equal to 0.7 or higher, the placement QoR becomes

Chapter 4. Cost Function 40

Figure 4.2: Cost Function Tradeoff Sweep 0.1 to 0.9 (VTRBM).

quite poor. They are removed from further experiments. The next worst WDP is at λ = 0.6, which

offers no favorable QoR over the other lower λ lines, so it is also removed.

Figure 4.4 is the WDP comparison after removing λ = 0.6 and higher. In this figure, it shows that

the best WDP happens when the final CE is around 30. λ at 0.2 or 0.3 offers the best WDP, which is

about 4.5% lower than when final CE = 8. The WDP line flattens out after the CE = 25 point and does

not improve after CE = 30.

Figure 4.5 compares the “critical path delay” for λ values 0.1 to 0.5. The plot shows that delay at λ

= 0.1 always produces worse delay than the default for all final CE values. λ = 0.5 produces the best

delay, and the lowest delay happens when the final CE is set to 30.

Figure 4.6 shows the wirelength comparison for λ values 0.1 to 0.5. It shows an opposite trend to the

delay plot, at λ = 0.1 produces the best wirelength and 0.5 produces the worst. All conditions, however,

produce similar or better wirelength than the baseline value.

Overall this experiment suggests that by setting the final CE to 30, VPR produces better QoR than

the default setting for all λ values. λ = 0.2 produces the best WDP and wirelength, and λ = 0.5 produces

the best delay when setting the final CE to 30. The QoR plots of λ = 0.2 and 0.5 are shown in Figure

4.7 and Figure 4.8.

Figures 4.9 and 4.10 show a comparison of a typical change in criticality over the entire annealing

process (using circuit “or1200 ” from the VTR benchmark suite in this example) for two final CE values,

8 and 30. In VPR, the rate of change of CE is inversely proportional to the rate of change of the move

range limit, rlim, which is plotted as the dotted black line in both graphs. A total of 9 different plots

are shown corresponding to 9 paths with 9 different initial (1 − slack(i,j)Dmax

) values, and we assume their

(1− slack(i,j)Dmax

) values stay constant throughout the annealing.

In both cases, the annealing starts with a move range limit that covers the entire chip and CE =

1. Both rlim and CE do not change for the first 1/3 of annealing. Once the move range limit starts to

decrease, CE increases inversely to the move range limit and the criticality values drop exponentially.

Chapter 4. Cost Function 41

Figure 4.3: Criticality Exponent Sweep WDP Comparison (λ 0.1 to 0.9).

Figure 4.4: Criticality Exponent Sweep WDP Comparison (λ 0.1 to 0.5).

Chapter 4. Cost Function 42

Figure 4.5: Criticality Exponent Sweep Delay Comparison (λ 0.1 to 0.5).

Figure 4.6: Criticality Exponent Sweep Wirelength Comparison (λ 0.1 to 0.5).

Chapter 4. Cost Function 43

Figure 4.7: Criticality Exponent Sweep QoR Comparison (λ = 0.2).

Figure 4.8: Criticality Exponent Sweep QoR Comparison (λ = 0.5).

Chapter 4. Cost Function 44

Figure 4.9: Changing Criticality Values for Various Critical Paths over an Annealing Process (final CE=8).

Criticality values at CE = 30 drop much sharply than when CE = 8.

During the final 40% of annealing, the move range limit reaches its minimum (1 grid unit in VPR),

and the CE and path criticality values also reach their final values. This region shows the biggest

difference between the two cases. For CE = 8, medium critical paths ((1− slack(i,j)Dmax

) 0.7 to 0.9) still have

some impact on timing cost. However, any placement perturbation of low critical paths at this stage is

unlikely to affect the design timing quality.

For CE = 30, however, the criticality values for all medium and low critical paths are near 0 in

the final annealing stage. This allows the placer to focus only on fine tuning the timing cost of the

connections associated with the most critical paths. At the end, the placer at CE = 30 produces better

timing quality than CE = 8 with same wirelength quality and placement effort.

Note that the results also suggest that for timing cost purposes, medium and low critical paths can be

ignored when the move range limit is so small. Intuitively, when only small perturbations are permitted

(controlled by move range limit), it is unlikely any move will produce a change in delay so large that a

medium or low critical path becomes more critical than the current critical path.

Figures 4.11 and 4.12 show QoR sweeps for different “inner num”s to ensure the QoR gap between

CE = 30 and CE = 8 holds true at different placer efforts. Critical Path Delay and wirelength are

plotted in solid lines while WDP is plotted in dotted line. As the plots show, the QoR gap between final

CE = 8 and 30 is consistent through various placer effort levels.

4.1.3 Conclusion

The cost function trade-off (λ) sweep suggests that the QoR at λ = 0.5 is still the most balanced when

using the default VPR placement settings. It produces nearly the best delay and WDP while maintaining

Chapter 4. Cost Function 45

Figure 4.10: Changing Criticality Values for Various Critical Paths over an Annealing Process (finalCE= 30).

Figure 4.11: QoR comparison between final CE = 8 and 30 for various placer efforts (λ = 0.2).

Chapter 4. Cost Function 46

Figure 4.12: QoR comparison between final CE = 8 and 30 for various placer efforts (λ = 0.5).

a good level of wirelength usage. The Criticality Exponent Sweep shows that by setting the final CE

to 30 the placer puts more focus on improving the critical path delay during the final annealing stages.

This setting outperforms the default value of 8 in VPR with 3% better critical path delay when λ =

0.5, or 4.6% better WDP when λ = 0.2. Furthermore, we have shown that this trend is consistent over

various levels of placer effort.

4.2 Delay Budget

4.2.1 Motivations

Circuit speed is one of the two main design objectives in the FPGA CAD flow. Traditionally the

placement cost function in VPR lacks the ability to focus on low criticality paths which potentially can

become critical and slow down the design, as observed in [11]. A Delay Budget, which indicates the

maximum delay allowed for a given connection before it becomes critical, is used in [11] to improve

circuit speed and close timing constraints in a commercial FPGA CAD routing tool. To the best of

our knowledge, no prior work has used Delay Budgets in timing-driven SA placement. In this work we

investigate some Delay Budget assignment methods and a new timing cost function for placement in

VPR.

To obtain Delay Budgets for each connection, a Slack Allocation step is necessary. Slack Allocation

assigns all “path slack” to its connections but it is normally runtime expensive. In [11], the Iterative-

minimax-PERT method was used, and a few iterations are necessary to finish the allocation. In this work

we first evaluate this method, and then compare it with a new Slack Allocation method that produces

overly-optimistic Delay Budgets.

Chapter 4. Cost Function 47

Table 4.1: Delta Timing Cost using Original VPR Timing Cost Function.

4.2.2 Delay Budget Timing Cost Function

VPR uses a timing cost as shown in Equation 2.6. For a high criticality connection, any change in

connection delay will result in a large change in timing cost. Figure 4.13 gives an example of when the

current timing cost function fails to identify a “bad move” (for simplicity, we ignore the exact unit of

each value).

In this example, the initial placement (Figure 4.13a) has a total timing cost of 14.01 as shown in

Table 4.1, and path B-C-D is the critical path with a path delay of 12. After the annealer proposes a

swap, the new placement timing graph is shown in Figure 4.13b with all the delay changes indicated on

the edges. The total timing cost of the new placement is 13.86, and the change in timing cost is -0.15.

However, as shown in Figure 4.13b, path A-C-D now has a path delay of 13, which surpasses the original

critical path delay of 12. Therefore the new placement has a new critical path which the timing cost

function failed to identify without a timing analysis.

In this work, we propose a new timing cost function for VPR placement as shown in Equation 4.1.

Timing Costc = Criticalityc ∗ (Delayc

Delay Budgetc+max(0,

Delaycdb ratio ∗Delay Budgetc

− 1.0)) (4.1)

The first factor, Criticality, is the same as the original VPR Timing Cost function. We still use

the connection criticality to identify high critical paths and control placer focus. The second factor,

however, is new and has two components. The first component is the “delay budget normalized delay

value”, which is a ratio of the connection delay and its Delay Budget. The second component is a

“penalty” term which is used to add extra cost to connections that use a high fraction of their Delay

Budgets. The fraction level is set by the “db ratio” term.

The first component, means that connections with larger Delayc

Delay Budgetcratio will have more weight

on their delay while other connections will have more movement freedom. This freedom allows the placer

Chapter 4. Cost Function 48

(a) Timing Information before a placement change.

(b) Timing information after a placement change but before a re-timing anal-ysis.

Figure 4.13: Example where the Timing Cost Function fails to identify a “bad move”.

Chapter 4. Cost Function 49

Table 4.2: Delta Timing Cost using New VPR Timing Cost Function.

to explore a broader search space than with the original cost function, and gives a better chance to find

higher quality placement solutions.

To illustrate this, the Delay Budget values of the connections from the previous example are shown

in the third column of Table 4.2. Note that connection C-D has a smaller Delay Budget value than

connection B-C. This ensures that for the same amount of delay change, connection C-D will have a

larger effect on timing cost than connection B-C even though both have the same criticality. Intuitively,

connection C-D belongs to 2 paths which means its delay change can potentially affect the timing of both

paths, whereas connection B-C only belongs to 1 path so it has less potential impact on path delays.

The second component, the penalty term, makes sure the delay of each connection does not go too

high and become speed limiting. The placer will put less focus on lower criticality connections until

they become too slow, which gives more movement freedom to the blocks which they connect. This

penalty term also effectively tries to speed up slow connections towards the “target” delay values set by

“db ratio ∗Delay Budgetc”.

Finally, the last row in Table 4.2 shows the Delta Timing Cost now is +0.24, which correctly identifies

this move as a “bad move”.

4.2.3 Delay Budget Computation

After identifying the problem in the current VPR Timing Cost function and replacing it with an improved

function, the next step is to develop a method to compute the Delay Budget for each connection. In this

work, we first implemented and tested the Iterative-minimax-PERT (PERT) method as used in prior

work in routing [11], and then we introduced a fast but overly optimistic Delay Budget computation

method called Direct Slack.

Chapter 4. Cost Function 50

Table 4.3: PERT Delay Budget Experiment Architecture Stats.

Benchmark VTR MCNC

Experiment exp1 exp2 exp3 exp4

Arch File arch1 arch2 arch3 arch2

LUT size (K) 5/6 5/6 4 5/6

BLE per cluster (N) 10 4 4 4

Inputs per cluster (I) 40 16 10 16

Fc in 0.15 0.15 0.15 0.15

Fc out 0.10 0.10 0.25 0.10

Fs 3 3 3 3

Wire Segment (L) 4 4 1 4

4.2.4 Iterative-minimax-PERT: Implementation

The goal of the PERT method is to assign all path slack to connections (by increasing connection Delay

Budgets) until no slack is left or a predefined minimum slack level is reached. It only assigns a fraction

of the total path slack at each iteration to ensure the total slack allocated is no more than the initial

available path slack.

The first step is to traverse the timing graph forwards and backwards to determine the longest path

(in number of connections) passing through each connection. This number is used to divide the path

slack when PERT assigns slack to a connection. For example, in Figure 4.13, the longest path that goes

through connection A-C has 2 connections and has a path slack of 4.0, so after the first iteration the

amount of slack allocated for A-C is 4.0 / 2 = 2.0 and the Delay Budget for connection A-C is 3.0 + 2.0

= 5.0.

A timing analysis is necessary at the end of each PERT iteration to calculate the new slack for each

connection. In our version of the PERT method, the slack allocation process stops either after a user

defined maximum allowed iterations (10 in our experiment) or when the remaining slack is smaller than

a predefined minimum threshold (less than 1 picosecond). The Delay Budget value of each connection

for a simple example is shown in Table 4.2.

PERT is runtime expensive due to the iterative slack assignment and repeated timing analysis. There-

fore it is only practical to perform a PERT Delay Budget computation periodically. In our experiment,

the annealer calls PERT at the beginning of each temperature stage.

4.2.5 Iterative-minimax-PERT: Experimental Results

To fully test the effectiveness of the new timing cost function and Delay Budget, we run 3 sets of

experiments using different architectures and benchmark circuits. The details of each architecture file

and benchmark suite are listed in Table 4.3, and the names of the architecture files are shown in Table

4.4. The first experiment uses the VTR benchmark suite with a size-10 (N10) cluster heterogeneous

architecture “arch1”. The second experiment uses the VTR benchmarks with a size-4 (N4) heterogeneous

architecture “arch2”. The third experiment uses the MCNC benchmark suite with a size-4 (N4) cluster

homogeneous architecture “arch3”.

For each experiment, we sweep the “db ratio” (from 0.1 to 1.0) to find the ratio that produces the best

QoR. The VPR placement settings are set to their defaults (Criticality Exponent starts at 1 and ends

at 8, inner num = 1, Cost Function Tradeoff λ = 0.5). All data shown here are post-placement results

Chapter 4. Cost Function 51

Table 4.4: PERT Delay Budget Experiment Architecture Files.

Arch Name File Name

arch1 k6 frac N10 mem32K 40nm

arch2 k6 frac N4 mem32K 40nm

arch3 k4 N4 90nm

Table 4.5: QoR of Delay Budget with PERT method vs Original VPR using “arch1” and VTR bench-marks.

DB Ratio Delay WL WDP Total Swaps Runtime

0.1 0.983 1.013 0.996 1.03 4.44

0.2 0.982 1.019 1.001 1.05 4.52

0.3 0.984 1.026 1.010 1.05 4.51

0.4 0.981 1.027 1.007 1.05 4.51

0.5 0.979 1.020 0.999 1.04 4.47

0.6 0.979 1.019 0.998 1.04 4.45

0.7 0.983 1.013 0.995 1.03 4.43

0.8 0.982 1.015 0.996 1.03 4.42

0.9 0.980 1.015 0.995 1.02 4.37

1.0 0.986 1.011 0.997 1.01 4.39

Geomean 0.982 1.018 0.999 1.03 4.45

normalized against the original VPR implementation, so both Critical Path Delay and Wirelength are

placement estimated values. Tables 4.5, 4.6, and 4.7 show the results of each experiment, respectively.

Column 5 “Total Swaps” shows the ratio of total annealing swaps examined compared to original VPR

to ensure any QoR gain is not caused by performing significantly more swaps.

Tables 4.5 and 4.6 show the results of experiments using the VTR benchmarks. The best circuit

speed happens at db ratio = 0.6 when using the “arch1” (N10) architecture, where we obtained a

2.1% improvement in critical path delay while giving up 1.9% in wirelength cost. For the “arch2” (N4)

architecture, the best circuit speed is at when db ratio = 0.7, where PERT delay budgets produce a 3.3%

improvement in delay with no degradation in wirelength. On average there is a 2.8% delay improvement

with only 0.8% extra wirelength usage. Overall PERT delay budgets improve delay by 2% to 3% for

the heterogeneous benchmarks with a small degradation in wirelength. In terms of number of swaps,

the delay budgets with PERT implementation uses only 3% more swaps than original VPR. However,

the extra PERT iterations and the timing analyses they entail cause a minimum 4.5x runtime increase

versus original VPR.

The results of experiments using MCNC benchmarks are listed in Table 4.7. Using a standard

homogeneous N4 architecture (“arch3”) from the VPR package, PERT produces the best QoR at db ratio

= 0.9, which reduces delay by 7.2% and only uses 2.6% extra wirelength. However, it also requires a

large amount of extra runtime.

4.2.6 Direct Slack: Implementation

The biggest drawback of the PERT method is the high runtime required. A 4x to 6x runtime increase

is unacceptable. We found, however, that the Delay Budgets need not be exactly accurate, meaning the

Chapter 4. Cost Function 52

Table 4.6: QoR of Delay Budget with PERT method vs Original VPR using “arch2” and VTR bench-marks.

DB Ratio Delay WL WDP Total Swaps Runtime

0.1 0.975 1.001 0.976 1.01 5.42

0.2 0.977 1.004 0.980 1.01 4.93

0.3 0.972 0.996 0.969 1.02 4.66

0.4 0.970 1.000 0.970 1.01 4.53

0.5 0.974 1.006 0.980 1.02 5.27

0.6 0.968 1.014 0.982 1.03 5.52

0.7 0.967 1.014 0.980 1.04 5.44

0.8 0.968 1.016 0.984 1.04 5.44

0.9 0.974 1.012 0.985 1.06 6.12

1.0 0.977 1.012 0.989 1.05 6.45

Geomean 0.972 1.008 0.980 1.03 5.35

Table 4.7: QoR of Delay Budget with PERT method vs Original VPR using “arch3” and MCNCbenchmarks.

DB Ratio Delay WL WDP Total Swaps Runtime

0.1 0.968 1.025 0.992 1.01 5.84

0.2 0.968 1.038 1.005 1.01 5.96

0.3 0.963 1.050 1.011 1.03 5.86

0.4 0.949 1.062 1.008 1.03 5.85

0.5 0.929 1.064 0.988 1.02 5.80

0.6 0.928 1.058 0.982 1.01 5.74

0.7 0.930 1.049 0.975 0.99 5.71

0.8 0.925 1.037 0.959 0.98 6.37

0.9 0.928 1.026 0.952 0.98 6.43

1.0 0.954 1.018 0.971 0.98 6.93

Geomean 0.944 1.043 0.984 1.00 6.04

Chapter 4. Cost Function 53

total slack assigned can exceed the available path slack, to have the new timing cost function work well.

In the previous example, Table 4.2, we see that the path A-C-D has a path slack of 4.0, and this entire

value assigned to connection A-C’s Delay Budget because connection C-D belongs to the critical path

B-C-D. Inspired by this, we propose to assign the path slack directly to each of its connections’ delay

budget in one shot, as shown in Equation 4.2, which we call it the Direct Slack (DS) method. The DS

method in fact maintains the properties of the Delay Budget cost function, and it puts even more placer

focus on high criticality connections.

Delay Budgetc = Delayc + Slackc (4.2)

From an implementation point of view, DS requires much less runtime compared to the PERT

method. Both connection delay and slack values are already available during annealing. The timing cost

computation for DS only needs to add these two values together for each connection during the regular

timing cost update, and only one iteration is necessary to update all delay budget values. Multiple

iterations and timing analysises, as required in PERT, are no longer necessary.

4.2.7 Direct Slack: Experimental Results

We first test DS using the VTR benchmark suite on the “arch1” (N10) heterogeneous architecture. All

VPR placement parameters are set to default values (inner num = 1, and so on.), and the parameter

“db ratio” for the delay budget timing cost function is set to 0.5. Each run goes through placement,

a binary search routing (to find minimum required channel width W), and a “low stress” routing with

1.3x “minimum W”. The flow is repeated with 3 different placement seeds to minimize CAD noise.

Table 4.8 shows the raw data from this experiment, and Table 4.9 shows the same result normalized

against the original VPR placement result. All normalized critical path delay and wirelength values use

3 digits to show more detail. Columns 2 and 3 are the Placement QoR. On average placement estimated

critical path delay is improved by 3.9% and wirelength stays the same compared to original VPR. In

the “Min W” column, we see that placement using the DS method does not cause the VPR router to

use higher routing channel width. The next two columns are the post-routing results. Final routing

critical path delay is reduced by 1.9% and routing wirelength remains the same compared to original

VPR. The last three columns are runtime required for placement, routing, and the overall flow. Using

the DS method the placement runtime is increased by 9% and total runtime is 6% higher.

For “arch1” (N10) architecture, DS is effective at improving circuit speed with no extra wirelength

cost, but there is about a 2% gap between the placement and routing QoR. We have three hypotheses for

the possible causes for this gap. The first possible cause is an inaccurate delay model in VPR placement.

To verify this, we compare the placement QoR of earlier results from the SSS section, and the difference

between placement and routing delay is only 0.6% on average. This is significantly lower than the 2%

from the DS results. The second possible cause is that in this experiment we use the DS cost function

in VPR placement. However, the timing cost function in the VPR router is still the same as that of

the original VPR placement, so this discrepancy may cause the VPR router to disagree with certain

decisions from the delay budget placement.

The last possible cause is related to more paths having delays equal or close to the critical path delay.

Although Delay Budgets are effective at reducing the critical path delay in placement, by doing so more

paths may become near critical than with the original timing cost function. The subsequent effect is

Chapter 4. Cost Function 54

Table 4.8: Direct Slack Results using N10 VTR benchmarks.

that the router now has to focus on improving delay on more paths than before, which increases the

routing difficulty and makes it harder for the router to produce critical path delay improvements that

match those predicted by the placer.

Table 4.10 shows the normalized QoR for the 5 largest VTR circuits. The placement estimated delay

is reduced by 6.6%, and placement wirelength remains the same. The routing results show a similar

behavior with a 3.6% reduction in delay and no change in wirelength respectively. In terms of runtime,

the largest 5 circuits take 12% more time than original VPR, which is 6% higher than the benchmark

average. Overall DS performs better for larger circuits than smaller circuits but with more runtime.

To verify DS performs well for different architectures we also run experiments with “arch2” (N4)

heterogeneous architecture with the VTR benchmarks, and “arch3” (N4) homogeneous architecture with

the MCNC benchmarks. Table 4.11 shows the N4 VTR experiment results. The first four columns are

the raw values of the placement results, and the last four columns are the placement results normalized

against original VPR results. Both Placement Estimated Delay and Wirelength show similar trends as

in the N10 VTR experiment results, and the critical path delay is improved by 4.5% and wirelength is

only increased by 0.8%. Placement runtime is only increased by 5%. DS again performs well for larger

circuits with 7.2% improvement in delay and no degradation in wirelength. However, it requires about

12% more placement runtime when working on the larger circuits.

Table 4.12 shows the DS results with the “arch3” homogeneous architecture (file k4 N4 90nm.xml) on

Chapter 4. Cost Function 55

Table 4.9: Direct Slack Results using N10 VTR benchmarks normalized vs Original VPR.

Table 4.10: Direct Slack Results using N10 VTR benchmarks normalized vs Original VPR (Largest 5).

Chapter 4. Cost Function 56

Table 4.11: Direct Slack Results using N4 VTR benchmarks.

Chapter 4. Cost Function 57

Table 4.12: Direct Slack Results using N4 MCNC benchmarks.

the largest 20 MCNC benchmark circuits. The results show a trend similar to that of the heterogeneous

experiments. Placement Estimated Delay is reduced by 6% while wirelength increased by 3%. Runtime

is 7% higher than the original VPR.

4.2.8 Conclusion

This section identifies an issue with the current VPR timing cost function, and it is improved with a new

timing cost function that incorporates Delay Budgets. Delay Budgets have been shown to be effective in

some prior work but traditional Slack Allocation techniques are often runtime expensive. In this section

we compared both the traditional Iterative-Minimax-PERT method and the new Direct Slack method.

Experiments using both heterogeneous and homogeneous benchmarks show that PERT improves circuit

speed by 2 to 3% while giving up some wirelength cost. However, this technique is impractical as it

requires about 4x to 6x more runtime than the original VPR placement. The Direct Slack method, on

the other hand, improves circuit speed by 4% with no degradation in wirelength. More importantly,

Direct Slack only increases runtime by 6 to 7% compared to original VPR, which makes it a practical

optimization.

The post-placement results also show that Direct Slack performs better with larger designs, where it

Chapter 4. Cost Function 58

improves circuit speed by 6% to 7% with no extra wirelength cost. This aligns well with the main goal

of this thesis, which is to optimize placement algorithms to handle large designs better.

Further experiments also confirm that Direct Slack is versatile for different architectures and bench-

marks. It produces similar QoR gains for smaller cluster heterogeneous and homogeneous architectures.

Chapter 5

Conclusion

5.1 Summary

The continued growth in FPGA chip size and architectural complexity motivate exploration into better

and faster CAD algorithms. In this thesis we examined ways to improve Simulated Annealing placement

by either producing better QoR or shortening runtime. We divided our focus into two main categories:

Move Generation and Cost Function improvements.

In Move Generation we examined both Median Region Moves and Smart Search Space. MR aims

to reduce placement wirelength by placing blocks into their Median Regions, which are the regions

that have the shortest overall distance to all a block’s connected net Bounding Boxes. In this work

we implement MR moves in VPR to handle heterogeneous designs, and the experiments show that by

making 50% of total moves MR moves the wirelength is improved by 2 to 5% depending on placement

effort. Alternatively, one can use MR to produce the same QoR with less runtime; for example MR

produces the same QoR as the original SA placer (at inner num = 10) with approximately half the

runtime, using inner num = 3.

We also implemented the “Smart Search Space (SSS)” move generation to make move generation

more effective for heterogeneous architectures. SSS compresses the search space for each block type so

that move proposals are not interfered with by any boundary formed by the other block types. It also

guarantees the move generator will always search within valid locations regardless of the move range

limit in VPR. Experiments show that SSS is effective at producing the same or better QoR than the

original VPR placement algorithm with no extra runtime required. It also performs better on larger

designs, and reduces the total annealing moves by 7%.

Timing driven SA placers use criticality to determine the timing focus on different connections. We

swept several placement timing cost function parameters, and found that by changing the final criticality

exponent from 8 to 30 the placement estimated circuit speed is improved by 3%. This is a result of putting

more placer focus on high criticality connections during the later stages of annealing.

The current timing cost function in VPR fails to identify certain “bad moves” that are associated with

low criticality connections. By using the Direct Slack method along with a new timing cost function, the

post-placement circuit speed is improved by 4% while maintaining the same wirelength as the original

VPR. Direct Slack also performs better for larger designs; the circuit speed is improved by an average

6% to 7% in post-placement results for the 5 largest VTR benchmarks. Direct Slack also has a very

59

Chapter 5. Conclusion 60

small runtime overhead due to its simplicity, and only adds about 6% extra runtime. This makes Direct

Slack a more practical optimization for placement than the other Delay Budget assignment methods.

5.2 Future Directions

Although Median Region moves showed good performance in our experiment, they are still runtime

expensive. By capping the net size during MR computation the runtime was improved by 12% on

average. As a future work, we need to develop a more efficient way to extract all the coordinates

of all connected nets. A possible solution could be to integrate MR computation with the Incremental

Bounding Box method described in [4], so that BB coordinates only need to be recomputed if the “source

block” is located on the boundary of a BB. Another issue is that given enough placer effort, random

moves are able to produce similar QoR as MR moves. One possible use of MR moves at high placer

efforts is to use MR moves during the later stages of the annealing to “repair” badly placed blocks.

During experiments we noticed that the current SSS implementation does not perform well for designs

with IO blocks dominating the block population. This is likely caused by the special orientation of

IO blocks in the current VPR architecture files, which only allows IO blocks to be located on the chip

periphery. A special move generator can be designed in the future to handle this unique block orientation.

Our experiments showed that the new timing cost function with Direct Slack method is effective

at improving circuit speed, but there remains a gap between post-placement and post-routing QoR

improvements. This is likely due to the difference in timing cost function between the placement and

routing stages. The Direct Slack method may also produce more critical paths during placement which

increases the routing stress. A possible future work would be develop a timing cost function similar to

that we introduced for placement into the VPR router so it can better handle multiple critical paths,

such as is described in [11].

An issue we faced in this work, and it is true for much other FPGA CAD research, is that the avail-

able benchmarks are small compared to modern commercial designs. The largest benchmark available,

LU32PEEng from the VTR benchmarks, has 7544 cluster-level blocks and occupies a 98*98 grid when

mapped to a size-10 cluster architecture. To put this in perspective, LU32PEEng occupies less than 5%

of the chip area of Altera’s Stratix V device [23]. This limitation restricts the CAD problem size we

could model in this work. The Titan benchmark set [23], which consists of 23 benchmark circuits that

are 44x larger than VTR benchmarks on average, would be a better option. However, at the time this

work was conducted, a timing model has not been made available for the Titan benchmarks. As a future

work, it will be important to test the quality of the optimizations we present on the Titan benchmarks.

Bibliography

[1] U Berkeley. Berkeley logic interchange format. Technical report, tech. rep., Technical report,

Technical report, University of California at Berkeley, 1998.

[2] Vaughn Betz. Fpga challenges and opportunities at 40nm and beyond. In FPL, page 4, 2009.

[3] Vaughn Betz and Jonathan Rose. Vpr: A new packing, placement and routing tool for fpga research.

In Field-Programmable Logic and Applications, pages 213–222. Springer, 1997.

[4] Vaughn Betz, Jonathan Rose, and Alexander Marquardt. Architecture and CAD for deep-submicron

FPGAs. Kluwer Academic Publishers, 1999.

[5] Robert Brayton and Alan Mishchenko. Abc: An academic industrial-strength verification tool. In

Computer Aided Verification, pages 24–40. Springer, 2010.

[6] Stephen Brown and Jonathan Rose. Architecture of fpgas and cplds: A tutorial. IEEE Design and

Test of Computers, 13(2):42–57, 1996.

[7] Deming Chen, Jason Cong, and Peichen Pan. Fpga design automation: A survey. Foundations and

Trends R© in Electronic Design Automation, 1(3):139–169, 2006.

[8] Gang Chen and Jason Cong. Simultaneous timing-driven placement and duplication. In Proceedings

of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pages

51–59. ACM, 2005.

[9] Patrick Dorsey. Xilinx stacked silicon interconnect technology delivers breakthrough fpga capacity,

bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs, pages 1–10, 2010.

[10] Jon Frankle. Iterative and adaptive slack allocation for performance-driven layout and fpga rout-

ing. In Proceedings of the 29th ACM/IEEE Design Automation Conference, pages 536–542. IEEE

Computer Society Press, 1992.

[11] Ryan Fung, Vaughn Betz, and William Chow. Slack allocation and routing to improve fpga timing

while repairing short-path violations. Computer-Aided Design of Integrated Circuits and Systems,

IEEE Transactions on, 27(4):686–697, 2008.

[12] Marcel Gort and Jason H Anderson. Analytical placement for heterogeneous fpgas. In Field Pro-

grammable Logic and Applications (FPL), 2012 22nd International Conference on, pages 143–150.

IEEE, 2012.

[13] Stratix Device Handbook. Altera corporation, san jose. CA [Online]. Available: www. altera. com.

61

Bibliography 62

[14] Peter Jamieson, Kenneth B Kent, Farnaz Gharibian, and Lesley Shannon. Odin ii-an open-source

verilog hdl synthesis tool for cad research. In Field-Programmable Custom Computing Machines

(FCCM), 2010 18th IEEE Annual International Symposium on, pages 149–156. IEEE, 2010.

[15] Adrian Ludwin, Vaughn Betz, and Ketan Padalia. High-quality, deterministic parallel placement for

fpgas on commodity hardware. In Proceedings of the 16th international ACM/SIGDA symposium

on Field programmable gate arrays, pages 14–23. ACM, 2008.

[16] Jason Luu, Jason Helge Anderson, and Jonathan Scott Rose. Architecture description and pack-

ing for logic blocks with hierarchy, modes and complex interconnect. In Proceedings of the 19th

ACM/SIGDA international symposium on Field programmable gate arrays, pages 227–236. ACM,

2011.

[17] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang, Kenneth Kent,

and Jonathan Rose. Vpr 5.0: Fpga cad and architecture exploration tools with single-driver routing,

heterogeneity and process scaling. ACM Transactions on Reconfigurable Technology and Systems

(TRETS), 4(4):32, 2011.

[18] Alexander Marquardt, Vaughn Betz, and Jonathan Rose. Timing-driven placement for fpgas. In

Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate

arrays, pages 203–213. ACM, 2000.

[19] Alexander Sandy Marquardt, Vaughn Betz, and Jonathan Rose. Using cluster-based logic blocks and

timing-driven packing to improve fpga speed and density. In Proceedings of the 1999 ACM/SIGDA

seventh international symposium on Field programmable gate arrays, pages 37–46. ACM, 1999.

[20] M Imran Masud and Steven JE Wilton. A new switch block for segmented fpgas. In Field Pro-

grammable Logic and Applications, pages 274–281. Springer, 1999.

[21] Larry McMurchie and Carl Ebeling. Pathfinder: a negotiation-based performance-driven router for

fpgas. In Proceedings of the 1995 ACM third international symposium on Field-programmable gate

arrays, pages 111–117. ACM, 1995.

[22] G.E. Moore. Cramming more components onto integrated circuits, 1998.

[23] K.E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz. From quartus to vpr: Converting hdl to blif

with the titan flow. pages 1–1, 2013.

[24] Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville,

Kenneth B Kent, Peter Jamieson, and Jason Anderson. The vtr project: architecture and cad for

fpgas from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field

Programmable Gate Arrays, pages 77–86. ACM, 2012.

[25] Jarrod A Roy, David A Papa, Saurabh N Adya, Hayward H Chan, Aaron N Ng, James F Lu, and

Igor L Markov. Capo: robust and scalable open-source min-cut floorplacer. In Proceedings of the

2005 international symposium on Physical design, pages 224–226. ACM, 2005.

[26] Natarajan Viswanathan and CC-N Chu. Fastplace: Efficient analytical placement using cell shifting,

iterative local refinement, and a hybrid net model. Computer-Aided Design of Integrated Circuits

and Systems, IEEE Transactions on, 24(5):722–733, 2005.

Bibliography 63

[27] Kristofer Vorwerk, A Kenning, Jonathan Greene, and Doris T Chen. Improving annealing via

directed moves. In Field Programmable Logic and Applications, 2007. FPL 2007. International

Conference on, pages 363–370. IEEE, 2007.

[28] Kristofer Vorwerk, Andrew Kennings, and Jonathan W Greene. Improving simulated annealing-

based fpga placement with directed moves. Computer-Aided Design of Integrated Circuits and

Systems, IEEE Transactions on, 28(2):179–192, 2009.

[29] Chris C Wang and Guy GF Lemieux. Scalable and deterministic timing-driven parallel placement

for fpgas. In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable

gate arrays, pages 153–162. ACM, 2011.

[30] Saeyang Yang. Logic synthesis and optimization benchmarks user guide: version 3.0. Citeseer, 1991.

[31] Habib Youssef and Eugene Shragowitz. Timing constraints for correct performance. In Computer-

Aided Design, 1990. ICCAD-90. Digest of Technical Papers., 1990 IEEE International Conference

on, pages 24–27. IEEE, 1990.