Download - Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

Gate Transfer Level Synthesis as an Automated Approach

to Fine-Grain PipeliningAlexander Smirnov

Alexander Taubin

Mark Karpovsky

Leonid Rozenblyum

Q

QSET

CLR

D

DFF

CLCL

Presentation goals

Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work

Q

QSET

CLR

D

DFF

CLCL

Objective

Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications

Industrial quality Easy to integrate in RTL oriented environment Capable of handling very large designs – scalability

Automated fine-grain pipelining To achieve high performance (throughput) Automated to reduce design time

Q

QSET

CLR

D

DFF

CLCL

Choice of paradigm

Synchronous RTL 8 logic levels per stage is the limit

Due to register, clock skew and jitter overhead Timing closure

No pipelining automation available – stage balancing is difficult Performance limitations

To guarantee correctness with process variation etc Asynchronous GTL

Lower design time Automated pipelining possible from RTL specification

Higher performance Gate-level (finest possible) pipelining achievable

Controllable power consumption Smoothly slows down in case of voltage reduction

Improved yield Correct operation regardless of variations

Q

QSET

CLR

D

DFF

CLCL

Easy integration & scalability: Weaver flow architecture RTL tools reuse

Creates the impression that nothing has changed

Saves development effort

Substitution based transformations Linear complexity Enabled by using

functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries

Single-rail (synchronous) netlist synthesis

Weaving

GTL netlist mapping

srGTL lib

Single-rail netlist

GTL netlist

Mapped QDI netlist P&R constraints

Hos

t syn

thes

is e

ngi

ne

HDL design specification

P&R, etc tools following synthesis in EDA flow

GTL lib

VHDL packages

lib cells,GTL stages

We

ave

r

Q

QSET

CLR

D

DFF

CLCL

Easy integration & scalability: Weaver flow architecture Synthesis flow

Interfacing with host synthesis engine

Transforming Synchronous RTL to Asynchronous GTL – Weaving

Dedicated library(ies) Dual-rail encoded data logic Cells comprising entire

stages Internal delay assumptions

only

Single-rail (synchronous) netlist synthesis

Weaving

GTL netlist mapping

srGTL lib

Single-rail netlist

GTL netlist

Mapped QDI netlist P&R constraints

Hos

t syn

thes

is e

ngi

ne

HDL design specification

P&R, etc tools following synthesis in EDA flow

GTL lib

VHDL packages

lib cells,GTL stages

We

ave

r

Q

QSET

CLR

D

DFF

CLCL

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline

REGCombinational logic

Q

QSET

CLR

D

DFF

CLCL

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate

asynchronously and independently

Q

QSET

CLR

D

DFF

CLCL



Many pipeline styles can be used

Q

QSET

CLR

D

DFF

CLCL

C

M

CDPC

ACK

F

RReq

RAck

parasitic capacitances

LAck

LReq

A0

B0

A1

B1

Z0

Z1

weak



Many pipeline styles can be used

Templates already exist

Q

QSET

CLR

D

DFF

CLCL

Weaving

Critical transformations Mapping combinational gates (basic weaving) Mapping sequential gates Initialization preserving liveness and safeness

Optimizations Performance optimization

Fine-grain pipelining (natural) Slack matching

Area optimization Optimizing out identity function stages

Q

QSET

CLR

D

DFF

CLCL

Basic Weaving

De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals

Merge insertion Fork insertion

Reset routing

AND2

A

B

X

Z

weaving

GTL_AND2

ACK

PC CD

F: AND2 M

A0A1B0B1 Z1

Z0

X0X1

XReq

ZReq

XAck

ZAck

AReq

BReq

AAck

BAck

dual-rail impl-n

Q

QSET

CLR

D

DFF

CLCL

Basic Weaving: example (C17 MCNC benchmark)

AND2

AND2

AND2

OR2

weaving

GTLAND2

GTLAND2

GTLAND2 GTL

OR2

GTLOR2

GTLAND2

data0data1 } dr_data

reqack } handshake

Q

QSET

CLR

D

DFF

CLCL

Linear pipeline (RTL)

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

Q

QSET

CLR

D

L

DL1

CL1

Q

QSET

CLR

D

L

DL2

Q

QSET

CLR

D

L

DL3

CL2

clk0clk1

Q

QSET

CLR

D

L

DL4

Q

QSET

CLR

D

L

DL5

CL3

Q

QSET

CLR

D

DFF

CLCL

t3 t4t1 t3 t4

t3 t4t1

...

.........

clk0/clk<=’0’

clk1/clk<=’1’

t5 t6

...

...

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

pipeline PN (PPN) model with local handshake

pipeline PN model with global synchronization

Q

QSET

CLR

D

DFF

CLCL

t3 t4t1 t3 t4

*

t3 t4t1

...

.........

clk0/clk<=’0’

clk1/clk<=’1’

t5 t6

...

...

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

pipeline PN (PPN) model with local handshake


Q

QSET

CLR

D

DFF

CLCL

t3 t4t1 t3 t4

*

t3 t4t1

...

.........

clk0/clk<=’0’

clk1/clk<=’1’

t5 t6

...

...

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

PPN models asynchronous full-buffer pipelines


Q

QSET

CLR

D

DFF

CLCL

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

ACKPC

M

ACKPC

M

ACKPC

M

ACKPC

M

ACKPC CD

MCL1

ACKPC CD

M

ACKPC CD

MCL2

CDCD CDACK

PC CD

MCL2

CDCD CD

GTL implementation

RTL implementation

Q

QSET

CLR

D

DFF

CLCL

Correctness

Safeness Guarantees that the number of data portions (tokens) stays

the same over the time Liveness

Guarantees that the system operates continuously Flow equivalence

In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream

Q

QSET

CLR

D

DFF

CLCL

Non-linear pipelines

Deterministic token flow Broadcasting tokens to all

channels at Forks Synchronizing at Merges

Data dependent token flow Ctrl is also a dual-rail

channel To guarantee liveness

MUXes need to match deMUXes – hard computationally

Fork Merge

in out

ctrl

MUX

in out

ctrl

deMUX

Q

QSET

CLR

D

DFF

CLCL

Non-linear pipeline liveness Currently guaranteed for deterministic

token flow only by construction (weaving)

A marking of a marked graph is live if each directed PN circuit has a marker

Linear closed pipelines can be considered instead

Q

QSET

CLR

D

DFF

CLCL

t2t1*

*

Closed linear PPN

Every PPN “stage” is a circuit and has a marker by definition

Q

QSET

CLR

D

DFF

CLCL

t2t1*

*

Closed linear PPN


Each implementation loop forms two directed circuits Forward – has at least

one token inferred for a DFF

Q

QSET

CLR

D

DFF2

Q

QSET

CLR

D

DFF

CLCL

t2t1*

*

Closed linear PPN


Each implementation loop forms two directed circuits Forward – has at least

one token inferred for a DFF

Feedback – has at least one NULL inferred from CL or added explicitly

Q

QSET

CLR

D

DFF2

CL

Q

QSET

CLR

D

DFF

CLCL

t2t1*

*

Closed linear PPN pipeline is live iff(for full-buffer pipelines) Every loop has at least 2

stages Token capacity for any loop:

1 C N - 1

Assumption we made – every loop in synchronous circuit has a DFF

A loop with no CL is meaningless

Liveness conditions hold by construction (Weaving)

Q

QSET

CLR

D

DFF2

CL

Q

QSET

CLR

D

DFF

CLCL

entity seqr is port (D_out : out std_logic; D_in, clk : in std_logic ); end seqr; architecture behavior of seqr is signal Data : std_logic_vector(2 downto 0); begin process(clk) begin if (clk='0' and clk'event) then Data <= (D_in, Data(2), Data(1)); end if; if (Data = "111") then D_out <= '1'; else D_out <= '0'; end if; end process; end behavior;

D_inD_outF

HB

FB

F

FBFB

Initialization: example

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

DD_in

clk

D_out

data1

ackreq

data0 } channelweaving

Q

QSET

CLR

D

DFF

CLCL

Initialization: FSM example

outin

outin

Fork

Next state CL

Output CL

State REG

clk

inout

…

HB HB HB HB HB HB

…

Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

GTL data flow structure is equivalent to the source RTL by weaving No data dependencies are removed No additional dependencies introduced

In deterministic flow architecture There are no token races (tokens cannot pass

each other) All forks are broadcast and all joins are

synchronizers Flow equivalence preserved by construction

Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

2 1 12

N N N2 1NNNNNN

2 1 12

NNN N N N NNN 2 1

GTL initialization is same as RTL

Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

2 1 12

3 N N2 NNNN2NN

2 1 12

NNN N N N NN3 2 1

but token propagation is independent

Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

2 1 12

3 N N2 NNN22NN

2 1 12

NNN N N N N33 2 N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

2 1 12

3 N NN NNN22N3

2 1 12

2NN N N N N3N 2 N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

2 1 12

N N NN NNN2N33

2 1 12

2N3 2 N N N3N N N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

N 2 NN NNN2N33

3 2 12

233 2 2 N NNN N N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

N 2 N3 NNNNN3N

3 2 12

233 2 2 2 NN4 N N


In GTL “3” hits the first top register output

Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

4 2 N3 NN2NNNN

3 2 12

N3N 2 2 2 NN4 N N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

4 2 N3 N22N3NN

3 2 12

N3N N 2 2 244 3 N

In GTL “3” hits the first bottom register output


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

4 N 23 N22N3N4

3 2 12

NNN N N 2 24N 3 N


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 12

N N 2N 2223344

3 2 12

NN4 N N N 2NN 3 2

In GTL “2” hits the second register output


Q

QSET

CLR

D

DFF

CLCL

Flow equivalence

3 2 23

N N 2N 22N3344

3 2 23

3N4 N N N 24N 3 2


In RTL “3” and “2” moved one stage ahead

timing is independent, the order is unchanged

Q

QSET

CLR

D

DFF

CLCL

Optimizations

Area Optimizing out identity function stages

Performance Fine-grain pipelining (natural) Slack matching

Q

QSET

CLR

D

DFF

CLCL

Optimizing out identity function stages Identity function stages (buffers) are inferred

for clocked DFFs and D-latches Implement no functionality Can be removed as long as

The token capacity is not decreased below the RTL level

The resulting circuit can still be properly initialized

Q

QSET

CLR

D

DFF

CLCL

CL CLCL

Optimizing out identity function stages: example

HB HB HB HB HB HB HB HBHB HB HB HBDFF DFF

Final implementation is the same as if the RTL had not been pipelined (except for initialization)

Saves pipelining effort

Q

QSET

CLR

D

DFF

CLCL

Slack matching implementation Adjusting the pipeline slack to optimize its

throughput Implementation

leveling gates according to their shortest paths from primary inputs (outputs)

Inserting buffer stages to break long dependencies Buffer stages initialized to NULL

Currently performed for circuits with no loops only Complexity O(|X||C|2)

|X| - the number of primary inputs |C| - the number of connection points in the netlist

Q

QSET

CLR

D

DFF

CLCL

Slack matching correctness Increases the token capacity

Potentially increases performance Does not affect the number of initial tokens

Liveness is not affected Does not affect the system structure

The flow equivalence is not affected

Q

QSET

CLR

D

DFF

CLCL

Experimental results: MCNC

0

200

400

600

MHz

C1355 C5315 cm163a lal

Benchmark

Performance, MHz

RTL GTL

RTL implementation Not pipelined

GTL implementation Naturally fine-grain

pipelined Slack matching

performed Both implementations

obtained automatically from the same VHDL behavior specification on average ~ x4 better performance

Q

QSET

CLR

D

DFF

CLCL

Experimental results: AES

0

200

400

600

800

MHz

Inverter keyexpansn aes10rnds

Module

Performance, MHz

RTL GTL

00.5

11.5

22.5

33.5

4

sq. um xE+06

Inverter keyexpansn

Module

Area, sq. um xE+06

RTL GTL

~ x36 better performance ~ x12 larger

Q

QSET

CLR

D

DFF

CLCL

Base line

Demonstrated an automatic synthesis of QDI (robust to variations) automatically gate-level pipelined implementations from large behavioral specifications

Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced

Resulting circuits feature increased performance (depth dependent ~4x for MCNC) area overhead

Practical solution – first prerelease at http://async.bu.edu/weaver/

Demonstrated correctness of transformations (weaving)

Q

QSET

CLR

D

DFF

CLCL

Future work

Library design Dynamic (domino-like) library design Low leakage library design to combine high performance of

fine-grain pipelining with low power from very aggressive voltage reduction

Balanced library for security related applications Extending the concept to other technologies

Automated asynchronous fine-grain pipelining for standard FPGAs

Synthesis flow development Integration of efficient GTL “design-ware” and architectures

Thank you!

Questions?Comments?Suggestions?

Q

QSET

CLR

D

DFF

CLCL

Backup slides

Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness

Q

QSET

CLR

D

DFF

CLCL

Slack matching: example (C17)

handshake

dr_datachannel}

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

Q

QSET

CLR

D

DFF

CLCL

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

GTL1

GTL1


handshake

dr_datachannel}

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

GTL1

GTL1

GTL1

Q

QSET

CLR

D

DFF

CLCL


handshake

dr_datachannel}

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

Q

QSET

CLR

D

DFF

CLCL


handshake

dr_datachannel}

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

GTLAND2

GTLOR2

GTLAND2

GTL1

Q

QSET

CLR

D

DFF

CLCL

Back to backup slides

Q

QSET

CLR

D

DFF

CLCL

Similar work: the difference Null Convention Logic

Coarse-grain Slow and large synchronization trees

Phased logic Different encoding provides less switching activity Complicated synthesis algorithm due to encoding

De-synchronization Bundled data Coarse grain

None of the above provide support for automated fine-grain pipelining

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

Example: data path

CL

FSM

CL REGMUX MUX

CL

CL

REG

Q

QSET

CLR

D

DFF

CLCL

Example: data path

CL

FSM

CL REGMUX MUX

CL

CL

Q

QSET

CLR

D

DFF

CLCL

Example: data path

CL

FSM

CL MUXMUXDE

MUX

CL

CL

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

Experiments setup

Standard gates library vtvt from Virginia Tech TSMC 0.25

C-elements – derived from PCHB library from USC and simulated to obtain performance

C

Storage

CDPC

ACK

F

C

C

LAck

LReq RReq

RAck

A0

B0

A1

B1

Z0

Z1

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

All correctness prerequisites1. no additional data dependencies are added and no existing data dependencies are removed during

weaving;2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent

function for dual-rail encoded data and initialized to NULL (spacer);3. closed asynchronous HB pipeline maximum token capacity is S/2 - 1 (where S is the number of HB

stages);4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages);5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any

two adjacent stages);6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a

spacer and another – to a token;7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or

half-DFFs) in the corresponding synchronous RTL implementation;8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation;9. no stage state is shared between any two stages.10. exactly one place is marked in every stage state.11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker;12. GTL style pipeline is properly modeled by HB PPN.13. a live closed HB PPN is at least 3 HB stages long;14. a live closed HB PPN has at least one token and at most S/2 – 1 tokens;15. the token flow is deterministic and does not depend on data itself;16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit);17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must

satisfy the conditions (xi), (xiii) and (xiv);18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

t3 t4t1 t3 t4

*

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

t3 t4 t5t1 t2 t7 t8t6

*

PPN models full-buffer pipelines

HB PPN models half-buffer pipelines

Q

QSET

CLR

D

DFF

CLCL

t3 t4t1 t3 t4

*

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

t3 t4 t5t1 t2 t7 t8t6

*

HB PPN stage has three states

PPN stage has two states

Q

QSET

CLR

D

DFF

CLCL

Linear pipeline

Q

QSET

CLR

D

DFF1

CL1

clk

CL2

Q

QSET

CLR

D

DFF2

CL3

Q

QSET

CLR

D

DFF3

t3 t4 t5t1 t2 t7 t8t6

*

HB PPN stage has three statesACK

PC

M

ACKPC

M

ACKPC

M

ACKPC

M

ACKPC CD

MCL1

ACKPC CD

M

ACKPC CD

MCL2

CDCD CDACK

PC CD

MCL2

CDCD CD

models properly HB GTL implementation

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

Non-linear pipeline

fork

t1

t3

t2merge

t3

t1

t2

fork

t1

t3

t2merge

t3

t1

t2

HB PPN model

PPN equivalent to HB PPN besides token capacity

Q

QSET

CLR

D

DFF

CLCL

Non-linear pipeline

fork

t1

t3

t2merge

t3

t1

t2

merget3

t2

forkt1

t3

HB PPN model

MG PN equivalent to HB PPN besides token capacity

Q

QSET

CLR

D

DFF

CLCL


Q

QSET

CLR

D

DFF

CLCL

Closed linear HB pipeline is live iff Every loop has at least 3

stages Token capacity for any loop:

1 C N/2 - 1

Assumption we made – every loop in synchronous circuit has a DFF

A loop with no CL is meaningless

Liveness conditions hold

t2 t3t1

Q

QSET

CLR

D

DFF2

CL

Q

QSET

CLR

D

DFF

CLCL