Gate Transfer Level Synthesis as an Automated Approach
to Fine-Grain PipeliningAlexander Smirnov
Alexander Taubin
Mark Karpovsky
Leonid Rozenblyum
Q
QSET
CLR
D
DFF
CLCL
Presentation goals
Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work
Q
QSET
CLR
D
DFF
CLCL
Objective
Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications
Industrial quality Easy to integrate in RTL oriented environment Capable of handling very large designs – scalability
Automated fine-grain pipelining To achieve high performance (throughput) Automated to reduce design time
Q
QSET
CLR
D
DFF
CLCL
Choice of paradigm
Synchronous RTL 8 logic levels per stage is the limit
Due to register, clock skew and jitter overhead Timing closure
No pipelining automation available – stage balancing is difficult Performance limitations
To guarantee correctness with process variation etc Asynchronous GTL
Lower design time Automated pipelining possible from RTL specification
Higher performance Gate-level (finest possible) pipelining achievable
Controllable power consumption Smoothly slows down in case of voltage reduction
Improved yield Correct operation regardless of variations
Q
QSET
CLR
D
DFF
CLCL
Easy integration & scalability: Weaver flow architecture RTL tools reuse
Creates the impression that nothing has changed
Saves development effort
Substitution based transformations Linear complexity Enabled by using
functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries
Single-rail (synchronous) netlist synthesis
Weaving
GTL netlist mapping
srGTL lib
Single-rail netlist
GTL netlist
Mapped QDI netlist P&R constraints
Hos
t syn
thes
is e
ngi
ne
HDL design specification
P&R, etc tools following synthesis in EDA flow
GTL lib
VHDL packages
lib cells,GTL stages
We
ave
r
Q
QSET
CLR
D
DFF
CLCL
Easy integration & scalability: Weaver flow architecture Synthesis flow
Interfacing with host synthesis engine
Transforming Synchronous RTL to Asynchronous GTL – Weaving
Dedicated library(ies) Dual-rail encoded data logic Cells comprising entire
stages Internal delay assumptions
only
Single-rail (synchronous) netlist synthesis
Weaving
GTL netlist mapping
srGTL lib
Single-rail netlist
GTL netlist
Mapped QDI netlist P&R constraints
Hos
t syn
thes
is e
ngi
ne
HDL design specification
P&R, etc tools following synthesis in EDA flow
GTL lib
VHDL packages
lib cells,GTL stages
We
ave
r
Q
QSET
CLR
D
DFF
CLCL
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline
REGCombinational logic
Q
QSET
CLR
D
DFF
CLCL
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate
asynchronously and independently
Q
QSET
CLR
D
DFF
CLCL
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate
asynchronously and independently
Many pipeline styles can be used
Q
QSET
CLR
D
DFF
CLCL
C
M
CDPC
ACK
F
RReq
RAck
parasitic capacitances
LAck
LReq
A0
B0
A1
B1
Z0
Z1
weak
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate
asynchronously and independently
Many pipeline styles can be used
Templates already exist
Q
QSET
CLR
D
DFF
CLCL
Weaving
Critical transformations Mapping combinational gates (basic weaving) Mapping sequential gates Initialization preserving liveness and safeness
Optimizations Performance optimization
Fine-grain pipelining (natural) Slack matching
Area optimization Optimizing out identity function stages
Q
QSET
CLR
D
DFF
CLCL
Basic Weaving
De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals
Merge insertion Fork insertion
Reset routing
AND2
A
B
X
Z
weaving
GTL_AND2
ACK
PC CD
F: AND2 M
A0A1B0B1 Z1
Z0
X0X1
XReq
ZReq
XAck
ZAck
AReq
BReq
AAck
BAck
dual-rail impl-n
Q
QSET
CLR
D
DFF
CLCL
Basic Weaving: example (C17 MCNC benchmark)
AND2
AND2
AND2
OR2
weaving
GTLAND2
GTLAND2
GTLAND2 GTL
OR2
GTLOR2
GTLAND2
data0data1 } dr_data
reqack } handshake
Q
QSET
CLR
D
DFF
CLCL
Linear pipeline (RTL)
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
Q
QSET
CLR
D
L
DL1
CL1
Q
QSET
CLR
D
L
DL2
Q
QSET
CLR
D
L
DL3
CL2
clk0clk1
Q
QSET
CLR
D
L
DL4
Q
QSET
CLR
D
L
DL5
CL3
Q
QSET
CLR
D
DFF
CLCL
t3 t4t1 t3 t4
t3 t4t1
...
.........
clk0/clk<=’0’
clk1/clk<=’1’
t5 t6
...
...
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
pipeline PN (PPN) model with local handshake
pipeline PN model with global synchronization
Q
QSET
CLR
D
DFF
CLCL
t3 t4t1 t3 t4
*
t3 t4t1
...
.........
clk0/clk<=’0’
clk1/clk<=’1’
t5 t6
...
...
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
pipeline PN (PPN) model with local handshake
pipeline PN model with global synchronization
Q
QSET
CLR
D
DFF
CLCL
t3 t4t1 t3 t4
*
t3 t4t1
...
.........
clk0/clk<=’0’
clk1/clk<=’1’
t5 t6
...
...
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
PPN models asynchronous full-buffer pipelines
pipeline PN model with global synchronization
Q
QSET
CLR
D
DFF
CLCL
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
ACKPC
M
ACKPC
M
ACKPC
M
ACKPC
M
ACKPC CD
MCL1
ACKPC CD
M
ACKPC CD
MCL2
CDCD CDACK
PC CD
MCL2
CDCD CD
GTL implementation
RTL implementation
Q
QSET
CLR
D
DFF
CLCL
Correctness
Safeness Guarantees that the number of data portions (tokens) stays
the same over the time Liveness
Guarantees that the system operates continuously Flow equivalence
In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream
Q
QSET
CLR
D
DFF
CLCL
Non-linear pipelines
Deterministic token flow Broadcasting tokens to all
channels at Forks Synchronizing at Merges
Data dependent token flow Ctrl is also a dual-rail
channel To guarantee liveness
MUXes need to match deMUXes – hard computationally
Fork Merge
in out
ctrl
MUX
in out
ctrl
deMUX
Q
QSET
CLR
D
DFF
CLCL
Non-linear pipeline liveness Currently guaranteed for deterministic
token flow only by construction (weaving)
A marking of a marked graph is live if each directed PN circuit has a marker
Linear closed pipelines can be considered instead
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN
Every PPN “stage” is a circuit and has a marker by definition
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN
Every PPN “stage” is a circuit and has a marker by definition
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN
Every PPN “stage” is a circuit and has a marker by definition
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN
Every PPN “stage” is a circuit and has a marker by definition
Each implementation loop forms two directed circuits Forward – has at least
one token inferred for a DFF
Q
QSET
CLR
D
DFF2
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN
Every PPN “stage” is a circuit and has a marker by definition
Each implementation loop forms two directed circuits Forward – has at least
one token inferred for a DFF
Feedback – has at least one NULL inferred from CL or added explicitly
Q
QSET
CLR
D
DFF2
CL
Q
QSET
CLR
D
DFF
CLCL
t2t1*
*
Closed linear PPN pipeline is live iff(for full-buffer pipelines) Every loop has at least 2
stages Token capacity for any loop:
1 C N - 1
Assumption we made – every loop in synchronous circuit has a DFF
A loop with no CL is meaningless
Liveness conditions hold by construction (Weaving)
Q
QSET
CLR
D
DFF2
CL
Q
QSET
CLR
D
DFF
CLCL
entity seqr is port (D_out : out std_logic; D_in, clk : in std_logic ); end seqr; architecture behavior of seqr is signal Data : std_logic_vector(2 downto 0); begin process(clk) begin if (clk='0' and clk'event) then Data <= (D_in, Data(2), Data(1)); end if; if (Data = "111") then D_out <= '1'; else D_out <= '0'; end if; end process; end behavior;
D_inD_outF
HB
FB
F
FBFB
Initialization: example
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
DD_in
clk
D_out
data1
ackreq
data0 } channelweaving
Q
QSET
CLR
D
DFF
CLCL
Initialization: FSM example
outin
outin
Fork
Next state CL
Output CL
State REG
clk
inout
…
HB HB HB HB HB HB
…
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
GTL data flow structure is equivalent to the source RTL by weaving No data dependencies are removed No additional dependencies introduced
In deterministic flow architecture There are no token races (tokens cannot pass
each other) All forks are broadcast and all joins are
synchronizers Flow equivalence preserved by construction
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
2 1 12
N N N2 1NNNNNN
2 1 12
NNN N N N NNN 2 1
GTL initialization is same as RTL
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
2 1 12
3 N N2 NNNN2NN
2 1 12
NNN N N N NN3 2 1
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
2 1 12
3 N N2 NNN22NN
2 1 12
NNN N N N N33 2 N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
2 1 12
3 N NN NNN22N3
2 1 12
2NN N N N N3N 2 N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
2 1 12
N N NN NNN2N33
2 1 12
2N3 2 N N N3N N N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
N 2 NN NNN2N33
3 2 12
233 2 2 N NNN N N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
N 2 N3 NNNNN3N
3 2 12
233 2 2 2 NN4 N N
but token propagation is independent
In GTL “3” hits the first top register output
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
4 2 N3 NN2NNNN
3 2 12
N3N 2 2 2 NN4 N N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
4 2 N3 N22N3NN
3 2 12
N3N N 2 2 244 3 N
In GTL “3” hits the first bottom register output
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
4 N 23 N22N3N4
3 2 12
NNN N N 2 24N 3 N
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 12
N N 2N 2223344
3 2 12
NN4 N N N 2NN 3 2
In GTL “2” hits the second register output
but token propagation is independent
Q
QSET
CLR
D
DFF
CLCL
Flow equivalence
3 2 23
N N 2N 22N3344
3 2 23
3N4 N N N 24N 3 2
but token propagation is independent
In RTL “3” and “2” moved one stage ahead
timing is independent, the order is unchanged
Q
QSET
CLR
D
DFF
CLCL
Optimizations
Area Optimizing out identity function stages
Performance Fine-grain pipelining (natural) Slack matching
Q
QSET
CLR
D
DFF
CLCL
Optimizing out identity function stages Identity function stages (buffers) are inferred
for clocked DFFs and D-latches Implement no functionality Can be removed as long as
The token capacity is not decreased below the RTL level
The resulting circuit can still be properly initialized
Q
QSET
CLR
D
DFF
CLCL
CL CLCL
Optimizing out identity function stages: example
HB HB HB HB HB HB HB HBHB HB HB HBDFF DFF
Final implementation is the same as if the RTL had not been pipelined (except for initialization)
Saves pipelining effort
Q
QSET
CLR
D
DFF
CLCL
Slack matching implementation Adjusting the pipeline slack to optimize its
throughput Implementation
leveling gates according to their shortest paths from primary inputs (outputs)
Inserting buffer stages to break long dependencies Buffer stages initialized to NULL
Currently performed for circuits with no loops only Complexity O(|X||C|2)
|X| - the number of primary inputs |C| - the number of connection points in the netlist
Q
QSET
CLR
D
DFF
CLCL
Slack matching correctness Increases the token capacity
Potentially increases performance Does not affect the number of initial tokens
Liveness is not affected Does not affect the system structure
The flow equivalence is not affected
Q
QSET
CLR
D
DFF
CLCL
Experimental results: MCNC
0
200
400
600
MHz
C1355 C5315 cm163a lal
Benchmark
Performance, MHz
RTL GTL
RTL implementation Not pipelined
GTL implementation Naturally fine-grain
pipelined Slack matching
performed Both implementations
obtained automatically from the same VHDL behavior specification on average ~ x4 better performance
Q
QSET
CLR
D
DFF
CLCL
Experimental results: AES
0
200
400
600
800
MHz
Inverter keyexpansn aes10rnds
Module
Performance, MHz
RTL GTL
00.5
11.5
22.5
33.5
4
sq. um xE+06
Inverter keyexpansn
Module
Area, sq. um xE+06
RTL GTL
~ x36 better performance ~ x12 larger
Q
QSET
CLR
D
DFF
CLCL
Base line
Demonstrated an automatic synthesis of QDI (robust to variations) automatically gate-level pipelined implementations from large behavioral specifications
Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced
Resulting circuits feature increased performance (depth dependent ~4x for MCNC) area overhead
Practical solution – first prerelease at http://async.bu.edu/weaver/
Demonstrated correctness of transformations (weaving)
Q
QSET
CLR
D
DFF
CLCL
Future work
Library design Dynamic (domino-like) library design Low leakage library design to combine high performance of
fine-grain pipelining with low power from very aggressive voltage reduction
Balanced library for security related applications Extending the concept to other technologies
Automated asynchronous fine-grain pipelining for standard FPGAs
Synthesis flow development Integration of efficient GTL “design-ware” and architectures
Thank you!
Questions?Comments?Suggestions?
Q
QSET
CLR
D
DFF
CLCL
Backup slides
Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness
Q
QSET
CLR
D
DFF
CLCL
Slack matching: example (C17)
handshake
dr_datachannel}
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
Q
QSET
CLR
D
DFF
CLCL
Slack matching: example (C17)
handshake
dr_datachannel}
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
Q
QSET
CLR
D
DFF
CLCL
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
GTL1
GTL1
Slack matching: example (C17)
handshake
dr_datachannel}
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
GTL1
GTL1
GTL1
Q
QSET
CLR
D
DFF
CLCL
Slack matching: example (C17)
handshake
dr_datachannel}
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
Q
QSET
CLR
D
DFF
CLCL
Slack matching: example (C17)
handshake
dr_datachannel}
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
GTLAND2
GTLOR2
GTLAND2
GTL1
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
Similar work: the difference Null Convention Logic
Coarse-grain Slow and large synchronization trees
Phased logic Different encoding provides less switching activity Complicated synthesis algorithm due to encoding
De-synchronization Bundled data Coarse grain
None of the above provide support for automated fine-grain pipelining
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
Example: data path
CL
FSM
CL REGMUX MUX
CL
CL
REG
Q
QSET
CLR
D
DFF
CLCL
Example: data path
CL
FSM
CL REGMUX MUX
CL
CL
Q
QSET
CLR
D
DFF
CLCL
Example: data path
CL
FSM
CL REGMUX MUX
CL
CL
Q
QSET
CLR
D
DFF
CLCL
Example: data path
CL
FSM
CL MUXMUXDE
MUX
CL
CL
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
Experiments setup
Standard gates library vtvt from Virginia Tech TSMC 0.25
C-elements – derived from PCHB library from USC and simulated to obtain performance
C
Storage
CDPC
ACK
F
C
C
LAck
LReq RReq
RAck
A0
B0
A1
B1
Z0
Z1
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
All correctness prerequisites1. no additional data dependencies are added and no existing data dependencies are removed during
weaving;2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent
function for dual-rail encoded data and initialized to NULL (spacer);3. closed asynchronous HB pipeline maximum token capacity is S/2 - 1 (where S is the number of HB
stages);4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages);5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any
two adjacent stages);6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a
spacer and another – to a token;7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or
half-DFFs) in the corresponding synchronous RTL implementation;8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation;9. no stage state is shared between any two stages.10. exactly one place is marked in every stage state.11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker;12. GTL style pipeline is properly modeled by HB PPN.13. a live closed HB PPN is at least 3 HB stages long;14. a live closed HB PPN has at least one token and at most S/2 – 1 tokens;15. the token flow is deterministic and does not depend on data itself;16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit);17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must
satisfy the conditions (xi), (xiii) and (xiv);18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
t3 t4t1 t3 t4
*
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
t3 t4 t5t1 t2 t7 t8t6
*
PPN models full-buffer pipelines
HB PPN models half-buffer pipelines
Q
QSET
CLR
D
DFF
CLCL
t3 t4t1 t3 t4
*
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
t3 t4 t5t1 t2 t7 t8t6
*
HB PPN stage has three states
PPN stage has two states
Q
QSET
CLR
D
DFF
CLCL
Linear pipeline
Q
QSET
CLR
D
DFF1
CL1
clk
CL2
Q
QSET
CLR
D
DFF2
CL3
Q
QSET
CLR
D
DFF3
t3 t4 t5t1 t2 t7 t8t6
*
HB PPN stage has three statesACK
PC
M
ACKPC
M
ACKPC
M
ACKPC
M
ACKPC CD
MCL1
ACKPC CD
M
ACKPC CD
MCL2
CDCD CDACK
PC CD
MCL2
CDCD CD
models properly HB GTL implementation
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
Non-linear pipeline
fork
t1
t3
t2merge
t3
t1
t2
fork
t1
t3
t2merge
t3
t1
t2
HB PPN model
PPN equivalent to HB PPN besides token capacity
Q
QSET
CLR
D
DFF
CLCL
Non-linear pipeline
fork
t1
t3
t2merge
t3
t1
t2
merget3
t2
forkt1
t3
HB PPN model
MG PN equivalent to HB PPN besides token capacity
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Q
QSET
CLR
D
DFF
CLCL
Closed linear HB pipeline is live iff Every loop has at least 3
stages Token capacity for any loop:
1 C N/2 - 1
Assumption we made – every loop in synchronous circuit has a DFF
A loop with no CL is meaningless
Liveness conditions hold
t2 t3t1
Q
QSET
CLR
D
DFF2
CL
Q
QSET
CLR
D
DFF
CLCL
Back to backup slides
Top Related