Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and...

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

ArvindComputer Science and Artificial Intelligence Laboratory MIT

Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009]

November 19, 2009 L23-1http://csg.csail.mit.edu/korea

Modular transformation

SSM1

SSM3

SSM2 BDN1

Is this transformation correct?

BDN2

BDN3

Yes, provided each BDNi implements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive

SSM BDN


BDN Implementing an SSM

A BDN is said to implement an SSM iff1. There is a bijective mapping between

inputs (outputs) of the SSM and BDN2. The output histories of the SSM and

BDN match whenever the input histories match

3. The BDN is deadlock-free

SSM...

...

BDN...

...


Latency-Insensitive BDN (LI-BDN)

A BDN implementing an SSM is an LI-BDN iff it has

No extraneous dependencies property

Self cleaning property

Theorem: A BDN where all the nodes are LI-BDNs will not deadlock


No-Extraneous Dependency (NED) property

SSM

outInputs combinationally

connected to out

outQProduction of outQ

waits only for these input FIFOs

BDN


Self-Cleaning (SC) property

If the BDN has enqueued all its outputs, it will dequeue all its inputs


Modular refinement - revisited

SSM1 module to be refined

LI-BDN1 implementing

SSM1

SSM2 rest of the design

LI-BDN2 Automatically generated

LI-BDN1 refined manually

LI-BDN2


Writing an LI-BDN wrapper for an SSMGiven the SSM:

oj(t) = fj(ij1(t), ... ,ijIj(t), s(t))

// ij1, ij2, ... ijIj are combinationally connected to oj

s(t+1) = g(i1(t), i2(t), ... , s(t))

LI-BDN:

rule Oj when (donej)

donej True;

oj.enq( fj(ij1.first, ... ,ijIj.first, s) )

rule Finish when (done1 done2 ...)

done1 False; done2 False; ...;

s g(i1.first, i2.first, ... , s);

i1.deq ; i2.deq ; ...

introduce a done flag and a rule for each output

introduce the Finish rule


Wrapper circuit

enq

1 0

All input

Ii

Oj

value

first Patient SSM

enable

donej

Depends-on(Oj)

not-full

not-

em

pty

Alldones

deq

deq


Patient SSM

CombinationalLogic

...

...

...

Outp

uts

Inputs

Enable

CombinationalLogic

...

Outp

uts

...

Inputs

...


Example

3-port and 1-port Register Files

interface RegisterFile3Ports method Value rd0(Addr a); method Value rd1(Addr a); method Action wr(Addr a,

Value x);endinterface

rf

ra0ra1wen

wa

wd

rd0

rd1

interface RegisterFile1port method ActionValue#(Value) access(Req r);endinterface//Response to write access is // unconstrainedtypedef union tagged{ W struct{a:Addr,v:Value}; R struct{a:Addr}; } Req;

rf

enR/W

a

d

out


LI-BDN for a 3-port register filerule RD0 when (rd0Done) rd0.enq(rf.r1(ra0.first)); rd0Done True;

rule RD1 when (rd1Done) rd1.enq(rf.r1(ra1.first)); rd1Done True;

rule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.wr(wa.first, wd.first); rd0Done False; rd1Done False;

rd0Donerd1Done

rf

ra0ra1wen

wa

wd

rd0

rd1


Refinement into a one-ported register file LI-BDN

ra0

ra1

wen

wa

wd

rd0

rd1

rule RD0 when (rd0Done) let x rf.access(R ra0.first); rd0.enq(x); rd0Done Truerule RD1 when (rd1Done) let x rf.access(R ra1.first); rd1.enq(x); rd1Done Truerule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.access(W {a:wa.first, v:wd.first}); rd0Done False; rd1Done False;

rd0Done

rd1Done

This uses 1 port

rf

enR/W

a

d

out


Pipelining combinational circuits

S3

S1

S2

f3

a

b

e

f2

f1

c

d

R3

R1

R2

f3

a

b

e

f2

f1

c

d

Can potentially reduce the critical path of the entire circuit


Optimizing an LI-BDN mux

a

b

c

d

Does not wait for don’t-care inputs Counters used to keep track of how many

inputs to drop Can potentially increase the throughput

a

b

c

d


Summary

Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system


A Design Flow issue

We can apply the technique discussed to refine this designBut where does this design come from in the first place?

Verilog? Verilog Compiler Output? Bluespec?

RegFile

Mem2/ALU/

ExceptionHandler

AddrCalc/

BranchResolve

Mem1Fetch1 Fetch2Branch

PredCrack Decode

Branch ResolutionBranch Prediction

Exception

Register Write

Pipelined MultiplierMulticycle divider

Register file implemented as a BRAM


Design Flow Issues

Generation of appropriate RTL is the major problemRTL / Specifications should be written in such a way that they are amenable to refinements

Latency Insensitive Design Methodology


The PowerPC Project

Cycle-accurate modeling of PowerPC on FPGAs


PPC In-order Pipeline

The designer specifies the FSM for each stageThe FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages

stall bypass

I$/ITlb1

I$/ITlb2

Mem

PC Fetch BrPred Crack Decode RegRdAddrCalcBrRes

Mem1

Mem2ALUExcep

RegWr

D$/DTlb1

D$/DTlb2

Mem

epochs


The steps in Cycle-accurate implementation on FPGAs

The specs are turned into Bluespec code to give a target SSM

Once the size of FIFOs is fixed the whole design has a precise timing specification

If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined

Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM

This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs

Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced

This also ensures deadlock-free operation

Can

be m

ech

an

ized


Initial results using XUPV5 FPGA

Direct prototypeCycle-accurate model

using BDN theory

Developed mainly at IBM (Kattamuri Ekanadham, Jessica Tseng)

Developed mainly at MIT (Asif Khan, Yuan Tang), re-using a lot of components

92% LUT resources 24% LUT resources

20 MHz, can possibly be increased to 40 MHz

125 MHz

Boots linux Boots linux


Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009)

Cycle-accurate refinements onto Xilinx XUPV5 Slice Logic Utilization:

Number of Slice Registers: 15448 out of 69120 22% Number of Slice LUTs: 16702 out of 69120 24%

Specific Feature Utilization: Number of Block RAM/FIFO: 1 out of 148 0% (only 1

BRAM for the register file) Number of DSP48Es: 12 out of 64 18% (these are used

for the divider) Minimum period: 7.988ns (Maximum Frequency:

125.188MHz) Partially verified by running a 50 instruction program

Compared to Jessica has port onto Xilinx XUPV5Takes up 92% of the area; 20Mhz 40Mhz

No numbers yet for actual work done


Conclusion

Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulatorsBDNs offer a way to refine RTL without losing cycle-accuracyBluespec makes quick RTL generation feasible

The generation of BDNs can be automatedWe plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem.


Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and...

Documents

Transcript of Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and...