Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and...
-
Upload
myles-walker -
Category
Documents
-
view
215 -
download
0
Transcript of Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and...
Bounded Dataflow Networks and Latency Insensitive Circuits Cont…
ArvindComputer Science and Artificial Intelligence Laboratory MIT
Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009]
November 19, 2009 L23-1http://csg.csail.mit.edu/korea
Modular transformation
SSM1
SSM3
SSM2 BDN1
Is this transformation correct?
BDN2
BDN3
Yes, provided each BDNi implements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive
SSM BDN
November 19, 2009 L23-2http://csg.csail.mit.edu/korea
BDN Implementing an SSM
A BDN is said to implement an SSM iff1. There is a bijective mapping between
inputs (outputs) of the SSM and BDN2. The output histories of the SSM and
BDN match whenever the input histories match
3. The BDN is deadlock-free
SSM...
...
BDN...
...
November 19, 2009 L23-3http://csg.csail.mit.edu/korea
Latency-Insensitive BDN (LI-BDN)
A BDN implementing an SSM is an LI-BDN iff it has
No extraneous dependencies property
Self cleaning property
Theorem: A BDN where all the nodes are LI-BDNs will not deadlock
November 19, 2009 L23-4http://csg.csail.mit.edu/korea
No-Extraneous Dependency (NED) property
SSM
outInputs combinationally
connected to out
outQProduction of outQ
waits only for these input FIFOs
BDN
November 19, 2009 L23-5http://csg.csail.mit.edu/korea
Self-Cleaning (SC) property
If the BDN has enqueued all its outputs, it will dequeue all its inputs
November 19, 2009 L23-6http://csg.csail.mit.edu/korea
Modular refinement - revisited
SSM1 module to be refined
LI-BDN1 implementing
SSM1
SSM2 rest of the design
LI-BDN2 Automatically generated
LI-BDN1 refined manually
LI-BDN2
November 19, 2009 L23-7http://csg.csail.mit.edu/korea
Writing an LI-BDN wrapper for an SSMGiven the SSM:
oj(t) = fj(ij1(t), ... ,ijIj(t), s(t))
// ij1, ij2, ... ijIj are combinationally connected to oj
s(t+1) = g(i1(t), i2(t), ... , s(t))
LI-BDN:
rule Oj when (donej)
donej True;
oj.enq( fj(ij1.first, ... ,ijIj.first, s) )
rule Finish when (done1 done2 ...)
done1 False; done2 False; ...;
s g(i1.first, i2.first, ... , s);
i1.deq ; i2.deq ; ...
introduce a done flag and a rule for each output
introduce the Finish rule
November 19, 2009 L23-8http://csg.csail.mit.edu/korea
Wrapper circuit
enq
1 0
All input
Ii
Oj
value
first Patient SSM
enable
donej
Depends-on(Oj)
not-full
not-
em
pty
Alldones
deq
deq
November 19, 2009 L23-9http://csg.csail.mit.edu/korea
Patient SSM
CombinationalLogic
...
...
...
Outp
uts
Inputs
Enable
CombinationalLogic
...
Outp
uts
...
Inputs
...
November 19, 2009 L23-10http://csg.csail.mit.edu/korea
Example
3-port and 1-port Register Files
interface RegisterFile3Ports method Value rd0(Addr a); method Value rd1(Addr a); method Action wr(Addr a,
Value x);endinterface
rf
ra0ra1wen
wa
wd
rd0
rd1
interface RegisterFile1port method ActionValue#(Value) access(Req r);endinterface//Response to write access is // unconstrainedtypedef union tagged{ W struct{a:Addr,v:Value}; R struct{a:Addr}; } Req;
rf
enR/W
a
d
out
November 19, 2009 L23-11http://csg.csail.mit.edu/korea
LI-BDN for a 3-port register filerule RD0 when (rd0Done) rd0.enq(rf.r1(ra0.first)); rd0Done True;
rule RD1 when (rd1Done) rd1.enq(rf.r1(ra1.first)); rd1Done True;
rule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.wr(wa.first, wd.first); rd0Done False; rd1Done False;
rd0Donerd1Done
rf
ra0ra1wen
wa
wd
rd0
rd1
November 19, 2009 L23-12http://csg.csail.mit.edu/korea
Refinement into a one-ported register file LI-BDN
ra0
ra1
wen
wa
wd
rd0
rd1
rule RD0 when (rd0Done) let x rf.access(R ra0.first); rd0.enq(x); rd0Done Truerule RD1 when (rd1Done) let x rf.access(R ra1.first); rd1.enq(x); rd1Done Truerule finish when (rd0Done rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.access(W {a:wa.first, v:wd.first}); rd0Done False; rd1Done False;
rd0Done
rd1Done
This uses 1 port
rf
enR/W
a
d
out
November 19, 2009 L23-13http://csg.csail.mit.edu/korea
Pipelining combinational circuits
S3
S1
S2
f3
a
b
e
f2
f1
c
d
R3
R1
R2
f3
a
b
e
f2
f1
c
d
Can potentially reduce the critical path of the entire circuit
November 19, 2009 L23-14http://csg.csail.mit.edu/korea
Optimizing an LI-BDN mux
a
b
c
d
Does not wait for don’t-care inputs Counters used to keep track of how many
inputs to drop Can potentially increase the throughput
a
b
c
d
November 19, 2009 L23-15http://csg.csail.mit.edu/korea
Summary
Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system
November 19, 2009 L23-16http://csg.csail.mit.edu/korea
A Design Flow issue
We can apply the technique discussed to refine this designBut where does this design come from in the first place?
Verilog? Verilog Compiler Output? Bluespec?
RegFile
Mem2/ALU/
ExceptionHandler
AddrCalc/
BranchResolve
Mem1Fetch1 Fetch2Branch
PredCrack Decode
Branch ResolutionBranch Prediction
Exception
Register Write
Pipelined MultiplierMulticycle divider
Register file implemented as a BRAM
November 19, 2009 L23-17http://csg.csail.mit.edu/korea
Design Flow Issues
Generation of appropriate RTL is the major problemRTL / Specifications should be written in such a way that they are amenable to refinements
Latency Insensitive Design Methodology
November 19, 2009 L23-18http://csg.csail.mit.edu/korea
The PowerPC Project
Cycle-accurate modeling of PowerPC on FPGAs
November 19, 2009 L23-19http://csg.csail.mit.edu/korea
PPC In-order Pipeline
The designer specifies the FSM for each stageThe FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages
stall bypass
I$/ITlb1
I$/ITlb2
Mem
PC Fetch BrPred Crack Decode RegRdAddrCalcBrRes
Mem1
Mem2ALUExcep
RegWr
D$/DTlb1
D$/DTlb2
Mem
epochs
November 19, 2009 L23-20http://csg.csail.mit.edu/korea
The steps in Cycle-accurate implementation on FPGAs
The specs are turned into Bluespec code to give a target SSM
Once the size of FIFOs is fixed the whole design has a precise timing specification
If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined
Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM
This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs
Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced
This also ensures deadlock-free operation
Can
be m
ech
an
ized
November 19, 2009 L23-21http://csg.csail.mit.edu/korea
Initial results using XUPV5 FPGA
Direct prototypeCycle-accurate model
using BDN theory
Developed mainly at IBM (Kattamuri Ekanadham, Jessica Tseng)
Developed mainly at MIT (Asif Khan, Yuan Tang), re-using a lot of components
92% LUT resources 24% LUT resources
20 MHz, can possibly be increased to 40 MHz
125 MHz
Boots linux Boots linux
November 19, 2009 L23-22http://csg.csail.mit.edu/korea
Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009)
Cycle-accurate refinements onto Xilinx XUPV5 Slice Logic Utilization:
Number of Slice Registers: 15448 out of 69120 22% Number of Slice LUTs: 16702 out of 69120 24%
Specific Feature Utilization: Number of Block RAM/FIFO: 1 out of 148 0% (only 1
BRAM for the register file) Number of DSP48Es: 12 out of 64 18% (these are used
for the divider) Minimum period: 7.988ns (Maximum Frequency:
125.188MHz) Partially verified by running a 50 instruction program
Compared to Jessica has port onto Xilinx XUPV5Takes up 92% of the area; 20Mhz 40Mhz
No numbers yet for actual work done
November 19, 2009 L23-23http://csg.csail.mit.edu/korea
Conclusion
Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulatorsBDNs offer a way to refine RTL without losing cycle-accuracyBluespec makes quick RTL generation feasible
The generation of BDNs can be automatedWe plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem.
November 19, 2009 L23-24http://csg.csail.mit.edu/korea