VLSI Programming 2016: Lecture 5

03/05/16 1

VLSI Programming 2016: Lecture 5

Course: 2IMN35

Teachers: Kees van Berkel [email protected] Rudolf Mak [email protected]

Lab: Kees van Berkel, Rudolf Mak, Alok Lele

www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 5 folding

03/05/16 2

VLSI Programming (2IMN35): time table 2016 2016 in Tue:h5-h8;MF.07 out 2016 in Thu:h1-h4;Gemini-Z3A-08/10/13 out

19-Apr

introduc/on,DSPgraphs,bounds,…

21-Apr

pipelining,re/ming,transposi/on,J-slow,unfolding

T1+T2

26-Apr

toolsinstalled

Introduc/onstoFPGAandVerilog

L1:audiofiltersimula/on

L1L2

28-Apr

T1+T2

unfolding,look-ahead,strengthreduc/on

L1cntd

T3+T4

3-May

folding

L2:audiofilteronXUPboard

5-May

10-May

T3+T4

DSPprocessors

L2cntd

L3

12-May

L3:sequen/alFIR+strength-reducedFIR

17-May

L3cntd

19-May

L3cntd

L4

24-May

systoliccomputa/on

T5

26-May

L3

L4

31-May

T5

L4:audiosamplerateconvertor

2-Jun

L4cntd

L5

7-Jun

L5:1024xaudiosamplerateconvertor

9-Jun

L4

L5cntd

14-Jun

16-Jun

L5

deadlinereportL5

03/05/16 3

Outline Lecture 5

•  Folding Transformation

Mandatory reading (reminder):

•  Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proc. of the IEEE, Vol. 75, No. 9, Sept 1987, pp 1235-1245.

FOLDING (TRANSFORMATION)

03/05/16 4

Folding versus unfolding

• Unfolding (a.k.a. block processing, MIMO processing): increase throughput (#samples per time unit), at the expense of extra hardware resources, by processing blocks of L samples.

• Folding: reduce hardware resources (#multipliers, # adders), at the expense of reduced throughput, by mapping multiple operations on the same HW resource.

• Folding is not: from MIMO back to SISO.

• Typically: given required throughput (latency) of a function (application), apply folding or unfolding (along with other transformations) to achieve the required throughput @ minimal hardware costs.

03/05/16 5

03/05/16 6

Folding transformation [Parhi]

Goal: N × lower throughput, N × fewer (best case) resources

Given a data-flow graph, folding amounts to: 1.  chose a folding factor N 2.  chose folding sets S (resource binding + order):

ordered sets of ≤N data-flow operations that share the same functional unit (hardware)

3.  calculate the folded delay DF for each data-flow edge 4.  successful folding if all DF ≥0,

i.e. production precedes consumption

When DF for a node <0: try step 2 again. Step 2 is a creative step; may involve some trial and error.

03/05/16 7


•  S1 = adder, latency=1; S2 = multiplier, latency=2

•  smart order: add4 before add2; add3 before add1

•  add2 before add3: saves register

•  On iteration now takes 4 cycles (= max(|S1|,|S2|)), initiating one addition and one multiplication in each cycle.

03/05/16 8

DF (U → V) = N we – Pu + v –u


03/05/16 9

Folding transformation

Folding equation, one per edge of the original graph

DF (U → V) = N we – Pu + v –u

where

•  N = folding factor

•  U and V are two nodes in graph connected by edge e •  we = #delays in edge e •  Pu = latency (pipeline-level) for each unit u (eg. adder)

•  u and v are folding orders of nodes U and V: which phase [0..N-1] are operations scheduled

03/05/16 10

03/05/16 11

03/05/16 12

DF (U → V) = N we – Pu + v –u

03/05/16 13

03/05/16 14

2IN35: reporting guidelines 2016 (1)

1.  Submit one report per team (2 students)

2.  Respect deadlines: • Assignment L3: Thursday May 26, 2016 • Assignment L4: Thursday June 9, 2016 • Assignment L5: Thursday June 16, 2016

3.  Make sure that assignments L3, L4, and L5 are demonstrated to and signed of by Alok, Rudolf, or Kees.

4.  Report on lab assignments L3, L4, and L5.

5.  Submit the reports using Peach (paper copies will not be accepted).

03/05/16 15


General guidelines (each assignments), to be followed strictly:

6.  Analyze the specifications and requirements.

7.  Present/motivate key ideas/decisions, design options, alternatives, trade-offs.

8.  Draw architecture block diagram (= picture!).

9.  Explain functional correctness of your Verilog programs(include your complete Verilog programs in an appendix).

10. Explain #clock cycles per sample time Ts. Include waveforms.

11. Report, analyze & explain FPGA-resource usage and utilization {#multipliers, #BRAMS, #LUTs} in relation to your design.

12. Report, analyze & explain (min) sample time Ts and (max) sample frequency fs, both after synthesis and after placement & routing.


13. Include simulation results: both wave forms in time domain, and in frequency domain (apply FFT) (assignments 3 and 4 only).

14.  Include answers to the inline questions

15.  Annotate all graphs to include for both axis: - quantity (weight, distance, duration, …) - unit (ounce, light year, century, …) - linear/log/... (ok to assume linear)

03/05/16 16

THANK YOU

VLSI Programming 2016: Lecture 5

Documents

Transcript of VLSI Programming 2016: Lecture 5