CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

CDA 4253 FPGA System DesignRTL Design Methodology

Hao ZhengComp Sci & Eng

Structure of a Typical Digital Design

Datapath(Execution

Controller(Control

Data Inputs

Data Outputs

Control Inputs

Control Outputs

Control Signals

StatusSignals

Hardware Design with RTL VHDL

Pseudocode

Datapath Controller

Blockdiagram

State diagramor ASM chart

VHDL code VHDL code VHDL code

Interface

Hardware Design with RTL VHDL

Steps of the Design Process1. Text description2. Interface3. Pseudocode4. Block diagram of the Datapath5. Interface divided into Datapath and Controller 6. FSM of the Controller7. RTL VHDL code of the Datapath, Controller, and Top-

Level Unit8. Testbench for the Datapath, Controller, and Top-Level

Unit9. Functional simulation and debugging10. Synthesis and post-synthesis simulation11. Implementation and timing simulation12. Experimental testing using FPGA board

Min_Max_Average

Pseudocode Input: M[i]Outputs: max, min, average

max = 0min = MAX // the maximal constantsum = 0for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endifendforaverage = sum/32

Data M[i] are stored in memory.

Results are stored in the internal registers.

Circuit Interface

in_data

in_addr

out_data

out_addrMIN_MAX_AVR

Interface Table

Port Width Meaning

clk 1 System clock

reset 1 System reset – clears internal registers

in_data n Input data bus

in_addr 5 Address of the internal memory where input data is stored

write 1 Synchronous write control signal

START 1 Starts the computations

DONE 1 Asserted when all results are ready

out_data n Output data bus used to read results

out_addr 2 01 – reading minimum10 – reading maximum11 – reading average

Datapath

Input: M[i]Outputs: max, min, average

max = 0min = MAXsum = 0for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endifendforaverage = sum/32

Datapath

average

min max

d min d max

min max

mux mux

State Diagram for Controller

start / rst=1

i==32 / done=1

i < 32 / i++

start’/

done=0

Sorting

Beforesorting

During Sorting Aftersorting

3 3 2 2 1 1 1 12 2 3 3 3 3 2 24 4 4 4 4 4 4 31 1 1 1 2 2 3 4

i=0 i=0 i=0 i=1 i=1 i=2j=1 j=2 j=3 j=2 j=3 j=3

Legend: position of memory indexed by i

position of memory indexed by j

Sorting - Example

Pseudocode

for i=0 to k-2 doA = M[i]for j=i+1 to k-1 do

B = M[j]if A > B then

M[i] = B

M[j] = A

A = Bend if

end forend for

Sorting - Interface

resetdin

startMemory

Datapathfor i=0 to k-2 do

A = M[i]for j=i+1 to k-1 do

M[i] = B

M[j] = A

A = Bend if

end forend for

• Registers to hold A, B, and memory addresses i and j.

• Incrementor.• Comparator.

M[i] = B

M[j] = A

A = Bend if

end forend for

counteri

enable

en_regj

M[i] = B

M[j] = A

A = Bend if

end forend for

regA regBenB

State Diagram for the Controller

M[i] = B

M[j] = A

A = Bend if

end forend for

• Nested loops by two FSMs: one for the outer loop controls the one for the inner loop.

• Reuse the FSM for the single for loop in the previous example.

State Diagram for the Controller

M[i] = B

M[j] = A

A = Bend if

end forend for

start / rst=1

j<k-1 / …

i<k-2 / A <= M[i]j <= i+1

start’/

done=0

outeri==k-2 /done=1end

j==k-1 / …

Optimization for Performance

Performance Definitions

• Throughput: the number of inputs processed per unit time.

• Latency: the amount of time for an input to be processed.

• Maximizing throughput and minimizing latency in conflict.• Both require timing optimization:

- Reduce delay of the critical path

Achieving High Throughput: Pipelining

• Divide data processing into stages• Process different data inputs in different stages

simultaneously.

xpower = 1;for (i = 0; i < 3; i++) xpower = x * xpower;

process (clk) begin if rising_edge(clk) then if start=‘1’ then

cnt <= 3; end if; if cnt > 0 then

cnt <= cnt – 1;xpower <= xpower * x;

elsif cnt = 0 then done <= ‘1’; end if;end process;

Throughput: 1 data / 3 cycles = 0.33 data / cycle .Latency: 3 cycles.Critical path delay: 1 multiplier delay

xpower = 1;for (i = 0; i < 3; i++) xpower = x * xpower;

process (clk, rst) begin if rising_edge(clk) then if start=‘1’ then -- stage 1

x1 <= x;xpower1 <= x;done1 <= start;

end if; -- stage 2 x2 <= x1; xpower2 <= xpower1 * x1; done2 <= done1; -- stage 3 xpower <= xpower2 * x2; done <= done2; end if;end process;

Throughput: 1 data / cycleLatency: 3 cycles + register delays.Critical path delay: 1 multiplier delay

simultaneously.

din dout

simultaneously.

din dout…stage 1 stage 2 stage n

Penalty: increase in area as logic needs to be duplicated for different stages

registers

Reducing Latency

• Closely related to reducing critical path delay.• Reducing pipeline registers reduces latency.

registers

Reducing Latency

• Closely related to reducing critical path delay.• Reducing pipeline registers reduces latency.

Timing Optimization

• Maximal clock frequency determined by the longest path delay in any combinational logic blocks.

• Pipelining is one approach.

registers

din dout

Timing Optimization: Spatial Computing

• Extract independent operations• Execute independent operations in parallel.

X = A + B + C + D

process (clk, rst) begin if rising_edge(clk) then X1 := A + B; X2 := X1 + C; X <= X2 + D; end if;end process;

process (clk, rst) begin if rising_edge(clk) then X1 <= A + B; X2 <= C + D; X <= X1 + X2; end if;end process;

Critical path delay: 3 adders Critical path delay: 2 adders

Timing Optimization: Avoid Unwanted Priority

process (clk, rst) begin if rising_edge(clk) then if c0=‘1’ then r0 <= din; elsif c1=‘1’ then r1 <= din; elsif c2=‘1’ then r2 <= din; elsif c3=‘1’ then r3 <= din; end if; end if;end process;

Critical path delay: 4-input AND gate + 4x1 MUX.

Timing Optimization: Avoid Unwanted Priority

Critical path delay: 2x1 MUX

process (clk, rst) begin if rising_edge(clk) then if c0=‘1’ then r0 <= din; end if; if c1=‘1’ then r1 <= din; end if; if c2=‘1’ then r2 <= din; end if; if c3=‘1’ then r3 <= din; end if; end if;end process;

Timing Optimization: Register Balancing• Maximal clock frequency determined by the longest path

delay in any combinational logic blocks.

din block 1 block 2 dout

Timing Optimization: Register Balancing

process (clk, rst) begin if rising_edge(clk) then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if;end process;

process (clk, rst) begin if rising_edge(clk) then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if;end process;

Optimization for Area

Area Optimization: Resource Sharing• Rolling up pipleline: share common resources at different

time – a form of temporal computing

dindout

Block including all all logic in stage 1 to n.

Area Optimization: Resource Sharing

• Use registers to hold inputs• Develop FSM to select which inputs to process in each

cycle.X = A + B + C + D

• Use registers to hold inputs• Develop FSM to select which inputs to process in each

cycle.X = A + B + C + D

A, B, C, D need to hold steady until X is processed

control

Merge duplicate components

together

Merge duplicate components

together

Impact of Reset on Area

Resetting Block RAM

• Block RAM only supports synchronous reset.• Suppose that Mem is 256x16b RAM. • Implementations of Mem with synchronous and

asynchronous reset on Xilinx Virtex-4.

Optimization for Power

Power Reduction Techniques

• In general, FPGAs are power hungry. • Power consumption is determined by

where V is voltage, C is load capacitance, and f is switching frequency• In FPGAs, V is usually fixed, C depends on the number of

switching gates and length of wires connecting all gates.• To reduce power,

• turn off gates not actively used,• have multiple clock domains,• reduce f.

Dual-EdgeTriggered FFs

• A design that is active on both clock edges can reduce clock frequency by 50%.

din doutstage 1 stage 2 stage nstage 4

Example 1

Example 2 positively triggerednegatively triggered

CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

Documents

Transcript of CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

USF SculptureWalk

Summary of USF ICC Reform - Intro USF

TV = “Super Media” - RTL Group...Germany RTL Interactive launches Gamechannel.de Netherlands RTL Nederland launches Radio 53Z8 Luxembourg RTL Group prepares for this year’s Télévie

Ersatz-RTL-Thermostat-Kopf - imi-hydronic.com Thermostatic Control ENGINEERING ADVANTAGE Tartalék RTL termosztátfej Rezervna RTL-Termostat glava H HR GR MielŒtt a tartalék RTL

Tcvn 4253 2012 907187

CDA 4253 FPGA System Design Sequential Circuit Building Blocks Hao Zheng Dept of Comp Sci & Eng USF.

Buy RTL-SDR Dongles (RTL2832U) - rtl-sdr.pdf

Primera Línea 4253 03 09 14

RTL - Home

RTL Report

rtl- " . L

WordPress RTL

IS - 4253-2004

RTL - Exterity

APREEMINENT USF SYSTEM RESEARCH · 2018-10-05 · NEW STUDENT PROFILE (FTIC) USF System USF USF Tampa St. Petersburg USF Sarasota-FIRST TIME IN COLLEGE (FTIC) ENROLLMENT Manatee *Does

GCC internals intro y optimizationesnicolasw/Docencia/CP/gcc_int_intro.pdf · Optimizaciones Back-end Backend: – RTL tree + RTL language + RTL Engine + RTL Compiler [Middle-End]

Antimicoticos usf

CDA 4253 FPGA System Design Hao Zheng Dept of Comp Sci & Eng USF.

Spoelbakken & toebehoren - Linum catalogusdelen...RTL-0101* RTL-001 360 358 160 84 1”1/2 40 17 15,0 3 RTL-0102 RTL-002 380 378 167 84 1”1/2 45 16 16,5 5 RTL-0103 RTL-003 420 418

Hao Zheng Comp Sci & Eng USF CDA 4253 FPGA System Design Chapter 5 Finite State Machines.