FPGA BASED REAL TIME VIDEO PROCESSING

FPGA BASED REAL TIME VIDEO PROCESSING Final B presentation

Duration : year

Presented by: Roman Kofman Sergey Kleyman

Supervisor: Mike Sumszyk

AGENDAProject Objectives

Algorithm – review

Data flow

Part B motivation

Implementation

Upscaling

Insights and conclusions

Project Objectives Improve video quality

Study Non linear Diffusion algorithm

Adjust algorithm to real time demands

Implement on ProcstarII board with Altera’s FPGA

Display real time results



Data flow

Part B motivation

Implementation

Upscaling


Algorithm - review

Based on the 2D non-linear Diffusion equation

Iterative solution

Good Filtering – smoothes noises

Keeps borders intact

2, , , ,t I x y t div g I I x y t

Matlab simulation

Original image

Filtered image

Good filtering

Keeps borders intact

Iteration nb: 3

Non linear diffusion filtering.

Thomas

Inverts three diagonal matrix

Makes real time implementation hard

Thomas

Consists of three loops – one of which is reversed

β,m,y are vectors of length N (frame resolution) and need to be stored in memory since they are read backwards

The original Thomas requires memory of at least four times the frame size.

for i=1:N-1 l(i) = gamma(i)/m(i); m(i+1) = alpha(i+1)-l(i)*beta(i);endy=u;y(1)=d(1);for i=2:N y(i)=d(i)-l(i-1)*y(i-1);endu(N)=y(N)/m(N);

for i=N-1:-1:1 u(i)=(y(i)-beta(i)*u(i+1))/m(i);end

α(N)γ(N-1)000

β(N-1)……..

…β(3)α(3)γ(2)0

…0β(2)α(2)γ(1)

…00β(1)α(1)

Memory efficient implementation

The inverse of a block diagonal matrix is another block diagonal matrix, composed of the inverse of each block

We flip each row separately, this way internal memory would be sufficient

Requires selective treatment of borders,1 ,1

,1 ,2

, 1

, 1 ,

0 0

0where is a matrix of

0

and N is the number of pixels in a row(size of column), i is the row number

i i

i i

i

i N

i N i N

A

Error criteria

We use the relative Root Mean Square Error as an error criteria. The RRMSE2 is defined by:

2A R1

Relative RMSE 1All pixels

RAll pixels

X XNumber of pixels

XNumber of pixels

Where:XA is the filtered image

XR is the relative image

Fixed point considerations

Matlab simulations showed that during calculations we need to work with 16bits:

.

Fixed point

dt=5, 4 iterations

Full precisionRMSE=0.0416

Original image



Data flow

Part B motivation

Implementation

Upscaling


DVIIN

DVIOUT

Data Flow - One IterationTwo parallel channels

LinesPIPE Thomas 3

M4K LINE REVERSE

WRITE

M4K LINE REVERSE

READ

M4K LINE REVERSE

WRITE

M4K LINE REVERSE

READ

ColumnsPIPE Thomas 3

M4K LINE REVERSE

WRITE

M4K LINE REVERSE

READ

M4K LINE REVERSE

WRITE

M4K LINE REVERSE

READ

T’ T’

How to implement T’ In real time?



Data flow

Part B motivation

Implementation

Upscaling


Project B - “divide and conquer”

Divide between 2 problems: Algorithm and Memory access

Implement algorithm on a small frame – our partComplete implementation of transpose using

DDR – Neta and HillelIntegrate both parts into one full data path

Project B motivation and goals

Implement the algorithm in a scalable mannerDisplay results for a small frame Implement the transpose in internal memory Implement blocks that will create the mini frame at the

beginning and generate full frame at the end

DVIIN

DVIOUTAlgorithm

Create mini

frame

Generate large frame

Mini from full: using internal memory we select a small frame, and send it in a burst to the pipe

As to fit the internal memory of the FPGA, we choose a 100*100 mini frame

The implementation of the algorithm is scalable to larger frames

Full out of mini: after the algorithm we generate large frame by zero padding the mini frame

Thom

as lo

op 2

Thom

as lo

op 1

Thom

as lo

op 3

Row flip on M4KM4KMRam

TriPort

control control controlSync signals

Sync signals

Sync signals

Mram TriPort

control

Normal and transpose read of mini frame

Pipeline Row flip Row flipPipeline Full frame generation

Data path for mini frame processing

Upscalable to full frame processing



Data flow

Part B motivation

Implementation

Upscaling


Synchronization signals

As mentioned, we must treat borders differently then normal pixels.

Therefore – we must distinguish throughout the entire pipeline between borders and non borders, and whether it is start or end.

To do so – we generate four sync signals, that describe every pixel

End of lineStart of line

Start of column

End of column

The need for these signals had upraised in the algorithm, but we can now use these signals for memory sync and frame generation sync

From the four signals we can easily derive a “start frame” signal, and also an “end frame” signal.

Transpose in internal memory

Transpose the image during reading read address is a sum of two weighted values: row and column

pointers Transpose the image by switching the pointers

Row pointerColumn pointer weight Transposed read address

Column pointerRow pointer weight Normal read address

Row flip on internal memory

Use M4K memory to reverse order on incoming data for entire row

Implement scalable design to be used on different row sizes Use sync signals as inputs and generate them for the next

block at the output

From Matlab to HDL - Simulink phase

From serial code to a combination of sequential and parallel blocks

“Close to hardware” implementationReal time simulation and comparison to code

(non real time) results

Simulation in Simulink

Data from DVI Using repeating

sequence block

Emulating Row flip Using enabled system

for sync Buffer depth is number

of pixels in a row

Derivatives calculation

Thomas forward

loops

Thomas reversed loop

Double buffered

flip – memory emulatio

n

Simulink data path

Double buffered

flip – memory emulatio

n

Building tri-diagonal matrix

From Matlab to HDL – SinplifyDSP phase

From full precision Simulink blocks, to fixed point hardware representing blocks

Pipelining and frequency considerationsGenerate HDL code

From Matlab to HDL – SinplifyDSP phase

After transforming every part of the data path from Simulink blocks to SinplifyDSP blocks and synthesizing– not all parts achieved the required frequency (DVI clk ~25MHz)

The critical paths were in the loops in the design – unfortunately you can’t pipeline a loop

Solution: Simplify every loop as much as possible

Simplifying loops – Thomas 1Loops with heavy calculations

become critical paths

Move multiplier out of loop

Pipeline

1 1

i iii i i

i i

m mm m

Simplifying loops

After movements – the maximum frequency is still to low Solution: Replace the SinplifyDSP division block with faster but less

accurate implementation – AEPG division algorithm

Anderson, Earle, Goldschmidt and Powers division algorithm

Iterative algorithm that calculates N/D

Adding iterations increases precision but uses more resources (extra multipliers and sums)

1

1 1

1 1

0 0

where x is a result of the iterations:

2

with the start conditions:x ,

must be in the range of 1 D<2 or 0.5 D<1

i i

i i i

i i i

N xDf tx x ft t f

N t DD

Simplifying loops – Thomas 2

The change in Thomas 1 influenced Thomas 2

originallm

newl m

divide by

Simplifying loops – Thomas 3

is slow and part of loopab1 a the multiplier is fasterb

1 division is before the loopb

Integrating the Sinplify projects into one Quartus project

SinplifyDSP generates registers and blocks with default similar names

Simple combine in Quartus will not work Combining several blocks in Quartus demands different

approach Manually change names – not realistic as to the huge amount of

blocks

Combine all the Sinplify projects to one and use black boxes to include VHDL code for the row flip – works, but synthesize is very long (could be days)

Use design partitions – our recommended method – dramatically shortens synthesizes and allows simple modular design

!

Functional and timing simulations

Using the design partitions method – we created test benches to test every block in Modelsim - functional simulation

After correct functional simulation in Modelsim, we repeated this simulation in Quartus simulator tool

We then did a timing simulation in the simulator tool

Algorithm Mini frame related

Tested the Simulink phase in comparison with theoretic matlab code

Tested the SinplifyDSP phase in comparison with theoretic matlab code and tested synchronization

Tested in Modelsim for functional operation

Tested timing in Quartus View real time results on

screen

Tested in Modelsim for functional operation

Tested timing in Quartus View real time results on

screen

Functional and timing simulations

Results

Algorithm works correctly and gives correct results –ready to be upscaled

For large dt values we get artifacts – also seen in Matlab fixed point simulations

Unresolved sync problem in the mini frame creation – only a problem for displaying the mini frame



Data flow

Part B motivation

Implementation

Upscaling


Upscaling to a full frame

Change parameters in SinplifyDSP (N_col, N_row), Synthesize and create design partitions

Change memory size for the row flipChange parameters throughout the design

(Generic parameters)

Remove mini from full and full from mini, and integrate with Neta’s and Hillel’s data path

Synthesize, place and route in Quartus



Data flow

Part B motivation

Implementation

Upscaling


Achievements

Algorithm and pipeline working at ~26MHz (a bit higher then DVI clk)

One iteration is enough to see results. Maximum stable dt is ~5 (In comparison to semi implicit design where dt was only limited to 0.5)

Display real time results and improve video quality

Unfortunately, Still having some unresolved problems with sync


Signal processing in real time is quite hard and demands precise planning and designing

Test benches must be used to test every part of the design

The principle of “divide and conquer” is a gateway to success

You should be familiar with the available tools prior to beginning working

Documentation is, unfortunately, not complete and sometimes not accurate

Compatibility, versions and lack of remote access

Thank you for listening…

We invite you to join us in the lab for a short

demonstration

FPGA BASED REAL TIME VIDEO PROCESSING

Documents

Transcript of FPGA BASED REAL TIME VIDEO PROCESSING