FPGA BASED REAL TIME VIDEO PROCESSING
description
Transcript of FPGA BASED REAL TIME VIDEO PROCESSING
![Page 1: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/1.jpg)
FPGA BASED REAL TIME VIDEO PROCESSING Final B presentation
Duration : year
Presented by: Roman Kofman Sergey Kleyman
Supervisor: Mike Sumszyk
![Page 2: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/2.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 3: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/3.jpg)
Project Objectives Improve video quality
Study Non linear Diffusion algorithm
Adjust algorithm to real time demands
Implement on ProcstarII board with Altera’s FPGA
Display real time results
![Page 4: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/4.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 5: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/5.jpg)
Algorithm - review
Based on the 2D non-linear Diffusion equation
Iterative solution
Good Filtering – smoothes noises
Keeps borders intact
2, , , ,t I x y t div g I I x y t
![Page 6: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/6.jpg)
Matlab simulation
Original image
Filtered image
Good filtering
Keeps borders intact
Iteration nb: 3
Non linear diffusion filtering.
![Page 7: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/7.jpg)
Thomas
Inverts three diagonal matrix
Makes real time implementation hard
![Page 8: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/8.jpg)
Thomas
Consists of three loops – one of which is reversed
β,m,y are vectors of length N (frame resolution) and need to be stored in memory since they are read backwards
The original Thomas requires memory of at least four times the frame size.
for i=1:N-1 l(i) = gamma(i)/m(i); m(i+1) = alpha(i+1)-l(i)*beta(i);endy=u;y(1)=d(1);for i=2:N y(i)=d(i)-l(i-1)*y(i-1);endu(N)=y(N)/m(N);
for i=N-1:-1:1 u(i)=(y(i)-beta(i)*u(i+1))/m(i);end
α(N)γ(N-1)000
β(N-1)……..
…β(3)α(3)γ(2)0
…0β(2)α(2)γ(1)
…00β(1)α(1)
![Page 9: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/9.jpg)
Memory efficient implementation
The inverse of a block diagonal matrix is another block diagonal matrix, composed of the inverse of each block
We flip each row separately, this way internal memory would be sufficient
Requires selective treatment of borders,1 ,1
,1 ,2
, 1
, 1 ,
0 0
0where is a matrix of
0
and N is the number of pixels in a row(size of column), i is the row number
i i
i i
i
i N
i N i N
A
![Page 10: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/10.jpg)
Error criteria
We use the relative Root Mean Square Error as an error criteria. The RRMSE2 is defined by:
2A R1
Relative RMSE 1All pixels
RAll pixels
X XNumber of pixels
XNumber of pixels
Where:XA is the filtered image
XR is the relative image
![Page 11: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/11.jpg)
Fixed point considerations
Matlab simulations showed that during calculations we need to work with 16bits:
.
Fixed point
dt=5, 4 iterations
Full precisionRMSE=0.0416
Original image
![Page 12: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/12.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 13: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/13.jpg)
DVIIN
DVIOUT
Data Flow - One IterationTwo parallel channels
LinesPIPE Thomas 3
M4K LINE REVERSE
WRITE
M4K LINE REVERSE
READ
M4K LINE REVERSE
WRITE
M4K LINE REVERSE
READ
ColumnsPIPE Thomas 3
M4K LINE REVERSE
WRITE
M4K LINE REVERSE
READ
M4K LINE REVERSE
WRITE
M4K LINE REVERSE
READ
T’ T’
How to implement T’ In real time?
![Page 14: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/14.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 15: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/15.jpg)
Project B - “divide and conquer”
Divide between 2 problems: Algorithm and Memory access
Implement algorithm on a small frame – our partComplete implementation of transpose using
DDR – Neta and HillelIntegrate both parts into one full data path
![Page 16: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/16.jpg)
Project B motivation and goals
Implement the algorithm in a scalable mannerDisplay results for a small frame Implement the transpose in internal memory Implement blocks that will create the mini frame at the
beginning and generate full frame at the end
DVIIN
DVIOUTAlgorithm
Create mini
frame
Generate large frame
![Page 17: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/17.jpg)
Mini from full: using internal memory we select a small frame, and send it in a burst to the pipe
As to fit the internal memory of the FPGA, we choose a 100*100 mini frame
The implementation of the algorithm is scalable to larger frames
Full out of mini: after the algorithm we generate large frame by zero padding the mini frame
![Page 18: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/18.jpg)
Thom
as lo
op 2
Thom
as lo
op 1
Thom
as lo
op 3
Row flip on M4KM4KMRam
TriPort
control control controlSync signals
Sync signals
Sync signals
Mram TriPort
control
Normal and transpose read of mini frame
Pipeline Row flip Row flipPipeline Full frame generation
Data path for mini frame processing
Upscalable to full frame processing
![Page 19: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/19.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 20: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/20.jpg)
Synchronization signals
As mentioned, we must treat borders differently then normal pixels.
Therefore – we must distinguish throughout the entire pipeline between borders and non borders, and whether it is start or end.
To do so – we generate four sync signals, that describe every pixel
End of lineStart of line
Start of column
End of column
![Page 21: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/21.jpg)
The need for these signals had upraised in the algorithm, but we can now use these signals for memory sync and frame generation sync
From the four signals we can easily derive a “start frame” signal, and also an “end frame” signal.
![Page 22: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/22.jpg)
Transpose in internal memory
Transpose the image during reading read address is a sum of two weighted values: row and column
pointers Transpose the image by switching the pointers
Row pointerColumn pointer weight Transposed read address
Column pointerRow pointer weight Normal read address
![Page 23: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/23.jpg)
Row flip on internal memory
Use M4K memory to reverse order on incoming data for entire row
Implement scalable design to be used on different row sizes Use sync signals as inputs and generate them for the next
block at the output
![Page 24: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/24.jpg)
From Matlab to HDL - Simulink phase
From serial code to a combination of sequential and parallel blocks
“Close to hardware” implementationReal time simulation and comparison to code
(non real time) results
![Page 25: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/25.jpg)
Simulation in Simulink
Data from DVI Using repeating
sequence block
Emulating Row flip Using enabled system
for sync Buffer depth is number
of pixels in a row
![Page 26: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/26.jpg)
Derivatives calculation
Thomas forward
loops
Thomas reversed loop
Double buffered
flip – memory emulatio
n
Simulink data path
Double buffered
flip – memory emulatio
n
Building tri-diagonal matrix
![Page 27: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/27.jpg)
From Matlab to HDL – SinplifyDSP phase
From full precision Simulink blocks, to fixed point hardware representing blocks
Pipelining and frequency considerationsGenerate HDL code
![Page 28: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/28.jpg)
From Matlab to HDL – SinplifyDSP phase
After transforming every part of the data path from Simulink blocks to SinplifyDSP blocks and synthesizing– not all parts achieved the required frequency (DVI clk ~25MHz)
The critical paths were in the loops in the design – unfortunately you can’t pipeline a loop
Solution: Simplify every loop as much as possible
![Page 29: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/29.jpg)
Simplifying loops – Thomas 1Loops with heavy calculations
become critical paths
Move multiplier out of loop
Pipeline
1 1
i iii i i
i i
m mm m
![Page 30: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/30.jpg)
Simplifying loops
After movements – the maximum frequency is still to low Solution: Replace the SinplifyDSP division block with faster but less
accurate implementation – AEPG division algorithm
![Page 31: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/31.jpg)
Anderson, Earle, Goldschmidt and Powers division algorithm
Iterative algorithm that calculates N/D
Adding iterations increases precision but uses more resources (extra multipliers and sums)
1
1 1
1 1
0 0
where x is a result of the iterations:
2
with the start conditions:x ,
must be in the range of 1 D<2 or 0.5 D<1
i i
i i i
i i i
N xDf tx x ft t f
N t DD
![Page 32: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/32.jpg)
Simplifying loops – Thomas 2
The change in Thomas 1 influenced Thomas 2
originallm
newl m
divide by
![Page 33: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/33.jpg)
Simplifying loops – Thomas 3
is slow and part of loopab1 a the multiplier is fasterb
1 division is before the loopb
![Page 34: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/34.jpg)
Integrating the Sinplify projects into one Quartus project
SinplifyDSP generates registers and blocks with default similar names
Simple combine in Quartus will not work Combining several blocks in Quartus demands different
approach Manually change names – not realistic as to the huge amount of
blocks
Combine all the Sinplify projects to one and use black boxes to include VHDL code for the row flip – works, but synthesize is very long (could be days)
Use design partitions – our recommended method – dramatically shortens synthesizes and allows simple modular design
!
![Page 35: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/35.jpg)
Functional and timing simulations
Using the design partitions method – we created test benches to test every block in Modelsim - functional simulation
After correct functional simulation in Modelsim, we repeated this simulation in Quartus simulator tool
We then did a timing simulation in the simulator tool
![Page 36: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/36.jpg)
Algorithm Mini frame related
Tested the Simulink phase in comparison with theoretic matlab code
Tested the SinplifyDSP phase in comparison with theoretic matlab code and tested synchronization
Tested in Modelsim for functional operation
Tested timing in Quartus View real time results on
screen
Tested in Modelsim for functional operation
Tested timing in Quartus View real time results on
screen
Functional and timing simulations
![Page 37: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/37.jpg)
Results
Algorithm works correctly and gives correct results –ready to be upscaled
For large dt values we get artifacts – also seen in Matlab fixed point simulations
Unresolved sync problem in the mini frame creation – only a problem for displaying the mini frame
![Page 38: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/38.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 39: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/39.jpg)
Upscaling to a full frame
Change parameters in SinplifyDSP (N_col, N_row), Synthesize and create design partitions
Change memory size for the row flipChange parameters throughout the design
(Generic parameters)
Remove mini from full and full from mini, and integrate with Neta’s and Hillel’s data path
Synthesize, place and route in Quartus
![Page 40: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/40.jpg)
AGENDAProject Objectives
Algorithm – review
Data flow
Part B motivation
Implementation
Upscaling
Insights and conclusions
![Page 41: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/41.jpg)
Achievements
Algorithm and pipeline working at ~26MHz (a bit higher then DVI clk)
One iteration is enough to see results. Maximum stable dt is ~5 (In comparison to semi implicit design where dt was only limited to 0.5)
Display real time results and improve video quality
Unfortunately, Still having some unresolved problems with sync
![Page 42: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/42.jpg)
Insights and conclusions
Signal processing in real time is quite hard and demands precise planning and designing
Test benches must be used to test every part of the design
The principle of “divide and conquer” is a gateway to success
You should be familiar with the available tools prior to beginning working
Documentation is, unfortunately, not complete and sometimes not accurate
Compatibility, versions and lack of remote access
![Page 43: FPGA BASED REAL TIME VIDEO PROCESSING](https://reader030.fdocuments.net/reader030/viewer/2022033101/56816492550346895dd6649f/html5/thumbnails/43.jpg)
Thank you for listening…
We invite you to join us in the lab for a short
demonstration