Flexible Partial Reconfiguration based Design Architecture ...

Post on 16-Oct-2021

6 views 0 download

Transcript of Flexible Partial Reconfiguration based Design Architecture ...

Flexible Partial Reconfiguration based Design Architecture for Dataflow Computation

Mihir Shah

Advisor: Benjamin Carrion Schaefer

1

Thesis Index

2

SR. NO TITLE

1 Thesis Motivation

2 Thesis Contribution

5 Proposed Design Methodology

6 Design Implementations โ€“ Spatial & Partial Reconfiguration based

7 Comparative Study & Analysis

8 Conclusion & Future Works

3

Thesis Motivation

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

Dataflow Computing (DC) specific model of computation

Target application described as Data Flow Graph (DFG)

Used Extensively in High Frequency Trading, Image, Signal processing based applications

JPEG Encoder as Dataflow Process

Node

Node: Processing Element (PE), Execution

kernel/accelerator/

Links : FIFO (first-in first-out) queue or buffer

4

Thesis Motivation

4

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

๐‘ท๐‘ฌ๐ŸŽ

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

Container

FPGA Fabric

Reconfigurable Partition Areaโ€ฆโ€ฆ

Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

55

Thesis Motivation

5

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

๐‘ท๐‘ฌ๐ŸŽ

Container

FPGA Fabric

Reconfigurable Partition Areaโ€ฆโ€ฆ

Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

๐‘ท๐‘ฌ๐ŸŽ

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

666

Thesis Motivation

6

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

Container

FPGA Fabric

๐‘ท๐‘ฌ๐ŸŽโ€ฆโ€ฆPartial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

๐‘ท๐‘ฌ๐ŸŽ

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

๐‘ท๐‘ฌ๐Ÿ

7777

Thesis Motivation

7

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

Container

FPGA Fabric

๐‘ท๐‘ฌ๐Ÿโ€ฆโ€ฆPartial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

๐‘ท๐‘ฌ๐ŸŽ

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

88

Thesis Motivation

๐‘ท๐‘ฌ๐ŸŽ ๐‘ท๐‘ฌ๐Ÿ ๐‘ท๐‘ฌ๐Ÿ ๐‘ท๐‘ฌ๐’Œโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ. Spatially placed FPGA Design

Problem ? Area Utilization HighDataflow Process on FPGA Fabric

๐‘ท๐‘ฌ๐ŸŽ

FPGA Fabric

Reconfigurable Partition Area ARM

Processor

DDR3 Memory

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

โ€ฆโ€ฆ

Container

PR based design using External Off chip DDR

memory as FIFOSolves: Area Utilization Issue

Problem ? Runtime & Latency Hit !!

PRDDR design method

Static design method

99

Thesis Motivation

๐‘ท๐‘ฌ๐ŸŽ

FPGA Fabric

Reconfigurable Partition Area

๐‘ท๐‘ฌ๐Ÿ

๐‘ท๐‘ฌ๐’Œ

โ€ฆโ€ฆBRAM Memory

THUS WE PROPOSE - PR based design using Internal On chip BRAM memory as FIFO for dataflow process

Improves Runtime & Latency compared to PRDDR

with reduced FPGA resource savings

PRBRAM design method

Container

1. Semi-automatic design methodology for dataflow computation

Fixed overlay Static architecture - PRBRAM

Support for partial re-configurability

Input is behavior description language for HLS

2. Prototyped on Xilinx Zynq FPGA

JPEG Encoder given in SystemC

Three Testcase images

3. Extensive experimental results

Measure hardware running time vs. area characteristics of static and PR-based methods.

Comparative study between PRBRAM and PRDDR methods

Hardware Running Time vs. Varying Size of Pblock or Reconfigurable Partitions

10

Thesis Contribution

11

JPEG Encoder: A Dataflow Process for Comparative Study

11

DPCM

RLC

Entropy Coding

HeaderTables

Data

CodingTables

Quantโ€ฆTables

DCTpixel(i, j)

8 x 8

DCT(i, j)

8 x 8

QuantizationDCTq(i, j)

Original_Image.bmp 512 X 512 pixels

Size: 258 KB

Compressed_Image.jpg 512 X 512 pixels

Size: 36 KBCompression Ratio : 7.17:1

Run-Length Zig-Zag Scan

DC

AC

JPEG File Format

PE1 : DCT

PE2: Quantization

PE3: RunLengthEncoding

PE4: Huffman Encoding

12

Using Xilinx SDKApp.cpp +

๐‘บ๐’•๐’‚๐’•๐’Š๐’„๐’ƒ๐’๐’‚๐’๐’Œ.bit = BOOT. bin

Overview of the Proposed Design Methodology

12

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

1313

Using Xilinx SDKApp.cpp +

๐‘บ๐’•๐’‚๐’•๐’Š๐’„๐’ƒ๐’๐’‚๐’๐’Œ.bit = BOOT. bin

13

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Overview of the Proposed Design Methodology

141414

Using Xilinx SDKApp.cpp +

๐‘บ๐’•๐’‚๐’•๐’Š๐’„๐’ƒ๐’๐’‚๐’๐’Œ.bit = BOOT. bin

14

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Stage 3: TCL Automated Floorplan for PR Designs

Overview of the Proposed Design Methodology

15151515

Using Xilinx SDKApp.cpp +

๐‘บ๐’•๐’‚๐’•๐’Š๐’„๐’ƒ๐’๐’‚๐’๐’Œ.bit = BOOT. bin

15

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Stage 3: TCL Automated Floorplan for PR Designs

Stage 4: Deploying the Binaries on Zynq-7000: PRBRAM Static

Architecture

Overview of the Proposed Design Methodology

16

Stage 1: Behavioral Description of Dataflow to RTL Generation

Key-points when describing dataflow application using BDL:

Uniformity in the number, direction and data-widths of I/O all the PEโ€™s

Control interface signals - done, reset and start : Close Loop Feedback when Context Switching

16

๐‘ท๐‘ฌ๐ŸŽ. ๐’„ ๐‘ท๐‘ฌ๐Ÿ. ๐’„ ๐‘ท๐‘ฌ๐Ÿ. ๐’„ ๐‘ท๐‘ฌ๐’Œ. ๐’„โ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ.

Behavioral Description of Dataflow Algorithm using SystemC/C++/C

High Level Synthesis

RTL GENERATION

Raw Input Data

Processed Output Data

โ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ.

โ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ.๐‘ท๐‘ฌ๐ŸŽ. ๐’—, ๐‘ท๐‘ฌ๐Ÿ. ๐’—, ๐‘ท๐‘ฌ๐Ÿ. ๐’—, ๐‘ท๐‘ฌ๐Ÿ‘. ๐’—

Vivado HLS (Xilinx),

Stratus (Cadence),

CyberWorkBench(NEC),

Catapult(Mentor)

17

Stage 2: Validation and Creation of Custom IPs

17

Logical Synthesis & Simulation

Packaging Design using Vivado IP Packager

Creating System Level Design with Xilinx IP

Integrator

Validating the IP design using Xilinx SDK

Structural RTL to

Xilinx Primitives

Writing Test

Benches

Ensure Target

Platform Function

& Timing Violations

Step.1 Step.2 Step.3 Step.4

Design Re-usability

Xilinx Supports

AMBA AXI

I/O instance port-

map to internal

slave registers

Helps to Write

Software Code for

the IP

Compare results

with Simulation

Block Design with

ZYNQ7 PS

Create Top-Level

Wrapper

Export to SDK to

create BSP &

drivers

18

Stage 2: Validation and Creation of Custom IPs

18

PE0.v, PE1.v . . . . PEk.v PE0 PE1 PEk

Post Logical Synthesis

Functional Simulation

Match

?

. . . .

Hardware Results

Integrating IP to ARM

Dedicated IPsRTL

Goal: Ensures correct memory maps in IPs

Validates Control Signals & Functionality of the Module before integrating into tighter PR Flow

19

Stage 3: TCL Automated Floorplan for PR Designs

19

1. Synthesized Static & RMs separately

2. Create physical constraints to create

pblock RP

3. Set property HD.RECONFGIURABLE

on RP

4. Implement static + one RM โ€“ complete

design

5. Save the DCP after routing

6. Remove the RM from RP and save

static DCP

7. Lock static placement & routing

8. Add New RM to the static design &

Save the DCP

All RMs covered ?

No

9. Run pr_verifyYes

10. Create bitstreams for each configuration

1) fplan.xdc

2) PEk.v

3) Static.v

1) partial binaries (.bin)

2) blanking.bit

**Processing Element (PE) will be referred as, Reconfigurable Module(RM) in PR Designs

20

Stage 4: Deploying the Binaries on Zynq-7000

20

BOOT.bin = first stage image for PL side + User application

software

After power-on reset, the Boot ROM determines the boot mode (SD flash memory) and the encryption status (non-secure)

Load First Stage Boot Loader (FSBL) into on-chip

RAM (OCM).

Releases CPU control to the FSBL which in turn

configures the PL with the full Staticblank.bit via

PCAP (Processor Control Access Port)

SD Card

2121

Stage 4: Deploying the Binaries on Zynq-7000

21

Partial bitstreams are loaded into DDR memory from SD

card to maximize throughput during configuration

The Raw Image data which is the input to the dataflow

process in JPEG Encoder is also transferred to the DDR

Memory

After this step, the Reconfigurable Modules can

be loaded into the Reconfigurable Partition to start

computation.

2222

Stage 4: Deploying the Binaries on Zynq-7000

Load the BRAM Memory with Raw Data from DDR

Mux Sel=0

232323

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

PCAP fetches partial bitfile (PE0.bin) from ddr3 memory to load into configuration port

24242424

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Read the Raw Image data from BRAM Memory and Input it to PE0.bin

Start/Done BRAM Read Sel Mux=1

2525252525

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Enable โ€˜start computationโ€™ signal from ARM for PE0.bin

262626262626

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Read โ€˜done computationโ€™ signal to ARM for PE0.bin

27272727272727

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Write the output generated by PE0.bin to BRAM memory

Start/Done BRAM Write

282828

Stage 4: Deploying the Binaries on Zynq-7000

PEk.bin

The process to Load RM, Read BRAM, Compute & Write BRAM continues till the terminal PE โ€ฆ Finally, PCAP fetches partial bitfile (PEk.bin) from ddr3 memory to load into configuration port

292929

Stage 4: Deploying the Binaries on Zynq-7000

Load the Results generated by PEk.bin to SD Card from BRAM Memory

Mux Sel=0

30

JPEG Encoder PRBRAM Design Implementation

Utilizing 91.42 % of BRAM memory to store intermediate results

Total memory requirements (1048.576 Kbytes) > Available BRAM memory (140 blocks X 36Kb = 630 Kbytes)

Minimum number of partial reconfigurations = 8 (4 (Reconfigurable Modules) * 2(Dividing factor))

30

2048 blocks

4096 8 X 8 pixel

blocks

Divided the image dataset into 2 & Reload BRAM

2048 blocks

3131

PRBRAM Design Implementation โ€“ Experimental Result

32

๐‘ป๐’“๐’–๐’๐’•๐’Š๐’Ž๐’† = ๐‘ป๐’‹๐’‘๐’†๐’ˆโˆ’๐’„๐’๐’Ž๐’‘๐’–๐’•๐’Š๐’๐’ˆ + ๐‘ป๐’๐’—๐’†๐’“๐’‰๐’†๐’‚๐’… + ๐‘ป๐’ƒ๐’Š๐’ โˆ— ๐‘ต๐’ƒ๐’Š๐’

Tjpeg-computing: actual computing time it takes for processing all the inputs of each reconfigurable module

Tbin : time it takes to partially configure the bitstream

Nbin : number of times reconfiguration occurs

Toverhead : time it takes to load the partial binaries and raw image data from SD card to DDR memory

The values Tbin = 0.1975 s, Tjpeg-computing = 2.994 s and Toverhead = 1.675 s are obtained experimentally

PRBRAM Design Implementation โ€“ Experimental Results II

Cases of no. of reconfigurations (1) 32 (2) 64

(3) 128 (4) 256 (5) 512

32

33

RTBRAM values for varying RPBitsize

PRBRAM Design Implementation โ€“ Experimental Results III

33

The size of the pblock or reconfigurable partition affects Tbin

There is a linear relationship

The Hardware running time for reconfigurable architectures is significantly impacted by this.

34

PRBRAM Design Implementation โ€“ Experimental Results IV

[Case 1: RPBitsize = 1598.896 KB, Case 2: RPBitsize = 1306.272 KB, Case 3: RPBitsize = 786.664 KB]

34

4.57

3.86706

3.30284

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE :No. of Reconfiguration = 8

9.305 7.71959

5.62322

4

5

6

7

8

9

10

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 32

28.167

23.13036

14.9039

0

5

10

15

20

25

30

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 128

103.619

83.11808

52.02699

0

20

40

60

80

100

120

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 512

35

Objective:

Prove area utilization efficacy of PRBRAM

Objective:

Prove PRBRAM is runtime and latency efficient compared to PRDDR

JPEG Encoder PRDDR MethodJPEG Encoder Spatial Method

35

DCT

Quantization

IP containing

Reconfigurable

Partition

36

Calculating Area Utilization

๐ด๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ = ๐ด๐‘ ๐‘ก๐‘Ž๐‘ก๐‘–๐‘ + ๐ด๐‘‘๐‘ฆ๐‘›๐‘Ž๐‘š๐‘–๐‘ ๐‘š๐‘Ž๐‘ฅ ๐‘ƒ๐ธ0, ๐‘ƒ๐ธ1, ๐‘ƒ๐ธ2, โ€ฆ . ๐‘ƒ๐ธ๐‘˜

36

๐ด๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ = ๐‘ƒ๐ธ0, ๐‘ƒ๐ธ1, ๐‘ƒ๐ธ2, โ€ฆ . ๐‘ƒ๐ธ๐‘˜

PR Based Designs

Static Designs

*PE: Process Element

RunLength Encoding PE

Astatic of PRBRAM is high due to additional IPs in the overlay architecture

Max. BRAM Utilization

24759 (Spatial)

18985 (PR_BRAM)

16442 (PR_DDR)

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

25000

26000

1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3

AR

EA_F

F

FPGA_RUNTIME (s)

FPGA_RUNTIME vs AREA_FF

37

I. COMPARATIVE STUDY โ€“ AREA UTILIZATION vs FPGA RUNTIME

37

14997 (Spatial)

12373 (PR_BRAM)

10811 (PR_DDR)

8000

9000

10000

11000

12000

13000

14000

15000

16000

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6

AR

EA_L

UT

FPGA_RUNTIME (s)

FPGA_RUNTIME vs AREA_LUT

PRBRAM design method requires slightly more area compared to PRDDR due additional IPs - ARM-FPGA

Control Bus, ARM-Side BRAM Control, MUX and Block RAM Memory modules.

Comparing spatial design implementation, the utilization of PRBRAM is significantly low, which is as expected.

3838

1.53

1.163

0.947 0.88 0.832

2.0291.854

1.767 1.722 1.701

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

LATE

NC

Y (

s)

NUMBER_RECONFIGURATIONS

LATENCY COMPARISON

LATENCY_BRAM LATENCY_DDR

7.80710.96317.249

29.824

54.975

10.19416.915

30.344

57.211

110.937

05

101520253035404550556065707580859095

100105110115120

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

FPG

A_R

UN

TIM

E (s

)

NUMBER_RECONFIGURATIONS

FPGA_RUNTIME COMPARISON

RUNTIME_BRAM RUNTIME_DDR

Experiment with unequal RPBitsize

RPBitsize = 3416.088 KB for PRDDR

RPBitsize = 1598.896 KB for PRBRAM

Non-Linear Relationship in Runtime between the

graphs due to Tbin * Nbin not constant

II. COMPARATIVE STUDY โ€“ RUNNING TIME vs LATENCY

39

PRBRAM & PRDDR Static.dcp Floorplan View โ€“ Equal Pblock Sizes

3939

Partition Pins

BRAM Blocks

Higher utilization of

logic elements due to

additional IPs.

4040

II. COMPARATIVE STUDY โ€“ RUNNING TIME vs LATENCY

Experiment with equal RPBitsize = 1306.272 KB for both PR implementations.

Average improvement in runtime is 0.529s

Runtime varies linearly because Nbin * Tbin is constant

40

1.106429

0.874237

0.758173 0.700145

1.322754

0.982405

0.812247

0.727149

0.650.7

0.750.8

0.850.9

0.951

1.051.1

1.151.2

1.251.3

1.351.4

0 10 20 30 40 50 60 70

LATE

NC

Y (

s)

NUMBER_RECONFIGURATIONS

LATENCY COMPARISON

LATENCY_BRAM LATENCY_DDR

2.2128593.496951

6.065391

11.20233

2.685509

3.969622

6.537978

11.67439

22.5

33.5

44.5

55.5

66.5

77.5

88.5

99.510

10.511

11.512

0 8 16 24 32 40 48 56 64 72

FPG

A_R

UN

TIM

E (s

)

NUMBER_RECONFIGURATIONS

FPGA_RUNTIME COMPARISON

RUNTIME_BRAM RUNTIME_DDR

41

CONCLUSION

41

Improvement in average hardware running (PRBRAM ) of 0.529s vs. PRDDR

Implementation with the proposed Architecture PRBRAM is area efficient compared to spatial

implementation with

LUT area savings up to 21.20 % & FF area savings up to 30.41 % for 1306.272 KB

These %โ€™s are including the additional resources utilized by proposed static architecture

Implemented JPEG Encoder on Zynq -7000

For three testcase images-Lena, Peppers and Goldhill

Using Spatial, PRDDR & PRBRAM for Comparative Study & Analysis

Novel design methodology for dataflow computation with proposed PRBRAM overlay static architecture

Including TCL based automated floorplanning + User software application algorithms

42

FUTURE WORKS

Sophisticated Partial Reconfiguration Controllers

Minimize the time required for reconfiguring

Enhanced parallelism of operations in hardware accelerators/Processing Elements due

to saved resources in reconfigurable architecture.

In extremely data-intensive applications, exploring performance impact on PRBRAM

BRAM + Distributed RAM to deal with limitations of on-chip memory

42

43

THANK YOU

43

44

APPENDIX

45

Experimental Results โ€“ Spatial, PRBRAM & PRDDR

Table IV: JPEG Encoder Results for lena.bmp with PRDDR and PRBRAM Design Implementation

Table III: JPEG Encoder Results with Spatial Design Implementation

45

4646

Partial Reconfiguration โ€“ Full vs Partial Bitstream

4646

Figure.2 Configuration Process & Contents of (a) Full and

(b) Partial bitstreams

(a) (b)

46

4747

Partial Reconfiguration โ€“ Method to configure Partial Bitstreams

Internal Configuration Access Port (ICAP) :

User configuration solutions

Requires ICAP controller + Logic to drive the ICAP interface

JTAG Port :

Quick Testing or Debug

Driven using iMPACT or ChipScope Analyzer

Processor Configuration Port (PCAP) :

Configuration mechanism for all Zynq-7000 designs.

Figure.3 (a) ICAP (b) PCAP (c) JTAG

(a) (b) (c)

47

48

JPEG Encoder Hardware Accelerators : DCT & Quantization

Discrete Cosine Transform (DCT):

Converts spatial domain to frequency domain

๐‘ซ๐‘ช๐‘ป ๐’Š, ๐’‹ =๐Ÿ

๐Ÿ’๐‘ช ๐’Š ๐‘ช ๐’‹

๐’™=๐ŸŽ

๐Ÿ•

๐’š=๐ŸŽ

๐Ÿ•

๐’‘๐’Š๐’™๐’†๐’ ๐’™, ๐’š ๐œ๐จ๐ฌ๐Ÿ๐’™ + ๐Ÿ ๐’Š๐œซ

๐Ÿ๐Ÿ”๐œ๐จ๐ฌ

๐Ÿ๐’š + ๐Ÿ ๐’‹๐œซ

๐Ÿ๐Ÿ”๐Ÿ ,๐–๐ก๐ž๐ซ๐ž, ๐‘ช ๐’Œ =

๐Ÿ

๐Ÿ๐’Š๐’‡ ๐’Œ = ๐ŸŽ & ๐‘ช ๐’Œ = ๐Ÿ ๐’๐’•๐’‰๐’†๐’“๐’˜๐’Š๐’”๐’†

Quantization:

Dividing transformed image DCT matrix by quantization matrix used and rounding off

Aims at reducing most of the less important high frequency DCT coefficients to zero

๐‘ซ๐‘ช๐‘ป๐‘ธ ๐’Š, ๐’‹ = ๐‘น๐’๐’–๐’๐’…๐‘ซ๐‘ช๐‘ป(๐’Š, ๐’‹)

๐‘ธ(๐’Š, ๐’‹)(๐Ÿ)

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

48

49

JPEG Encoder Hardware Accelerators : RunLength Encoding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

Zig Zag Scan

Encoding the difference between current and the previous 8 X 8 block

Encodes a series of zeros as a (skip, value) pair

Differential Pulse Code Modulation (DPCM)

RunLength on AC Components (RLC)

49

50

JPEG Encoder Hardware Accelerators : Entropy Coding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

SIZE Value Code

0 0 ---

1 -1,1 0,1

2 -3, -2, 2,3 00,01,10,11

3 -7,โ€ฆ, -4, 4,โ€ฆ, 7 000,โ€ฆ, 011, 100,โ€ฆ111

4 -15,โ€ฆ, -8, 8,โ€ฆ, 15

0000,โ€ฆ, 0111, 1000,โ€ฆ, 1111

. .

. .

11 -2047,โ€ฆ, -1024, 1024,โ€ฆ 2047

โ€ฆ

DC Components

DC components are differentially coded as

(SIZE, Value)

Code for a Value is derived from the

Size_and_Value Table (Table.1)

Code for a SIZE is derived from Table.2

Example: If a DC component is 40 and the previous DC component is 48. The difference is -8. Huffman coded as: 1010111

0111: The value for representing โ€“8

(Size_and_Value table)

101: The size from the same table reads

4, which corresponds to 101 from Table.2

Table.1 Size_and_Value

SIZE Code

Length

Huffman Code

0 2 00

1 3 010

2 3 011

3 3 100

4 3 101

5 3 110

6 4 1110

7 5 11110

8 6 111110

9 7 1111110

10 8 11111110

11 9 111111110

Table.2 Huffman Table for DC component SIZE field

50

5151

JPEG Encoder Hardware Accelerators : Entropy Coding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

AC Components: Coded as (S1,S2 pairs)

S1: (RunLength/SIZE) , where RunLength : length of the

consecutive zero values [0..15] & SIZE : No. of bits needed to code

the next nonzero AC componentโ€™s value

S2: (Value), where Value is the value of the AC component from

Table.1

Zig-Zag order -> 12,10, 1, -7 2 0s, -4, 56 zeros

12: read as zero 0s,12: (0/4)12 10111100

1011: The code for (0/4 from Table.3)

1100: The code for 12 from the Table.1

56 0s: (0,0) 1010 (Rest of the components are zeros therefore we simply put the EOB to signify this fact)

Run/

SIZE

Code

Length

Code

0/0 4 1010

0/1 2 00

0/2 2 01

0/3 3 100

0/4 4 1011

0/5 5 11010

0/6 7 1111000

0/7 8 11111000

0/8 10 1111110110

0/9 16 1111111110000010

0/A 16 1111111110000011

Run/

SIZE

Code

Length

Code

1/1 4 1100

1/2 5 11011

1/3 7 1111001

1/4 9 111110110

1/5 11 11111110110

1/6 16 1111111110000100

1/7 16 1111111110000101

1/8 16 1111111110000110

1/9 16 1111111110000111

1/A 16 1111111110001000

โ€ฆ 15/A More Such rows

Table.3 Huffman Table for AC component SIZE field

40 12 0 0 0 0 0 010 -7 -4 0 0 0 0 01 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

Figure.9 Example of 8 X 8 block after quantization

51

52

PRDDR Design Implementation โ€“ Floorplan View Post P & R

Figure.24 Design Checkpoints after performing Place & Route (a) Staticddr.dcp (b) DCTddr.dcp (c) Quantizationddr.dcp (d) RLEddr.dcp (e) Huffmanddr.dcp for RPBitsize = 1306.272 KB case

52

53

PRDDR Design Implementation โ€“ Experimental Results II

Table.13 JPEG Encoder Results for lena.bmp with PRDDR Design Implementation

Additional experiment results were tabulated testcase images of peppers.bmp & goldhill.bmp:

Table.14 peppers.bmp Table.15 goldhill.bmp

53

54

SPATIAL Design Implementation โ€“ Experimental Results

Table.11 JPEG Encoder results with Spatial Design Implementation

Figure.22 SSIM Maps (a) Lena (b) Goldhill (c) Peppers

(a) (b) (c)

54

55

SPATIAL Design Implementation - System Implementation and Setup

Using the Vivado IP Integrator, custom IPs of JPEG Encoder are connected

using AXI4 spatially.

Clock frequency : 50 MHz

Additional Hardware Resources :

SD Card and DDR3 connected to the external interfaces on the

Processing Side (PS) of the Zynq FPGA for storing data

Figure.21 Floorplan View

Table.10 Utilization Report

55

56

PRDDR Design Implementation - System Implementation and Setup

2 Micron DDR3 128 Megabit x 16 memory components

creating a 32-bit interface, totaling 512 MB.

The DDR3 is connected to the hard memory controller in

the Processor Subsystem (PS).

DDR3 memory is referenced using pointers in user software

application

address mappings provided in the systems.hdf file

In JPEG Encoder Implemented,

Max. No of I/O: RLE RM block &

Max. data-widths of I/O : Huffman Encoding RM block

Figure.23 PRDDR design methodology Block Diagram for dataflow computation

56

Table.12 Utilization Report Post P & R

5757

Partial Reconfiguration - Ideology & Benefits

Reduced Resource and Power Consumption:

Integrating the design into a lower FPGA IC count.

Power savings due to Reduction in off-chip communication.

Performance Improvements and Flexibility:

Computation capacity of the system adapted at run time

Additional resources for speeding up the operation of the kernel

More number of kernels to perform the operation in parallel.

Improved Fault Tolerance and Dependability:

Safety critical systems - aerospace & defense industries.

Self Adapting Hardware Designs:

Adapt to changing operating and environmental conditions based on AI &

learning.

Figure.1 Basic Ideology of PR

Partial Reconfiguration Definitionโ€“Allows the modification of an operating FPGA

by downloading Bitfile

58

PRBRAM Design Implementation โ€“ Experimental Results I

Table.7 JPEG Encoder Results for lena.bmp with PRBRAM Design Implementation

Table.8 JPEG Encoder Results for goldhill.bmp and peppers.bmp with PRBRAM Design Implementation

Figure.16 PRBRAM results for lena.bmp testcase with RPBitsize = 1306.272 KB

58

Runtime and Latency have inverse relationship

Latency and Throughput have linear relationship