MEMOCODE 2007 HW/SW Co-design Contest

MEMOCODE 2007HW/SW Co-design Contest

Documentation of the submission by

Eric SimpsonPengyuan YuSumit AhujaSandeep ShuklaPatrick Schaumont

Electrical and Computer Engineering

DepartmentVirginia Tech

Table of Contents

Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion

Section 1Performance Evaluation and

Analysis

Performance Results

Section 1 Performance Evaluation and Analysis

Matrix Size 64 128 256 512 1024

Run Time(sec)

Our Design

(Average)0.0052 0.0322 0.2170 1.5176 11.882

Reference 0.0346 0.6697 5.3133 42.302 338.72

SpeedUp 6.65 20.8 24.5 26.9 28.5

DeviceUtilization

BRAM 80 (64 Coprocessor + 16 On-Chip-Memory)

Mult 128

Performance Calculation

FCPU-Speed = 1, we used 300Mhz PPC FFPGA-Capacity = 1, we used XUP’s XC2VP30 FFPGA-speed = 1, we used 100Mhz clock for bus and

coprocessor

TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) *

FCPU-Speed * FFPGA-Capacity * FFPGA-speed

= (11.882 + 64*0.217) * 1 * 1 * 1 = 25.77 seconds

Performance Results

Speed Up Factor Against Reference Design (Blocked by 16)

24.4926.92

64 128 256 512 1,024

Test Case Matrix Size

Section 2Matrix Multiplication

Algorithm Optimization

Algorithm is optimized based on targeting platform (Virtex2 Pro VP30)

Optimization goal: Best utilized the slow DDR Memory Interface

Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput

Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams

Section 2 Matrix Multiplication Algorithm Optimization

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Optimized Algorithm

Bring in 4 complex numbers from “A”

Optimized Algorithm

• Bring in four numbers from “B” and perform the following calculations:

C[0][0] = C[0][0] + A[0][0]*B[0][0]C[0][1] = C[0][0] + A[0][0]*B[0][1]C[0][2] = C[0][0] + A[0][0]*B[0][2]C[0][3] = C[0][0] + A[0][0]*B[0][3]…C[8][0] = C[8][0] + A[8][0]*B[0][0]C[8][1] = C[8][0] + A[8][0]*B[0][1]C[8][2] = C[8][0] + A[8][0]*B[0][2]C[8][3] = C[8][0] + A[8][0]*B[0][3]

• Where “A*B” is a complex multiplication.

• 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle

Optimized Algorithm

At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM

Optimized Algorithm

Next, we repeat the previous algorithm to

calculate the next “8xN CSlice”

Optimized Algorithm

Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers)

Linear scan through B matrix (optimizing interface to DDR storage)

Section 3HW/SW System Implementation

System Architecture

Processor Local Bus

Section 3 HW/SW System Implementation

Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2

at a time instead of 4 PLB DMA failed to function resulting in a much slower

{DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to

Coprocessor FIFO To maintain SAME calculation capacity: A-Block

dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers.

Coprocessor Architecture vs. Optimized Algorithm

Coprocessor Architecture

Coprocessor is scalable! Reduce the depth of the A-matrix subblock to

reduce the amount of MAC needed

Coprocessor Architecture

MAC Unit ArchitectureB(0,0) B(0,1)

A(0,0)

C(0,1)

C(0,0)

A(L,0)

C(L,1)

C(L,0)

......

Row[0]

Row[15]

MAC Unit ArchitectureB(0,0) B(0,1)

A(0,0)

C(0,1)

C(0,0)

A(L,0)

C(L,1)

C(L,0)

......

Row[0]

Row[15]

Complex Multiply

Accumulate

BlockRAM Storage for current “C” value

Input “B” Value

“A” Values

Section 4Co-design Flow and

Methodology

Design Flow ReferenceC Algorithm

OptimizedC Algorithm

Driver CAlgorithm

GEZELCoprocessor

VHDLPPC Binary

XUP Board

Manual Partitioning

Rectangular-BlockTransformation

Cosimulation

Synthesis

PerformanceAnalysis

Section 4 Co-design Flow and Methodology

Simulation ReferenceC Algorithm

OptimizedC Algorithm

Driver CAlgorithm

GEZELCoprocessor

VHDLPPC Binary

XUP Board

workstation

cycle-basedinstruction-set

cosimulator

Simulation Simulation-based verification on three levels

workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since

synthesis is swift and easy) Drawback - simulations capture only behavior,

but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior (DMA,

DDR, ...) in a C simulation model

Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool

http://rijndael.ece.vt.edu/gezel2

Application SW(C Code)

uP DDR

“N” Reg FIFO IN FIFO OUT

Coprocessor

ExecutableInstruction

Set simulator

CosimulationInterfaces

CoprocessorHardware

Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software

and hardware before synthesis Coprocessor mapped in FSMD semantics

Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL

simulation primitives Memory-mapped register FIFO based (with request/acknowledge

handshake)

HW-SW Interface Example

ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to

exists

connectedto ISS

PPC SW can write to address 0x80000000 Will drive data output and perform handshake

PPC SW can check status with read from 0x80000004

GEZEL Code Hardware

SynthesisApplication SW

(C Code)

uP DDR

“N” Reg FIFO IN FIFO OUT

Coprocessor

InstructionSet simulator

CosimulationInterfaces

CoprocessorHardware

Automatic conversion to hierarchical RTL-VHDL, withblack-boxes for cosimulation

interfacesXilinx EDK + ISE

Conclusions

Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow

Conclusions

Remaining Challenges Memory bottleneck (hardware/software codesign

yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data

caching schemes

Conclusions

Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to

support the designer

MEMOCODE 2007 HW/SW Co-design Contest

Documents

Transcript of MEMOCODE 2007 HW/SW Co-design Contest