MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu...

57
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu...

Page 1: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

MEMOCODE 2007HW/SW Co-design Contest

Documentation of the submission by

Eric Simpson

Pengyuan Yu

Sumit Ahuja

Sandeep Shukla

Patrick Schaumont

Electrical and Computer Engineering

Department

Virginia Tech

Page 2: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Table of Contents

Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion

Page 3: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Section 1Performance Evaluation and

Analysis

Page 4: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Performance Results

Section 1 Performance Evaluation and Analysis

Matrix Size 64 128 256 512 1024

Run Time

(sec)

Our Design

(Average)0.0052 0.0322 0.2170 1.5176 11.882

Reference 0.0346 0.6697 5.3133 42.302 338.72

SpeedUp 6.65 20.8 24.5 26.9 28.5

Device

Utilization

BRAM 80 (64 Coprocessor + 16 On-Chip-Memory)

Mult 128

Page 5: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Performance Calculation

FCPU-Speed = 1, we used 300Mhz PPC

FFPGA-Capacity = 1, we used XUP’s XC2VP30

FFPGA-speed = 1, we used 100Mhz clock for bus and coprocessor

TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) *

FCPU-Speed * FFPGA-Capacity * FFPGA-speed

= (11.882 + 64*0.217) * 1 * 1 * 1

= 25.77 secondsSection 1 Performance Evaluation and Analysis

Page 6: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Performance Results

Section 1 Performance Evaluation and Analysis

Speed Up Factor Against Reference Design (Blocked by 16)

6.66

20.79

24.4926.92

28.51

0.00

5.00

10.00

15.00

20.00

25.00

30.00

64 128 256 512 1,024

Test Case Matrix Size

Spe

edU

p Fa

ctor

Page 7: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Section 2Matrix Multiplication

Algorithm Optimization

Page 8: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Algorithm Optimization

Algorithm is optimized based on targeting platform (Virtex2 Pro VP30)

Optimization goal: Best utilized the slow DDR Memory Interface

Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput

Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams

Section 2 Matrix Multiplication Algorithm Optimization

Page 9: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Optimized Algorithm

A

B

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 10: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

Bring in 4 complex numbers from “A”

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 11: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 12: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 13: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 14: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 15: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

• Bring in four numbers from “B” and perform the following calculations:

C[0][0] = C[0][0] + A[0][0]*B[0][0]C[0][1] = C[0][0] + A[0][0]*B[0][1]C[0][2] = C[0][0] + A[0][0]*B[0][2]C[0][3] = C[0][0] + A[0][0]*B[0][3]…C[8][0] = C[8][0] + A[8][0]*B[0][0]C[8][1] = C[8][0] + A[8][0]*B[0][1]C[8][2] = C[8][0] + A[8][0]*B[0][2]C[8][3] = C[8][0] + A[8][0]*B[0][3]

• Where “A*B” is a complex multiplication.

• 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle

Page 16: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 17: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 18: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 19: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 20: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 21: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 22: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 23: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 24: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 25: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 26: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 27: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 28: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 29: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 30: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 31: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 32: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 33: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 34: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 35: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 36: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM

Page 37: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

C

Page 38: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

A

B

C

• [A] currently in coprocessor

• [A] currently used for calculation

• [B] currently used for calculation

• [C] stored and accumulated in BRAM

• [C] being multiplied and accumulated

Section 2 Matrix Multiplication Algorithm Optimization

Next, we repeat the previous algorithm to

calculate the next “8xN CSlice”

Page 39: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Optimized Algorithm

Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers)

Linear scan through B matrix (optimizing interface to DDR storage)

Section 2 Matrix Multiplication Algorithm Optimization

Page 40: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Section 3HW/SW System Implementation

Page 41: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

System Architecture

Processor Local Bus

Section 3 HW/SW System Implementation

Page 42: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2

at a time instead of 4 PLB DMA failed to function resulting in a much slower

{DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to

Coprocessor FIFO

To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers.

Coprocessor Architecture vs. Optimized Algorithm

Section 3 HW/SW System Implementation

Page 43: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Coprocessor Architecture

Coprocessor is scalable! Reduce the depth of the A-matrix subblock to

reduce the amount of MAC needed

Section 3 HW/SW System Implementation

Page 44: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Coprocessor Architecture

Section 3 HW/SW System Implementation

Page 45: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

MAC Unit ArchitectureB(0,0) B(0,1)

A(0,0)

C(0,1)

X

C(0,0)

X

A(L,0)

C(L,1)

X

C(L,0)

X

......

...

Row[0]

Row[15]

Section 3 HW/SW System Implementation

Page 46: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

MAC Unit ArchitectureB(0,0) B(0,1)

A(0,0)

C(0,1)

X

C(0,0)

X

A(L,0)

C(L,1)

X

C(L,0)

X

......

...

Row[0]

Row[15]

Complex Multiply

Accumulate

BlockRAM Storage for current “C” value

Input “B” Value

“A” Values

Section 3 HW/SW System Implementation

Page 47: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Section 4Co-design Flow and

Methodology

Page 48: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Design Flow ReferenceC Algorithm

OptimizedC Algorithm

Driver CAlgorithm

GEZELCoprocessor

VHDLPPC Binary

XUP Board

Manual Partitioning

Rectangular-BlockTransformation

Cosimulation

Synthesis

PerformanceAnalysis

Section 4 Co-design Flow and Methodology

Page 49: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Simulation ReferenceC Algorithm

OptimizedC Algorithm

Driver CAlgorithm

GEZELCoprocessor

VHDLPPC Binary

XUP Board

workstation

cycle-basedinstruction-set

cosimulator

FPGA

Section 4 Co-design Flow and Methodology

Page 50: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Simulation Simulation-based verification on three levels

workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since

synthesis is swift and easy) Drawback - simulations capture only behavior,

but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior

(DMA, DDR, ...) in a C simulation model

Section 4 Co-design Flow and Methodology

Page 51: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool

http://rijndael.ece.vt.edu/gezel2

Application SW(C Code)

uP DDR

“N” Reg FIFO IN FIFO OUT

Coprocessor

ExecutableInstruction

Set simulator

CosimulationInterfaces

CoprocessorHardware

Section 4 Co-design Flow and Methodology

Page 52: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software

and hardware before synthesis Coprocessor mapped in FSMD semantics

Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL

simulation primitives Memory-mapped register FIFO based (with request/acknowledge

handshake)

Section 4 Co-design Flow and Methodology

Page 53: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

HW-SW Interface Example

ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to

cop

roce

ssor

data

exists

read

connectedto ISS

PPC SW can write to address 0x80000000 Will drive data output and perform handshake

PPC SW can check status with read from 0x80000004

fsl1

GEZEL Code Hardware

Section 4 Co-design Flow and Methodology

Page 54: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Synthesis

Application SW(C Code)

uP DDR

“N” Reg FIFO IN FIFO OUT

Coprocessor

InstructionSet simulator

CosimulationInterfaces

CoprocessorHardware

Automatic conversion to hierarchical RTL-VHDL, withblack-boxes for cosimulation

interfacesXilinx EDK + ISE

Section 4 Co-design Flow and Methodology

Page 55: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Conclusions

Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow

Page 56: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Conclusions

Remaining Challenges Memory bottleneck (hardware/software codesign

yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data

caching schemes

Page 57: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Conclusions

Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to

support the designer