Download - Holland#215 MAPLD 2005 Survey of C-based Application Mapping Tools for Reconfigurable Computing Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille,

Holland #215 MAPLD 2005

Survey ofSurvey ofC-based Application Mapping ToolsC-based Application Mapping Toolsfor Reconfigurable Computingfor Reconfigurable Computing

Brian Holland, Mauricio Vacas, Vikas Aggarwal,

Ryan DeVille, Ian Troxel, and Alan D. George

High-performance Computing and Simulation (HCS) Research Lab

Department of Electrical and Computer Engineering

University of Florida

#215 MAPLD 2005Holland 2

Outline Introduction General Survey

Ten C-based Application Mappers

Benchmarking & Results Finite-Impulse Response (FIR) N-Queens Radix Sort

Lessons Learned Conclusions Acknowledgements References

DIME-C

SURVEY

SA-C

STREAMS C

CATAPULT C

SYSTEMC

CARTE

MITRION C

IMPULSE C

HANDEL C

BENCHMARK

NAPA C


Motivation for Application Mappers Motivation for Application Mappers

HDL programming has shortcomings Limited applicability to application developers More involved development process (vs. software) Requires training beyond application level

Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity? Can we bring RC performance benefits to application developers? Would this be practical/possible in traditional HDL?

HDL is well below the level of traditional application programming Consequently, we need to move to a higher level of abstraction

HDL


Introduction

Selecting a Higher Level of Abstraction CAD tools: Visual appealing, but tedious for large projects New language: Optimal, but requires complete retraining Traditional or Object-Oriented languages: Which? How?

Ideally, use pure ANSI-C, “The Universal Language” Requires no additional knowledge or special training Port existing C programs into hardware implementations (HDL)

Translation can be handled by a hardware compiler Programmer concentrates on algorithmic functionality

Configuration File

Netlist

HDL

C Code

COMPILER


Commonalities

General characteristics of C-based application mappers: Companies create proprietary ANSI C-based language Languages do not have all ANSI C features Extra pragmas are included for corresponding compilers Additional libraries of functions/macros for further extensions Must adhere to specific programming “style” for maximum optimization Emphasis on both hardware generation and I/O interfaces

void FIR(int INPUTA, int OUTPUTB){

/*user source*/

}

COMPILER

Entity FIR is Port( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0));end;

/*user source in VHDL*/

ANSI-C VHDL


BENCHMARK SECTION

Spectrum of C-based Application MappersSURVEY PORTION

Open Standard

SystemC

Generic HDLMultiple Platforms

Impulse C

Catapult C

Mitrion C

Targets a SpecificPlatform/Configuration

Generic HDL (Optimize forManufacturer’s Hardware)

DIME-C

Handel C Streams C

SA-C

Napa C

RISC/FPGAHybrid Only

Carte

Handel CDIME-C

VHDL

Impulse C

ANSI-C

THE LAW OF CONSERVATION OF PAIN

Con

tro

l

Effort

No

t C

ycle

Acc

ura

teL

imite

dP

red

iciti

blit

yD

ete

rmin

istic

SoftwareSome HWPragmas

Many HWPragmas

ANSI-C

DIME-CImpulse C

Handel C

Cyc

le A

ccu

rate

HDL

VHDL


CarteSRC Computers [1]

C/Fortran FPGA environment Direct mapping of C/Fortran code

to configuration level Software emulation and simulation

of compiled code for debugging Capable of multiprocessor and

multi-FPGA computational definitions

Allows explicit data flow control within memory hierarchy

Targets SRC’s MAP processor Produces “Unified Executables” for

HW or SW processor execution Runtime libraries handle required

interfacing and management

Algorithmic synthesis tool for RTL generation RTL from “pure” untimed C++ No extensions, pragmas, etc.

Compiler uses “wrappers” around algorithmic code External: manages I/O interface Internal: constrains synthesis to

optimize for chosen interface

Explicit architectural constraints and optimization

Output: RTL netlists in VHDL, Verilog, and SystemC

Catapult CMentor Graphics [2-3]


DIME-CNallatech [4]

FPGA prototyping tool Designs are not cycle-accurate

Allows application synthesis for a higher clock speed

Compilation/Optimization Pipeline/parallelize where possible Included IEEE-754 FP cores Dedicated (integer) multipliers

Currently in beta, expected release: 4Q05

Output: synthesizable VHDL and DIMEtalk components

Environment for cycle-accurate application development

All operations occur in one deterministic clock cycle Makes it cycle-accurate, but clock

freq reduced to slowest operation Decisions/Loops are “penalty-free”

but can significantly impact timing

Language has pragmas for explicitly defined parallelism

Compiler can analyze, optimize, and rewrite code

Output: VHDL/Verilog, SystemC, or targeted EDIFs

Handel CCeloxica [5]


Impulse CImpulse Accelerated Technologies [6] Language/compiler for

modeling sequential apps. Processes - independent, potentially

concurrent, computing blocks Streams – communicate and

synchronize processes

Uses Streams-C methodology However, focuses on compatibility

with C development environments

Compilation Each process implemented as

separate state machine

Output: Generic or FPGA-specific VHDL

“Softcore” processor tactic “Processor” creates abstraction

layer between C code and FPGA

Compilation C code is mapped to a generic

“API” of possible functions Processor instantiated on FPGA,

tailored to specific application Custom instruction bit-widths,

specific cache and buffer sizes

Currently in beta, expected release: 4Q05

Output: a VHDL IP core for target architectures

Mitrion CMitrion [7]


Napa C National Semiconductor [8]

Language/compiler for RISC/FPGA hybrid processor Capitalize on single-cycle

interconnect instead of I/O bus

Datapath Synthesis Technique Hand-optimized pre-placed, pre-

routed module generators Compiler generates hardware

pipelines from C loops

Targets NS NAPA1000 hybrid processor Fixed-Instruction Processor (FIP),

Adaptive Logic Processor (ALP) ALP also compiles to RTL VHDL,

structural VHDL, structural Verilog

High-level, expression-oriented, machine-independent, single-assignment language Designed to implicitly express

data-parallel operations Image and signal processing

Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)

Loop optimizations Structural transforms Execution block placement

Target Platforms UC Irvine Morphosys; Annapolis

WildForce, StarFire, WildFire

SA-CColorado State University [9-12]


Streams CLos Alamos National Laboratory [12-14] Stream-oriented sequential

process modeling Essentially, data elements moving

through discrete functional blocks

Compiler Generates multi-threaded

processor executables and multiple FPGA bitstreams

Allows parallel C program translation into a parallel arch.

Includes functional-level simulation environment

Output: synthesizable RTL

Open-source extension of C++ for HW/SW modeling Core language, modules & ports

for defining structure, and interfaces & channels

Supports functional modeling Hierarchical decomposition of a

system into modules Structural connectivity between

modules using ports/exports Scheduling and synchronization of

concurrent processes using events

Event-driven simulator Events are basic dynamic/static

process synchronization objects

SystemCOpen SystemC Initiative (OSCI) [15-16]


About the Benchmarks Three classic algorithms used for benchmarking

Finite-Impulse Response (FIR) Simple 51-tap FIR filter for standard DSP applications Compare compiler solutions and analyze their usage metrics

N-Queens Classic embarrassingly parallel HPC backtracking search problem Showcases the potential of optimized implementations

Radix Sort Sorts using ‘binary bins’, minimizing resources Illustrates resource metrics in RAM-intensive applications

Implementation Details DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing) Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGA Resource utilization based on post place-and-route data Runtime represents communication time (setup and verification I/O is negated) Handel C and Impulse C require VHDL wrappers which can increase resource usage

-10

-8

-6

-4

-2

0

2

4

6

8

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27

0 110 100 10

1 111 101


Finite-Impulse Response

FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6]) Various application-mapper languages do not have a consistent I/O interface

Could not create a consistent streaming channel with requisite blocking in every tool Instead, FIR algorithm operates on values stored in a block RAM

Obtains speedup through parallel multiplication, efficient memory accesses The 51 coefficients and variables are stored in local variables

Additional performance boosts are possible in multi-channel DSP processing

FIR Resource Utilization Statistics

0

20

40

60

80

100

Slices Multipliers Block RAMs Clock Freq

% U

sag

e

DIME-C Handel C Impulse C VHDL

Speedup over 2.4GHz Xeon

0

1

2

3

4

DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0


N-Queens

Represents a purely computational algorithm; virtually no communication overhead Algorithm contains several parallelizable code segments, exploitable for speedup Implementations are based upon same baseline C code

Every available technique and compiler optimization is employed to boost performance

Notes: Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinements VHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelism DIME-C and Impulse C N-Queens are results of experimentation with beta compilers

N-Queens Resource Utilization Statistics

0

20

40

60

80

100

Slices Clock Freq

% U

sage



0

1

2

3

4

5

6

13 14 15 16 17 N



Radix Sort

Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time) Represents a “worst-case” legacy algorithm, containing no functional-level parallelism

Every element in every iteration depends on every previous element in every iteration Ideal for software processor with fast cache, challenging in FPGA hardware

Speedup comes through efficient RAM usage and compiler optimizations/pipelining Reduce quantity and addressing complexity of RAM accesses whenever possible

Metrics are based on sorting 600 32-bit integers contained within a block RAM

Radix Sort Resource Utilization Statistics

0

20

40

60

80

100

Slices Block RAMs Clock Freq

% U

sage



0.0

0.5

1.0



Some Optimization Techniques Keep expensive computational operations to a minimum Multiplication, division, modulo, greater/less than, and floating point are *slow*

Minimize reliance on arrays

Watch for combinable statements

for(i=0;i<20;i++){ a[0] = b[i];}

temp = a[0];for(i=0;i<20;i++){ temp = b[i];}a[0] = temp;

GO

OD

BA

D

if(flag == 1 && test == 1){ solution++;}

solution += (flag&test);

BE

TT

ER

OK

Exploit functional level parallelism

Reduce bit-widths to minimal size

for(i=0;i<2;i++){ for(j=0;j<20;j++){ a[i][j] = i+j; }}

for(j=0;j<20;j++){ a[j] = j;}

GO

OD

BA

D

for(j=0;j<20;j++){ b[j] = 1+j;}

for(j=0;j<20;j++){ c[j] = 2+j;}

int i;for(i=0;i<255;i++);O

Kshort i;for(i=0;i<255;i++);

BE

TT

ER

char i;for(i=0;i<255;i++);B

ES

T


Case Study: Dot Product DIME-C

void Kernel(int a[50], int b[50], int answer){ int i, temp = 0; for(i=0;i<50;i++) { temp += a[i] * b[i]; } answer = temp;}

void dot_product(int a1[50], int b1[50], int a2[50], int b2[50], int answer){ int answer1, answer2;

#pragma genusc instance Kernel1 Kernel(a1,b1,answer1);

#pragma genusc instance Kernel2 Kernel(a2,b2,answer2);

answer = answer1 + answer2}

IMPULSE Cvoid Kerne11(co_stream a1, co_stream b1, co_stream z1){ int a[50], b[50], answer=0; co_stream_open(a1,O_RDONLY,INT_TYPE(32)); /*etc*/ for(i=0;i<50;i++) { co_stream_read(a1, &a[i], sizeof(int32)); co_stream_read(b1, &b[i], sizeof(int32)); } for(i=0;i<50;i++) { #pragma CO UNROLL answer += a[i] * b[i]; } co_stream_write(z1, &answer, sizeof(int32)); co_stream_close(a1); /*etc*/}

void Kernel2(co_stream a2, co_stream b2, co_stream z2){ /* SAME AS IN Kernel1 */ }

void dot_product(co_stream z1, co_stream z2, co_stream ans){ int i, answer1, answer2, answer; co_stream_open(z1,O_RDONLY,INT_TYPE(32)); /*etc*/ co_stream_read(z1, &answer1, INT_TYPE(32)); co_stream_read(z2, &answer2, INT_TYPE(32)); answer = answer1 + answer2; co_stream_write(ans, &answer, INT_TYPE(32)); co_stream_close(z1); /*etc*/}

HANDEL Cint 32 Kernel1(int 32 a[50], int 32 b[50]){ static int 32 i, temp[i], answer; par(i=0;i<50;i++) { temp[i] = a[i] * b[i]; } for(i=0;i<50;i++) { answer += temp[i]; } return answer;}

int 32 Kernel2(int 32 a[50], int 32 b[50]) /* SAME AS IN Kernel1 */}

void main() //dot_product{ int 32 a1[50]; int 32 b1[50]; int 32 a2[50]; int 32 b2[50]; int 32 temp1, temp2; int 32 answer; interface bus_out() OutputResult(answer); par { ans1 = Kernel1(int 32 a1[50],int 32 b[50]); ans2 = Kernel2(int 32 a2[50],int 32 b[50]); } answer = ans1 + ans2;}

*Not all implementations are perfectly optimized. Your mileage will vary.*

Green – ComputationBlue – CommunicationOrange - Pragmas


Lessons Learned

Tools are not near point of automatic translation Programs still require some tweaking for hardware compilation [17]

Optimized Software C ≠ Optimized Hardware C

However, generating VHDL is significantly easier Learning basics of a C-based mapper is straightforward

At least two major challenges remain: Input/output interfaces become a limiting factor

Moving generic VHDL to unsupported platforms requires VHDL knowledge However, once a generic I/O wrapper is generated, it should be reusable

True hardware debugging remains a challenge Another level of abstraction means another layer for mistranslation With no knowledge of internal VHDL signals, tracing becomes difficult


Conclusions Advantages of C-based application mappers

Far broader audience of potential RC users with high-level languages Required HDL knowledge is significantly reduced or eliminated Time to preliminary results is much less than manual HDL Software-to-hardware porting is considerably easier Visualization of C hardware is far easier for scientific community

Disadvantages Mapper instructions are many times more powerful than CPU

instructions, but FPGA clocks are many times slower Mappers can parallelize and pipeline C code, however they generally

cannot automatically instantiate multiple functional units Optimized C-mapper code is obtained through manual parallelization of

existing code using techniques pertinent to algorithm’s structure Reduced development time can come at cost of performance


Acknowledgements

We thank the following vendors for application mapping tools, information, and technical support: Celoxica (Handel C) Impulse Accelerated Technologies (Impulse C) Nallatech (DIME-C) Mitrion (Mitrion C)

We thank the following vendors for providing tools and/or hardware that made this study possible: Aldec (Active-HDL & Riviera EDA tools) Intel (Xeon servers) Nallatech (FUSE & DIMEtalk tools, RC boards) Xilinx (ISE, RC boards, FPGAs)


References[1] http://www.srccomp.com

[2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/.

[3] K. Morris, “Catapult C: Mentor Announces Architectural Synthesis,” fpgajournal.com, June 1, 2004.

[4] Nallatech, Inc., “DIME-C User Guide,” Reference Manual, United Kingdom, 2005.

[5] Celoxica, Ltd. “Using Handel-C with DK,” Training Manual, United Kingdom, 2005.

[6] D. Pellerin and S. Thibault, “Practical FPGA Programming in C,” Pearson Education, Inc., Upper Saddle River, NJ, 2005.

[7] Mitrionics AB, Inc, “The Mitrion Processor,” Product Overview, Sweden, 2005.

[8] M. Gokhale, J. Stone and E. Gomersall, “Co-Synthesis to a Hybrid RISC/FPGA Architecture,” Journal of VLSI Signal Processing Systems, 24, pp. 165-180, 2000.

[9] J. Hammes and W. Böhm, “The SA-C Language,” Reference Manual, Colorado State University, 2001.

[10] J. Hammes, M. Chawathe and W. Böhm, “The SA-C Compiler,” Reference Manual, Colorado State University, 2001.

[11] Colorado State Univ. “Cameron Poster for ACS PI Meeting,” Arlington, VA, March 7, 2002.

[12] I. Troxel, “CARMA: An Infrastructure for Reconfigurable High-Performance Computing,” Ph.D. Prospectus, University of Florida, pp. 30-32, 2005.

[13] R. Goering, “Open-source C compiler targets FPGAs,” Embedded.com, October 18, 2002.

[14] J. Frigo, M. Gokhale and D. Lavenier, “Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective,” Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001.

[15] http://www.systemc.org.

[16] OSCI, “SystemC 2.0.1 Language Reference Manual,” Reference Manual, San Jose, CA, 2003.

[17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004.

[18] V. Aggarwal, I. Troxel, and A George, “Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI” Proc. MAPLD, Washington, DC, September 8-10, 2004.

[19] J. Jussel, “The future of programmable SoC design is C-based”, Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas, NV, June 27-30, 2005.