Holland #215 MAPLD 2005
Survey ofSurvey ofC-based Application Mapping ToolsC-based Application Mapping Toolsfor Reconfigurable Computingfor Reconfigurable Computing
Brian Holland, Mauricio Vacas, Vikas Aggarwal,
Ryan DeVille, Ian Troxel, and Alan D. George
High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida
#215 MAPLD 2005Holland 2
Outline Introduction General Survey
Ten C-based Application Mappers
Benchmarking & Results Finite-Impulse Response (FIR) N-Queens Radix Sort
Lessons Learned Conclusions Acknowledgements References
DIME-C
SURVEY
SA-C
STREAMS C
CATAPULT C
SYSTEMC
CARTE
MITRION C
IMPULSE C
HANDEL C
BENCHMARK
NAPA C
#215 MAPLD 2005Holland 3
Motivation for Application Mappers Motivation for Application Mappers
HDL programming has shortcomings Limited applicability to application developers More involved development process (vs. software) Requires training beyond application level
Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity? Can we bring RC performance benefits to application developers? Would this be practical/possible in traditional HDL?
HDL is well below the level of traditional application programming Consequently, we need to move to a higher level of abstraction
HDL
#215 MAPLD 2005Holland 4
Introduction
Selecting a Higher Level of Abstraction CAD tools: Visual appealing, but tedious for large projects New language: Optimal, but requires complete retraining Traditional or Object-Oriented languages: Which? How?
Ideally, use pure ANSI-C, “The Universal Language” Requires no additional knowledge or special training Port existing C programs into hardware implementations (HDL)
Translation can be handled by a hardware compiler Programmer concentrates on algorithmic functionality
Configuration File
Netlist
HDL
C Code
COMPILER
#215 MAPLD 2005Holland 5
Commonalities
General characteristics of C-based application mappers: Companies create proprietary ANSI C-based language Languages do not have all ANSI C features Extra pragmas are included for corresponding compilers Additional libraries of functions/macros for further extensions Must adhere to specific programming “style” for maximum optimization Emphasis on both hardware generation and I/O interfaces
void FIR(int INPUTA, int OUTPUTB){
/*user source*/
}
COMPILER
Entity FIR is Port( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0));end;
/*user source in VHDL*/
ANSI-C VHDL
#215 MAPLD 2005Holland 6
BENCHMARK SECTION
Spectrum of C-based Application MappersSURVEY PORTION
Open Standard
SystemC
Generic HDLMultiple Platforms
Impulse C
Catapult C
Mitrion C
Targets a SpecificPlatform/Configuration
Generic HDL (Optimize forManufacturer’s Hardware)
DIME-C
Handel C Streams C
SA-C
Napa C
RISC/FPGAHybrid Only
Carte
Handel CDIME-C
VHDL
Impulse C
ANSI-C
THE LAW OF CONSERVATION OF PAIN
Con
tro
l
Effort
No
t C
ycle
Acc
ura
teL
imite
dP
red
iciti
blit
yD
ete
rmin
istic
SoftwareSome HWPragmas
Many HWPragmas
ANSI-C
DIME-CImpulse C
Handel C
Cyc
le A
ccu
rate
HDL
VHDL
#215 MAPLD 2005Holland 7
CarteSRC Computers [1]
C/Fortran FPGA environment Direct mapping of C/Fortran code
to configuration level Software emulation and simulation
of compiled code for debugging Capable of multiprocessor and
multi-FPGA computational definitions
Allows explicit data flow control within memory hierarchy
Targets SRC’s MAP processor Produces “Unified Executables” for
HW or SW processor execution Runtime libraries handle required
interfacing and management
Algorithmic synthesis tool for RTL generation RTL from “pure” untimed C++ No extensions, pragmas, etc.
Compiler uses “wrappers” around algorithmic code External: manages I/O interface Internal: constrains synthesis to
optimize for chosen interface
Explicit architectural constraints and optimization
Output: RTL netlists in VHDL, Verilog, and SystemC
Catapult CMentor Graphics [2-3]
#215 MAPLD 2005Holland 8
DIME-CNallatech [4]
FPGA prototyping tool Designs are not cycle-accurate
Allows application synthesis for a higher clock speed
Compilation/Optimization Pipeline/parallelize where possible Included IEEE-754 FP cores Dedicated (integer) multipliers
Currently in beta, expected release: 4Q05
Output: synthesizable VHDL and DIMEtalk components
Environment for cycle-accurate application development
All operations occur in one deterministic clock cycle Makes it cycle-accurate, but clock
freq reduced to slowest operation Decisions/Loops are “penalty-free”
but can significantly impact timing
Language has pragmas for explicitly defined parallelism
Compiler can analyze, optimize, and rewrite code
Output: VHDL/Verilog, SystemC, or targeted EDIFs
Handel CCeloxica [5]
#215 MAPLD 2005Holland 9
Impulse CImpulse Accelerated Technologies [6] Language/compiler for
modeling sequential apps. Processes - independent, potentially
concurrent, computing blocks Streams – communicate and
synchronize processes
Uses Streams-C methodology However, focuses on compatibility
with C development environments
Compilation Each process implemented as
separate state machine
Output: Generic or FPGA-specific VHDL
“Softcore” processor tactic “Processor” creates abstraction
layer between C code and FPGA
Compilation C code is mapped to a generic
“API” of possible functions Processor instantiated on FPGA,
tailored to specific application Custom instruction bit-widths,
specific cache and buffer sizes
Currently in beta, expected release: 4Q05
Output: a VHDL IP core for target architectures
Mitrion CMitrion [7]
#215 MAPLD 2005Holland 10
Napa C National Semiconductor [8]
Language/compiler for RISC/FPGA hybrid processor Capitalize on single-cycle
interconnect instead of I/O bus
Datapath Synthesis Technique Hand-optimized pre-placed, pre-
routed module generators Compiler generates hardware
pipelines from C loops
Targets NS NAPA1000 hybrid processor Fixed-Instruction Processor (FIP),
Adaptive Logic Processor (ALP) ALP also compiles to RTL VHDL,
structural VHDL, structural Verilog
High-level, expression-oriented, machine-independent, single-assignment language Designed to implicitly express
data-parallel operations Image and signal processing
Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)
Loop optimizations Structural transforms Execution block placement
Target Platforms UC Irvine Morphosys; Annapolis
WildForce, StarFire, WildFire
SA-CColorado State University [9-12]
#215 MAPLD 2005Holland 11
Streams CLos Alamos National Laboratory [12-14] Stream-oriented sequential
process modeling Essentially, data elements moving
through discrete functional blocks
Compiler Generates multi-threaded
processor executables and multiple FPGA bitstreams
Allows parallel C program translation into a parallel arch.
Includes functional-level simulation environment
Output: synthesizable RTL
Open-source extension of C++ for HW/SW modeling Core language, modules & ports
for defining structure, and interfaces & channels
Supports functional modeling Hierarchical decomposition of a
system into modules Structural connectivity between
modules using ports/exports Scheduling and synchronization of
concurrent processes using events
Event-driven simulator Events are basic dynamic/static
process synchronization objects
SystemCOpen SystemC Initiative (OSCI) [15-16]
#215 MAPLD 2005Holland 12
About the Benchmarks Three classic algorithms used for benchmarking
Finite-Impulse Response (FIR) Simple 51-tap FIR filter for standard DSP applications Compare compiler solutions and analyze their usage metrics
N-Queens Classic embarrassingly parallel HPC backtracking search problem Showcases the potential of optimized implementations
Radix Sort Sorts using ‘binary bins’, minimizing resources Illustrates resource metrics in RAM-intensive applications
Implementation Details DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing) Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGA Resource utilization based on post place-and-route data Runtime represents communication time (setup and verification I/O is negated) Handel C and Impulse C require VHDL wrappers which can increase resource usage
-10
-8
-6
-4
-2
0
2
4
6
8
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27
0 110 100 10
1 111 101
#215 MAPLD 2005Holland 13
Finite-Impulse Response
FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6]) Various application-mapper languages do not have a consistent I/O interface
Could not create a consistent streaming channel with requisite blocking in every tool Instead, FIR algorithm operates on values stored in a block RAM
Obtains speedup through parallel multiplication, efficient memory accesses The 51 coefficients and variables are stored in local variables
Additional performance boosts are possible in multi-channel DSP processing
FIR Resource Utilization Statistics
0
20
40
60
80
100
Slices Multipliers Block RAMs Clock Freq
% U
sag
e
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0
1
2
3
4
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 14
N-Queens
Represents a purely computational algorithm; virtually no communication overhead Algorithm contains several parallelizable code segments, exploitable for speedup Implementations are based upon same baseline C code
Every available technique and compiler optimization is employed to boost performance
Notes: Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinements VHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelism DIME-C and Impulse C N-Queens are results of experimentation with beta compilers
N-Queens Resource Utilization Statistics
0
20
40
60
80
100
Slices Clock Freq
% U
sage
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0
1
2
3
4
5
6
13 14 15 16 17 N
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 15
Radix Sort
Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time) Represents a “worst-case” legacy algorithm, containing no functional-level parallelism
Every element in every iteration depends on every previous element in every iteration Ideal for software processor with fast cache, challenging in FPGA hardware
Speedup comes through efficient RAM usage and compiler optimizations/pipelining Reduce quantity and addressing complexity of RAM accesses whenever possible
Metrics are based on sorting 600 32-bit integers contained within a block RAM
Radix Sort Resource Utilization Statistics
0
20
40
60
80
100
Slices Block RAMs Clock Freq
% U
sage
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0.0
0.5
1.0
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 16
Some Optimization Techniques Keep expensive computational operations to a minimum Multiplication, division, modulo, greater/less than, and floating point are *slow*
Minimize reliance on arrays
Watch for combinable statements
for(i=0;i<20;i++){ a[0] = b[i];}
temp = a[0];for(i=0;i<20;i++){ temp = b[i];}a[0] = temp;
GO
OD
BA
D
if(flag == 1 && test == 1){ solution++;}
solution += (flag&test);
BE
TT
ER
OK
Exploit functional level parallelism
Reduce bit-widths to minimal size
for(i=0;i<2;i++){ for(j=0;j<20;j++){ a[i][j] = i+j; }}
for(j=0;j<20;j++){ a[j] = j;}
GO
OD
BA
D
for(j=0;j<20;j++){ b[j] = 1+j;}
for(j=0;j<20;j++){ c[j] = 2+j;}
int i;for(i=0;i<255;i++);O
Kshort i;for(i=0;i<255;i++);
BE
TT
ER
char i;for(i=0;i<255;i++);B
ES
T
#215 MAPLD 2005Holland 17
Case Study: Dot Product DIME-C
void Kernel(int a[50], int b[50], int answer){ int i, temp = 0; for(i=0;i<50;i++) { temp += a[i] * b[i]; } answer = temp;}
void dot_product(int a1[50], int b1[50], int a2[50], int b2[50], int answer){ int answer1, answer2;
#pragma genusc instance Kernel1 Kernel(a1,b1,answer1);
#pragma genusc instance Kernel2 Kernel(a2,b2,answer2);
answer = answer1 + answer2}
IMPULSE Cvoid Kerne11(co_stream a1, co_stream b1, co_stream z1){ int a[50], b[50], answer=0; co_stream_open(a1,O_RDONLY,INT_TYPE(32)); /*etc*/ for(i=0;i<50;i++) { co_stream_read(a1, &a[i], sizeof(int32)); co_stream_read(b1, &b[i], sizeof(int32)); } for(i=0;i<50;i++) { #pragma CO UNROLL answer += a[i] * b[i]; } co_stream_write(z1, &answer, sizeof(int32)); co_stream_close(a1); /*etc*/}
void Kernel2(co_stream a2, co_stream b2, co_stream z2){ /* SAME AS IN Kernel1 */ }
void dot_product(co_stream z1, co_stream z2, co_stream ans){ int i, answer1, answer2, answer; co_stream_open(z1,O_RDONLY,INT_TYPE(32)); /*etc*/ co_stream_read(z1, &answer1, INT_TYPE(32)); co_stream_read(z2, &answer2, INT_TYPE(32)); answer = answer1 + answer2; co_stream_write(ans, &answer, INT_TYPE(32)); co_stream_close(z1); /*etc*/}
HANDEL Cint 32 Kernel1(int 32 a[50], int 32 b[50]){ static int 32 i, temp[i], answer; par(i=0;i<50;i++) { temp[i] = a[i] * b[i]; } for(i=0;i<50;i++) { answer += temp[i]; } return answer;}
int 32 Kernel2(int 32 a[50], int 32 b[50]) /* SAME AS IN Kernel1 */}
void main() //dot_product{ int 32 a1[50]; int 32 b1[50]; int 32 a2[50]; int 32 b2[50]; int 32 temp1, temp2; int 32 answer; interface bus_out() OutputResult(answer); par { ans1 = Kernel1(int 32 a1[50],int 32 b[50]); ans2 = Kernel2(int 32 a2[50],int 32 b[50]); } answer = ans1 + ans2;}
*Not all implementations are perfectly optimized. Your mileage will vary.*
Green – ComputationBlue – CommunicationOrange - Pragmas
#215 MAPLD 2005Holland 18
Lessons Learned
Tools are not near point of automatic translation Programs still require some tweaking for hardware compilation [17]
Optimized Software C ≠ Optimized Hardware C
However, generating VHDL is significantly easier Learning basics of a C-based mapper is straightforward
At least two major challenges remain: Input/output interfaces become a limiting factor
Moving generic VHDL to unsupported platforms requires VHDL knowledge However, once a generic I/O wrapper is generated, it should be reusable
True hardware debugging remains a challenge Another level of abstraction means another layer for mistranslation With no knowledge of internal VHDL signals, tracing becomes difficult
#215 MAPLD 2005Holland 19
Conclusions Advantages of C-based application mappers
Far broader audience of potential RC users with high-level languages Required HDL knowledge is significantly reduced or eliminated Time to preliminary results is much less than manual HDL Software-to-hardware porting is considerably easier Visualization of C hardware is far easier for scientific community
Disadvantages Mapper instructions are many times more powerful than CPU
instructions, but FPGA clocks are many times slower Mappers can parallelize and pipeline C code, however they generally
cannot automatically instantiate multiple functional units Optimized C-mapper code is obtained through manual parallelization of
existing code using techniques pertinent to algorithm’s structure Reduced development time can come at cost of performance
#215 MAPLD 2005Holland 20
Acknowledgements
We thank the following vendors for application mapping tools, information, and technical support: Celoxica (Handel C) Impulse Accelerated Technologies (Impulse C) Nallatech (DIME-C) Mitrion (Mitrion C)
We thank the following vendors for providing tools and/or hardware that made this study possible: Aldec (Active-HDL & Riviera EDA tools) Intel (Xeon servers) Nallatech (FUSE & DIMEtalk tools, RC boards) Xilinx (ISE, RC boards, FPGAs)
#215 MAPLD 2005Holland 21
References[1] http://www.srccomp.com
[2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/.
[3] K. Morris, “Catapult C: Mentor Announces Architectural Synthesis,” fpgajournal.com, June 1, 2004.
[4] Nallatech, Inc., “DIME-C User Guide,” Reference Manual, United Kingdom, 2005.
[5] Celoxica, Ltd. “Using Handel-C with DK,” Training Manual, United Kingdom, 2005.
[6] D. Pellerin and S. Thibault, “Practical FPGA Programming in C,” Pearson Education, Inc., Upper Saddle River, NJ, 2005.
[7] Mitrionics AB, Inc, “The Mitrion Processor,” Product Overview, Sweden, 2005.
[8] M. Gokhale, J. Stone and E. Gomersall, “Co-Synthesis to a Hybrid RISC/FPGA Architecture,” Journal of VLSI Signal Processing Systems, 24, pp. 165-180, 2000.
[9] J. Hammes and W. Böhm, “The SA-C Language,” Reference Manual, Colorado State University, 2001.
[10] J. Hammes, M. Chawathe and W. Böhm, “The SA-C Compiler,” Reference Manual, Colorado State University, 2001.
[11] Colorado State Univ. “Cameron Poster for ACS PI Meeting,” Arlington, VA, March 7, 2002.
[12] I. Troxel, “CARMA: An Infrastructure for Reconfigurable High-Performance Computing,” Ph.D. Prospectus, University of Florida, pp. 30-32, 2005.
[13] R. Goering, “Open-source C compiler targets FPGAs,” Embedded.com, October 18, 2002.
[14] J. Frigo, M. Gokhale and D. Lavenier, “Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective,” Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001.
[15] http://www.systemc.org.
[16] OSCI, “SystemC 2.0.1 Language Reference Manual,” Reference Manual, San Jose, CA, 2003.
[17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004.
[18] V. Aggarwal, I. Troxel, and A George, “Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI” Proc. MAPLD, Washington, DC, September 8-10, 2004.
[19] J. Jussel, “The future of programmable SoC design is C-based”, Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas, NV, June 27-30, 2005.
Top Related