Gaj1MAPLD 2005/1016 Development and Maintenance of User Libraries for SRC Reconfigurable Computers...
-
Upload
andrea-armstrong -
Category
Documents
-
view
217 -
download
3
Transcript of Gaj1MAPLD 2005/1016 Development and Maintenance of User Libraries for SRC Reconfigurable Computers...
Gaj 1 MAPLD 2005/1016
Development and Maintenance of User Libraries for
SRC Reconfigurable Computers
Kris Gaj1, Tarek El-Ghazawi2, Paul Gage3, Dan Poznanovic3,
Chang Shu1, Deapesh Misra1,
Miaoqing Huang2, Esam El-Araby2,
Mohamed Taher2
1 George Mason University2 The George Washington University3 SRC Computers, Inc.
Gaj 2 MAPLD 2005/1016
ReconfigurableComputers
Gaj 3 MAPLD 2005/1016
Interface
P memory
P memory
. . .
P P . . .
I/O Interface
FPGA memory
FPGA memory
. . .
FPGA FPGA . . .
I/O
Microprocessor system FPGA system
What is a reconfigurable computer?
Gaj 4 MAPLD 2005/1016
Examples of High-End Reconfigurable Computers
• SRC-6E and SRC High-Bar Based Systems from SRC Computers, Inc.
• Cray XD1 (formerly Octiga Bay 12 K) from Cray Inc.
• SGI Altix 3000 from Silicon Graphics
• Star Bridge Hypercomputer from Star Bridge Systems
Gaj 5 MAPLD 2005/1016
SRC MAP™ Reconfigurable Processor
Source: [SRC, MAPLD04]
Gaj 6 MAPLD 2005/1016
SNAP
ComputerMemory(8 GB)
P4(2.8GHz)
P4(2.8GHz)
/ /22400MB/s
MIOC
L2L2
4256 MB/s
// 4256 MB/s1064 MB/s
DDRInterface
PCI-X
ControlFPGA
XC2V6000
2128 MB/s
On-Board Memory(24 MB)
/4800 MB/s(6x64 bits)
FPGA 1XC2V6000
FPGA 2XC2V6000
/
4800 MB/s(6x 64 bits)
/
4800 MB/s(6x 64 bits)
2400 MB/s(192 bits)
/
/ /
(108 bits)
ChainPorts 2400 MB/s
(108 bits)
/
1064 MB/s
½ MAPBoard
uPBoard
22400MB/s
SRC-6E Hardware Architecture
Gaj 7 MAPLD 2005/1016
Storage Area Storage Area Network Network
Local Area Local Area Network Network
Wide Area Wide Area Network Network DiskDisk
Customers’ Existing NetworksCustomers’ Existing Networks
• Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier• Up to 256 input and 256 output ports• Common Memory (CM) has controller with DMA capability• Up to 8 GB DDR SDRAM supported per CM node
PCI-XPCI-XPCI-XPCI-X
SRC Hi-Bar Based Systems
MAPMAP®®
SRC-6SRC-6
MAPMAP
PP
MemoryMemory
SNAPSNAP™™
PP
MemoryMemory
SNAPSNAP
Gig EthernetGig Ethernetetc.etc.
Common Common MemoryMemory
ChainingChainingGPIOGPIO
Common Common MemoryMemory
SRC Hi-Bar SwitchSRC Hi-Bar Switch
Source: [SRC, MAPLD04]
Gaj 8 MAPLD 2005/1016
SRC Programming
HLL (C)
HDL (VHDL)
SRCP system
FPGA system
ApplicationProgrammer
LibraryDeveloper
Gaj 9 MAPLD 2005/1016
C function for P
C function for FPGAs
VHDL macro for FPGAs
SRC Program Partitioning
P system
FPGA system
HLL
HDL
Gaj 10 MAPLD 2005/1016
Main program
Function_1(a, d, e)
Function_2(d, e, f)
Function_1
Function_2
Macro_1(a, b, c)
Macro_2(b, d)Macro_2(c, e)
Macro_3(s, t)
Macro_1(n, b)Macro_4(t, k)
FPGA……
……
……
Macro_1
Macro_2 Macro_2
a
b c
d e
FPGA contents afterthe Function_1 call
Program in C or Fortran
Run Time Reconfiguration in SRC
Gaj 11 MAPLD 2005/1016
SRC Development Environment
Objectfiles
Application sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker.bin files
.edf files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDLsources.c or .f files .vhd or .v files
Objectfiles
Application sourcesUser
Macro Sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker
.edf files
.bin files
. files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDL
.c or .f files .vhd or .v files
.v files
Gaj 12 MAPLD 2005/1016
Advantages of reconfigurable computers
• can be programmed by mathematicians themselves using traditional programming languages or GUI environments
• encourage innovation and experimentation
• general-purpose: cost distributed among multiple users with different needs
• behave like hardware: - parallel processing - distributed memory - specialized functional units, etc.
Gaj 13 MAPLD 2005/1016
Conditions necessary for the success of reconfigurable computers
• ease of use of library macros and functions
• existence of comprehensive libraries of user macros and functions capable of running on FPGAs
• significant speed-ups ( 100 x) of basic functions running on FPGAs compared to state-of-the-art microprocessors
Gaj 14 MAPLD 2005/1016
Development and Maintenance of SRC
Libraries
Gaj 15 MAPLD 2005/1016
Structure of the macro repository < top of repository >
<lib # 1 >
common rev_d rev_e
hdlfile InfoFile BlkBoxFile
macro1 macro2 macro3
< macros >
<lib # 2 > <lib # 3 >
rev_f
DebugCodeFile
DataSheet
Gaj 16 MAPLD 2005/1016
common: • These are macros that have no connections to external
pins nor to any specific FPGA type specific feature. This type of macro can be used on any MAP
rev_d: • These macros have a specific dependency on the dual MAP
rev_e: • These macros have a specific dependency on the single
MAP rev_f:
• These macros have a specific dependency on compact MAP
Macro Types
Gaj 17 MAPLD 2005/1016
Files describing the macro
Platform independent HDL file: macro.v or macro.vh
• Verilog or VHDL code defining the macro
Debug Code File: macro.c • provides the equivalent C functionality for the macro
Platform dependent Blk Box File: blackbox.v
• Interface (black box) definition for the macro in Verilog
Data sheet file: datasheet• contains the documentation for the macro
Info File: info• Info file entry for the given macro, containing macro type, latency, names of input/output/control signals, etc.
Gaj 18 MAPLD 2005/1016
To properly manage a distribution of macros a CVS repository must be setup.
This allows the source code changes to be controlled and permits multiple developers
to work on the code.
CVS repository
Gaj 19 MAPLD 2005/1016
The Installed Macro Library Structure
<xxx lib>
map 3 (built for the Xilinx Virtex2) map 4 (built for the Xilinx Virtex2Pro)
common rev_d rev_e
ngo blkbox.v macros.info
macro1 macro2 macro3 ......
common rev_d rev_e
Single info file
Single blackbox file
Obtained by running a special script developed by SRC
Gaj 20 MAPLD 2005/1016
Library Script
Usage:
build_libs [OPTION][-b, --branch br] Specify CVS branch[-c, --checkout] Checkout only[-d, --CVSROOT cvsroot] Specify CVSROOT[-M, --MAP maptype] Build for MAP maptype[-m, --module mod] Build mod only[-r, --restart mmddyy-hhmm] Restart previous build[-s, --step target] Run build step target[-v, --version N.n] Package as version N.n[-V, --vendor vend] Specify distribution vendor[-w, --workspace path] Create workspace in path
Gaj 21 MAPLD 2005/1016
Building libraries
• build_libs will checkout library and perform a build in /var/tmp/builds in a folder with a time stamp (i.e. 080405-1705)
• If there is an error check file called ‘output’ in the /var/tmp/builds. Fix the error and restart build by:
• build_libs --restart 080405-1705• You can also do a partial build, say only build
the library and not the CD• build_libs --step lib
• To build only a particular subset of a library, you can do so using a command such as:
• build_libs --module crypto
Gaj 22 MAPLD 2005/1016
Structure for the repository of MAP C functions
< top of repository >
<lib # 1 >
common rev_d rev_e
routine1 routine2 routine3
< userlib >
<lib # 2 > <lib # 3 >
rev_f
Gaj 23 MAPLD 2005/1016
Source file: • This is the .mc or .mf file defining the MAP routine
proto.h: • This file provides a prototype of the MAP routine
Makefile: • This is a standard Carte Makefile, with the exception that no BIN
environment variable is provided.
Docfile:• This file provide a man page format documentation
of the MAP routine.
Files describing the MAP C routine
Gaj 24 MAPLD 2005/1016
The Installed MAP Routine Library Structure
<userlib >
map 3 map 4
common rev_d rev_e
lib1.a lib1.so lib2.a
common rev_d rev_e
lib2.so ......
Gaj 25 MAPLD 2005/1016
Known problems:No support for variable size
of operands
Gaj 26 MAPLD 2005/1016
We would like to be able to create and maintain a library of generic components that work for various operand sizes.
Problem statement
Example:
Basic arithmetic operations (addition, subtraction, multiplication, division) of multiprecision (n-bit) integers.
Gaj 27 MAPLD 2005/1016
Possible solutions
1. Fixed-size interface to a macro
• using streams• without using streams
2. Variable-size interface to a macro cell
Gaj 28 MAPLD 2005/1016
Input (64-bits)
Output (64-bits)
Process Process
Gaj 29 MAPLD 2005/1016
Passing variable-size operandswithout streams
for (i=0; i<3*N+1; i++) { if (i < N) A_in = c[i]; B_in = d[i]; else A_in = 0; B_in = 0;
mul (i, A_in, B_in, &C_out);
if (i > N) e[i-N] = C_out;}
Gaj 30 MAPLD 2005/1016
Passing variable size operandsusing streams
#pragma src section { for (i=0; i<N; i++) { put_stream (&S0, A[i], 1); // put A[i] to S0 put_stream (&S1, B[i], 1); // put B[i] to S1 } } #pragma src section { mul (&S0, &S1, &S2); // read from S0 and S1, write to S2 } #pragma src section { for (i=0; i<2*N; i++) get_stream (&S2, &C[i]); // take from S2 and write to C[i] }
Gaj 31 MAPLD 2005/1016
Process Process
Gaj 32 MAPLD 2005/1016
Multiprecision Integer Library Generator
Multiprecision Integer Library
Generator(C engine)
C/VHDL Wrapper
Black Box Info file
Size of operands - N
In-line MAP Cfunction
Gaj 33 MAPLD 2005/1016
Inline MAP C functionfor N=2
int mul (int64_t *A, int64_t *B, int64_t *C, N){int64_t A0, A1;int64_t B0, B1;int64_t C0, C1, C2, C3;
A0=A[0];A1=A[1];B0=B[0];B1=B[1];Mul_128(A0, A1, B0, B1, &C0, &C1, &C2, &C3);C[0] = C0;C[1] = C1;C[2] = C2;C[3] = C3;}
Gaj 34 MAPLD 2005/1016
Pros and cons of both methods
1. Fixed-size interface to a macro
Pros: Interface independent of the operand size
Cons: input/output overhead
2. Variable-size interface to a macro cell
Pros: minimum overhead
Cons: need to generate automatically several macro files,
need for changes in the compiler
Gaj 35 MAPLD 2005/1016
GMU/GWU Libraries
Gaj 36 MAPLD 2005/1016
Cryptographic Libraries
Secret Key Ciphers
Secret key ciphers encryption and breaking – SecCiph
Public Key Ciphers • Elliptic Curve Cryptosystems arithmetic - ECC• Binary Galois Field GF(2m) arithmetic in Polynomial Basis - GF2n_PB• Binary Galois Field GF(2m) arithmetic in Normal Basis - GF2n_NB• Multiprecision integer arithmetic (in collaboration with University of South Carolina) – Long_Int• Operations supporting factorization of large integers using Number Field Sieve - NFS
Gaj 37 MAPLD 2005/1016
Digital Image Processing Libraries
Image Enhancement / Restoration Single-Resolution
Noise Reduction (Convolution Filtering) Smoothing (Lowpass) Gaussian (Lowpass) Blurring (Lowpass) Sharpening (Highpass)
Edge Detection (Derivative Filters) Prewitt Sobel
Multi-Resolution Discrete Wavelet Transform (DWT) Inverse Discrete Wavelet Transform (IDWT)
Similarity Measures Correlation
Gaj 38 MAPLD 2005/1016
Miscellaneous Libraries
Sorting
Stream-searching
BMM - Bit Matrix Multiply
DARPA benchmarks
Gaj 39 MAPLD 2005/1016
Performance of selected
applications based on GMU/GWU
libraries
Gaj 40 MAPLD 2005/1016
1. input/output intensive applications• bulk data encryption
(DES, IDEA, and RC5 encryption) • image processing (Sobel Edge Detection, Median Filter,
Wavelet Hyperspectral Dimension Reduction)
2. computationally intensive applications• secret-key cipher breaking based on
the exhaustive key search (DES, IDEA, RC5 breakers)
• public-key cipher breaking based on factoring
3. latency-critical applications• cipher key agreement and signature (ECC schemes, RSA)
Classes of applications
Gaj 41 MAPLD 2005/1016
PC based on Pentium IV, 2.4 GHz clock,
512 MB of RAM, 512 KB of cache
Reference Platform
Treated as a basic building block of a clusterof microprocessor boards.
Platform used in experiments
SRC-6E from SRC Computers, Inc.
Gaj 42 MAPLD 2005/1016
Timing Measurements
MAPAlloc.
MAP
FreeDMA
DataOut
DMA
Data In
FPGA
Computation
.c file .mc file
End-to-End time (SW)
MAPfunction
MAP function
FPGA
Configure
Configuration time
MAP
Allocation
time
MAP
Release
Time
End-to-End time (HW)
MAP – SRC Reconfigurable Processor based on two User FPGAs
Gaj 43 MAPLD 2005/1016
Application
ComputationalThroughput
(Mbits/s)
DataTransfer InThroughput
(Mbits/s)
DataTransfer OutThroughput
(Mbits/s)
End-to-End Throughput
(Mbits/s)Speed up
SRC 6E SRC 6E SRC 6E SRC 6E Pentium IV
DESEncryption 6,398 2,488 1,705 863 58 14.9
IDEAEncryption 12,788 2,487 1,799 938 165 5.7
RC5Encryption 6,398 2,505 1,590 836 366 2.3
Sobel EdgeDetection 5,680 2,493 1,701 849 76 11.0
MedianFilter 5,681 2,484 1,710 850 5 170
WaveletHyperspectral
DimensionReduction
6395 2,573 1,477 81867 – 159
(5 levels –1 level)
5 – 12(1 level –5 levels)
Input/Output Intensive ApplicationsP3 version of SRC-6E
Gaj 44 MAPLD 2005/1016
Wavelet Hyperspectral Dimension ReductionTime contributions
P3 version of SRC-6E vs. Pentium IV PC
Gaj 45 MAPLD 2005/1016
Application
ComputatinalThroughput
(Mbits/s)
DataTransfer InThroughput
(Mbits/s)
DataTransfer OutThroughput
(Mbits/s)
End-to-End Throughput
(Mbits/s)Speed up
SRC 6E SRC 6E SRC 6E SRC 6E Pentium IV
IDEAEncryption 12,790 10,627 10,583 3,479 165 21
RC5Encryption 6398 6371 6373 2,098 366 5.7
Sobel EdgeDetection 5,683 6,384 6,380 2,044 76 27
MedianFilter 5,684 6,384 6,383 2,044 5 409
WaveletHyperspectral
DimensionReduction
6,394 6,349 3,185 1,62667 – 159
(5 levels –1 level)
10 – 24(1 level – 5 levels)
Input/Output Intensive ApplicationsP4 version of SRC-6E
Gaj 46 MAPLD 2005/1016
Wavelet Hyperspectral Dimension ReductionTime contributions
P4 version of SRC-6E vs. Pentium IV PC
Gaj 47 MAPLD 2005/1016
Application
ComputationalThroughput
(Mbits/s)
DataTransfer InThroughput
(Mbits/s)
DataTransfer OutThroughput
(Mbits/s)
End-to-End Throughput
(Mbits/s)Speed up
SRC 6E SRC 6E SRC 6E SRC 6E Pentium IV
IDEAEncryption
(no overlapping)12,790 10,627 10,583 3,479 165 21
IDEAEncryption
(with overlapping)10,857 9,792 10,564 4,887 165 30
RC5Encryption
(no overlapping)6398 6371 6373 2,098 366 5.7
RC5Encryption
(with overlapping)6398 6,372 6,349 3,110 366 8.5
Input/Output Intensive ApplicationsP4 version of SRC-6E
without and with overlappingcomputations and data transfers
Gaj 48 MAPLD 2005/1016
Application
ComputationalThroughput
(Mbits/s)
DataTransfer InThroughput
(Mbits/s)
DataTransfer OutThroughput
(Mbits/s)
End-to-End Throughput
(Mbits/s)Speed up
SRC 6 SRC 6 SRC 6 SRC 6 Pentium IV
DESEncryption
(no overlapping)19,200 11,350 10,760 4,240 58 73
IDEAEncryption
(no overlapping)19,200 11,350 10,760 4,240 165 26
RC5Encryption
(no overlapping)19,200 11,350 10,760 4,240 366 12
Input/Output Intensive ApplicationsSRC Hi-Bar Based System
Gaj 49 MAPLD 2005/1016
Application
ComputationalThroughput
DataTransfer InThroughput
DataTransfer
OutThroughput
End-to-End Throughput
(mln keys/s) (mln keys/s) (mln keys/s) (mln keys/s)
SpeedupSRC 6E SRC 6E SRC 6E SRC 6E
PentiumIV
DES Breaker
800 N/A N/A 800 0.469 1706
IDEA Breaker
1000 N/A N/A 500 1.701 294
RC5
Breaker100 N/A N/A 100 0.516 194
Computationally Intensive ApplicationsP3 version of SRC-6E
Gaj 50 MAPLD 2005/1016
Latency-Critical Applications
Application
ComputatinalLatency
DataTransfer In
Latency
DataTransfer
OutLatency
End-to-End Latency
(μs) (μs) (μs) (μs)
Speedup
SRC 6E SRC 6E SRC 6E SRC 6EPentium
IV
ECC DHKey Agreementover GF(2233),
Optimal Normal Basis
201 39 17 592 364,000 615
ECC DH Key Agreement
over GF(2233), Polynomial Basis
560 66 7 943 31,050 33
Gaj 51 MAPLD 2005/1016
RSA: SRC vs. OpenSSL Software Comparison
Data SizeSW Function
Time (ms)SW Speedup
vs. MAP SW
1024 47.248 4.821 x
1536 138.466 3.642 x
2048 269.948 3.321 x
3072 853.050 3.468 x
4096 1755.266 3.624 x
Gaj 52 MAPLD 2005/1016
Sparse matrix by vector multiplication
MatrixSize
K
OneMultiplicationTime in SW
(ns)
OneMultiplicationTime in HW
(ns)
Speedup
144x144(Mesh12x12)
70 3440 12 282
Reference Optimized SW Implementation:
PC, Pentium IV, 2.768 GHz, 1 GB RAM
Gaj 53 MAPLD 2005/1016
Summary &
Conclusions
Gaj 54 MAPLD 2005/1016
Summary
Type of applicationEnd-to-end
speed-up of SRC vs. P4
Computationally intensive(cipher breaking)
200-1700
Latency critical RSA 0.2-0.3 ECC polynomial bases, general fields 33 ECC polynomial bases, special fields 12-27 ECC optimal normal bases 600
Input/output intensive 3-30(secret key encryption/decryption)
Gaj 55 MAPLD 2005/1016
Summary & conclusions (1)
General methodology for the design and maintenanceof SRC user libraries developed and tested
Existing libraries evaluated in terms of - performance - ease of use - flexibilityfor three wide classes of applications
Initial results very encouraging
Gaj 56 MAPLD 2005/1016
Selected files from the SRC libraries can be usedfor development of comparable librariesfor other reconfigurable computers
Full compatibility with other reconfigurable computers difficult to achieve because of the technical differences and intellectual property constraints
Summary & conclusions (2)