Post on 14-Jan-2016
description
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Scientific Computing Beyond CPUs: Scientific Computing Beyond CPUs: FPGA implementations of common FPGA implementations of common
scientific kernelsscientific kernels
Melissa C. SmithMelissa C. Smith11 Jeffrey S. VetterJeffrey S. Vetter22 Sadaf R. AlamSadaf R. Alam22
Sreesa AkellaSreesa Akella33 Luis CordovaLuis Cordova33
11Engineering Science and Technology Division, ORNLEngineering Science and Technology Division, ORNL
22Computer Science and Mathematics Division, ORNLComputer Science and Mathematics Division, ORNL
33University of South CarolinaUniversity of South Carolina
September 2005September 2005
2Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Outline
Introduction & MotivationIntroduction & MotivationCandidate Kernels/Apps & ImplementationCandidate Kernels/Apps & ImplementationResultsResultsFunction LibraryFunction LibraryLessons LearnedLessons LearnedConclusionsConclusions
3Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Reconfigurable Computing (RC) with FPGAsReconfigurable Computing (RC) with FPGAsFaster execution & lower power consumption all with slower clock speeds Faster execution & lower power consumption all with slower clock speeds Exploit inherent parallelism in algorithms Exploit inherent parallelism in algorithms Match computation to application data flow i.e. Data Flow Graph TheoryMatch computation to application data flow i.e. Data Flow Graph TheoryHardware-like speed with software-like flexibility that can adapt to the Hardware-like speed with software-like flexibility that can adapt to the
needs of the applicationneeds of the applicationGate densities suitable for 64b floating-pointGate densities suitable for 64b floating-point
Introduction
Traditional ComputingTraditional ComputingHardware development Hardware development
struggling to keep pace with struggling to keep pace with analysis needsanalysis needs
Reaching limits on computing Reaching limits on computing speed due to I/O bandwidth and speed due to I/O bandwidth and clock wallclock wall
Managing heat dissipation Managing heat dissipation becoming increasingly difficultbecoming increasingly difficult 1x
97 98 99 00 01 02 03 04 05
100x
96Year
Per
form
ance
Imp
rove
men
t FPGA
Proc
1000x
10000x
10x
100x
1000x
10000x
10x
Image courtesy of SRC
4Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Motivation
Many scientific applications at ORNL and elsewhere depend on Many scientific applications at ORNL and elsewhere depend on double precisiondouble precision operations operations
Kernel selection and classificationKernel selection and classification– compute intensivecompute intensive– common among many relevant applicationscommon among many relevant applications– candidate for hardware implementationcandidate for hardware implementation
Interface to legacy code (FORTRAN & C) Interface to legacy code (FORTRAN & C) extremelyextremely important importantMemory bottleneck in conventional memory hierarchies for Memory bottleneck in conventional memory hierarchies for
scientific applications throttling performancescientific applications throttling performance
With this knowledge:With this knowledge:– Can users harness reconfigurable hardware Can users harness reconfigurable hardware withoutwithout
(a) becoming (a) becoming hardware expertshardware experts and and
(b) completely (b) completely re-writingre-writing their code? their code?
– Can we develop function libraries such as BLAS, VSIPL, or others Can we develop function libraries such as BLAS, VSIPL, or others without loss of generality?without loss of generality?
5Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Candidate Kernels & Applications
Initial studiesInitial studies– KernelsKernels
• Dense matrix operations (e.g. DGEMM)Dense matrix operations (e.g. DGEMM)• Sparse matrix operationsSparse matrix operations
– ClimateClimate• PSTSWMPSTSWM
– BioinformaticsBioinformatics• BLASTBLAST• Fragment assemblyFragment assembly
– Molecular dynamicsMolecular dynamics• AMBERAMBER• LAMMPSLAMMPS
Cannot cover all apps studies today.
6Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
DGEMM & SGEMM
BLAS routines:BLAS routines: SGEMMSGEMM & & DGEMMDGEMM perform the matrix- perform the matrix-matrix operation:matrix operation:
C = C = AB + AB + CC and and are scalars, and are scalars, and AA, , BB, and , and CC are matrices are matrices ((AA is an is an m x km x k, , BB is an is an k x nk x n, and , and CC is an is an m x nm x n matrix) matrix)
What makes them difficult and interesting:What makes them difficult and interesting:– Memory communication bottleneck (limited bandwidth)Memory communication bottleneck (limited bandwidth)– Local storage limitation (for both sequential & parallel machines) Local storage limitation (for both sequential & parallel machines)
Answer:Answer:
Exploit Data Reusability and Data Flow with FPGAsExploit Data Reusability and Data Flow with FPGAs
7Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Fully utilize both user FPGAs (XC2V6000) of the Fully utilize both user FPGAs (XC2V6000) of the SRC MAPstationSRC MAPstation
DGEMM: DGEMM: 12 MAC12 MAC units per FPGA units per FPGA (SGEMM: (SGEMM: 25 MAC25 MAC units per FPGA) units per FPGA)
Geared to handle arbitrary size matrices up to Geared to handle arbitrary size matrices up to 1024x10241024x1024
Matrices operations occur in blocksMatrices operations occur in blocksHow to count FLOPS?How to count FLOPS?
– FPGA Algorithm performs FPGA Algorithm performs more FLOPS than efficient more FLOPS than efficient SW implementationSW implementation
– Takes advantage of the data Takes advantage of the data flow architecture flow architecture
– Later referred to as Later referred to as alternate FLOPSalternate FLOPS
A00 A01 A02 A03 A04 A05
A20 A21 A22 A23 A24 A25
A30 A31 A32 A33 A34 A35
A40 A41 A42 A43 A44 A45
A50 A51 A52 A53 A54 A55
A10 A11 A12 A13 A14 A15
Implementation
A00
A10
A01
A11
8Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Implementation – Stage0
FPGA1
OBMBanks
A
A00
,A01
B
B00
,B10
C
C00
,C01
D
A10
,A11
E
B01
,B11
F
C10
,C11
FPGA0
Calculations are conducted in two stagesCalculations are conducted in two stagesTwo FPGAs exchange ownership of the matrix B blocksTwo FPGAs exchange ownership of the matrix B blocks
800 MB/sPer bank
9Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Implementation – Stage1
FPGA1
OBMBanks
A
A00
,A01
B
B00
,B10
C
C00
,C01
D
A10
,A11
E
B01
,B11
F
C10
,C11
FPGA0
In stage two, the two FPGAs have exchanged ownership In stage two, the two FPGAs have exchanged ownership of the matrix B blocksof the matrix B blocks
10Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
0.00E+00
1.00E+03
2.00E+03
3.00E+03
4.00E+03
5.00E+03
6.00E+03
7.00E+03
0 100 200 300 400 500 600 700 800 900
Dimension [N]
pe
rfo
rma
nc
e [
Mfl
op
s/s
ec
]
atlas -03 flagcblas_dgemm()SRC dual Xeon
hw measured withRTDSC
hw measured withpga counters
computation onlyw/ FPGA counter
2x faster FPGA
0.00E+00
1.00E-01
2.00E-01
3.00E-01
4.00E-01
5.00E-01
6.00E-01
0 100 200 300 400 500 600 700 800 900
Dimension [N]
tim
e [
se
c]
ATLAS-03 flagcblas_dgemm()SRC dual Xeon
hw measuredwith RTDSC
hardwaremeasured withcounters in fpga
computation onlyw/ FPGA counter
2x faster FPGA
2x faster FPGA &data xfer
DGEMM AnalysisData transfer time Data transfer time in/out of hardware is in/out of hardware is significant and takes significant and takes away from “time to away from “time to solution” – Hence the solution” – Hence the interest in other interest in other memory systems such memory systems such as those used in as those used in systems by Cray and systems by Cray and SGISGI
Faster and/or denser Faster and/or denser FPGAs can significantly FPGAs can significantly improve performance and improve performance and ‘time to solution’‘time to solution’
Performance and ‘time Performance and ‘time to solution’ could to solution’ could potentially be potentially be improved with ‘DMA improved with ‘DMA streaming’ of datastreaming’ of data
11Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
1x97 98 99 00 01 02 03 04 05
100x
96Year
Per
form
ance
Imp
rove
men
t FPGA
Proc
1000x
10000x
10x
100x
1000x
10000x
10x
Image courtesy of SRC
FPGA Opportunity and Potential
Our results using SRC Our results using SRC CARTE v1.8CARTE v1.8
Dual Xilinx XC2V6000Dual Xilinx XC2V6000– 12 64-b MACs @ 100 MHz 12 64-b MACs @ 100 MHz
(or 25 32-b MACs)(or 25 32-b MACs)– 3.5 GFlops 3.5 GFlops (5.3 GFlops (5.3 GFlops alternate FLOPSalternate FLOPS))
Dou et.al. results Dou et.al. results using hardware using hardware languagelanguage
Xilinx XC2VP125-7Xilinx XC2VP125-7– 39 64-b MACs @ 200 MHz39 64-b MACs @ 200 MHz– 15.6 GFlops15.6 GFlops
Parts Available on the Cray XD1Parts Available on the Cray XD1– Xilinx XC2VP50-7 x 6 nodesXilinx XC2VP50-7 x 6 nodes– Up to 200 MHzUp to 200 MHz– Conservative estimate 18 64-b MACs ->7.2 GFlops per nodeConservative estimate 18 64-b MACs ->7.2 GFlops per node– Full utilization of all 6 nodes potentially 43.2 GFlopsFull utilization of all 6 nodes potentially 43.2 GFlops
Flops/MAC ratios: Flops/MAC ratios: MAPstation=0.44 MAPstation=0.44 Dou’s=0.4Dou’s=0.4
12Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Building Function Libraries for RC
GoalGoal: To assemble library of user friendly, familiar, and : To assemble library of user friendly, familiar, and pertinent scientific functionspertinent scientific functions
Initial functions identified:Initial functions identified:– BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM)BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM)– Sparse Matrix Operations Sparse Matrix Operations – FFT and 3D-FFTFFT and 3D-FFT– Bioinformatics query functionsBioinformatics query functions
MDApps
FFT
BLAS Iterative SolversSpMatVec
ClimateApps Bioinformatics
Apps
Queries
13Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Sparse Matrix Vector Operations
Used in iterative solvers for Used in iterative solvers for linear systemslinear systems
Not efficient on general Not efficient on general purpose microprocessor purpose microprocessor systemssystems– High cache miss rateHigh cache miss rate due to poor due to poor
data localitydata locality– Low utilization of floating point Low utilization of floating point
unitunit due to high ratio load/store due to high ratio load/store to floating point operationsto floating point operations
RC advantageRC advantage– Avoid cache missesAvoid cache misses with high on- with high on-
chip and off-chip memory chip and off-chip memory bandwidthbandwidth
– Local distributed memory banksLocal distributed memory banks– High density FPGAsHigh density FPGAs– High speed host to FPGA High speed host to FPGA
communicationcommunication
MAC
.
.
NZ[0]
NZ[1]
NZ[n-1]
NZ[n]
IV[CO[0]]
IV[CO[1]]
IV[CO[n-1]]
IV[CO[n]]
OV[0]
OV[1]
OV[n-1]
OV[n]
NZ – Non-zero element vector, CO – Column indices vector,IV- Input vector, OV- Output vector
MAC
MAC
MAC
Investigating multiple Investigating multiple storage formats (CSR, storage formats (CSR,
ELLPACK, and CSRPERM)ELLPACK, and CSRPERM)
14Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Candidate Application: Amber8 Acceleration Strategy
Identified regions of Identified regions of Amber8 application using Amber8 application using detailed profiling and detailed profiling and modeling of codemodeling of code– ew_direct.few_direct.f– veclib.fveclib.f
Examining strategy for Examining strategy for mapping this routine into mapping this routine into SRC’s two FPGAsSRC’s two FPGAs
Also investigating Also investigating acceleration of FFTs using acceleration of FFTs using FPGAsFPGAs– ew_recip.few_recip.f– ew_fft.few_fft.f– pub_fft.f & passb2.fpub_fft.f & passb2.f
11.22%
3.39%
3D FFT time worsens for parallel systems due communication costs
73.14%
O(N2) smaller problems
main
sander
runmd
ewald_force
get_nb_energy
short_ene
do_pmesh_kspace
force
grad_sumrc
vdinvsqrt
fft_backrc fft_forwardrc
fft3dzxyrc
fft2drc
fft3d0rc
cfftb1 cfftf1 cffti
passb4
passb2
nb_adjust
fastwt_mp_quick3
adjust_imagcrds
shake
fft_setup
passf4
pub_fft.f
passb2.f
ew_fft.f
ew_recip.f
ew_direct.f
vec_lib.f
ew_force.f
ew_box.f
23558000
47116000
1000
1
1000
1000
1
1
15Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
3D FFTs in LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator)
fftw (plan, total1/length1, data, 1/-1, length1, NULL, 0, 0)
Nfast (1) x Nmid (2) x Nslow (3)
remap_3d (data, copy, scratch, pre_plan)
fftw (plan, total2/length2, data, 1/-1, length2, NULL, 0, 0)
remap_3d (data, copy, scratch, pre_plan)
fftw (plan, total3/length3, data, 1/-1, length3, NULL, 0, 0)
remap_3d (data, copy, scratch, pre_plan)
fft_3d ( )
1
2
3
fftw
OBM
BRAM plane
BRAM plane
… 321 GCM
Single/Multi-MAP
total1/length1 = 1x3x3/3 = 3
total2/length2 = 3x1x3/3 = 3
total3/length3 = 3x3x1/3 = 31/-1 = forward / inverse
Will not necessarily fit but there is a penalty for going off-chip
Remap stages are exchanged by intelligent access and addressing
fftw
fftw
fftw
fftw
fftw
fftw
fly
MOI
fly
M
fftw_orchestrator
Depending on data size the FPGA implementation of the fftw will resemble the software counterpart with improved performance and data reuse
The fly element indicated stand for different FFT computation units with radix 2,3,4, and 5 and with certain level of parallelism
16Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Bioinformatics – BLAST
BLAST: Basic Local BLAST: Basic Local Alignment Search ToolAlignment Search Tool
Profiling of the NCBI Profiling of the NCBI source code determine source code determine time-consuming functions time-consuming functions that could be targeted to that could be targeted to FPGA completedFPGA completed
Currently investigating Currently investigating best problem structure and best problem structure and domain for given RC domain for given RC architecture and architecture and bandwidths (analysis of bandwidths (analysis of data streams, memory data streams, memory capacity, etc.)capacity, etc.)
17Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Lessons Learned
Effective use of HLL (such as the Carte tool used here) to design for Effective use of HLL (such as the Carte tool used here) to design for FPGAs still requires some hardware knowledgeFPGAs still requires some hardware knowledge– Memory limitationsMemory limitations– FPGA limitationsFPGA limitations– ‘‘Tricks’ to take advantage of FPGA strengthsTricks’ to take advantage of FPGA strengths– ‘‘Tricks’ to take advantage of RC architectureTricks’ to take advantage of RC architecture
Library development requires analysis to determine functions Library development requires analysis to determine functions appropriate for FPGA implementationappropriate for FPGA implementation
Breakout level of library functions may not always be appropriate Breakout level of library functions may not always be appropriate for RC implementation – still under investigationfor RC implementation – still under investigation– Combine or fuse appropriate function calls to form larger functions with Combine or fuse appropriate function calls to form larger functions with
more computational weightmore computational weight
18Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Status Review & Future Work
Consider these caveatsConsider these caveats– FPGA growth rates exceeding general purpose microprocessorsFPGA growth rates exceeding general purpose microprocessors
• These FPGA implementations demonstrate These FPGA implementations demonstrate performance performance with additional with additional power and spacepower and space savings vs. general processor implementations savings vs. general processor implementations
– Restricted our evaluation to compiler transformed high-level languagesRestricted our evaluation to compiler transformed high-level languages• No manual VHDL codingNo manual VHDL coding• Performance comparable with VHDL techniques Performance comparable with VHDL techniques
(adjusting for FPGA size & clock frequency)(adjusting for FPGA size & clock frequency)
– New higher bandwidth RC architectures promise to dramatically reduce New higher bandwidth RC architectures promise to dramatically reduce data transfer costsdata transfer costs
– Efforts in 64b floating-point computation just beginningEfforts in 64b floating-point computation just beginning• Cores not widely availableCores not widely available
– No common tools exist that identify candidate codes or regions in the No common tools exist that identify candidate codes or regions in the application for accelerationapplication for acceleration• Must manually profile and model large, complex applicationsMust manually profile and model large, complex applications
We expect the performance advantages and applicability of these systems to only improve over the coming years.
19Smith MAPLD 2005/187
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Status Review & Future Work (cont.)
Ability to code inAbility to code in C or FORTAN C or FORTAN a significant benefit for our users a significant benefit for our usersProgress on several application areasProgress on several application areas
– Initial studies completed with competitive performanceInitial studies completed with competitive performance• Kernels (dense & sparse matrix), climateKernels (dense & sparse matrix), climate
– Actively studying other fruitful areasActively studying other fruitful areas• Molecular dynamics, Bioinformatics Molecular dynamics, Bioinformatics
Future work will focus on Future work will focus on – Maximum utilization of FPGA resourcesMaximum utilization of FPGA resources– Additional function/kernel library developmentAdditional function/kernel library development– Resource management for multi-paradigm platformsResource management for multi-paradigm platforms– Evaluations of other RC platforms (Cray XD1 and SGI)Evaluations of other RC platforms (Cray XD1 and SGI)
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
EndEnd