Overlays:asolu-onparadigmforFPGAhigh-leveldesign?
TarekS.Abdelrahman
TheEdwardS.RogersDepartmentofElectricalandComputerEngineering
UniversityofToronto
ReconfigurableSystemsontheRise• FPGAsareincreasinglyintegratedincompuCngsystems
– Massiveparallelismcanleadtohighperformance– Lowerpower– Customizability
• NewergeneraConofhigh-performancesystemsintegrateFPGAswithmulCcores,targeCngdatacenters– ExamplesystemsfromIntel,IBMandXilinx– UsedmainlybysoOwaredevelopers
16-07-11 2
ReconfigurableSystemsontheRise• FPGAsareincreasinglyintegratedincompuCngsystems
– Massiveparallelismcanleadtohighperformance– Lowerpower– Customizability
• NewergeneraConofhigh-performancesystemsintegrateFPGAswithmulCcores,targeCngdatacenters– ExamplesystemsfromIntel,IBMandXilinx– UsedmainlybysoOwaredevelopers
16-07-11 3
FPGAProgrammabilityBurdens• FPGAsareprogrammedusingahardwaredesignabstracCon,
whichisforeigntothebulkofsoOwaredevelopers– HDL,Timing,fiYng,seedsweeps,etc.
• FPGAdevelopmenttoolsleadtoextremelylongdevelopmentcyclescomparedtotheirsoOwarecounterparts– Alargecircuitcantakedaystocompile(synthesis,place,route,Cme,
etc.)andmayneedseveralcompiles
• ThereisapressingneedtoalleviatetheseburdensandmakeFPGAdesignaccessibletosoOwaredevelopers
16-07-11 4
TacklingtheBurden• High-LevelSynthesis(HLS)
– GeneratedhardwareincreasinglycompeCCvewithHDLdesign
• High-levelprogrammingmodels– DataflowmodelfromMaxeler
• Nonetheless:– Developerremainsexposedtovariousaspectsofhardwaredesign– UseofFPGAdesigntoolsissCllrequired!⇒longdevelopmentcycles
16-07-11 5
Overlays• Pre-compiledFPGAcircuitsthatareinthemselves
configurable/programmable,i.e.,run-Cmeconfigurable– Examples:soOprocessors,GPU-on-FPGA,mesh-of-FUs,etc.
16-07-11 6
SoFProcessor
Source:Andrycetal:FlexGrip:ASoFGPGPUforFPGAs,FPT13
PE PE PE
PE PE PE
PE PE PE
FPGAvs.OverlayDesignFlows
16-07-11 7
Pre-compiledoverlay
FPGAFPGA
FPGADesignTools
ConfiguraConStreamFPGA
bitstream
Applica-on(HDL)
Applica-on-to-OverlayTools
Applica-on(C,CUDA,DFG,etc.)
seconds
hours/days
µseconds
harder simpler
Mesh-of-FUsOverlays[FPL2013]
16-07-11 8
ADD ADD
EXP SHF
ADD SUB SUB
MUL
DIV
FuncConUnit
RouCnglogic
4-NNconnectedarrayofcells
DataFlowGraph
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
MappingDFGstoOverlay–Place
16-07-11 9
ADD ADD
EXP SHF
ADD SUB SUB
MUL
DIV
I1
I2
I3 I4 I5
I6
O1 O2
A B C
D
E
DataFlowGraph
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
16-07-11 10
ADD SUB SUB
ADD MUL ADD
DIV EXP SHF
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
I1
I2
I3 I4 I5
I6
O1 O2
A B C
D
E
pipelineregister/FIFO
DataFlowGraph
MappingDFGstoOverlay–Route
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
PipelinedExecu-on
16-07-11 11
ADD SUB SUB
ADD MUL ADD
DIV EXP SHF
O1
O2
I1 I2 I3 I4 I5 I6
ADD SUB SUB
MUL
DIV
C
E
D
A B
I1
I2
I3 I4 I5
I6
O1 O2
A B C
D
E
pipelineregister/FIFO
DataFlowGraph
Mesh-of-FUsTools• ApplicaCon-to-overlaytoolchainthat:
– ExtractsDFGofbodiesofparallelloopsinCcode
– PlacesandroutestheDFGnodesontotheoverlay• ConfigurestheswitchestoestablishDFGconnecCvity• GeneratestheconfiguraConstreamoftheoverlay
16-07-11 12
HighPerformancewithnoHardwareDesign
16-07-11 13
DFG Size(nodes) GFLOPS CompileTime(sec)
n-Body 125 18.72 0.44BlackSholes 131 21.22 1.33MatMul 96 19.66 1.05MatMulAdd 114 22.46 3.80
• Examplemesh-of-FUsoverlayonaStraCxIV[FPL2013]– SingleprecisionfloaCngpointoperaCons– 288cellsimplementedasan18x16mesh– fMAXof312MHzand32.4GFLOPSpeak(integerat415MHz)
• Othersalsoreporthighperformanceresults
GFLOPS CompileTime(sec)
21.52 272422.10 250825.21 204528.79 919
HDLOverlay
SoFware-FriendlyTarget• OverlaysraisethelevelofabstracConofusingFPGAstoone
thatismorefamiliartosoOwaredesigners– CprogrammingforasoOprocessor– CUDA/OpenCLforGPUoverlays– Dataflowgraphsformesh-of-FUs
• ThisopensupopportuniCesfor“standard”soOwaretoolstotargetFPGAs
16-07-11 14
JITCompila-ontoHardware
• Profilecode
16-07-11 15
:ADD R9,R7,R10BEQZ end
L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FPGAOverlay
CPU
JITCompila-ontoHardware
• IdenCfyhotsegmentsofcode
16-07-11 16
:ADD R9,R7,R10BEQZ end
L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:
FPGAOverlay
CPU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
JITCompila-ontoHardware
• ExtractDFGandconfiguretheoverlay
16-07-11 17
:ADD R9,R7,R10BEQZ end
L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:
ADD
MULT ADD
SLT SUB
FPGAOverlay
CPU
ADD
MULT ADD
SLT SUB
JITCompila-ontoHardware
• Re-writethecode
16-07-11 18
:ADD R9,R7,R10BEQZ end
L1: ADD R7,R6,R1:
FPGAOverlay
CPU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
ADD
MULT ADD
SLT SUB
ADD
MULT ADD
SLT SUB
JITCompila-ontoHardware
• TransferexecuContotheoverlay
16-07-11 19
:ADD R9,R7,R10BEQZ end
L1: ADDR7,R6,R1:
FPGAOverlay
CPU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
FU FU FU
ADD
MULT ADD
SLT SUB
ADD
MULT ADD
SLT SUB
User-TransparentDynamicProgramAccelera-on
APrototypeJITCompiler• Target:IntelQuickAssistPlamorm• Thecompilerprototype:
– BuiltaroundLLVM,targetsinnermostloopsofscienCficcode– MiCgatesmuchoftherun-CmeoverheadtocompileCme
• Overlaycurrentlybeingintegratedintothetargetplamorm16-07-11 20
CPU CPU
SystemMemory
QPICoherentInterconnect
StraCxFPGA
QPIIPAFU
XeonMulCcoreProcessor
FigureaOerIntelliterature
Accelera-onPoten-al
16-07-11 21
0
1
2
3
4
5
6
7
Speedu
p
Aplplica-on
FPGAsimulaConresultsbasedonmeasuredsystemparameters
Customizability• OneofthekeyadvantagesofFPGAsisthattheycanbe
customizedforapplicaCons
• Overlayscanalsobe“user”customizable– WithminimalusageofFPGAdesigntools
• Inthecontextofourmesh-of-FUs,wecanvarythechoiceoftheFUateachlocaConofthemesh,i.e.,thefuncConallayout,totheoverlaymoreefficientforanapplicaCon
16-07-11 22
ALibrary-BasedApproach
16-07-11 23
A M D D
A M S S
A M A M
A M A M
A M D D
A M S S
A M A M
A M A M
DesiredOverlay LibraryofPre-PlacedandPre-routedOverlays SCtchedOverlay
M M
A AD D
S SA M
A M
• Bopom-Upflowallows(restricted)relocaConofpre-placedandpre-routedgroupsofcells[FPL2014] sCtch
• Example12x15overlay:35minutesvs.15hours
D A S S
A
A
A
M
M
M
M
ProgramAnalysisforCustomiza-on
16-07-11 24
A
M A M
SS
M
D
A
D
MAProgramAnalysis
A M D D
A M S S
A M A M
A M A M
CandidateOverlaysProgramDFGWork-to-be-done
SystemIntegra-on
• MustbeabletovirtualizetheFPGA– Takesnapshots– Migrate– Shareandmanageasaresource
16-07-11 25
CPUs GPUs FPGAs
VM VM
CPUs GPUs FPGAs
VM VM
CPUs GPUs FPGAs
VM VM
Spark Hadoop GraphLab TensorFlow
ApplicaCon ApplicaCon ApplicaCon
OverlaysFacilitateVirtualiza-on• FPGAvirtualizaCononlynowbeingexplored
– Requiresspecializedhardware– Averylarge“state”
• Overlaysnaturallyhaveamuchsmallerstate,facilitaCngsnapshotsandcontextswitching
– Wewouldliketoexplorethissupportinourmesh-of-FUsoverlay
16-07-11 26
ChallengestoOverlays
• Resourceoverhead– Thatis,theFPGAresourcesusedbytheoverlaycomparedtoa
dedicatedcircuit(HDL)thatimplementsthesameapplicaCon– ~4XforourFPoverlayandcanbehigher– DifficulttoquanCfydesigneffort
– FPGAsareareincreasinginsize– HardfloaCngpointunits– Hardeningtheoverlayoncedesignisover?
16-07-11 27
ChallengestoOverlays–Cont’d• OverlayarchitecturesneedmoreexploraCon:which
architectureforagivenapplicaCondomain– Howtoensurescalability?– TakingintoaccounttheunderlyingFPGAdeviceconstraints– Howtoimplementwell(e.g.,data-drivenexecuCon,FIFOs,etc.)?– FixedfuncConvs.mulC-funcConFUs?– Howtoreducingresourceoverhead?– TimemulCplexed?– MulCpledevices?
16-07-11 28
ChallengestoOverlays–Cont’d• EvolvingtheFPGAdesigntools
– Modulararchitecturesdonotleadtomodularcircuits• Thetoolsdonotunderstandthemodularity• Atpresentwemust“fightwiththem”[FPL2014]
– Thetoolsmustevolvetoallowdeveloperstoexpressandtorecognizethemodularityofthearchitecture• Scalablecircuitsfromscalablearchitectures
16-07-11 29
ConcludingRemarks• Acaseforoverlays
– Performance,soOware-friendliness,customizabilityandsystemintegraCon
• Theycanserveas“middleground”betweenhardwaredesignandsoOwareprogramming– EitherforproducConorfordebuggingandprototyping
• Challengestoarchitecture,programmingmodels,implementaConandresourceoverhead
16-07-11 30
Ques-ons?
16-07-11 31
Top Related