DNNExplorer: A Framework for Modeling and Exploring a ...

DNNExplorer: A Framework for Modeling and Exploring aNovel Paradigm of FPGA-based DNN Accelerator

Xiaofan Zhang1∗, Hanchen Ye1∗, Junsong Wang2, Yonghua Lin2, Jinjun Xiong3, Wen-mei Hwu1, Deming Chen11University of Illinois at Urbana-Champaign, 2Easy-visible, 3IBM Research

{xiaofan3, hanchen8, w-hwu, dchen}@illinois.edu, {junsong.wang, yonghua.lin}@easy-visible.com, [email protected]

ABSTRACTExisting FPGA-based DNN accelerators typically fall into two de-sign paradigms. Either they adopt a generic reusable architecture tosupport different DNN networks but leave some performance andefficiency on the table because of the sacrifice of design specificity.Or they apply a layer-wise tailor-made architecture to optimizelayer-specific demands for computation and resources but loosethe scalability of adaptation to a wide range of DNN networks.To overcome these drawbacks, this paper proposes a novel FPGA-based DNN accelerator design paradigm and its automation tool,called DNNExplorer, to enable fast exploration of various acceler-ator designs under the proposed paradigm and deliver optimizedaccelerator architectures for existing and emerging DNN networks.Three key techniques are essential for DNNExplorer’s improved per-formance, better specificity, and scalability, including (1) a uniqueaccelerator design paradigm with both high-dimensional designspace support and fine-grained adjustability, (2) a dynamic designspace to accommodate different combinations of DNN workloadsand targeted FPGAs, and (3) a design space exploration (DSE) en-gine to generate optimized accelerator architectures following theproposed paradigm by simultaneously considering both FPGAs’computation and memory resources and DNN networks’ layer-wisecharacteristics and overall complexity. Experimental results showthat, for the same FPGAs, accelerators generated by DNNExplorercan deliver up to 4.2x higher performances (GOP/s) than the state-of-the-art layer-wise pipelined solutions generated by DNNBuilder [1]for VGG-like DNN with 38 CONV layers. Compared to acceleratorswith generic reusable computation units, DNNExplorer achievesup to 2.0x and 4.4x DSP efficiency improvement than a recentlypublished accelerator design from academia (HybridDNN [2]) anda commercial DNN accelerator IP (Xilinx DPU [3]), respectively.

1 INTRODUCTIONGreat successes have been achieved by deep neural networks (DNNs)in a massive number of real-life artificial intelligence (AI) applica-tions. Such a success has been driven in part by the continuousimprovement of DNN models, using deeper and more sophisticatedlayer interconnections, to deliver the state-of-the-art solutions [4–11].With the impressive quality of results comes alongDNNmodels’increasing requirements of more computation and memory. Thisin turn puts stringent design constraints on any domain-specifichardware accelerator designs to not only deliver high inference

*These authors made equal contributions.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute tolists, requires prior specific permission and/or a fee. Request permissions from [email protected] ’20, November 2–5, 2020, Virtual Event, USA© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8026-3/20/11. . . $15.00https://doi.org/10.1145/3400302.3415609

accuracy but also satisfy inference speed, throughput, and energyefficiency.

There are rich studies on customized DNN accelerators that takeadvantage of different hardware devices, such as exploring kerneloptimizations on GPUs and customizing accelerators on ASICs andFPGAs [12–16]. Because of FPGA’s flexibility of meeting stringentenergy constraints and fast time-to-market, FPGA-based solutionshave recently become prominent choices for DNN acceleration withimproved latency and energy efficiency [17–19].

By investigating recent FPGA-based solutions, we see two popu-lar accelerator design paradigms: one adopts the generic reusablearchitectures for all DNN layers; while the other customizes hard-ware implementation for each DNN layer. When following the firstparadigm, such as designs in [2, 20, 21], a generic computation unitis instantiated on FPGA where all DNN layers are processed in arecurrent manner. When adopting the second paradigm, such asdesigns in [1, 22, 23], a layer-wise pipeline architecture is imple-mented on FPGA, where each DNN layer is handled by a dedicatedpipeline stage. In other words, accelerator designs have better cus-tomization for each layer. However, both paradigms may not besuitable for the emerging DNN models where more diverse layerconfigurations and deeper network structures are involved. Thefirst paradigm has difficulty to optimize DNNs with diverse layersthat exhibit vastly different computation-to-communication (CTC)ratios. This will degrade the accelerator performance because ofthe lack of fine-grained adjustments in such a generic architecture.The second paradigm also has an obvious flaw as it requires hard-ware instances for each pipeline stage. The more layers in the DNNmodels, the less resources for each stage, which eventually leads tolower performance.

To address these challenges, we propose DNNExplorer, a frame-work tomodel and explore DNN accelerator designs with the goal ofproducing a balanced DNN solution while keeping the advantagesof the two aforementioned popular accelerator design paradigmsby addressing their respective flaws. In summary, the main contri-butions of this paper are as follows.

(1) We introduce a new FPGA-based accelerator design paradigmwith higher-dimensional design space support, better specificity,and improved scalability to effectively address the drawbacks ofexisting FPGA-based DNN accelerator designs.(2) We propose an automation tool called DNNExplorer to enablefast architecture exploration under the proposed paradigm and de-liver optimized accelerator designs in three steps as:model analysis,performance modeling, and architecture exploration.(3) We define a dynamic design space to capture more detailed con-figurations of the proposed accelerator paradigm. It helps generatea well-defined design space for architecture exploration to betteraccommodate different input DNN models and targeted FPGAs.

arX

iv:2

008.

1274

5v2

[cs

.AR

] 2

3 M

ar 2

021

https://doi.org/10.1145/3400302.3415609

Figure 1: CTC (computation-to-communication) distribu-tion in VGG-16 models (without FC layers) regarding 12 in-put resolution cases. Inputs are RGB images with 3 depthchannels and height and width listed as input size.

(4) We propose a two-level (global and local) automatic designspace exploration (DSE) engine to efficiently find optimized accel-erator architectures by simultaneously considering both FPGAs’computation and memory resources and DNN networks’ layer-wisecharacteristics and overall complexity.(5) We experimentally demonstrate that DNNExplorer deliversbetter FPGA-based DNN accelerators with 4.2x higher throughputcompared to the pure pipeline design (DNNBuilder), and 4.4x higherDSP efficiency compared to a generic design (Xilinx DPU).

2 RELATEDWORKThere are extensive studies in customized DNN accelerator designusing FPGAs. To pursue higher energy efficiency, accelerators in[17, 18] incorporate a dynamic data quantization scheme for bothDNN parameters and intermediate feature maps to relax the com-pute and memory demands. The designs in [24, 25] support binaryand ternary quantization, which intend to replace the hardware-intensive floating-point multiplications by logical operations. Fastconvolution (CONV) algorithms have been investigated to lowerthe computation complexity and speedup DNN inference, such asusing Winograd-based solutions, fast Fourier transform (FFT), andhybrid schemes with both spatial CONV and fast CONV algorithms[2, 26, 27]. A fine-grained layer-based pipeline architecture is pro-posed by DNNBuilder[1], which works with a resource allocationscheme to lower the end-to-end latency when running real-lifeDNN applications. Recently, more solutions have been developedwith both hardware and software optimizations. For example, au-thors in [28] apply DNN compression before mapping the DNNsonto FPGAs for higher speedup, while authors in [11, 21] proposehardware-software co-design to accelerate resource-demandingDNN applications for embedded FPGAs.

Recently published literature also focuses on building automa-tion tools for rapid performance estimation and implementation ofFPGA-based DNN accelerators. The work in [29] proposes a unifiedrepresentation for CONV and fully-connected (FC) layers to facil-itate the accelerator modeling on targeted FPGAs. A frameworkproposed in [30] incorporates systolic arrays to improve computingefficiency, while the design in [2] further improves the acceler-ator design by proposing processing engines with both spatial-and Winograd-CONV support. To help fully explore available re-sources of the targeted FPGAs, authors in [1] introduce an auto-matic resource allocation tool to generate fine-grained parallelismguidelines for its architecture. Researchers also employ High-levelSynthesis (HLS) in their tools to improve design efficiency of FPGA-based hardware designs [31–35].

Figure 2: (a) The trends of DSP efficiency when runningVGG16 with increasing input sizes in three representativeFPGA-based DNN accelerators (batch size = 1); (b) Normal-ized throughput performance in three accelerators whenrunning VGG-like DNNs with 3×224×224 inputs and 13∼38CONV layers. (The performance of each accelerator is nor-malized to their baseline cases running the 13-layer DNN.)

3 CHALLENGES OF CURRENT SOLUTIONSBy following the first paradigm as mentioned in Section 1, differentDNN layers are handled by a generic compute unit. Although themajority of computation in DNNs comes from CONV layers, thedifferences among CONV layers may be significant. For example,Figure 1 shows 12 models’ layer characteristics in terms of theirCTC ratio distribution. The medians of CTC show an upward trendalong with higher input resolutions. From 32×32 to 512×512 inputs,CTC medians rapidly increase by nearly 256 times following theblue curve, which implies significantly different computation andmemory-access patterns. It is also obvious that the CTC rangeor variations are large even for the same model. If using a genericcompute unit to process these layers, we have to accept sub-optimalsolutions because of the lack of architecture specificity and limiteddesign spaces. We use DSP efficiency (Eq. 1) to evaluate whether anaccelerator is working efficiently, where 𝛼 represents the numberof multiply-accumulate (MAC) operations handled by one DSP inone clock cycle, i.e., 𝛼 = 2 for 16-bit and 𝛼 = 4 for 8-bit inputs.

𝐸𝐹𝐹𝐼𝐷𝑆𝑃 =𝐺𝑂𝑃/𝑠

𝛼 ×𝐷𝑆𝑃𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑒𝑑 × 𝐹𝑅𝐸𝑄 (1)

As shown in Figure 2 (a), both Xilinx DPU [3] and HybridDNN[2] (representing the first paradigm) suffer lower DSP efficiency (upto 64.9% and 53.7% degradation, respectively) compared to dedicateddesigns from DNNBuilder [1] (representing the second paradigm).For accelerators following the second paradigm, separate pipelinestages are instantiated on FPGAs for each major DNN layer andeventually all of them are combined into pipeline implementations.This allows dedicate layer designs according to layers’ inherentcharacteristics (e.g., CTC ratio, computation and memory demands),which enables the possibility of more fine-grained hardware config-urations and more in-depth design space exploration. However, itsscalability may be easily restricted by the number of DNN layersthat it can support, as a deeper DNN means more pipeline stagesand less resources for each layer. In this case, performance degra-dation is expected. Some intuitive examples can be found in Figure2 (b), where accelerators need to handle 4 VGG-like DNNs with13∼38 CONV layers. The performance of DNNBuilder decreases77.8% on a 38-layer model compared to the shallower network with13 CONV layers. In contrast, the generic accelerators maintain astable performance.

Figure 3: The proposed accelerator paradigm to overcomeexisting drawbacks when mapping DNNs onto FPGAs. It in-cludes 𝑆𝑃 pipelined stages for layer 1 ∼ 𝑆𝑃 and a genericaccelerator for the rest of layers. 𝑆𝑃 is the split-point.

Table 1: Ratio of CTC variances between the first half (𝑉1)and the second half (𝑉2) of DNNs.

Network Input Size 𝑉1/𝑉2 Network Input Size 𝑉1/𝑉2Alexnet 3x227x227 185.8 ResNet-18 3x224x224 1607.3

GoogLeNet 3x224x224 3622.8 ResNet-50 3x224x224 998.7InceptionV3 3x299x299 6210.6 SqueezeNet 3x227x227 238.9VGG-16 3x224x224 489.8 MobileNet 3x224x224 3904.2VGG-19 3x224x224 552.6 MobileNetV2 3x224x224 251.5

4 DNNEXPLORER FRAMEWORK4.1 The Proposed Novel ParadigmTo overcome these challenges, we propose a new accelerator para-digm to capture the essences of both existing paradigms but addresstheir disadvantages. We present the proposed paradigm in Figure 3,to leverage both layer-dedicated and generic reusable designs. Forthe first part, we implement a pipeline accelerator to generate dedi-cate layer stages for the first 𝑆𝑃 layers. It helps guarantee higherDSP efficiency and more fine-grained resource allocation. For thesecond part, we adopt a generic architecture for the rest of DNNlayers so that it supports deeper DNNs with given resources.

To validate the proposed idea, we investigate 10 popular DNNsand summarize their CTC fluctuations in Table 1. Specifically, wedivide every DNN into two halves: the first half covers the bottompart of layers (close to the input layer) with 50% of the total MACoperations while the second half contains the rest of layers. Ineach half, we calculate the average value of the squared differencebetween each layer’s CTC and the mean CTC. We use 𝑉1 and 𝑉2to represent these results (variances) in the first and second half,respectively. In Table 1, 𝑉1 is on average 1806.2 times higher than𝑉2, which means the first half have more CTC fluctuations. This isa common phenomenon during DNN inference, which causes lowDSP efficiency of the FPGA-based accelerators following the firstparadigm. In our proposed design, these CTC fluctuations from thefirst half can be properly resolved by the layer-dedicated pipelinearchitecture. By considering the second half, where layers have lessCTC variances and share higher similarity, a generic structure cansuccessfully eliminate the poor-scalability of the pipeline design.

The proposed design is not a simple concatenation of the twoexisting paradigms, as the fusion of two heterogeneous structurescan directly cause an exponential increase of the design space andeasily lead to tedious explorations and sub-optimal solutions. Theseadditional design parameters include the task partitioning scheme,the batch size, and the hardware resource allocation between twostructures. Therefore, we propose a highly-efficient design tool,called DNNExplorer, to perform automatic architecture explorationunder the proposed paradigm and deliver optimized solutions.

Figure 4: The Design flow of DNNExplorer containing 3steps to generate optimized DNN accelerators given DNNmodels and FPGA specifications.

4.2 DNNExplorer Design FlowDNNExplorer generates optimized FPGA-based accelerators accord-ing to the input DNN model, the targeted FPGA specification, andthe proposed accelerator paradigm. As shown in Figure 4, threesteps are included in DNNExplorer to provide Model/HW Analysis,Accelerator Modeling, and Architecture Exploration.

In Model/HW Analysis, DNN definition files and trained parame-ters are passed to DNNExplorer for model profiling. The layer-wiseinformation is extracted in this step, such as layer type, layer con-figuration, computation and memory demands, CTC ratio, quanti-zation scheme, etc, and then packed as DNN info for ArchitectureExploration. The input also includes a FPGA specification, whichhelps setup boundaries of available resources, such as DSP, BRAM,and external memory bandwidth.

In Accelerator Modeling, we use models corresponding to thepipeline (𝑃 ) and generic (𝐺) structure of the proposed paradigm.The goal of this step is to adopt highly-accurate pre-built analyticalmodels for resource utilization and performance estimation. Even-tually, a list of configurable accelerator parameters are collected forexploration in the next step. Also, the analytical models contributeto the performance evaluation in Architecture Exploration. In-depthintroductions of these models are provided in Section 6.

In Architecture Exploration, we adopt a divide and conquer strat-egy and design a two-level automatic DSE engine to efficientlysearch for optimal accelerators in the massive design space. We useResource Allocation Vector (RAV), a 5-dim vector 𝑅, to describeresources allocated to the 𝑃 and 𝐺 structure. 𝑅 contains the split-point (𝑆𝑃 ), which indicates task partitioning between these twostructures, the batch size (𝐵𝑎𝑡𝑐ℎ), and three major FPGA resources(𝐷𝑆𝑃, 𝐵𝑅𝐴𝑀, 𝐵𝑊 ). Given inputs from the last two steps, a globaloptimization is first performed to explore and update 𝑅 in everyiteration, providing guidelines for building 𝑃 and 𝐺 structure, re-spectively. After that, local optimizations are executed individuallyfor 𝑃 and 𝐺 structure to explore the best configurations. All se-lected accelerator parameters are documented on an optimizationfile for driving the performance evaluation using analytical modelsin the modeling step. With the performance feedback, the globaloptimization algorithm can decide whether to continue exploration.Eventually, DNNExplorer generates the optimized architecture.

5 ACCELERATOR ARCHITECTURE5.1 Accelerator OverviewIn our design, the targeted DNN is mapped into hardware based onthe RAV. Given a specific 𝑅, we have

𝑅 = [𝑆𝑃, 𝐵𝑎𝑡𝑐ℎ, 𝐷𝑆𝑃𝑝 , 𝐵𝑅𝐴𝑀𝑝 , 𝐵𝑊𝑝 ] (2)

Img. 1 Pip. Img. 2 Pip. Img. 3 Pip.



Img. 1 Gen. Img. 2 Gen. Img. 3 Gen.

Layer 1

Layer 2

Layer SP

Layer

SP+1 ~ N

... ... ...

...

Time

Lp

LgLmodel

...

...

...

...

...

Img. 4 Pip.

Img. 4 Pip.

Img. 4 Pip.

...

Figure 5: The overall dataflow of the proposed accelerator

Layer 1 ∼ 𝑆𝑃 are instantiated into a layer-wise tailor-made struc-ture following a pipeline manner with batch size = 𝐵𝑎𝑡𝑐ℎ andavailable resource [𝐷𝑆𝑃𝑝 , 𝐵𝑅𝐴𝑀𝑝 , 𝐵𝑊𝑝 ]. Denoting the total avail-able global resource on the targeted FPGA is [𝐷𝑆𝑃, 𝐵𝑅𝐴𝑀, 𝐵𝑊 ],the rest of layers are mapped into a generic reusable structure fol-lowing a recurrent manner with batch size = 𝐵𝑎𝑡𝑐ℎ and availableresource [𝐷𝑆𝑃 −𝐷𝑆𝑃𝑝 , 𝐵𝑅𝐴𝑀 −𝐵𝑅𝐴𝑀𝑝 , 𝐵𝑊 −𝐵𝑊𝑝 ]. The pipelineand generic structure share the same clock frequency. We willdiscuss the selection of RAV in Section 7.

Figure 6 presents the proposed accelerator design, where threemajor FPGA resources are utilized as computation resources (greenarea), on-chip memory (blue area), and external memory (orangearea). Inputs are first streamed into on-chip buffer from externalmemory in the first pipeline stage and then intermediate results(DNN feature maps) are passed horizontally through all pipelinestages and reach the feature map buffer of the generic structure.On the other hand, trained DNN parameters are passed verticallywith read-only transmission from the external memory. The overalldataflow of the proposed accelerator is shown in Figure 5. We adoptthe fine-grained pipeline from [1] to reduce the initial latency whenhandling layer 1 ∼ 𝑆𝑃 in our pipeline structure. After that, thegeneric structure starts working on the rest of layers. To reachthe maximum throughput performance, we need to balance thelatency of each pipeline stage (𝐿𝑝 ) and the generic structure (𝐿𝑔).With a perfect load-balanced design, we are able to achieve the peakthroughput performance as 1/𝑚𝑎𝑥 (𝐿𝑝 , 𝐿𝑔). The detailed algorithmsof searching such load-balanced design are introduced in Section 7.5.2 The Pipeline StructureThe first part of Figure 6 covers 𝑆𝑃 dedicated pipeline stages to pro-cess major layers in targeted DNNs, such as convolutional (CONV)layers, pooling (POOL) layers, and fully-connected (FC) layers.These layers are likely to dominate most of the computation andmemory resources. Other layers, such as batch normalization (BN)and activation layers are concatenated into the major ones and pro-cessed by the same pipeline stage. In each stage, we implement threemain components as: (1) a computation engine to carry out DNNlayer operations with dedicated parallelism design, (2) a weightbuffer to pump in DNN parameters from external memory; and (3) acolumn/row buffer to cache columns or rows of intermediate resultsgenerated by the previous stage. Four configurable parameters areavailable in every stage, including the channel parallelism factor(CPF) and the kernel parallelism factor (KPF) (which are unrollingfactors along input and output dimensions), the input data bit-width(DW), and the weight bit-width (WW). With these parameters, wecan generate dedicate accelerator designs for different DNN layers.

5.2.1 ComputationEngine (CE). As shown in the green partsof Figure 6, CEs handle most of the computations from DNN layers.Inside each CE, we implement a number of processing elements

(PEs) with two-dim parallelism (CPF and KPF). Since both CONVand FC layers share similar computation patterns, we leverage thesame PE for these two layers.More specifically, assuming layer stage1 in Figure 6 works for CONV layers, the CE in this stage contains𝐾𝑃𝐹1 PEs, and each PE is designed to handle one 𝐶𝑃𝐹1-length vec-tor multiplication in one clock cycle (assuming 𝐷𝑊1 =𝑊𝑊1 = 16).Once the trained DNN parameters are ready in weight buffer, wewill broadcast a𝐶𝑃𝐹1-length vector of input feature map to all𝐾𝑃𝐹1PEs, and after calculations, the accumulated results (𝐾𝑃𝐹1-lengthvector) will be written to the column/row buffer of the next stage.

5.2.2 Fine-grained Pipeline & Column-based Buffer. Toensure lower initial latency and efficient BRAM utilization of thepipeline structure, we adopt the fine-grained layer-based pipelineand column-based cache scheme from [1]. With these technologies,we no longer need to wait for the full completion of intermediateresults before continuing to the next pipeline stage. Instead, asshown in Figure 5, the next pipeline stage can be launched oncethe first few columns or rows of input frame are ready.5.3 The Generic StructureThe second part of the proposed architecture is a generic compu-tation unit for processing layer 𝑆𝑃 + 1 ∼ 𝑛. It is composed of anMAC array, a feature maps buffer, a weight buffer, an accumulationbuffer, and a functional sub-module for activation and pooling oper-ations. A controller module is included for the data and instructionstransfer and management.

5.3.1 ReusableMACArray. The key component of the genericstructure is the callable MAC array, which contains 𝐶𝑃𝐹𝑔 × 𝐾𝑃𝐹𝑔units of MAC (implemented with DSPs on FPGAs). The MAC ar-ray is able to calculate one general matrix-vector multiplication(GEMV) between a 𝐶𝑃𝐹𝑔-length vector of input feature map and a𝐶𝑃𝐹𝑔 × 𝐾𝑃𝐹𝑔 matrix of DNN parameters in each clock cycle. Thedata width of the on-chip memory bus between the MAC array andweight, feature maps, and accumulation buffer is 𝐶𝑃𝐹𝑔 × 𝐾𝑃𝐹𝑔 ×𝑊𝑊𝑔 ,𝐶𝑃𝐹𝑔 ×𝐷𝑊𝑔 , and 𝐾𝑃𝐹𝑔 ×𝐷𝑊𝑔 bits, respectively, where𝑊𝑊𝑔and 𝐷𝑊𝑔 represents the quantization width of weights and featuremaps of the generic structure.

5.3.2 On-chip Buffers Allocation. Two types of resource onFPGAs (BRAMs and LUTs) can be utilized to allocate on-chip buffersin the generic structure. BRAMs and LUTs have vastly differentcharacteristics in terms of memory capacity and allocation gran-ularity, which makes them suitable for different usage scenarios.Two main strategies are found in previous literature as: (1) allocat-ing BRAMs for feature map and accumulation buffer while LUTsfor weight buffer (Xilinx DPU [3]); and (2) allocating BRAMs forall on-chip buffers (VTA [36] and HybridDNN [2]). The first strat-egy allocates most of BRAMs to the feature map buffer, which caneffectively reduce the frequency of loading/saving intermediatefeature maps from/to the external memory. Meanwhile, the firststrategy allows a flexible adjustment for the depth of the featuremap buffer to accommodate the BRAMs constraint of the targetedFPGA. The second strategy allocates most of BRAMs to the weightbuffer, allowing controller to schedule computations with morefreedom, which can potentially benefit the performance but usuallyconsumes more BRAMs than the first strategy. In DNNExplorer,we construct resource and performance models (Section 6) for bothstrategies to better explore the optimized designs.

External Memory

Weight Buffer

Col./

Row

Buffer

Weight Buffer

Feature

Maps

Buffer

Accum.

Buffer

PE (0)

PE (1)

PE (KPF1-1)

MAC

(0,0)

MAC

(0,1)

MAC (0,

KPFg-1)

MAC

(1,0)

MAC

(1,1)

MAC(1,

KPFg-1)

MAC(CPFg-1,

0)

MAC(CPFg-1,

1)

MAC(CPFg-1,

KPFg-1)

Col./

Row

Buffer

Layer 1 Comp.

Engine (CONV)

Weight

Buffer

Layer 2

Comp.

Engine

(e.g. FC,

CONV,

POOL)

Col./

Row

Buffer

Weight

Buffer

Layer SP

Comp.

Engine

(e.g. FC,

CONV,

POOL)

…

…

…

…

… … ……

Layer 2 Layer SP

CPF1*KPF1*WW1

CPF1*DW1 KPF1*DW2

CPF2*KPF2*WW2 CPFSP*KPFSP*WWSP CPFg*KPFg*WWg

CPFSP*DWSP KPFSP*DWg

KPFg*DWg

KPFg*DWg

CPFg*DWg

Layer 1

Pipeline Structure Generic Structure

ACT, POOLCtrl.

Figure 6: The proposed accelerator architecture with 𝑆𝑃 pipeline stages instantiated for processing the first 𝑆𝑃 DNN layers anda generic computation unit implemented for accelerating the rest of layers.

5.3.3 Feature Map Partitioning. Ideally, intermediate fea-ture maps should be fully buffered by on-chip memory for reducingthe high-cost data transfer between FPGA and external memory.However, more and more emerging DNN applications are takinghigh-resolution image or video inputs, demanding a large on-chipbuffer capacity, which cannot be accommodated by limited BRAMson FPGA. To support these emerging applications, we follow aline-based partitioning strategy to break down large feature mapsinto small groups along the dimension of height. With this strategy,the group that completes computations will be swapped out ofthe on-chip buffer for other new groups. However, such frequentgroup swapping occupies additional memory access bandwidth,which can impact the accelerator performance. In Section 6, we willintroduce how we capture these negative impacts in our modelingalgorithms to help deliver optimized designs with better trade-offs.

6 ACCELERATOR MODELINGOne of the most critical part in DNNExplorer is the AcceleratorModeling that provides fast and accurate estimation of hardwareperformance and resource utilization. Only with a timely feed-back can the decision-making algorithm in Architecture Explorationmake the most informed decisions. Our modeling supports mostof commonly-used DNN layers (e.g., CONV, POOL, and FC layers),and can also be extended to support new DNN layers (e.g., depth-wise CONV layers). Considering CONV is the most computation-intensive operation in modern DNNs, in this section, we will take aDNN model with 𝑁 CONV layers as example to illustrate how weprovide the pipeline and generic structure modeling. We assumethe 𝑖-th CONV layer is calculated with a 3-dim input feature 𝐷 (size𝐻 ×𝑊 with 𝐶 channels) and a 4-dim kernel 𝐺 (size 𝑅 × 𝑆 with 𝐾output and 𝐶 input channels) with clock frequency as 𝐹𝑅𝐸𝑄 .

6.1 Pipeline Structure ModelingIn the pipeline structure, we employ dedicated IPs to carry out thecomputation of the DNN model following a CTC-based resourceallocation algorithm (with detailed discussions in Section 7). Thegoal of this algorithm is to ensure a load-balanced pipeline structurewith perfect match between computation and memory resourcesfor maximal performance. Since a two-dim parallelism factor (𝐶𝑃𝐹𝑖 ,

Figure 7: (a) Performance estimation errors of the pipelinestructure model when mapping 6 DNNs onto Xilinx ZC706.N1∼N3 represent AlexNet, ZF, and YOLO with 16-bit quanti-zation, while N4∼N6 represent the same group of networksusing 8-bit quantization. (b) We swap the FPGA to XilinxKU115 and evaluate the estimation errors on 8 DNNs whereN1∼N4 represent AlexNet, ZF, VGG16, and YOLOwith 16-bitquantization while the rest 4 using 8-bit quantization.

𝐾𝑃𝐹𝑖 ) is adopted, the latency 𝐿𝑖 of the 𝑖-th CONV layer is:

𝐿𝑖 =𝐻𝑖 ×𝑊𝑖 × 𝑅𝑖 × 𝑆𝑖 ×𝐶𝑖 ×𝐾𝑖𝐶𝑃𝐹𝑖 ×𝐾𝑃𝐹𝑖 × 𝐹𝑅𝐸𝑄

(3)

Since the throughput maximum is limited by the pipeline stagewith the longest latency, we use Eq. 4 to calculate the overallthroughput performance with a batch size of 𝐵𝑎𝑡𝑐ℎ.

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡𝑝 =𝐵𝑎𝑡𝑐ℎ

𝑚𝑎𝑥 (𝐿1, 𝐿2, ..., 𝐿𝑁 ) (4)

We summarize the estimated throughput performance and theerror of the proposed pipelinemodel in Figure 7.We evaluate 6 DNNmodels on an embedded FPGA (Xilinx ZC706) and 8 DNN modelson a mid-range FPGA (Xilinx KU115). Results show that the averageerror between the estimated and the board-level performance is1.15%, which verifies the accuracy of our proposed analytical model.6.2 Generic Structure ModelingAs we discussed in Section 5, two types of on-chip buffer allocationstrategies are considered here. For the first strategy, BRAMs areused for keeping feature map and accumulation buffer while forthe second strategy, BRAMs are allocated for all on-chip buffers.

0

128

256

384

512

0

5

10

15

20

Fea

ture

/ C

han

nel

Siz

e

Err

or

(%)

Feature Size Channel Size Esti. Error

1x1 3x3 5x5 7x7 Kernel Size

Figure 8: Performance estimation errors of the generic struc-ture when mapping 36 cases of CONV layers. Test casescover commonly-used feature map sizes (56, 112, 224), chan-nel sizes (64, 128, 256, 512), and kernel sizes (1, 3, 5, 7).

6.2.1 Allocation Strategy 1. We follow aweight reuse schemeto perform CONV layer. The reuse factor of each weight is deter-mined by the capacity of the accumulation buffer. Assuming thecapacity (in bits) of the accumulation buffer is𝐶𝐴𝑃𝑎𝑏𝑢𝑓 𝑓 , the inputand output feature map will be organized as 𝐺 𝑓𝑚 groups where:

𝐺𝑓𝑚 =𝐻 ×𝑊 ×𝐾 ×𝐷𝑊𝑔

𝐶𝐴𝑃𝑎𝑏𝑢𝑓 𝑓 /2. (5)

In Eq. 5, 𝐶𝐴𝑃𝑎𝑏𝑢𝑓 𝑓 is divided by a factor of 2 because of usingping-pong buffers to avoid data pollution. Pixels of the feature mapinside the same group can reuse the same weights which locate onchip. It means the accelerator needs to fetch weights 𝐺 𝑓𝑚 timesbefore finishing all computations in this CONV layer. The latencyof computation 𝐿𝑐𝑜𝑚𝑝 and weight loading 𝐿𝑤 can be calculated as:

𝐿𝑐𝑜𝑚𝑝 =𝐻 ×𝑊 × 𝑅 × 𝑆 ×𝐶 ×𝐾𝐶𝑃𝐹 ×𝐾𝑃𝐹 × 𝐹𝑅𝐸𝑄 (6)

𝐿𝑤 =𝑅 × 𝑆 ×𝐶 ×𝐾 ×𝑊𝑊𝑔

𝐵𝑊(7)

In Eq. 7, 𝐵𝑊 represents the external memory bandwidth allo-cated to the generic structure. Then, the overall latency of thisCONV layer 𝐿𝑙𝑎𝑦𝑒𝑟 is:

𝐿𝑙𝑎𝑦𝑒𝑟 =𝑚𝑎𝑥 (𝐿𝑐𝑜𝑚𝑝 , 𝐿𝑤 ×𝐺𝑓𝑚) (8)However, due to the feature map partitioning strategy discussed

in Section 5.3, the real case is more complicated than Eq. 8. Whenthe feature map buffer is insufficient, feature maps need to be par-titioned and swapped between on-/off-chip memory. To capturethe impact of this mechanism, we first divide the external mem-ory bandwidth 𝐵𝑊 into three portion, 𝐵𝑊𝑤 , 𝐵𝑊𝑖 𝑓𝑚 , and 𝐵𝑊𝑜 𝑓𝑚 ,according to the data access behaviors related to weight loading, fea-ture map swapping in, and feature map swapping out, respectively.Then, the latency 𝐿𝑖 𝑓𝑚 and 𝐿𝑜 𝑓𝑚 can be calculated as:

𝐿𝑖 𝑓𝑚 =𝐻 ×𝑊 ×𝐶 ×𝐷𝑊𝑔

𝐵𝑊𝑖 𝑓𝑚

, 𝐿𝑜𝑓𝑚 =𝐻 ×𝑊 ×𝐾 ×𝐷𝑊𝑔

𝐵𝑊𝑜𝑓𝑚

, (9)

while the weight loading latency 𝐿𝑤 and the overall latency 𝐿𝑙𝑎𝑦𝑒𝑟are updated as:

𝐿𝑤 =𝑅 × 𝑆 ×𝐶 ×𝐾 ×𝑊𝑊𝑔

𝐵𝑊𝑤(10)

𝐿𝑙𝑎𝑦𝑒𝑟 =𝑚𝑎𝑥 (𝐿𝑐𝑜𝑚𝑝 , 𝐿𝑤 ×𝐺𝑓𝑚, 𝐿𝑖 𝑓𝑚, 𝐿𝑜𝑓𝑚) (11)6.2.2 Allocation Strategy 2. Similar to strategy 1, to accom-

modate the capacity of on-chip buffers, we partition input and out-put feature maps into 𝐺 𝑓𝑚 groups along the dimension of height,and partition weights into 𝐺𝑤 groups along the dimension of out-put channels 𝐾 . If we use 𝐶𝐴𝑃𝑤𝑏𝑢𝑓 𝑓 to represent the capacity ofweight buffer, 𝐺𝑤 can be calculated as:

𝐺𝑤 =𝑅 × 𝑆 ×𝐶 ×𝐾 ×𝑊𝑊𝑔

𝐶𝐴𝑃𝑤𝑏𝑢𝑓 𝑓 /2(12)

Table 2: Design space of the proposed paradigm

Pipeline Structure Generic Structure

Parameters

𝐶𝑃𝐹𝑝 = {𝐶𝑃𝐹1,𝐶𝑃𝐹2, · · ·𝐶𝑃𝐹𝑖 }𝐾𝑃𝐹𝑝 ={𝐾𝑃𝐹1, 𝐾𝑃𝐹2, · · ·𝐾𝑃𝐹𝑖 }

𝐶𝑃𝐹𝑔𝐾𝑃𝐹𝑔

𝐵𝑢𝑓 𝑓 𝑒𝑟 -𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝐷𝑎𝑡𝑎𝑓 𝑙𝑜𝑤

𝑆𝑝𝑙𝑖𝑡 -𝑝𝑜𝑖𝑛𝑡 (𝑆𝑃 )𝐵𝑎𝑡𝑐ℎ

Constraints Available 𝐷𝑆𝑃, 𝐵𝑅𝐴𝑀, 𝐵𝑊

Under the second allocation strategy, two types of dataflow aresupported as input stationary (IS) and weight stationary (WS). ForIS, the computation and memory access behaviors are similar tothe case using the first strategy, and the overall latency can becalculated by Eq. 11. However, in this case, because most of BRAMsare allocated to the weight buffer, 𝐺 𝑓𝑚 is usually larger than usingthe first strategy. It may damage the overall latency according toEq. 11. Meanwhile, smaller feature map buffer will prevent theevolution from Eq. 11 to Eq. 8, and increase the external memorybandwidth demands compared to the first strategy. For WS, theproposed accelerator keeps weights on chip, and for each groupof weights, all input feature maps need to be loaded before anycomputations. Since there are totally 𝐺𝑤 groups of weights to becalculated, the overall latency is:

𝐿𝑙𝑎𝑦𝑒𝑟 =𝑚𝑎𝑥 (𝐿𝑐𝑜𝑚𝑝 , 𝐿𝑤 , 𝐿𝑖 𝑓𝑚 ×𝐺𝑤 , 𝐿𝑜𝑓𝑚 ×𝐺𝑤 ) (13)

To evaluate the analytical models of the generic structure, weinclude 36 cases of CONV layers with different channels, featuremap, and kernel configurations as benchmarks. We compare theestimated performance with the measured board-level performanceusing a Xilinx VU9P FPGA. Results are shown in Figure 8, whereonly a 2.17% average error is observed across all 36 cases.

7 DESIGN SPACE EXPLORATION7.1 Design Space DefinitionWith the new accelerator design paradigm, we are allowed to ex-plore a significantly larger design space and perform a more fine-grained hardware resource allocation. We define a dynamic de-sign space regarding all possible accelerator design combinationsin Table 2. Split-point defines the task partitioning between thepipeline and generic structure. With more layers distributed tothe pipeline structure, more stages are instantiated along withhigher dimensions of 𝐶𝑃𝐹𝑝 and 𝐾𝑃𝐹𝑝 . Given resource constraints[𝐷𝑆𝑃, 𝐵𝑅𝐴𝑀, 𝐵𝑊 ] of the targeted FPGA, DNNExplorer is designedto explore all design parameters and generate the best acceleratorsfor targeted DNN applications. For the efficiency of the exploration,we adopt divide and conquer strategy and propose a two-level auto-matic DSE engine as the global and local optimization, which willbe introduced in the next two subsections.7.2 Global OptimizationGlobal optimization is the first task of the Architecture Explorationstep, where updated RAV is generated to provide task partitioningand resource allocation guidelines for our unique accelerator ar-chitecture. To select the best RAV across high-dimensional designspace, we employ a particle swarm optimization (PSO) algorithm todiscover the most suitable combination of 𝑆𝑃 , 𝐵𝑎𝑡𝑐ℎ, and hardwareresource distribution between the pipeline and generic structure.The global optimization can be also extended to support other opti-mization algorithms in the future for different scenarios.

As shown in Algorithm 1, each RAV is considered as a particle𝑃𝑖 , and all of them contribute to the swarm M. We use throughput

Algorithm 1: Global optimization algorithm1: Initialize RAV with M Population: 𝑃 (M)2: Initialize iteration number: N3: Initialize HW boundary: 𝑆𝑃𝑚𝑎𝑥 , 𝐷𝑆𝑃𝑚𝑎𝑥 , 𝐵𝑅𝐴𝑀𝑚𝑎𝑥 , 𝐵𝑊𝑚𝑎𝑥

4: Evaluate each RAV: 𝐹𝑖𝑡𝑖 = 𝐹𝑖𝑡𝑛𝑒𝑠𝑠𝑆𝑐𝑜𝑟𝑒 (𝑃𝑖 ) , where 𝑖 ∈ M5: Keep the local best for each RAV: 𝐿𝑖 = 𝐹𝑖𝑡𝑖 , and the global best:𝐺 =𝑚𝑎𝑥 (𝐿𝑖 )6: While 𝑖𝑡𝑟 < N:7: For each 𝑃𝑖 in M:8: Get local and global velocity:𝑉𝑡𝑜𝐿𝑏𝑒𝑠𝑡𝑖 ,𝑉𝑡𝑜𝐺𝑏𝑒𝑠𝑡𝑖

9: Update velocity:𝑉𝑖 = 𝑤 ·𝑉𝑖 +𝑐1 ·𝑟𝑎𝑛𝑑 () ·𝑉𝑡𝑜𝐿𝑏𝑒𝑠𝑡𝑖 +𝑐2 ·𝑟𝑎𝑛𝑑 () ·𝑉𝑡𝑜𝐺𝑏𝑒𝑠𝑡𝑖

10: Update RAV: 𝑃𝑖 = 𝑈𝑝𝑑𝑎𝑡𝑒𝑃𝑜𝑠 (𝑃𝑖 ,𝑉𝑖 )11: Evaluate updated RAV: 𝐹𝑖𝑡𝑖 = 𝐹𝑖𝑡𝑛𝑒𝑠𝑠𝑆𝑐𝑜𝑟𝑒 (𝑃𝑖 )12: Update local best: 𝐿𝑖13: Update global best:𝐺14: Output the best RAV

performance as the fitness index to evaluate the performance ofeach particle. We label the local best for particle 𝑖 across all iter-ations as 𝐿𝑖 and use 𝐺 to represent the global best particle. Thesearch contains N iterations and in the 𝑖𝑡𝑟 -iteration, the positionof each particle is updated according to the current position andupdated velocity 𝑉𝑖 . The updated velocity is calculated based onthe velocity to the local𝑉𝑡𝑜𝐿𝑏𝑒𝑠𝑡𝑖 and global𝑉𝑡𝑜𝐺𝑏𝑒𝑠𝑡𝑖 best position.In the update function, 𝑟𝑎𝑛𝑑 () generates random numbers between0 to 1 and we also include the adjustable parameters as inertiaweight,𝑤 , acceleration constants, 𝑐1 and 𝑐2 to fine-tune the searchprocess. Eventually, the best particle is selected, which is the bestRAV indicating the optimal task partitioning and resource alloca-tion. For improving the search efficiency of the global optimization,we introduce an early termination feature, which will terminatethe optimization if𝐺 is not improved for two continuous iterations.7.3 Local OptimizationOnce RAV is updated by the global optimizer, the local optimizationwill be launched to search for the best parameters (e.g.,𝐶𝑃𝐹𝑝 ,𝐾𝑃𝐹𝑝 ,𝐶𝑃𝐹𝑔 , 𝐾𝑃𝐹𝑔 , and 𝐵𝑎𝑡𝑐ℎ) combination given the constraint of RAV.Considering the flexible architecture of our proposed accelerator en-ables a huge design space, a brute-force search algorithm indicatesan exponentially increasing complexity and is infeasible to find theoptimal solution. To solve this problem, we propose a two-phaselocal optimization algorithm as shown in Algorithm 2 (phase 1)and Algorithm 3 (phase 2). In the first phase, we will calculate 𝐶𝑃𝐹and 𝐾𝑃𝐹 of each layer in the pipeline structure based on workloadand CTC characteristics, and ensure the computation resource andexternal memory bandwidth is utilized to the maximum. In thesecond phase, we increase 𝐶𝑃𝐹𝑔 and 𝐾𝑃𝐹𝑔 step by step until theperformance of the pipeline and the generic structure is balanced,which means the overall latency of the generic structure is lessor equal to the max latency of the pipeline structure. If the FPGAresources are exhausted under the balanced point, we will roll backand scale down 𝐶𝑃𝐹 s and 𝐾𝑃𝐹 s of the pipeline structure, until allresource constraints are met. In the second phase, Algorithm 3 willbe executed for each on-chip buffer allocation strategy discussed insubsection 5.3.2, and the best one will become the final solution. Ifthe second strategy is selected, the latency update in line 9 of Algo-rithm 3 will automatically select the better dataflow configuration(IS or WS) for each layer.8 EXPERIMENTAL RESULTS8.1 DSP Efficiency ComparisonTo accelerate DNN applications using FPGAs, DSP is the mostimportant and scarce computation resources. In this subsection,

Algorithm 2: CTC-based local optimization algorithm forthe pipeline structure1: [𝑆𝑃, 𝐵𝑎𝑡𝑐ℎ, 𝐷𝑆𝑃𝑝 , 𝐵𝑅𝐴𝑀𝑝 , 𝐵𝑊𝑝 ] = 𝑅𝐴𝑉2: Initialize latency and resource model, 𝑙𝑝𝑖𝑝 () and 𝑟𝑝𝑖𝑝 () , for pipeline structure.3: Calculate operation number 𝑂𝑃𝑖 and computation reuse factor 𝐶𝑇𝐶𝑖 for each

layer of the targeted DNN model.4: 𝐵𝑊 𝑛𝑜𝑟𝑚

𝑡𝑜𝑡𝑎𝑙=∑𝑆𝑃

𝑖=1 (𝑂𝑃𝑖/𝐶𝑇𝐶𝑖 )5: for 𝑖 in [1, 𝑆𝑃] :6: 𝑃𝐹𝑖 = ⌈𝑂𝑃𝑖 × 𝐵𝑊𝑝/𝐵𝑊 𝑛𝑜𝑟𝑚

𝑡𝑜𝑡𝑎𝑙/𝐹𝑅𝐸𝑄 ⌉

7: while (∑𝑆𝑃

𝑖=1 𝑅𝑖 ≥ [𝐷𝑆𝑃𝑝 , 𝐵𝑅𝐴𝑀𝑝 ]) :8: for 𝑖 in [1, 𝑆𝑃] :9: 𝑃𝐹𝑖 =𝑚𝑎𝑥 (1, 𝑃𝐹𝑖/2) , update𝐶𝑃𝐹𝑖 and 𝐾𝑃𝐹𝑖 .10: Update 𝐿𝑖 and 𝑅𝑖 with model 𝑙𝑝𝑖𝑝 () and 𝑟𝑝𝑖𝑝 () .

Algorithm 3: Balance-oriented local optimization algorithmfor the generic structure1: Initialize latency and resource model, 𝑙𝑔𝑒𝑛 () and 𝑟𝑔𝑒𝑛 () , for generic structure.2: Initialize 𝑅𝑡𝑜𝑡𝑎𝑙 with the total resource boundary of the targeted FPGA.3: while (1) :4: 𝑃𝐹𝑔 = 1, 𝐿𝑚𝑎𝑥

𝑝 =𝑚𝑎𝑥 (𝐿1, 𝐿2, ..., 𝐿𝑆𝑃 )5: while (

∑𝑁𝑖=𝑆𝑃+1 𝐿𝑖 > 𝐿

𝑚𝑎𝑥𝑝 and 𝑅𝑔 +∑𝑆𝑃

𝑖=1 𝑅𝑖 < 𝑅𝑡𝑜𝑡𝑎𝑙 ) :6: 𝑃𝐹𝑔 = 𝑃𝐹𝑔 × 2, update𝐶𝑃𝐹𝑔 and 𝐾𝑃𝐹𝑔 .7: Update 𝑅𝑔 with resource model 𝑟𝑔𝑒𝑛 () .8: for i in [𝑆𝑃 + 1, 𝑁 ] :9: Update 𝐿𝑖 with latency model 𝑙𝑔𝑒𝑛 () .10: 𝐵𝑎𝑡𝑐ℎ𝑝 = (𝑅𝑡𝑜𝑡𝑎𝑙 − 𝑅𝑔)/

∑𝑆𝑃𝑖=1 𝑅𝑖

11: if (∑𝑁

𝑖=𝑆𝑃+1 𝐿𝑖 > 𝐿𝑚𝑎𝑥𝑝 or 𝐵𝑎𝑡𝑐ℎ > 𝐵𝑎𝑡𝑐ℎ𝑝 ) :

12: for i in [1, 𝑆𝑃] :13: 𝑃𝐹𝑖 =𝑚𝑎𝑥 (1, 𝑃𝐹𝑖/2) , update𝐶𝑃𝐹𝑖 and 𝐾𝑃𝐹𝑖 .14: Update 𝐿𝑖 and 𝑅𝑖 with model 𝑙𝑝𝑖𝑝 () and 𝑟𝑝𝑖𝑝 () .15: else : break

we use DSP efficiency as criteria to evaluate whether acceleratorsmaximize the usage of allocated DSPs. We compare our design totwo other automatic accelerator design frameworks, DNNBuilder[1] and HybridDNN [2], by targeting the same task: to deliver DNNaccelerators on a Xilinx KU115 FPGA corresponding to 12 VGG-16(without the last three FC layers) models with different input sizes(as the same cases in Figure 1) to simulate tasks of real-life DNNapplications. In our experiments, weights and feature maps arequantized to 16-bits fixed-point in all three frameworks for equality.Results are shown in Figure 9, where DNNBuilder achieves thehighest DSP efficiency especially targeting small size inputs (e.g,case 1 and 2) because of its dedicated pipeline stage design. Ourproposed design is slightly behind DNNBuilder when targetingsmall inputs but we then reach the same efficiency level (>95%) aftercase 3. Compared to the generic accelerator design in HybridDNN,our design can deliver 2.0x and 1.3x higher efficiency for case 1 and2, respectively. We also compare to Xilinx DPU [3], a commercialDNN accelerator IP, for running first 9 cases on a ZCU102 FPGA (asthe last three inputs are not supported by this IP). The acceleratorsgenerated by DNNExplorer can achieve an average 1.6x higher DSPefficiency, peaking at 4.4x for case 1. As the input size increases,the efficiency gap decreases (<10% after case 5).8.2 Performance ComparisonWe compare the throughput (GOP/s) among accelerators generatedby the proposed DNNExplorer, DNNBuilder, and HybridDNN. First,we target the same DNNs mentioned in the last subsection usingXilinx KU115 FPGA with 200MHz clock frequency. Performanceresults are listed in Figure 10, where accelerators generated byDNNExplorer achieve better throughput performance comparedto the state-of-the art solutions. To meet resource constraints and

Table 3: Performance and resource overhead of the DNNExplorer-generated accelerators with batch size = 1.

Case Input Size GOP/s Img./s 𝑅 = [𝑆𝑃, 𝐷𝑆𝑃, 𝐵𝑅𝐴𝑀, 𝐵𝑊 ] Total DSP DSP Efficiency Total BRAM Avg. Search Time (s)1 3x32x32 368.5 588.9 [4, 48.2%, 17.7%, 62.9%] 2268 42.3% 2326 86.62 3x64x64 890.8 339.1 [5, 46.5%, 13.5%, 65.9%] 2730 77.9% 2560 41.63 3x128x128 1702.3 169.5 [9, 50.2%, 38.9%, 54.2%] 4686 90.8% 3589 70.24 3x224x224 1702.3 55.4 [12, 63.6%, 53.7%, 67.3%] 4444 95.8% 3296 75.65 3x320x320 1702.4 27.1 [13, 73.5%, 71.4%, 63.5%] 4450 95.7% 3224 45.96 3x384x384 1702.4 18.8 [14, 73.2%, 70.1%, 55.0%] 4452 95.6% 3436 64.97 3x320x480 1702.4 18.1 [14, 78.6%, 74.6%, 58.2%] 4452 95.6% 3296 71.18 3x448x448 1702.4 13.8 [13, 75.8%, 72.9%, 36.1%] 4450 95.6% 3552 82.19 3x512x512 1702.4 10.6 [13, 80.0%, 70.0%, 80.0%] 4450 95.6% 3678 70.010 3x480x800 1702.4 7.2 [13, 80.0%, 72.0%, 80.0%] 4450 95.6% 3678 103.011 3x512x1382 1702.5 3.9 [14, 77.4%, 73.1%, 34.3%] 4452 95.6% 3792 105.512 3x720x1280 1702.5 3.0 [13, 82.7%, 73.4%, 84.0%] 4450 95.6% 4186 143.9

Figure 9: DSP efficiency comparison when running VGG16(batch size = 1) with 12 different input sizes.

Figure 10: Throughput comparison when running VGG16(batch size = 1) with 12 different input sizes.

Figure 11: Throughput comparison when running deeperDNNs with the same 3 × 224 × 224 input.

maximize performance, the proposed DSE engine generates config-uration guidelines for implementing the proposed architecture.

Detailed results are shown in Table 3, where 𝑅 indicates the mostsuitable task partition scheme and resources allocated to the 𝑃 and𝐺 structures of the proposed accelerator. The total DSP and totalBRAM represent the compute and on-chip memory utilization. Tocapture the runtime of searching the optimal architecture given in-put model and hardware constraints, we perform five independentsearches using Intel i5-650 CPU for each case and list the averagesearch time. With the divide-and-conquer idea and the early ter-mination feature, DNNExplorer delivers the best configurations in

Table 4: Performance and resource overhead of the gener-ated accelerators without batch size restrict.

Case Input Size Batch GOP/s Img./s DSP BRAM1 3x32x32 8 1698.1 2712.7 4464 22282 3x64x64 8 1701.5 678.2 4688 35343 3x128x128 4 1702.4 169.5 4686 33264 3x224x224 2 1702.3 55.4 4686 3619

minutes regardless the high-dimensional design space. We extendthe search for batch size in Table 4, where lower DNN complexity(caused by smaller inputs in case 1 to 4) creates opportunities forbatch size exploration given limited available FPGA resources.

We also evaluate the scalability of DNNExplorer by targetingdeeper DNNs. We prepare 4 DNNs, with 13, 18, 28, and 38 CONVlayers. The 13-layer one comes directly from VGG16 by removingthe last 3 FC layers. Since VGG is composed of 5 CONV groups,where each group has the same CONV configurations (e.g, numberof CONV kernels), we add one CONV layer to each group (main-taining the same configurations) and get the 18-layer (13+5) model.Similarly, we add 3 and 5 CONV layers to each part for the 28- and38-layer model, respectively. In Figure 11, our designs maintainthe highest performance despite targeting deeper networks. Com-pared to DNNBuilder, we can deliver 4.2x higher performance foraccelerating a 38-layer VGG-like DNN on the same FPGA platform.

9 CONCLUSIONIn this paper, we presented a novel FPGA-based DNN acceler-

ator design paradigm with both dedicated pipeline structure andgeneric reusable structure to overcome the drawbacks of existingdesigns by providing improved performance, better specificity, andscalability. To achieve the full potential of the new paradigm, weproposed DNNExplorer, an automation tool to enable fast architec-ture exploration following the new paradigm. Novel technologieswere proposed which included a dynamic design space for morefine-grained architecture configurations and a two-level automaticDSE engine for delivering optimized designs. With the above noveldesigns, DNNExplorer achieved 4.2x higher performance comparedto the state-of-the-art design produced from DNNBuilder whentargeting a 38-layer VGG-like DNN on the same FPGA. We alsoachieved higher DSP efficiency (up to 4.4x) compared to the latestgeneric architecture accelerators from academia (HybridDNN) andindustry (Xilinx DPU).

ACKNOWLEDGMENTThis work was supported in part by the IBM-Illinois Center

for Cognitive Computing System Research (C3SR) – a researchcollaboration as part of IBM AI Horizons Network. Xiaofan Zhangis supported by a Google PhD Fellowship.

REFERENCES[1] Xiaofan Zhang et al. DNNBuilder: an automated tool for building high-

performance DNN hardware accelerators for FPGAs. In Proceedings of Inter-national Conference on Computer-Aided Design (ICCAD), 2018.

[2] Hanchen Ye et al. HybridDNN: A framework for high-performance hybrid DNNaccelerator design and implementation. In Proceedings of the Design AutomationConference (DAC), 2020.

[3] Xilinx. Zynq dpu v3.1, 2019. Accessed: 2020-5-23.[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification

with deep convolutional neural networks. In Advances in neural informationprocessing systems, 2012.

[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[6] Christian Szegedy et al. Going deeper with convolutions. In Proceedings of theIEEE conference on computer vision and pattern recognition (CVPR), 2015.

[7] Kaiming He et al. Deep residual learning for image recognition. In Proceedingsof the IEEE conference on computer vision and pattern recognition (CVPR), 2016.

[8] Joseph Redmon et al. You only look once: Unified, real-time object detection.In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), 2016.

[9] Barret Zoph et al. Learning transferable architectures for scalable image recog-nition. In Proceedings of the IEEE conference on computer vision and patternrecognition (CVPR), 2018.

[10] Esteban Real et al. Regularized evolution for image classifier architecture search.In AAAI conference on Artificial Intelligence (AAAI), 2019.

[11] Xiaofan Zhang et al. SkyNet: a hardware-efficient method for object detectionand tracking on embedded systems. In Conference on Machine Learning andSystems (MLSys), 2020.

[12] Dustin Franklin. NVIDIA Jetson AGX Xavier delivers 32 teraops for new era ofAI in robotics. NVIDIA Accelerated Computing| Parallel For all, 2018.

[13] Norman P Jouppi et al. In-datacenter performance analysis of a tensor processingunit. In Proceedings of International Symposium on Computer Architecture (ISCA),2017.

[14] Yu-Hsin Chen et al. Eyeriss: An energy-efficient reconfigurable accelerator fordeep convolutional neural networks. In IEEE International Solid-State CircuitsConference (ISSCC), 2016.

[15] Chen Zhang et al. Optimizing FPGA-based accelerator design for deep con-volutional neural networks. In Proceedings of the International Symposium onField-Programmable Gate Arrays (FPGA), 2015.

[16] Pengfei Xu et al. AutoDNNchip: An automated DNN chip predictor and builderfor both FPGAs and ASICs. 2020.

[17] Jiantao Qiu et al. Going deeper with embedded FPGA platform for convolutionalneural network. In Proceedings of International Symposium on Field-ProgrammableGate Arrays (FPGA), 2016.

[18] Xiaofan Zhang et al. High-performance video content recognition with long-termrecurrent convolutional network for FPGA. In Proceedings of the InternationalConference on Field Programmable Logic and Applications (FPL), 2017.

[19] Xiaofan Zhang et al. Machine learning on FPGAs to face the IoT revolution. InProceedings of International Conference on Computer-Aided Design (ICCAD), 2017.

[20] Jialiang Zhang and Jing Li. Improving the performance of opencl-based fpgaaccelerator for convolutional neural network. In Proceedings of the InternationalSymposium on Field-Programmable Gate Arrays (FPGA), 2017.

[21] Cong Hao et al. FPGA/DNN co-design: An efficient design methodology for IoTintelligence on the edge. In Proceedings of the Design Automation Conference(DAC), 2019.

[22] Huimin Li et al. A high performance fpga-based accelerator for large-scaleconvolutional neural networks. In Proceedings of the International Conference onField Programmable Logic and Applications (FPL), 2016.

[23] Xuechao Wei et al. TGPA: tile-grained pipeline architecture for low latencycnn inference. In Proceedings of the International Conference on Computer-AidedDesign (ICCAD), 2018.

[24] Ritchie Zhao et al. Accelerating binarized convolutional neural networks withsoftware-programmable FPGAs. In Proceedings of the International Symposiumon Field-Programmable Gate Arrays (FPGA), 2017.

[25] Junsong Wang et al. Design flow of accelerating hybrid extremely low bit-widthneural network in embedded FPGA. In Proceedings of the International Conferenceon Field Programmable Logic and Applications (FPL), 2018.

[26] Qingcheng Xiao et al. Exploring heterogeneous algorithms for accelerating deepconvolutional neural networks on fpgas. In Proceedings of the Design AutomationConference (DAC), 2017.

[27] Chuanhao Zhuge et al. Face recognition with hybrid efficient convolution algo-rithms on FPGAs. In Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI),2018.

[28] Song Han et al. ESE: Efficient speech recognition engine with sparse lstm onFPGA. In Proceedings of the International Symposium on Field-Programmable GateArrays (FPGA), 2017.

[29] Chen Zhang et al. Caffeine: Toward uniformed representation and accelerationfor deep convolutional neural networks. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, 38(11):2072–2085, 2018.

[30] X. Wei et al. Automated systolic array architecture synthesis for high throughputCNN inference on FPGAs. In DAC, 2017.

[31] Kyle Rupnow et al. A study of high-level synthesis: Promises and challenges. InIEEE International Conference on ASIC, 2011.

[32] Xinheng Liu et al. High level synthesis of complex applications: An h. 264 videodecoder. In Proceedings of International Symposium on Field-Programmable GateArrays (FPGA), 2016.

[33] Yijin Guan et al. Fp-dnn: An automated framework for mapping deep neuralnetworks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of theInternational Symposium on Field-Programmable Custom Computing Machines(FCCM), 2017.

[34] Qin Li et al. Implementing neural machine translation with bi-directional GRUand attention mechanism on FPGAs using HLS. In Proceedings of Asia and SouthPacific Design Automation Conference (ASP-DAC), 2019.

[35] Yao Chen et al. Cloud-DNN: An open framework for mapping DNN mod-els to cloud FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2019.

[36] ThierryMoreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng,Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, et al. A hardware–softwareblueprint for flexible deep learning specialization. IEEE Micro, 39(5):8–16, 2019.

DNNExplorer: A Framework for Modeling and Exploring a ...

Documents

Transcript of DNNExplorer: A Framework for Modeling and Exploring a ...