Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...

2017-12-05Andreas Kurth 1| |Hardware Acceleration for Data Processing Seminar

Discussion of the FPGA‘17 paper by Intel Corp. (Nurvitadhi et al.)

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks?


In short: The situation

➢ Implementing DNNs efficiently is very important.

Image credit:NVIDIA


In short: The problem

Which computing device is most suitable for this task?

Open question depending on many factors .. .. but GPUs (and ASICs such as DaDianNao and the TPU) are the de-facto standard.

Why not FPGAs? They can nominally be more energy-efficient than GPUs, but their inferior

memory interface and floating point performance negates this advantage.


In short: The solution

Intel claims their upcoming FPGA families will address this,as one flagship FPGA (Stratix 10 SX2800) will feature: >5k “hard macro” floating-point units (FPUs) 28 MB on-chip RAM up to 1 TB/s off-chip memory bandwidth (HBM2)


In short: The evaluation

According to Intel’s calculations, the SX2800 matches or outperforms a state-of-the-art GPU (NVIDIA TITAN X) in terms of nominal GEMM performance: 9.2 TFLOP/s vs. 11 TFLOP/s (FP32) and energy efficiency: 60 GFLOP/s/W vs. 45 GFLOP/s/W (FP32),

as well as in benchmarks: sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency


If you only remember one thing from this talk ...

Intel claims their next-generation FPGAs will ... surpass state-of-the-art GPUs in terms of energy efficiency and match them in performance

at SGEMM operations both nominally and for real DNN workloads.


How does Intel justify this claim?

1)DNN trends could favor FPGAs.

2)FPGA architecture and technology is closing the gap to GPUs.

3) Intel developed a computational template that matches (1) to (2).


DNN trends could favor FPGAs

1) DNNs are getting deeper (more layers) to increase accuracy, but they are not getting larger in terms of memory.

➢ The increased compute density and the employed irregularity (e.g., sizes, links) across layers are thought to be favorable for FPGAs.

Table 1: Recent ImageNet challenge winners.



2)Compact data types (e.g., FP16, Int8, but also binary or ternary) reduce number of computations and memory size at moderate accuracy losses.

➢ Even though modern GPUs support FP16 and Int8, the non-FP32/64 operations on such data types (e.g., binary XNOR-net) can be favorable for FPGAs.

Figure 5.b) Binarized matrix multiply implemented with XNOR and bitcount.



3)Weights and neurons are never 100% non-zero (e.g., in non fully-

connected layers or after ReLU), yet zeros wastefully participate in calculations. Sparsity can additionally be increased by pruning weights that are deemed unimportant.

➢ Above a certain level of sparsity, FPGAs support sparse calculations more efficiently than GPUs due to irregular compu-tations.


FPGA architecture and technology is closing the gap to GPUs

● Increased on-chip RAM:28.6 MB on SX2800 vs. 13.5 MB on TITAN X

● On-par bandwidth to main memory (HBM2)

● HyperFlex to increase clock frequencies

● Nearly on-par peak FP32 performance:9.2 TFLOP/s on SX2800 vs. 11 TFLOP/s on TITAN X

● Larger set of “native” data types through bit-level manipulations and FP16/32.


Customizable Hardware Architecture Template for DNNs

Figure 4: Customizable hard-ware architecture templatefor DNNs.


Evaluation: Methodology

Altera Quartus Early Beta for synthesis

Altera Early Power Estimator and post-implementation(?) netlist to estimate performance and power

GPU: nvprof for performance and power numbers on an implementation

Table 2: FPGAs and GPU under study.


Evaluation: Dense DNNs

Numbers from the respective data sheets, not implementation. They did not create an optimized FPGA implementation because “FP32 dense matrix multiplications are a sweet spot for GPUs, not FPGAs.”

FPGA frequency not specified.


Evaluation: Sparse (Pruned) DNNs

85% pruned sparsity on AlexNet (<1% accuracy loss) with FP32

GPU implementation: extension of the optimized open-source MAGMA dense matrix multiplication library; performs worse than dense multiplications. (cuSPARSE targets >99% sparsity.)

FPGA implementation: with Sparse PEs, giving 4x speedup.

300 MHz “conservative estimate”, 500 MHz and 700 MHz “moderate” and “aggressive projections”, respectively.


Evaluation: Compact DNNs

6 bit integers for weights and neurons

No GPU implementation, nominal Int8 peak performance used instead.

FPGA implementation based on systolic GEMM, achieving 920 MHz because “well optimized for frequency”.


Evaluation: Binary DNNs

1 bit types for both weights and neurons

GPU implementation: xnor_gemm kernel from BinaryNet (CUDA threads perform xnor and bitcount operations; 32 32-bit bitcounts per SM per cycle)

FPGA implementation: systolic array of Binary PEs, 256-wide binary dot product operations; synthesized for both Arria and Stratix; measured on Arria, simulated on Stratix


Evaluation: Ternary ResNet-50

Ternary weights (-1, 0, +1), FP32 neurons; within 1% accuracy of full ResNet 70..80% sparsity across weights and neurons (ideally 3.3..5x op. reduction) GPU implementation: Torch, batch size 64, cuDNN 5 with most aggressive

performance setting, 3x faster than closest other implementation FPGA implementation: only exploit sparsity in neurons to achieve simpler

design (450 MHz (“conservative estimation”) but only 2x op. reduction)


Conclusion

Can FPGAs beat GPUs in performance for next-generation DNNs?

Yes, if the Stratix 10 2800 meets Intel’s performance projections:

sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency


My opinion: The good

Intel is trying to accelerate innovation in FPGAs (e.g., memory, architectural mix) and wants to challenge the market lead of GPUs for DNNs.

Concretely, Intel proposes an accelerator template and evaluates it in a promising case study.

They use a competitive baseline for the GPU or use GPU peak numbers where no such baseline was available.


My opinion: The bad

The paper is more marketing than science:

Core methods are based on unreleased tools, devices, and benchmarks, making results not reproducible by the community, thus claims not falsifiable.

Comparison to related work is basically “all other work is based on obsolete platforms and/or not focused on emerging DNNs”.

Energy efficiency is important, but so is time-to-market and cost scaling. It remains unclear how and if the proposed accelerator template can be integrated in a productive development

framework, and

Intel can price FPGAs competitively for wide adoption in HPC(SX2800 vs TITAN X: 15k $ vs. 1.5k $).


My opinion: The ugly

They compare their next-generation devices that still have not hit the market by Q4 2017 to a GPU that was in mass production in Q3 2016.

Their main figure of merit, energy efficiency, is based on preliminary estimations, not measurements. Moreover, really significant advantages only result at very aggressive projections.


Outlook: The systems perspective

The Intel Programmable Acceleration Cardwith Intel Arria 10 GX FPGA plugs into a serverto accelerate workloads. Announced for Q2 2018.Image credit: Intel

GPU-like system integration on PCIe bus In-package integration on server socket

Intel Arria 10 GX MCP co-integrated in a singlepackage with a 15-core Broadwell EP, intercon-nected with QPI (?) for high-bandwidth,low-latency shared memory.Image credit: TheNextPlatform

Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...

Documents

Transcript of Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...