Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...
Transcript of Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...
2017-12-05Andreas Kurth 1| |Hardware Acceleration for Data Processing Seminar
Discussion of the FPGA‘17 paper by Intel Corp. (Nurvitadhi et al.)
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks?
2017-12-05Andreas Kurth 2| |Hardware Acceleration for Data Processing Seminar
In short: The situation
➢ Implementing DNNs efficiently is very important.
Image credit:NVIDIA
2017-12-05Andreas Kurth 3| |Hardware Acceleration for Data Processing Seminar
In short: The problem
Which computing device is most suitable for this task?
Open question depending on many factors .. .. but GPUs (and ASICs such as DaDianNao and the TPU) are the de-facto standard.
Why not FPGAs? They can nominally be more energy-efficient than GPUs, but their inferior
memory interface and floating point performance negates this advantage.
2017-12-05Andreas Kurth 4| |Hardware Acceleration for Data Processing Seminar
In short: The solution
Intel claims their upcoming FPGA families will address this,as one flagship FPGA (Stratix 10 SX2800) will feature: >5k “hard macro” floating-point units (FPUs) 28 MB on-chip RAM up to 1 TB/s off-chip memory bandwidth (HBM2)
2017-12-05Andreas Kurth 5| |Hardware Acceleration for Data Processing Seminar
In short: The evaluation
According to Intel’s calculations, the SX2800 matches or outperforms a state-of-the-art GPU (NVIDIA TITAN X) in terms of nominal GEMM performance: 9.2 TFLOP/s vs. 11 TFLOP/s (FP32) and energy efficiency: 60 GFLOP/s/W vs. 45 GFLOP/s/W (FP32),
as well as in benchmarks: sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency
2017-12-05Andreas Kurth 6| |Hardware Acceleration for Data Processing Seminar
If you only remember one thing from this talk ...
Intel claims their next-generation FPGAs will ... surpass state-of-the-art GPUs in terms of energy efficiency and match them in performance
at SGEMM operations both nominally and for real DNN workloads.
2017-12-05Andreas Kurth 7| |Hardware Acceleration for Data Processing Seminar
How does Intel justify this claim?
1)DNN trends could favor FPGAs.
2)FPGA architecture and technology is closing the gap to GPUs.
3) Intel developed a computational template that matches (1) to (2).
2017-12-05Andreas Kurth 8| |Hardware Acceleration for Data Processing Seminar
DNN trends could favor FPGAs
1) DNNs are getting deeper (more layers) to increase accuracy, but they are not getting larger in terms of memory.
➢ The increased compute density and the employed irregularity (e.g., sizes, links) across layers are thought to be favorable for FPGAs.
Table 1: Recent ImageNet challenge winners.
2017-12-05Andreas Kurth 9| |Hardware Acceleration for Data Processing Seminar
DNN trends could favor FPGAs
2)Compact data types (e.g., FP16, Int8, but also binary or ternary) reduce number of computations and memory size at moderate accuracy losses.
➢ Even though modern GPUs support FP16 and Int8, the non-FP32/64 operations on such data types (e.g., binary XNOR-net) can be favorable for FPGAs.
Figure 5.b) Binarized matrix multiply implemented with XNOR and bitcount.
2017-12-05Andreas Kurth 10| |Hardware Acceleration for Data Processing Seminar
DNN trends could favor FPGAs
3)Weights and neurons are never 100% non-zero (e.g., in non fully-
connected layers or after ReLU), yet zeros wastefully participate in calculations. Sparsity can additionally be increased by pruning weights that are deemed unimportant.
➢ Above a certain level of sparsity, FPGAs support sparse calculations more efficiently than GPUs due to irregular compu-tations.
2017-12-05Andreas Kurth 11| |Hardware Acceleration for Data Processing Seminar
FPGA architecture and technology is closing the gap to GPUs
● Increased on-chip RAM:28.6 MB on SX2800 vs. 13.5 MB on TITAN X
● On-par bandwidth to main memory (HBM2)
● HyperFlex to increase clock frequencies
● Nearly on-par peak FP32 performance:9.2 TFLOP/s on SX2800 vs. 11 TFLOP/s on TITAN X
● Larger set of “native” data types through bit-level manipulations and FP16/32.
2017-12-05Andreas Kurth 12| |Hardware Acceleration for Data Processing Seminar
Customizable Hardware Architecture Template for DNNs
Figure 4: Customizable hard-ware architecture templatefor DNNs.
2017-12-05Andreas Kurth 13| |Hardware Acceleration for Data Processing Seminar
Evaluation: Methodology
Altera Quartus Early Beta for synthesis
Altera Early Power Estimator and post-implementation(?) netlist to estimate performance and power
GPU: nvprof for performance and power numbers on an implementation
Table 2: FPGAs and GPU under study.
2017-12-05Andreas Kurth 14| |Hardware Acceleration for Data Processing Seminar
Evaluation: Dense DNNs
Numbers from the respective data sheets, not implementation. They did not create an optimized FPGA implementation because “FP32 dense matrix multiplications are a sweet spot for GPUs, not FPGAs.”
FPGA frequency not specified.
2017-12-05Andreas Kurth 15| |Hardware Acceleration for Data Processing Seminar
Evaluation: Sparse (Pruned) DNNs
85% pruned sparsity on AlexNet (<1% accuracy loss) with FP32
GPU implementation: extension of the optimized open-source MAGMA dense matrix multiplication library; performs worse than dense multiplications. (cuSPARSE targets >99% sparsity.)
FPGA implementation: with Sparse PEs, giving 4x speedup.
300 MHz “conservative estimate”, 500 MHz and 700 MHz “moderate” and “aggressive projections”, respectively.
2017-12-05Andreas Kurth 16| |Hardware Acceleration for Data Processing Seminar
Evaluation: Compact DNNs
6 bit integers for weights and neurons
No GPU implementation, nominal Int8 peak performance used instead.
FPGA implementation based on systolic GEMM, achieving 920 MHz because “well optimized for frequency”.
2017-12-05Andreas Kurth 17| |Hardware Acceleration for Data Processing Seminar
Evaluation: Binary DNNs
1 bit types for both weights and neurons
GPU implementation: xnor_gemm kernel from BinaryNet (CUDA threads perform xnor and bitcount operations; 32 32-bit bitcounts per SM per cycle)
FPGA implementation: systolic array of Binary PEs, 256-wide binary dot product operations; synthesized for both Arria and Stratix; measured on Arria, simulated on Stratix
2017-12-05Andreas Kurth 18| |Hardware Acceleration for Data Processing Seminar
Evaluation: Ternary ResNet-50
Ternary weights (-1, 0, +1), FP32 neurons; within 1% accuracy of full ResNet 70..80% sparsity across weights and neurons (ideally 3.3..5x op. reduction) GPU implementation: Torch, batch size 64, cuDNN 5 with most aggressive
performance setting, 3x faster than closest other implementation FPGA implementation: only exploit sparsity in neurons to achieve simpler
design (450 MHz (“conservative estimation”) but only 2x op. reduction)
2017-12-05Andreas Kurth 19| |Hardware Acceleration for Data Processing Seminar
Conclusion
Can FPGAs beat GPUs in performance for next-generation DNNs?
Yes, if the Stratix 10 2800 meets Intel’s performance projections:
sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency
2017-12-05Andreas Kurth 20| |Hardware Acceleration for Data Processing Seminar
My opinion: The good
Intel is trying to accelerate innovation in FPGAs (e.g., memory, architectural mix) and wants to challenge the market lead of GPUs for DNNs.
Concretely, Intel proposes an accelerator template and evaluates it in a promising case study.
They use a competitive baseline for the GPU or use GPU peak numbers where no such baseline was available.
2017-12-05Andreas Kurth 21| |Hardware Acceleration for Data Processing Seminar
My opinion: The bad
The paper is more marketing than science:
Core methods are based on unreleased tools, devices, and benchmarks, making results not reproducible by the community, thus claims not falsifiable.
Comparison to related work is basically “all other work is based on obsolete platforms and/or not focused on emerging DNNs”.
Energy efficiency is important, but so is time-to-market and cost scaling. It remains unclear how and if the proposed accelerator template can be integrated in a productive development
framework, and
Intel can price FPGAs competitively for wide adoption in HPC(SX2800 vs TITAN X: 15k $ vs. 1.5k $).
2017-12-05Andreas Kurth 22| |Hardware Acceleration for Data Processing Seminar
My opinion: The ugly
They compare their next-generation devices that still have not hit the market by Q4 2017 to a GPU that was in mass production in Q3 2016.
Their main figure of merit, energy efficiency, is based on preliminary estimations, not measurements. Moreover, really significant advantages only result at very aggressive projections.
2017-12-05Andreas Kurth 23| |Hardware Acceleration for Data Processing Seminar
Outlook: The systems perspective
The Intel Programmable Acceleration Cardwith Intel Arria 10 GX FPGA plugs into a serverto accelerate workloads. Announced for Q2 2018.Image credit: Intel
GPU-like system integration on PCIe bus In-package integration on server socket
Intel Arria 10 GX MCP co-integrated in a singlepackage with a 15-core Broadwell EP, intercon-nected with QPI (?) for high-bandwidth,low-latency shared memory.Image credit: TheNextPlatform