Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...

23
2017-12-05 Andreas Kurth 1 | | Hardware Acceleration for Data Processing Seminar Discussion of the FPGA‘17 paper by Intel Corp. (Nurvitadhi et al.) Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks?

Transcript of Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data...

Page 1: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 1| |Hardware Acceleration for Data Processing Seminar

Discussion of the FPGA‘17 paper by Intel Corp. (Nurvitadhi et al.)

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks?

Page 2: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 2| |Hardware Acceleration for Data Processing Seminar

In short: The situation

➢ Implementing DNNs efficiently is very important.

Image credit:NVIDIA

Page 3: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 3| |Hardware Acceleration for Data Processing Seminar

In short: The problem

Which computing device is most suitable for this task?

Open question depending on many factors .. .. but GPUs (and ASICs such as DaDianNao and the TPU) are the de-facto standard.

Why not FPGAs? They can nominally be more energy-efficient than GPUs, but their inferior

memory interface and floating point performance negates this advantage.

Page 4: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 4| |Hardware Acceleration for Data Processing Seminar

In short: The solution

Intel claims their upcoming FPGA families will address this,as one flagship FPGA (Stratix 10 SX2800) will feature: >5k “hard macro” floating-point units (FPUs) 28 MB on-chip RAM up to 1 TB/s off-chip memory bandwidth (HBM2)

Page 5: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 5| |Hardware Acceleration for Data Processing Seminar

In short: The evaluation

According to Intel’s calculations, the SX2800 matches or outperforms a state-of-the-art GPU (NVIDIA TITAN X) in terms of nominal GEMM performance: 9.2 TFLOP/s vs. 11 TFLOP/s (FP32) and energy efficiency: 60 GFLOP/s/W vs. 45 GFLOP/s/W (FP32),

as well as in benchmarks: sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency

Page 6: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 6| |Hardware Acceleration for Data Processing Seminar

If you only remember one thing from this talk ...

Intel claims their next-generation FPGAs will ... surpass state-of-the-art GPUs in terms of energy efficiency and match them in performance

at SGEMM operations both nominally and for real DNN workloads.

Page 7: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 7| |Hardware Acceleration for Data Processing Seminar

How does Intel justify this claim?

1)DNN trends could favor FPGAs.

2)FPGA architecture and technology is closing the gap to GPUs.

3) Intel developed a computational template that matches (1) to (2).

Page 8: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 8| |Hardware Acceleration for Data Processing Seminar

DNN trends could favor FPGAs

1) DNNs are getting deeper (more layers) to increase accuracy, but they are not getting larger in terms of memory.

➢ The increased compute density and the employed irregularity (e.g., sizes, links) across layers are thought to be favorable for FPGAs.

Table 1: Recent ImageNet challenge winners.

Page 9: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 9| |Hardware Acceleration for Data Processing Seminar

DNN trends could favor FPGAs

2)Compact data types (e.g., FP16, Int8, but also binary or ternary) reduce number of computations and memory size at moderate accuracy losses.

➢ Even though modern GPUs support FP16 and Int8, the non-FP32/64 operations on such data types (e.g., binary XNOR-net) can be favorable for FPGAs.

Figure 5.b) Binarized matrix multiply implemented with XNOR and bitcount.

Page 10: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 10| |Hardware Acceleration for Data Processing Seminar

DNN trends could favor FPGAs

3)Weights and neurons are never 100% non-zero (e.g., in non fully-

connected layers or after ReLU), yet zeros wastefully participate in calculations. Sparsity can additionally be increased by pruning weights that are deemed unimportant.

➢ Above a certain level of sparsity, FPGAs support sparse calculations more efficiently than GPUs due to irregular compu-tations.

Page 11: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 11| |Hardware Acceleration for Data Processing Seminar

FPGA architecture and technology is closing the gap to GPUs

● Increased on-chip RAM:28.6 MB on SX2800 vs. 13.5 MB on TITAN X

● On-par bandwidth to main memory (HBM2)

● HyperFlex to increase clock frequencies

● Nearly on-par peak FP32 performance:9.2 TFLOP/s on SX2800 vs. 11 TFLOP/s on TITAN X

● Larger set of “native” data types through bit-level manipulations and FP16/32.

Page 12: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 12| |Hardware Acceleration for Data Processing Seminar

Customizable Hardware Architecture Template for DNNs

Figure 4: Customizable hard-ware architecture templatefor DNNs.

Page 13: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 13| |Hardware Acceleration for Data Processing Seminar

Evaluation: Methodology

Altera Quartus Early Beta for synthesis

Altera Early Power Estimator and post-implementation(?) netlist to estimate performance and power

GPU: nvprof for performance and power numbers on an implementation

Table 2: FPGAs and GPU under study.

Page 14: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 14| |Hardware Acceleration for Data Processing Seminar

Evaluation: Dense DNNs

Numbers from the respective data sheets, not implementation. They did not create an optimized FPGA implementation because “FP32 dense matrix multiplications are a sweet spot for GPUs, not FPGAs.”

FPGA frequency not specified.

Page 15: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 15| |Hardware Acceleration for Data Processing Seminar

Evaluation: Sparse (Pruned) DNNs

85% pruned sparsity on AlexNet (<1% accuracy loss) with FP32

GPU implementation: extension of the optimized open-source MAGMA dense matrix multiplication library; performs worse than dense multiplications. (cuSPARSE targets >99% sparsity.)

FPGA implementation: with Sparse PEs, giving 4x speedup.

300 MHz “conservative estimate”, 500 MHz and 700 MHz “moderate” and “aggressive projections”, respectively.

Page 16: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 16| |Hardware Acceleration for Data Processing Seminar

Evaluation: Compact DNNs

6 bit integers for weights and neurons

No GPU implementation, nominal Int8 peak performance used instead.

FPGA implementation based on systolic GEMM, achieving 920 MHz because “well optimized for frequency”.

Page 17: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 17| |Hardware Acceleration for Data Processing Seminar

Evaluation: Binary DNNs

1 bit types for both weights and neurons

GPU implementation: xnor_gemm kernel from BinaryNet (CUDA threads perform xnor and bitcount operations; 32 32-bit bitcounts per SM per cycle)

FPGA implementation: systolic array of Binary PEs, 256-wide binary dot product operations; synthesized for both Arria and Stratix; measured on Arria, simulated on Stratix

Page 18: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 18| |Hardware Acceleration for Data Processing Seminar

Evaluation: Ternary ResNet-50

Ternary weights (-1, 0, +1), FP32 neurons; within 1% accuracy of full ResNet 70..80% sparsity across weights and neurons (ideally 3.3..5x op. reduction) GPU implementation: Torch, batch size 64, cuDNN 5 with most aggressive

performance setting, 3x faster than closest other implementation FPGA implementation: only exploit sparsity in neurons to achieve simpler

design (450 MHz (“conservative estimation”) but only 2x op. reduction)

Page 19: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 19| |Hardware Acceleration for Data Processing Seminar

Conclusion

Can FPGAs beat GPUs in performance for next-generation DNNs?

Yes, if the Stratix 10 2800 meets Intel’s performance projections:

sparse (85% pruned) AlexNet: 1.1x in performance, 1.9x in energy efficiency DNNs with narrow (int6) data types: 1.5x in performance, 2.1x in energy efficiency BinaryNet: 5.4x in performance, 5.0x in energy efficiency Ternary ResNet-50 (ImageNet): 2.0x in performance, 2.7x in energy efficiency

Page 20: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 20| |Hardware Acceleration for Data Processing Seminar

My opinion: The good

Intel is trying to accelerate innovation in FPGAs (e.g., memory, architectural mix) and wants to challenge the market lead of GPUs for DNNs.

Concretely, Intel proposes an accelerator template and evaluates it in a promising case study.

They use a competitive baseline for the GPU or use GPU peak numbers where no such baseline was available.

Page 21: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 21| |Hardware Acceleration for Data Processing Seminar

My opinion: The bad

The paper is more marketing than science:

Core methods are based on unreleased tools, devices, and benchmarks, making results not reproducible by the community, thus claims not falsifiable.

Comparison to related work is basically “all other work is based on obsolete platforms and/or not focused on emerging DNNs”.

Energy efficiency is important, but so is time-to-market and cost scaling. It remains unclear how and if the proposed accelerator template can be integrated in a productive development

framework, and

Intel can price FPGAs competitively for wide adoption in HPC(SX2800 vs TITAN X: 15k $ vs. 1.5k $).

Page 22: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 22| |Hardware Acceleration for Data Processing Seminar

My opinion: The ugly

They compare their next-generation devices that still have not hit the market by Q4 2017 to a GPU that was in mass production in Q3 2016.

Their main figure of merit, energy efficiency, is based on preliminary estimations, not measurements. Moreover, really significant advantages only result at very aggressive projections.

Page 23: Can FPGAs beat GPUs in accelerating next-generation Deep ...Hardware Acceleration for Data Processing Seminar Andreas Kurth | 2017-12-05 | 8 DNN trends could favor FPGAs 1)DNNs are

2017-12-05Andreas Kurth 23| |Hardware Acceleration for Data Processing Seminar

Outlook: The systems perspective

The Intel Programmable Acceleration Cardwith Intel Arria 10 GX FPGA plugs into a serverto accelerate workloads. Announced for Q2 2018.Image credit: Intel

GPU-like system integration on PCIe bus In-package integration on server socket

Intel Arria 10 GX MCP co-integrated in a singlepackage with a 15-core Broadwell EP, intercon-nected with QPI (?) for high-bandwidth,low-latency shared memory.Image credit: TheNextPlatform