Programming and Benchmarking FPGAs with Software-Centric...

39
Cathal McCabe, Xilinx University Program DATE, March 2016 Programming and Benchmarking FPGAs with Software-Centric Design Entries

Transcript of Programming and Benchmarking FPGAs with Software-Centric...

Page 1: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

Cathal McCabe, Xilinx University ProgramDATE, March 2016

Programming and Benchmarking FPGAs with Software-Centric Design Entries

Page 2: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Page 2

Agenda

FPGA/CPU scaling

Productivity Gap

Heterogeneous computing

SDx

Benchmarking

Page 3: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Moore’s Law

– Number of transistors doubles every

1/1.5/2 years

Moore’s Second Law (Rock’s law)

– Cost of fabs increases exponentially

Moore’s Law’s Law

– Number of people predicting the end

of Moore’s law doubles every year

Dennard’s Law (Scaling)

– As transistors get smaller their

power density remains constant

The legal speak

*“Cramming More Components onto Integrated Circuits,” Electronics, pp. 114–117, April 19, 1965.

Page 4: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Transistors, Clock Speed, Power

Page 5: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

The world around us hasn’t stopped scaling

*Cisco Visual Networking Index: Global Mobile Data Traffic

Forecast Update, 2015–2020 White Paper

Page 6: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

FPGA evolution20nm Virtex Ultrascale

Over 20,000,000,000 transistors

2.5 micron XC2064

85,000 transistors

Maximum Density of Xilinx FPGA by node

Page 7: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

FPGA Scaling

Maximum Xilinx SerDes Rate per pin Aggregate Xilinx SerDes Bandwidth

High End PC per pin DDR rateAggregate Xilinx FPGA Memory

BW per node

*FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014

Page 8: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Heterogeneous compute is required to facilitate smarter, faster,

greener computing

– New generation of heterogeneous compute platforms emerge

• HP Moonshot & the machine, IBM’s OpenPower

Accelerator integration transitions from

– Example: Xilinx Zynq MPSoC & Intel/Altera hybrid

– NVDIMMs, Micron’s Automata processor, Mosys BWE

Page 8

Trends: Increasingly heterogeneous and integrated

tightly coupled QPI or CAPI accelerators

On-chip integration with

processors & memory

Loosely coupled IO devices

Page 9: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Productivity Gap: Application Perspective(David Thomas, Imperial College, UK)

Hour Day Week Month

0.25

1

Year

4

16

64

256

Initial Design

Parallelisation

Clock Rate

Relative

Performance

Design-time

CPU

GPU

FPGA

Page 9

FPGAs provide large speed-up and power savings – at a price!Days or weeks to get an initial version working

Multiple optimisation and verification cycles to get high performance

Page 10: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Evolution of Design Environments

• ISE, RTL-based design entry with IP library

Legacy

• Microblaze, SDK, EDK

Embedded CPU integration

• Vivado HLS

• SDNet (DSL PX), SDAccel (OpenCL)

• Vivado, Block stitching and manual integration of platform in RTL

Raised abstraction for accelerators

• SDSoC, SDNet, SDAccel

• Predefined methods for data transfer & automated implementation

Simplified host integration & automated infrastructure creation

Tim

e

Abstra

ctio

n

Page 11: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

SDx - SDSoC, SDAccel

Page 12: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Legacy MPSoC Design

Page 12

PL

PS

ApplicationSDKC/C++

DriverSDK, OS ToolsC

IP IntegratorIPI projectDatamover

PS-PL interface

IPVivadoHLS

Verilog, VHDL

HW-SW partition

spec

Met

Req

?

Need to modify multiple levels of design entry

Page 13: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Software IDE

Migrate C/C++ functions to

hardware/Simple hardware-software

partitioning

Fast Estimation

Full system generation including OS,

drivers, hardware connectivity, data

movers

System-level debug and profiling

Page 13

SDSoC: HW Acceleration from C/C++

C/C++ Applications

System-level Profiling

Specify Functions for

Hardware

Full System Generation

Rapid System

Level

Performance

Estimation

Page 14: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

After SDSoC: Automatic System Generation

Page 14

C/C++

PL

PS

IP

Application

Driver

SDSoC

Datamover

PS-PL interface

Met

Req

?

C/C++ to System in hour, days

func1_sw();

func2_hw();

func3_hw();

HW-SW partition

spec

Fast Estimate

Automatic HW/SW interfaces,

Data Movers

Page 15: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Evolving standard since 2008

Defined by Khronos Group

– Broad Vendor Adoption : CPU, GPU, DSP, FPGA

– Application Developers : Adobe, Huawei, Baidu, Fujitsu,

Initial GPU-favourable hardware abstractions, now moving

towards a more neutral standard

OpenCL

Data parallel

execution of all

PEs within 1 CU

Task parallel

execution only

between

compute units

Page 16: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Readily Available Boards

Libraries Availability

OpenCL built-

insIncluded

Video, DSP,

Linear AlgebraIncluded

OpenCV, BLAS

Provided by

Auviz

Systems

Page 16

SDAccel: Development Environment for C, C++ and OpenCL

Page 17: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Now that we have platform compatibility*,

we can figure out which applications work

on which platform!

*for a subset of applications that fit the OpenCL execution model,

and don’t run into resource limitations, and don’t use vendor

specific extensions

Page 18: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

No performance portability

– Numerous implementations for each application yield very different results

• High discrepancy between out-of-the-box and optimized performance

• Optimizations for different platforms are very different

• Many benchmarks are biased

– Running GPU benchmarks on FPGAs

Figure of merits/cost functions for a diverse set of hardware architectures

– Instruction sets are fundamentally different

– Different hardware counters, different profiling tools for different platforms

Correlating theoretical numbers (hardware-neutral) with measured results

What is an acceptable development effort?

Over what do we normalize: cost, technology node?

Challenges in Benchmarking with OpenCL

Page 19: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Characterization (broad)

– Profile & characterize a set of potential hardware platforms

• Leveraging and extending UC Berkeley roofline model

(Cabezas et al "Extending the roofline model: Bottleneck analysis with microarchitectural constraints. IISWC

2014)

– Analyse and profile a broad set of applications

• Compute load and memory requirements

– Correlation yields performance insights

Benchmarking (deep)

– Out-of-the-box implementation

• Static code analysis

• Profiling on all platforms

– Optimize implementations for all platforms until

• Optimal performance is reached, or discrepancy to optimal performance is fully

understood

• Static code analysis

• Profiling on all platforms

Key Concepts

Page 20: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Abstractions

Hardware Platform

executing some work

Host interface

(multiple channels)

Memory interface

(multiple channels)

Network interface

(multiple channels)

Currently only considering:

Cache

Implementation of

an Application

Page 21: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Implementation I of application A

– Work W [OPS], float and non-float

– Device memory traffic DMT [Bytes]

– Operational Intensity OI=W/DMT

– Execution time T [ms] (memory, accelerator, memory)

– Performance P [OPS/s]

– Power consumption PWR [Watt]

– Energy consumption E [Joules]

– Ratio R of out-of-the-box : optimized

– Number of design revisions REV

Hardware Platform

– Theoretical and measured peak performance P [OPS/s], float and non-float*

– Theoretical and measured peak memory bandwidth BW [Bytes/s]

– Peak power consumption TDP [Watt]

Figure of Merits

*Max of LINPACK, SHOC L0, CLPeak

Page 22: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Peak performance as a function of operational intensity

– P = min{ OI*BW; P}

Rooflines for Hardware Platforms

Operational intensity

of an implementation

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

maximum

performance

Memory bound Compute bound

Hardware:

P=100GOPS/s

BW= 1GB/s

Implementation

OI = 1OPS/Byte

Estimated peak performance for I:

1GOPS/s

Page 23: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Allows performance estimates, tracking of optimizations

Applications in Rooflines

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

Estimated

peak performance

Implementation

of Application A

Measurements

Page 24: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

Considered Platforms

Page 25: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Xeon processors (E5-2660 Ivybridge)

– Few beefy cores at high rates & ooo execution (10cores @ 2.2GHz)

– Large memory @ medium bandwidth (64GB @ 60GBps)

– Fully coherent memory subsystem (automatically managed)

– 256b vector units per core

Xeon Phi (5110P)

– More beefy (in-order processing) cores (60 cores @ 1.053GHz)

– Fast external memory with lower density (8GB @ 320GBps)

– Fully coherent memory system

– 512b vector processing units per core

Nvidia K20x

– Huge amount of light-weight threads SIMT(13*192 @ 732MHz)

– Fast external memory with lower density (6GB @ 250GBps)

– Specialized hardware DP units

FPGA ADM_PCIe_7V3

– Massively parallel and fine grained architecture

– Medium memory at slow speed (16GB @ 21.3GBps)

– High power efficiency (25W)

– Massive network connectivity

Considered Hardware Platforms

Page 26: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

This is a Subhead if needed

FPGA

Phi

GPU

CPU

Roofline for Non-Float. Theoretical Peak Performance

Page 27: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

This is a Subhead if needed

GOPS/Watt

FPGA

Phi

GPU

CPU

Roofline for Non-Float. Theoretical Peak Performance/Peak Power

Page 28: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

The Applications

Page 29: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Application Type Operational

Intensity (min)

Operational

Intensity (max)

Video Scalar Fixed point Vtaps*Htaps*sh*sv/4

*(sv*sh+1)= 1

(1,1,1,1)

7*Vtaps*Htaps*sh*sv/4

*(sv*sh+1) = 446

(16,16,16,16)

BobJenkins Hash Non-float 3.1 4.5

Memcached Non-float 3.65 300

….

Characterization of Application (out-of-the-box)Initial Estimates

Page 30: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Generic Image Filtering Equation

Theoretical Operational Intensity:

– OI=7*Vtaps*Htaps*sh*sv/4*(sv*sh+1)

Results

– Complexity independent of image size

– Complexity increases with larger window and

scaling factor

– Well suited for FPGA acceleration

Example 1: Polyphase Video Scaler

Sv,sh,vtaps, htaps TOI

1,1,1,1 1

4,4,4,4 26

16,16,16,16 446

Beyond that, this doesn’t matter …

Page 31: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Theoretical Operational Intensity:

– OI = 4.5 (for maximum key size)

Results

– Memory bound => poor performance

on FPGA

Example 2: Bob Jenkins Hash Algorithm

Page 32: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Streaming application

Network interface, SSD interface

Theoretical Operational Intensity:

(per packet, GET operation)

– OI = [3.65, 300]

Results

– 80Gbps performance

– 40TBytes

Example 3: Key-Value StoresHotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory

Page 33: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Preliminary results within constraint OpenCL1.2 execution model

– FPGAs excel at highly compute intensive applications with non-float

dominating performance

– Applications that cannot reach peak performance on other platforms

• For example for applications with data-flow compute model, high branch

conversion, asymmetric data types

– Video scaler, CNN, Smith Waterman

Many other applications outside this model

Observations

Page 34: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Operations are not directly comparable on different platform

Integrating network and host interface bottlenecks and local

memory characteristics into the roofline

Fairness of power measurements

Normalization

Extension to other design entries

Shortcomings & Next Steps

Page 35: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

New high-abstraction design flows are emerging for FPGAs

– Raise abstraction for accelerators and simplify host integration

Benchmarking diverse hardware architectures poses numerous

challenges

– Community effort needed

We propose a benchmarking methodology with cost functions to

– Quantify and measure work, performance, design productivity & complexity

– Across a diverse range of hardware platforms

– Characterization of applications across these platforms

Summary

Page 36: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Page 36

SDAccel Resources Link

Documentationhttp://www.xilinx.com/support/documentation-navigation/development-tools/software-development/sdaccel.html

Examples https://github.com/Xilinx/SDx

Product information www.xilinx.com/sdaccel

Page 37: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Categories Name Description

Getting Started

Hello Hello application demonstrating the use of printf in a kernel

Vector Add Example of vector addition in OpenCL

Vector Dot Product Example of vector dot product operation

Vector Mult/AddReuse of data stored in the DDR across kernels in different

binary containers

Security

AES DecryptionOptimized implementation of an AES-128 ECB Encrypt in

Software, followed by OpenCL Decryption

RSA Optimized implementation of the RSA decryption algorithm

SHA1 Optimized implementation of SHA1 secure hashing algorithm

Tiny encryption Optimized implementation of the tiny encryption algorithm

Application Examples on Github: https://github.com/Xilinx/SDx

Page 37

Page 38: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Categories Name Description

Imaging Sobel Edge Detector Implementation of a Sobel Edge Detector

Histogram Equalizer Optimized implementation of a 12-bit histogram equalizer

Huffman Codec Optimized implementation of Huffman encoding/decoding

algorithm

Median Filter Optimized implementation of a median filter for image noise

reduction

Watermarking Example of image watermarking

Acceleration Bitcoin Miner This is a patch and build infrastructure for building the

bfgminer application

Nearest Neighbor Optimized implementation of nearest neighbor linear search

algorithm

Smith-Waterman Genetic Sequencing and Matching

Application Examples on Github: https://github.com/Xilinx/SDx

Page 38

Page 39: Programming and Benchmarking FPGAs with Software-Centric ...res4ant.deib.polimi.it/.../Cathal_2016...aware_hetrogeneous_computi… · Page 12 PL PS C/C++ SDK Application C SDK, OS

© Copyright 2015 Xilinx.

Page 39

Thank you