Programming and Benchmarking FPGAs with Software-Centric...

Cathal McCabe, Xilinx University ProgramDATE, March 2016

Programming and Benchmarking FPGAs with Software-Centric Design Entries

© Copyright 2015 Xilinx.

Agenda

FPGA/CPU scaling

Productivity Gap

Heterogeneous computing

SDx

Benchmarking


Moore’s Law

– Number of transistors doubles every

1/1.5/2 years

Moore’s Second Law (Rock’s law)

– Cost of fabs increases exponentially

Moore’s Law’s Law

– Number of people predicting the end

of Moore’s law doubles every year

Dennard’s Law (Scaling)

– As transistors get smaller their

power density remains constant

The legal speak

*“Cramming More Components onto Integrated Circuits,” Electronics, pp. 114–117, April 19, 1965.


Transistors, Clock Speed, Power


The world around us hasn’t stopped scaling

*Cisco Visual Networking Index: Global Mobile Data Traffic

Forecast Update, 2015–2020 White Paper


FPGA evolution20nm Virtex Ultrascale

Over 20,000,000,000 transistors

2.5 micron XC2064

85,000 transistors

Maximum Density of Xilinx FPGA by node


FPGA Scaling

Maximum Xilinx SerDes Rate per pin Aggregate Xilinx SerDes Bandwidth

High End PC per pin DDR rateAggregate Xilinx FPGA Memory

BW per node

*FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014


Heterogeneous compute is required to facilitate smarter, faster,

greener computing

– New generation of heterogeneous compute platforms emerge

• HP Moonshot & the machine, IBM’s OpenPower

Accelerator integration transitions from

– Example: Xilinx Zynq MPSoC & Intel/Altera hybrid

– NVDIMMs, Micron’s Automata processor, Mosys BWE

Trends: Increasingly heterogeneous and integrated

tightly coupled QPI or CAPI accelerators

On-chip integration with

processors & memory

Loosely coupled IO devices


Productivity Gap: Application Perspective(David Thomas, Imperial College, UK)

Hour Day Week Month

0.25

1

Year

4

16

64

256

Initial Design

Parallelisation

Clock Rate

Relative

Performance

Design-time

CPU

GPU

FPGA

FPGAs provide large speed-up and power savings – at a price!Days or weeks to get an initial version working

Multiple optimisation and verification cycles to get high performance


Evolution of Design Environments

• ISE, RTL-based design entry with IP library

Legacy

• Microblaze, SDK, EDK

Embedded CPU integration

• Vivado HLS

• SDNet (DSL PX), SDAccel (OpenCL)

• Vivado, Block stitching and manual integration of platform in RTL

Raised abstraction for accelerators

• SDSoC, SDNet, SDAccel

• Predefined methods for data transfer & automated implementation

Simplified host integration & automated infrastructure creation

Tim

e

Abstra

ctio

n

SDx - SDSoC, SDAccel


Legacy MPSoC Design

PL

PS

ApplicationSDKC/C++

DriverSDK, OS ToolsC

IP IntegratorIPI projectDatamover

PS-PL interface

IPVivadoHLS

Verilog, VHDL

HW-SW partition

spec

Met

Req

?

Need to modify multiple levels of design entry


Software IDE

Migrate C/C++ functions to

hardware/Simple hardware-software

partitioning

Fast Estimation

Full system generation including OS,

drivers, hardware connectivity, data

movers

System-level debug and profiling

SDSoC: HW Acceleration from C/C++

C/C++ Applications

System-level Profiling

Specify Functions for

Hardware

Full System Generation

Rapid System

Level

Performance

Estimation


After SDSoC: Automatic System Generation

C/C++

PL

PS

IP

Application

Driver

SDSoC

Datamover

PS-PL interface

Met

Req

?

C/C++ to System in hour, days

func1_sw();

func2_hw();

func3_hw();

HW-SW partition

spec

Fast Estimate

Automatic HW/SW interfaces,

Data Movers


Evolving standard since 2008

Defined by Khronos Group

– Broad Vendor Adoption : CPU, GPU, DSP, FPGA

– Application Developers : Adobe, Huawei, Baidu, Fujitsu,

Initial GPU-favourable hardware abstractions, now moving

towards a more neutral standard

OpenCL

Data parallel

execution of all

PEs within 1 CU

Task parallel

execution only

between

compute units


Readily Available Boards

Libraries Availability

OpenCL built-

insIncluded

Video, DSP,

Linear AlgebraIncluded

OpenCV, BLAS

Provided by

Auviz

Systems

SDAccel: Development Environment for C, C++ and OpenCL

http://www.conveycomputer.com/

http://www.conveycomputer.com/


Now that we have platform compatibility*,

…

we can figure out which applications work

on which platform!

*for a subset of applications that fit the OpenCL execution model,

and don’t run into resource limitations, and don’t use vendor

specific extensions


No performance portability

– Numerous implementations for each application yield very different results

• High discrepancy between out-of-the-box and optimized performance

• Optimizations for different platforms are very different

• Many benchmarks are biased

– Running GPU benchmarks on FPGAs

Figure of merits/cost functions for a diverse set of hardware architectures

– Instruction sets are fundamentally different

– Different hardware counters, different profiling tools for different platforms

Correlating theoretical numbers (hardware-neutral) with measured results

What is an acceptable development effort?

Over what do we normalize: cost, technology node?

Challenges in Benchmarking with OpenCL


Characterization (broad)

– Profile & characterize a set of potential hardware platforms

• Leveraging and extending UC Berkeley roofline model

(Cabezas et al "Extending the roofline model: Bottleneck analysis with microarchitectural constraints. IISWC

2014)

– Analyse and profile a broad set of applications

• Compute load and memory requirements

– Correlation yields performance insights

Benchmarking (deep)

– Out-of-the-box implementation

• Static code analysis

• Profiling on all platforms

– Optimize implementations for all platforms until

• Optimal performance is reached, or discrepancy to optimal performance is fully

understood

• Static code analysis

• Profiling on all platforms

Key Concepts


Abstractions

Hardware Platform

executing some work

Host interface

(multiple channels)

Memory interface

(multiple channels)

Network interface

(multiple channels)

Currently only considering:

Cache

Implementation of

an Application


Implementation I of application A

– Work W [OPS], float and non-float

– Device memory traffic DMT [Bytes]

– Operational Intensity OI=W/DMT

– Execution time T [ms] (memory, accelerator, memory)

– Performance P [OPS/s]

– Power consumption PWR [Watt]

– Energy consumption E [Joules]

– Ratio R of out-of-the-box : optimized

– Number of design revisions REV

Hardware Platform

– Theoretical and measured peak performance P [OPS/s], float and non-float*

– Theoretical and measured peak memory bandwidth BW [Bytes/s]

– Peak power consumption TDP [Watt]

Figure of Merits

*Max of LINPACK, SHOC L0, CLPeak


Peak performance as a function of operational intensity

– P = min{ OI*BW; P}

Rooflines for Hardware Platforms

Operational intensity

of an implementation

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

maximum

performance

Memory bound Compute bound

Hardware:

P=100GOPS/s

BW= 1GB/s

Implementation

OI = 1OPS/Byte

Estimated peak performance for I:

1GOPS/s


Allows performance estimates, tracking of optimizations

Applications in Rooflines

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

Estimated

peak performance

Implementation

of Application A

Measurements

Considered Platforms


Xeon processors (E5-2660 Ivybridge)

– Few beefy cores at high rates & ooo execution (10cores @ 2.2GHz)

– Large memory @ medium bandwidth (64GB @ 60GBps)

– Fully coherent memory subsystem (automatically managed)

– 256b vector units per core

Xeon Phi (5110P)

– More beefy (in-order processing) cores (60 cores @ 1.053GHz)

– Fast external memory with lower density (8GB @ 320GBps)

– Fully coherent memory system

– 512b vector processing units per core

Nvidia K20x

– Huge amount of light-weight threads SIMT(13*192 @ 732MHz)

– Fast external memory with lower density (6GB @ 250GBps)

– Specialized hardware DP units

FPGA ADM_PCIe_7V3

– Massively parallel and fine grained architecture

– Medium memory at slow speed (16GB @ 21.3GBps)

– High power efficiency (25W)

– Massive network connectivity

Considered Hardware Platforms


This is a Subhead if needed

FPGA

Phi

GPU

CPU

Roofline for Non-Float. Theoretical Peak Performance


This is a Subhead if needed

GOPS/Watt

FPGA

Phi

GPU

CPU

Roofline for Non-Float. Theoretical Peak Performance/Peak Power

The Applications


Application Type Operational

Intensity (min)

Operational

Intensity (max)

Video Scalar Fixed point Vtaps*Htaps*sh*sv/4

*(sv*sh+1)= 1

(1,1,1,1)

7*Vtaps*Htaps*sh*sv/4

*(sv*sh+1) = 446

(16,16,16,16)

BobJenkins Hash Non-float 3.1 4.5

Memcached Non-float 3.65 300

….

Characterization of Application (out-of-the-box)Initial Estimates


Generic Image Filtering Equation

Theoretical Operational Intensity:

– OI=7*Vtaps*Htaps*sh*sv/4*(sv*sh+1)

Results

– Complexity independent of image size

– Complexity increases with larger window and

scaling factor

– Well suited for FPGA acceleration

Example 1: Polyphase Video Scaler

Sv,sh,vtaps, htaps TOI

1,1,1,1 1

4,4,4,4 26

16,16,16,16 446

Beyond that, this doesn’t matter …



– OI = 4.5 (for maximum key size)

Results

– Memory bound => poor performance

on FPGA

Example 2: Bob Jenkins Hash Algorithm


Streaming application

Network interface, SSD interface


(per packet, GET operation)

– OI = [3.65, 300]

Results

– 80Gbps performance

– 40TBytes

Example 3: Key-Value StoresHotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory


Preliminary results within constraint OpenCL1.2 execution model

– FPGAs excel at highly compute intensive applications with non-float

dominating performance

– Applications that cannot reach peak performance on other platforms

• For example for applications with data-flow compute model, high branch

conversion, asymmetric data types

– Video scaler, CNN, Smith Waterman

Many other applications outside this model

Observations


Operations are not directly comparable on different platform

Integrating network and host interface bottlenecks and local

memory characteristics into the roofline

Fairness of power measurements

Normalization

Extension to other design entries

Shortcomings & Next Steps


New high-abstraction design flows are emerging for FPGAs

– Raise abstraction for accelerators and simplify host integration

Benchmarking diverse hardware architectures poses numerous

challenges

– Community effort needed

We propose a benchmarking methodology with cost functions to

– Quantify and measure work, performance, design productivity & complexity

– Across a diverse range of hardware platforms

– Characterization of applications across these platforms

Summary


SDAccel Resources Link

Documentationhttp://www.xilinx.com/support/documentation-navigation/development-tools/software-development/sdaccel.html

Examples https://github.com/Xilinx/SDx

Product information www.xilinx.com/sdaccel


Categories Name Description

Getting Started

Hello Hello application demonstrating the use of printf in a kernel

Vector Add Example of vector addition in OpenCL

Vector Dot Product Example of vector dot product operation

Vector Mult/AddReuse of data stored in the DDR across kernels in different

binary containers

Security

AES DecryptionOptimized implementation of an AES-128 ECB Encrypt in

Software, followed by OpenCL Decryption

RSA Optimized implementation of the RSA decryption algorithm

SHA1 Optimized implementation of SHA1 secure hashing algorithm

Tiny encryption Optimized implementation of the tiny encryption algorithm

Application Examples on Github: https://github.com/Xilinx/SDx


Categories Name Description

Imaging Sobel Edge Detector Implementation of a Sobel Edge Detector

Histogram Equalizer Optimized implementation of a 12-bit histogram equalizer

Huffman Codec Optimized implementation of Huffman encoding/decoding

algorithm

Median Filter Optimized implementation of a median filter for image noise

reduction

Watermarking Example of image watermarking

Acceleration Bitcoin Miner This is a patch and build infrastructure for building the

bfgminer application

Nearest Neighbor Optimized implementation of nearest neighbor linear search

algorithm

Smith-Waterman Genetic Sequencing and Matching

Application Examples on Github: https://github.com/Xilinx/SDx


Thank you

Programming and Benchmarking FPGAs with Software-Centric...

Documents

Transcript of Programming and Benchmarking FPGAs with Software-Centric...