Programming and Benchmarking FPGAs with Software-Centric...
Transcript of Programming and Benchmarking FPGAs with Software-Centric...
Cathal McCabe, Xilinx University ProgramDATE, March 2016
Programming and Benchmarking FPGAs with Software-Centric Design Entries
© Copyright 2015 Xilinx.
Page 2
Agenda
FPGA/CPU scaling
Productivity Gap
Heterogeneous computing
SDx
Benchmarking
© Copyright 2015 Xilinx.
Moore’s Law
– Number of transistors doubles every
1/1.5/2 years
Moore’s Second Law (Rock’s law)
– Cost of fabs increases exponentially
Moore’s Law’s Law
– Number of people predicting the end
of Moore’s law doubles every year
Dennard’s Law (Scaling)
– As transistors get smaller their
power density remains constant
The legal speak
*“Cramming More Components onto Integrated Circuits,” Electronics, pp. 114–117, April 19, 1965.
© Copyright 2015 Xilinx.
Transistors, Clock Speed, Power
© Copyright 2015 Xilinx.
The world around us hasn’t stopped scaling
*Cisco Visual Networking Index: Global Mobile Data Traffic
Forecast Update, 2015–2020 White Paper
© Copyright 2015 Xilinx.
FPGA evolution20nm Virtex Ultrascale
Over 20,000,000,000 transistors
2.5 micron XC2064
85,000 transistors
Maximum Density of Xilinx FPGA by node
© Copyright 2015 Xilinx.
FPGA Scaling
Maximum Xilinx SerDes Rate per pin Aggregate Xilinx SerDes Bandwidth
High End PC per pin DDR rateAggregate Xilinx FPGA Memory
BW per node
*FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014
© Copyright 2015 Xilinx.
Heterogeneous compute is required to facilitate smarter, faster,
greener computing
– New generation of heterogeneous compute platforms emerge
• HP Moonshot & the machine, IBM’s OpenPower
Accelerator integration transitions from
– Example: Xilinx Zynq MPSoC & Intel/Altera hybrid
– NVDIMMs, Micron’s Automata processor, Mosys BWE
Page 8
Trends: Increasingly heterogeneous and integrated
tightly coupled QPI or CAPI accelerators
On-chip integration with
processors & memory
Loosely coupled IO devices
© Copyright 2015 Xilinx.
Productivity Gap: Application Perspective(David Thomas, Imperial College, UK)
Hour Day Week Month
0.25
1
Year
4
16
64
256
Initial Design
Parallelisation
Clock Rate
Relative
Performance
Design-time
CPU
GPU
FPGA
Page 9
FPGAs provide large speed-up and power savings – at a price!Days or weeks to get an initial version working
Multiple optimisation and verification cycles to get high performance
© Copyright 2015 Xilinx.
Evolution of Design Environments
• ISE, RTL-based design entry with IP library
Legacy
• Microblaze, SDK, EDK
Embedded CPU integration
• Vivado HLS
• SDNet (DSL PX), SDAccel (OpenCL)
• Vivado, Block stitching and manual integration of platform in RTL
Raised abstraction for accelerators
• SDSoC, SDNet, SDAccel
• Predefined methods for data transfer & automated implementation
Simplified host integration & automated infrastructure creation
Tim
e
Abstra
ctio
n
SDx - SDSoC, SDAccel
© Copyright 2015 Xilinx.
Legacy MPSoC Design
Page 12
PL
PS
ApplicationSDKC/C++
DriverSDK, OS ToolsC
IP IntegratorIPI projectDatamover
PS-PL interface
IPVivadoHLS
Verilog, VHDL
HW-SW partition
spec
Met
Req
?
Need to modify multiple levels of design entry
© Copyright 2015 Xilinx.
Software IDE
Migrate C/C++ functions to
hardware/Simple hardware-software
partitioning
Fast Estimation
Full system generation including OS,
drivers, hardware connectivity, data
movers
System-level debug and profiling
Page 13
SDSoC: HW Acceleration from C/C++
C/C++ Applications
System-level Profiling
Specify Functions for
Hardware
Full System Generation
Rapid System
Level
Performance
Estimation
© Copyright 2015 Xilinx.
After SDSoC: Automatic System Generation
Page 14
C/C++
PL
PS
IP
Application
Driver
SDSoC
Datamover
PS-PL interface
Met
Req
?
C/C++ to System in hour, days
func1_sw();
func2_hw();
func3_hw();
HW-SW partition
spec
Fast Estimate
Automatic HW/SW interfaces,
Data Movers
© Copyright 2015 Xilinx.
Evolving standard since 2008
Defined by Khronos Group
– Broad Vendor Adoption : CPU, GPU, DSP, FPGA
– Application Developers : Adobe, Huawei, Baidu, Fujitsu,
Initial GPU-favourable hardware abstractions, now moving
towards a more neutral standard
OpenCL
Data parallel
execution of all
PEs within 1 CU
Task parallel
execution only
between
compute units
© Copyright 2015 Xilinx.
Readily Available Boards
Libraries Availability
OpenCL built-
insIncluded
Video, DSP,
Linear AlgebraIncluded
OpenCV, BLAS
Provided by
Auviz
Systems
Page 16
SDAccel: Development Environment for C, C++ and OpenCL
© Copyright 2015 Xilinx.
Now that we have platform compatibility*,
…
we can figure out which applications work
on which platform!
*for a subset of applications that fit the OpenCL execution model,
and don’t run into resource limitations, and don’t use vendor
specific extensions
© Copyright 2015 Xilinx.
No performance portability
– Numerous implementations for each application yield very different results
• High discrepancy between out-of-the-box and optimized performance
• Optimizations for different platforms are very different
• Many benchmarks are biased
– Running GPU benchmarks on FPGAs
Figure of merits/cost functions for a diverse set of hardware architectures
– Instruction sets are fundamentally different
– Different hardware counters, different profiling tools for different platforms
Correlating theoretical numbers (hardware-neutral) with measured results
What is an acceptable development effort?
Over what do we normalize: cost, technology node?
Challenges in Benchmarking with OpenCL
© Copyright 2015 Xilinx.
Characterization (broad)
– Profile & characterize a set of potential hardware platforms
• Leveraging and extending UC Berkeley roofline model
(Cabezas et al "Extending the roofline model: Bottleneck analysis with microarchitectural constraints. IISWC
2014)
– Analyse and profile a broad set of applications
• Compute load and memory requirements
– Correlation yields performance insights
Benchmarking (deep)
– Out-of-the-box implementation
• Static code analysis
• Profiling on all platforms
– Optimize implementations for all platforms until
• Optimal performance is reached, or discrepancy to optimal performance is fully
understood
• Static code analysis
• Profiling on all platforms
Key Concepts
© Copyright 2015 Xilinx.
Abstractions
Hardware Platform
executing some work
Host interface
(multiple channels)
Memory interface
(multiple channels)
Network interface
(multiple channels)
Currently only considering:
Cache
Implementation of
an Application
© Copyright 2015 Xilinx.
Implementation I of application A
– Work W [OPS], float and non-float
– Device memory traffic DMT [Bytes]
– Operational Intensity OI=W/DMT
– Execution time T [ms] (memory, accelerator, memory)
– Performance P [OPS/s]
– Power consumption PWR [Watt]
– Energy consumption E [Joules]
– Ratio R of out-of-the-box : optimized
– Number of design revisions REV
Hardware Platform
– Theoretical and measured peak performance P [OPS/s], float and non-float*
– Theoretical and measured peak memory bandwidth BW [Bytes/s]
– Peak power consumption TDP [Watt]
Figure of Merits
*Max of LINPACK, SHOC L0, CLPeak
© Copyright 2015 Xilinx.
Peak performance as a function of operational intensity
– P = min{ OI*BW; P}
Rooflines for Hardware Platforms
Operational intensity
of an implementation
OPS:Byte
(log)
Achievable
Performance
GOPS/sec
(log)
maximum
performance
Memory bound Compute bound
Hardware:
P=100GOPS/s
BW= 1GB/s
Implementation
OI = 1OPS/Byte
Estimated peak performance for I:
1GOPS/s
© Copyright 2015 Xilinx.
Allows performance estimates, tracking of optimizations
Applications in Rooflines
OPS:Byte
(log)
Achievable
Performance
GOPS/sec
(log)
Estimated
peak performance
Implementation
of Application A
Measurements
Considered Platforms
© Copyright 2015 Xilinx.
Xeon processors (E5-2660 Ivybridge)
– Few beefy cores at high rates & ooo execution (10cores @ 2.2GHz)
– Large memory @ medium bandwidth (64GB @ 60GBps)
– Fully coherent memory subsystem (automatically managed)
– 256b vector units per core
Xeon Phi (5110P)
– More beefy (in-order processing) cores (60 cores @ 1.053GHz)
– Fast external memory with lower density (8GB @ 320GBps)
– Fully coherent memory system
– 512b vector processing units per core
Nvidia K20x
– Huge amount of light-weight threads SIMT(13*192 @ 732MHz)
– Fast external memory with lower density (6GB @ 250GBps)
– Specialized hardware DP units
FPGA ADM_PCIe_7V3
– Massively parallel and fine grained architecture
– Medium memory at slow speed (16GB @ 21.3GBps)
– High power efficiency (25W)
– Massive network connectivity
Considered Hardware Platforms
© Copyright 2015 Xilinx.
This is a Subhead if needed
FPGA
Phi
GPU
CPU
Roofline for Non-Float. Theoretical Peak Performance
© Copyright 2015 Xilinx.
This is a Subhead if needed
GOPS/Watt
FPGA
Phi
GPU
CPU
Roofline for Non-Float. Theoretical Peak Performance/Peak Power
The Applications
© Copyright 2015 Xilinx.
Application Type Operational
Intensity (min)
Operational
Intensity (max)
Video Scalar Fixed point Vtaps*Htaps*sh*sv/4
*(sv*sh+1)= 1
(1,1,1,1)
7*Vtaps*Htaps*sh*sv/4
*(sv*sh+1) = 446
(16,16,16,16)
BobJenkins Hash Non-float 3.1 4.5
Memcached Non-float 3.65 300
….
Characterization of Application (out-of-the-box)Initial Estimates
© Copyright 2015 Xilinx.
Generic Image Filtering Equation
Theoretical Operational Intensity:
– OI=7*Vtaps*Htaps*sh*sv/4*(sv*sh+1)
Results
– Complexity independent of image size
– Complexity increases with larger window and
scaling factor
– Well suited for FPGA acceleration
Example 1: Polyphase Video Scaler
Sv,sh,vtaps, htaps TOI
1,1,1,1 1
4,4,4,4 26
16,16,16,16 446
Beyond that, this doesn’t matter …
© Copyright 2015 Xilinx.
Theoretical Operational Intensity:
– OI = 4.5 (for maximum key size)
Results
– Memory bound => poor performance
on FPGA
Example 2: Bob Jenkins Hash Algorithm
© Copyright 2015 Xilinx.
Streaming application
Network interface, SSD interface
Theoretical Operational Intensity:
(per packet, GET operation)
– OI = [3.65, 300]
Results
– 80Gbps performance
– 40TBytes
Example 3: Key-Value StoresHotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory
© Copyright 2015 Xilinx.
Preliminary results within constraint OpenCL1.2 execution model
– FPGAs excel at highly compute intensive applications with non-float
dominating performance
– Applications that cannot reach peak performance on other platforms
• For example for applications with data-flow compute model, high branch
conversion, asymmetric data types
– Video scaler, CNN, Smith Waterman
Many other applications outside this model
Observations
© Copyright 2015 Xilinx.
Operations are not directly comparable on different platform
Integrating network and host interface bottlenecks and local
memory characteristics into the roofline
Fairness of power measurements
Normalization
Extension to other design entries
Shortcomings & Next Steps
© Copyright 2015 Xilinx.
New high-abstraction design flows are emerging for FPGAs
– Raise abstraction for accelerators and simplify host integration
Benchmarking diverse hardware architectures poses numerous
challenges
– Community effort needed
We propose a benchmarking methodology with cost functions to
– Quantify and measure work, performance, design productivity & complexity
– Across a diverse range of hardware platforms
– Characterization of applications across these platforms
Summary
© Copyright 2015 Xilinx.
Page 36
SDAccel Resources Link
Documentationhttp://www.xilinx.com/support/documentation-navigation/development-tools/software-development/sdaccel.html
Examples https://github.com/Xilinx/SDx
Product information www.xilinx.com/sdaccel
© Copyright 2015 Xilinx.
Categories Name Description
Getting Started
Hello Hello application demonstrating the use of printf in a kernel
Vector Add Example of vector addition in OpenCL
Vector Dot Product Example of vector dot product operation
Vector Mult/AddReuse of data stored in the DDR across kernels in different
binary containers
Security
AES DecryptionOptimized implementation of an AES-128 ECB Encrypt in
Software, followed by OpenCL Decryption
RSA Optimized implementation of the RSA decryption algorithm
SHA1 Optimized implementation of SHA1 secure hashing algorithm
Tiny encryption Optimized implementation of the tiny encryption algorithm
Application Examples on Github: https://github.com/Xilinx/SDx
Page 37
© Copyright 2015 Xilinx.
Categories Name Description
Imaging Sobel Edge Detector Implementation of a Sobel Edge Detector
Histogram Equalizer Optimized implementation of a 12-bit histogram equalizer
Huffman Codec Optimized implementation of Huffman encoding/decoding
algorithm
Median Filter Optimized implementation of a median filter for image noise
reduction
Watermarking Example of image watermarking
Acceleration Bitcoin Miner This is a patch and build infrastructure for building the
bfgminer application
Nearest Neighbor Optimized implementation of nearest neighbor linear search
algorithm
Smith-Waterman Genetic Sequencing and Matching
Application Examples on Github: https://github.com/Xilinx/SDx
Page 38
© Copyright 2015 Xilinx.
Page 39
Thank you