Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use...
Transcript of Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use...
1
Power Efficient Solutions w/ FPGAs
Bill Jenkins
Altera Sr. Product Specialist for Programming Language
Solutions
I/O I/O
System Challenges
Market Reaction: Growth of customized hardware and architectures…
3
Memory
Result:
Slow
Performance
(high latency)
CPUArchitecture is
inefficient for most
parallel computing
applications
(big data, search)
CPUResult:
Excessive power
consumption
CPUBottlenecks
are starving
the CPU
for data
Bottleneck
BottleneckBottleneck
Role of FPGA
Resource Sharing
Virtualization of computation,
Storage, Networking
Accelerators
Network Acceleration, Hypervisor
offload
Data Access Acceleration
Algorithm Acceleration
Cluster Computing
CPU and FPGA
Cluster Fabric
Cluster Interconnect
Host CPU
DR
AM
4
FPGA
FPGAs can greatly enhance CPU-based data center processing by
accelerating algorithms and minimizing bottlenecks
FPGAs Increase Efficiency in the Data Center
Massively parallel architecture
Has 10 to 100 times the number of
computational units
Enables pipelined designs that perform
multiple / different instructions in a single
clock cycle
Better localized memory avoids bottlenecks
Programmability enables
application-specific accelerators
5
10X+ increase in performance per watt
>5M Logic Elements
1.5TFLOPs Floating
Point DSP
Programmable I/O
3200Mbps DDR4
SDRAM/ 2.5Tbps
HMC
Mapping a simple program to an FPGA
6
R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
High-level code
Mem[100] += 42 * Mem[101]
CPU instructions
B
A
AALU
First let’s take a look at execution on a simple CPU
7
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Fixed and general
architecture:
- General “cover-all-cases” data-paths
- Fixed data-widths
- Fixed operations
B
A
AALU
Load constant value into register
8
Very inefficient use of hardware!
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
CPU activity, step by step
9
AR0 Load Mem[100]
AR1 Load Mem[101]
AR2 Load #42
AR2 Mul R1, R2
AR0 Add R2, R0
Store R0 Mem[100]A
Time
On the FPGA we unroll the CPU hardware…
10
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
Space
… and specialize by position
11
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
… and specialize
12
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
… and specialize
13
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
… and specialize
14
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
… and specialize
15
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
… and specialize
16
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
6. Reschedule!
Custom Data-Path on the FPGA Matches Your Algorithm!
17
Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
Throughput / Latency / Power
load load
store
42
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
Architectural Example: Image Processing
Convolutions: dataflow can proceed in pipelined fashion– No need to wait until the entire execution is complete
– Start a new set of data calculations as soon as the first stage completes its execution
Inew 𝑥 𝑦 =
𝑥′=−1
1
𝑦′=−1
1
Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
Main
Memory
Cache
Processor (CPU/GPU) Implementation
19
A cache can hide poor memory access patterns
for(int y=1; y<height-1; ++y) {for(int x=1; x<width-1; ++x) {for(int y2=-1; y2<1; ++y2) {for(int x2=-1; x2<1; ++x2) {i2[y][x] += i[y+y2][x+x2]
* filter[y2][x2];
CPU
FPGA Implementation
20
Example performance point: 1 pixel per cycle
Cache requirements: 9 reads + 1 write per cycle
Expensive hardware!– Power overhead
– Cost overhead: more built in addressing flexibility than we need
Why not customize the cache for the application?
CacheCustom
Data-path
9 read ports!Memory
Optimizing the “Cache”
21
Start out with the initial picture that is W pixels wide
w
Optimizing the “Cache”
22
ww
Let’s remove all the lines that aren’t in the neighborhood
of the window
Optimizing the “Cache”
23
Take all of the lines and arrange them as a 1D array of
pixels
ww
ww
w
ww
ww
w
ww w
Optimizing the “Cache”
24
ww
ww
w
ww
ww
w
Remove the pixels at the edges that we don’t need for the
computation
Optimizing the “Cache”
25
What happens when we move the window one pixel to the
right?
We have created a shift register implementation
w
data_out[9]
Shift Registers in Software
26
pixel_t sr[2*W+3];while(keep_going) {
// Shift data in#pragma unrollfor(int i=1; i<2*W+3; ++i)
sr[i] = sr[i-1]sr[0] = data_in;
// Tap output datadata_out = {sr[ 0], sr[ 1], sr[ 2],
sr[ w], sr[ w+1], sr[ w+2]sr[2*w], sr[2*w+1], sr[2*w+2]}
// ...}
wwdata_in
sr[0] sr[2*W+2]
Managing data
movement to match
the FPGA’s
architectural strengths
is key to obtaining
high performance
Traditional OpenCL Implementation of a Pipeline(CPU/GPU)
High-latency: requires access to global memory
High memory-bandwidth
Requires host coordination to pass buffers from one kernel
to another
With a particular design example we achieved 183 Images/s
on a Stratix V PCIe card
Kernel 1 Kernel 2 Kernel 3
Global Memory (DDR)
Buffer Buffer Buffer Buffer
Leveraging Kernel-to-Kernel Channels
Low-latency communication between kernels
Significantly less memory bandwidth requirements
Host is not involved in coordinating communication between kernels
This implementation on the same Stratix V PCIe card resulted in 400 Images/s
Global Memory (DDR)
Buffer Buffer
Channels
Kernel 1 Kernel 2 Kernel 3
• Channel declaration:
• Create a queue:value_type channel();
• Channel write:
• Push data into the queue:void write_channel_altera(channel &ch, value_type data);
• Channel read:
• Pop the first element from the queuevalue_type read_channel_altera(channel &ch);
channel int my_channel;
write_channel_altera(my_channel, x);
int y = read_channel_altera(my_channel);
FPGA Code
Kernels are written as standard
building blocks that are
connected together through
channels
The concept of having multiple
concurrent kernels executing
simultaneously and
communicating directly on a
device is currently unique to
FPGAs– Offered as Vendor Extension
– Portable in OpenCL 2.0 through the concept of “OpenCL Pipes”
#pragma OPENCL_EXTENSION cl_altera_channel: enable
// Declaration of Channel API data types
channel float prod_k1_channel;channel float k1_k2_channel;channel float k2_k3_channel;channel float k3_res_channel;
__kernel void convolution_prod(int batch_id_begin,int batch_id_end,__global const volatile float * restrict
input_global){for(...) {write_channel_altera(prod_k1_channel,input_global[...]);
write_channel_altera(k1_k2_channel,input_global[...]);...}
}
Migration Between FPGAs
30
In OpenCL, a float uses soft logic in an older FPGAs
Gen10 FPGAs have hardened floating point logic built into
the DSP blocks
On Arria 10 using the same code results in processing
6800 Images/s
Stratix 10 expectations:– Large increase in floating point resources
– Higher internal frequencies achievable
– 1.6x-2x performance increase
– 12x-16x performance/watt efficiency versus Stratix V
Additional Improvements: IO Channels
31
Kernel Channels are between OpenCL kernels
IO Channels take data directly from and to IO interfaces in
the FPGA– Camera or video feed could be processed directly in the FPGA without
going through the host
– Result could be passed out to the graphics card to be displayed or back to host memory for the host to use
Private, Local and Global memory can now be used to
buffer as needed
Kernel
Channels/Pipes
Kernel 1 Kernel 2 Kernel 3
IO Channels
FPGA
Lessons Learned
Exploiting pipelining on the FPGA requires some attention to coding style to overcome the inherent assumptions of writing “software”
– FPGAs do not have caches
– Need to exploit data reuse in a more explicit way
The concept of dataflow pipelining will not realize its full potential if we write intermediate results to memory
– Bandwidth limitations begin to dominate compute
– Use direct kernel to kernel communication called channels
Native support for floating point on the FPGA allows order of magnitude performance increase
Code can be ported to newer FPGAs without modification to get performance increase
IO Channels can lower latency and improve performance even more by taking the host out of the processing chain even more