DRAF: A Low-Power DRAM-based Reconfigurable Acceleration...
Transcript of DRAF: A Low-Power DRAM-based Reconfigurable Acceleration...
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis
ISCA – June 22, 2016
FPGA-Based Accelerators q Improve performance and energy efficiency q Good balance between flexibility (CPUs) and efficiency (ASICs)
q Recently used for many datacenter apps o Image/video processing, websearch, neural networks, …
2 Pictures: Putnam, et al. A Reconfigurable Fabric for Accelera:ng Large-‐Scale Datacenter Services. ISCA’14
Motivation q Deploy FPGAs in cost & power constrained systems
q Datacenter systems o High-density FPGAs for large accelerators for multiple apps o Low-power FPGAs to simplify integration in servers and racks
q Mobile systems o High-density FPGAs for accelerators for multiple apps o Low-power FPGAs for low cost and long battery life
3
DRAF in a Nutshell q A high-density & low-power FPGA
o Bit-level reconfigurable, just like conventional FPGAs
q Uses dense DRAM technology for lookup tables o Replacing the SRAM technology in conventional FPGAs
q DRAF vs. FPGA o 10 – 100x logic density o 1/3 power consumption o Multi-context support with fast context switch
4
Challenges of Building DRAM-based FPGAs
5
DRAM Array Structure
6
Subarray
……
MAT
……
Sense-‐amp
……
Master wordline
Row decod
er
Local wordlinebitline
A DRAM subarray is naturally a lookup-table
Input
Output
……
MAT
……
Sense-‐amp
……
Master wordline
Row decod
er
Local wordlinebitline
Challenges
7
~8k-bit output
~1k rows ~10-bit input
Mismatch LUT size a 8192-bit LUT?
Slow speed a LUT with 10 ns delay?
10-‐30 ns delay
Destructive access data lost after access?
User Clock
Physical Path
R1 R2
L1 L2
L3
L4
Destructive Access q Explicit activation, restoration, and precharge operations
o Longer access delay due to serialization
q Issue of LUT chaining: order of LUT access
8
Must activate L2 after L1
Must activate L4 after both L2 & L3
DRAF Architecture Basic Logic Element Multi-Context Support Timing
9
DRAF Overview q Same island layout and configurable interconnect as FPGA
10
DSP In DRAM technology Slower but not critical
Block RAM Uses DRAM arrays
CLB Contains multiple basic logic elements (BLEs)
Basic Logic Element 7-10 bits input 2-4 bits output
11
Subarray
……
MAT
……
Sense-‐amp
……
Master wordlineRo
w decod
er
Local wordline
bitline Narrower MAT
1k bits to 8-16 bits
Col logic
6
14
4x2 4x2
Specialized column logic Better flexibility
FFs
4 4Additional FFs & MUXs Registering & retiming
Single-MAT access Multi-context
3
4
Multi-Context Support q DRAF supports 8-16 contexts per chip
o Context: one MAT per BLE o Efficient use of MATs with little area and power overhead
q Instant switch between active contexts o Similar to context-switch between processes on CPU
q Context uses o One context per accelerator design or application o One context per part of a very large accelerator design
12
User Clock
Physical Path
R1 R2
L1 L2
L3
L4
Timing – Destructive Access q Issue of LUT chaining: order of LUT access q Solution: phase – similar to critical path finding
13
Phase 0 Phase 1
Phase Timeline
Phase 0
Phase 2
Phase 0 Phase 1 Phase 2
LUT-‐1
LUT-‐2
Timing – Latency Optimization q Issue: precharge and restore delays q Solution: 3-way delay overlapping
o Hide PRE/RST delays with wire propagation delay q Performance gap between DRAF and FPGA reduces from >10x to 2-4x
14
ACT PRE RST Wire
ACT PRE RST
LUT-‐1
LUT-‐2
ACT PRE RST Wire
ACT PRE RST
Saved delay
Summary q Challenges à solutions
o Mismatch LUT size à multi-context BLE o Destructive access à phase-based timing o Slow speed à 3-way delay overlapping
q Other design features (see paper) o Sense-amp as register o Time-multiplexed routing o Handling DRAM Refresh
15
Evaluation Area, power, performance against FPGA and CPU
16
Methodology q Synthesize, place & route with Yosys + VTR q CACTI-3DD with 45 nm power and area models
q Comparisons o 70 mm2 FPGA based on Xilinx Virtex-6 o 70 mm2 DRAF device, 8-context o Intel Xeon E5-2630 multi-core processor (2.3 GHz)
q 18 accelerator designs o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite o Web service, image processing, analytics, neural networks, …
17
DRAF Chip Area & Power
18
0.01
0.1
1
10
100
1000
0 0.5 1 1.5 Chip Area (m
m2)
Logic Capacity (in million 6-‐LUT equivalents)
FPGA
DRAF
0.01
0.1
1
10
100
1000
0 0.5 1 1.5
Peak Chip Po
wer (W
)
Logic Capacity (in million 6-‐LUT equivalents)
FPGA
DRAF
10x area improvement 50x peak power reduction
FPGA vs. DRAF (Area) q 8-context DRAF occupies 19% less area than 1-context FPGA
o 10x area efficiency: 8 designs in less silicon area than 1 design before o But only one context can be active at a time
19
0 0.2 0.4 0.6 0.8 1
1.2 1.4
aes backprop gemm gmm harris stemmer stencil viterbi editdist
Normalize
d Min
Boun
ding Area
FPGA Logic FPGA Rou:ng DRAF Logic DRAF Rou:ng
Inefficient use of larger DRAM LUT
exp/log functions
FPGA vs. DRAF (Power) q Use one context in DRAF q DRAF consumes 1/3 power of FPGA and 15% less energy
o Note: current CAD tools are less efficient with DRAF
20
0
0.2
0.4
0.6
0.8
1
aes backprop gemm gmm harris stemmer stencil viterbi editdist
Normalize
d Po
wer
FPGA Logic FPGA Rou:ng DRAF Logic DRAF Rou:ng
Performance q DRAF is 2.7x slower than FPGA q DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core
21
0.1
1
10
100
1000
aes backprop gemm gmm harris stemmer stencil viterbi Normalize
d Throughp
ut
CPU 4 CPU FPGA DRAF
exp/log functions
Efficient line buffer
Conclusions q DRAF: high-density and low-power reconfigurable fabric
o Based on dense DRAM technology o Optimized timing + multi-context support
q DRAF targets cost and power constrained applications o E.g., datacenters and mobile systems
q DRAF trades off some performance for area & power efficiency o 10x smaller area, 3x less power, and 2.7x slower than FPGA o Still 13x speedup over Xeon cores
22
Thanks! Questions?
Backup
Design flow q Verilog/VHDL programming and similar synthesis flow
o DRAF has the same primitives (LUT, FF, DSP, BRAM) as FPGA
q Specific tweaks o Wider LUT: more efficient packing o Optimize for latency rather than area
• Routing delay is easier to handle o Additional timing requirements, e.g. phase, etc.
Multi-Context q Why not do multi-context in SRAM FPGAs? q Store contexts in-place
o High area overhead, can be use to implement more normal LUTs o In DRAF: little overhead due to dense DRAM MAT array
q On-chip backup storage o Significant context switch overheads in power and latency o In DRAF: zero latency and power for context switch
Design Exploration q Lots of data in paper q Main tradeoff is between area and latency
o Larger LUT: better area, worse latency o Smaller LUT: worse area, better latency
q A major limitation is the CAD tool o Cannot efficiently map applications to large LUTs
q Final LUT size o 7-input, 2-output, 8-context o 64 rows, 32 columns, 2048-bit subarray
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis
Session 8A, Wednesday 9am
The Need for High-density & Low-power FPGAs
q FPGA accelerators improve performance and energy efficiency o Recently used for many datacenter apps (Microsoft, Baidu, …)
q Datacenter systems o Need high-density FPGAs for large accelerators for multiple apps o Need low-power FPGAs to simplify integration in servers and racks
q Mobile systems o Need high-density FPGAs for accelerators for multiple apps o Need low-power FPGAs for low cost and long battery life
DRAF: A High-density & Low-power FPGA q Based on dense DRAM arrays instead of SRAM LUTs
o 10-100x density of convectional FPGAs o 1/3 power consumption of convectional FPGAs o 13x speedup over Xeon cores
q Come to the talk to learn about o Dense, slow DRAM arrays as small, fast LUTs o Phase-based timing to address the problem of destructive reads o Multi-context support with instantaneous context switch
q Session 8A, Wednesday 9am