DRAF: A Low-Power DRAM-based Reconfigurable Acceleration...

31
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA – June 22, 2016

Transcript of DRAF: A Low-Power DRAM-based Reconfigurable Acceleration...

Page 1: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis

ISCA – June 22, 2016

Page 2: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

FPGA-Based Accelerators q  Improve performance and energy efficiency q  Good balance between flexibility (CPUs) and efficiency (ASICs)

q  Recently used for many datacenter apps o  Image/video processing, websearch, neural networks, …

2  Pictures:  Putnam,  et  al.  A  Reconfigurable  Fabric  for  Accelera:ng  Large-­‐Scale  Datacenter  Services.  ISCA’14  

Page 3: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Motivation q  Deploy FPGAs in cost & power constrained systems

q  Datacenter systems o  High-density FPGAs for large accelerators for multiple apps o  Low-power FPGAs to simplify integration in servers and racks

q  Mobile systems o  High-density FPGAs for accelerators for multiple apps o  Low-power FPGAs for low cost and long battery life

3  

Page 4: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF in a Nutshell q  A high-density & low-power FPGA

o  Bit-level reconfigurable, just like conventional FPGAs

q  Uses dense DRAM technology for lookup tables o  Replacing the SRAM technology in conventional FPGAs

q  DRAF vs. FPGA o  10 – 100x logic density o  1/3 power consumption o  Multi-context support with fast context switch

4  

Page 5: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Challenges of Building DRAM-based FPGAs

5  

Page 6: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAM Array Structure

6  

Subarray  

……

MAT

……

Sense-­‐amp

……

Master  wordline

Row  decod

er

Local  wordlinebitline

A DRAM subarray is naturally a lookup-table

Input  

Output  

Page 7: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

……

MAT

……

Sense-­‐amp

……

Master  wordline

Row  decod

er

Local  wordlinebitline

Challenges

7  

~8k-bit output

~1k rows ~10-bit input

Mismatch LUT size a 8192-bit LUT?

Slow speed a LUT with 10 ns delay?

10-­‐30  ns  delay  

Destructive access data lost after access?

Page 8: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

User  Clock

Physical  Path

R1 R2

L1 L2

L3

L4

Destructive Access q  Explicit activation, restoration, and precharge operations

o  Longer access delay due to serialization

q  Issue of LUT chaining: order of LUT access

8  

Must activate L2 after L1

Must activate L4 after both L2 & L3

Page 9: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF Architecture Basic Logic Element Multi-Context Support Timing

9  

Page 10: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF Overview q  Same island layout and configurable interconnect as FPGA

10  

DSP In DRAM technology Slower but not critical

Block RAM Uses DRAM arrays

CLB Contains multiple basic logic elements (BLEs)

Page 11: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Basic Logic Element 7-10 bits input 2-4 bits output

11  

Subarray

……

MAT

……

Sense-­‐amp

……

Master  wordlineRo

w  decod

er

Local  wordline

bitline Narrower MAT

1k bits to 8-16 bits

Col  logic

6

14

4x2 4x2

Specialized column logic Better flexibility

FFs

4 4Additional FFs & MUXs Registering & retiming

Single-MAT access Multi-context

3

4

Page 12: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Multi-Context Support q  DRAF supports 8-16 contexts per chip

o  Context: one MAT per BLE o  Efficient use of MATs with little area and power overhead

q  Instant switch between active contexts o  Similar to context-switch between processes on CPU

q  Context uses o  One context per accelerator design or application o  One context per part of a very large accelerator design

12  

Page 13: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

User  Clock

Physical  Path

R1 R2

L1 L2

L3

L4

Timing – Destructive Access q  Issue of LUT chaining: order of LUT access q  Solution: phase – similar to critical path finding

13  

Phase  0   Phase  1  

Phase  Timeline  

Phase  0  

Phase  2  

Phase  0   Phase  1   Phase  2  

Page 14: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

LUT-­‐1  

LUT-­‐2  

Timing – Latency Optimization q  Issue: precharge and restore delays q  Solution: 3-way delay overlapping

o  Hide PRE/RST delays with wire propagation delay q  Performance gap between DRAF and FPGA reduces from >10x to 2-4x

14  

ACT  PRE   RST  Wire  

ACT  PRE   RST  

LUT-­‐1  

LUT-­‐2  

ACT  PRE   RST  Wire  

ACT  PRE   RST  

Saved  delay  

Page 15: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Summary q  Challenges à solutions

o  Mismatch LUT size à multi-context BLE o  Destructive access à phase-based timing o  Slow speed à 3-way delay overlapping

q  Other design features (see paper) o  Sense-amp as register o  Time-multiplexed routing o  Handling DRAM Refresh

15  

Page 16: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Evaluation Area, power, performance against FPGA and CPU

16  

Page 17: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Methodology q  Synthesize, place & route with Yosys + VTR q  CACTI-3DD with 45 nm power and area models

q  Comparisons o  70 mm2 FPGA based on Xilinx Virtex-6 o  70 mm2 DRAF device, 8-context o  Intel Xeon E5-2630 multi-core processor (2.3 GHz)

q  18 accelerator designs o  MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite o  Web service, image processing, analytics, neural networks, …

17  

Page 18: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF Chip Area & Power

18  

0.01  

0.1  

1  

10  

100  

1000  

0   0.5   1   1.5  Chip  Area  (m

m2)  

Logic  Capacity  (in  million  6-­‐LUT  equivalents)    

FPGA  

DRAF  

0.01  

0.1  

1  

10  

100  

1000  

0   0.5   1   1.5  

Peak  Chip  Po

wer  (W

)  

Logic  Capacity  (in  million  6-­‐LUT  equivalents)    

FPGA  

DRAF  

10x area improvement 50x peak power reduction

Page 19: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

FPGA vs. DRAF (Area) q  8-context DRAF occupies 19% less area than 1-context FPGA

o  10x area efficiency: 8 designs in less silicon area than 1 design before o  But only one context can be active at a time

19  

0  0.2  0.4  0.6  0.8  1  

1.2  1.4  

aes   backprop   gemm   gmm   harris   stemmer   stencil   viterbi   editdist  

Normalize

d  Min  

Boun

ding  Area  

FPGA  Logic   FPGA  Rou:ng   DRAF  Logic   DRAF  Rou:ng  

Inefficient use of larger DRAM LUT

exp/log functions

Page 20: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

FPGA vs. DRAF (Power) q  Use one context in DRAF q  DRAF consumes 1/3 power of FPGA and 15% less energy

o  Note: current CAD tools are less efficient with DRAF

20  

0  

0.2  

0.4  

0.6  

0.8  

1  

aes   backprop   gemm   gmm   harris   stemmer   stencil   viterbi   editdist  

Normalize

d  Po

wer  

 

FPGA  Logic   FPGA  Rou:ng   DRAF  Logic   DRAF  Rou:ng  

Page 21: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Performance q  DRAF is 2.7x slower than FPGA q  DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core

21  

0.1  

1  

10  

100  

1000  

aes   backprop   gemm   gmm   harris   stemmer   stencil   viterbi  Normalize

d  Throughp

ut  

CPU   4  CPU   FPGA   DRAF  

exp/log functions

Efficient line buffer

Page 22: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Conclusions q  DRAF: high-density and low-power reconfigurable fabric

o  Based on dense DRAM technology o  Optimized timing + multi-context support

q  DRAF targets cost and power constrained applications o  E.g., datacenters and mobile systems

q  DRAF trades off some performance for area & power efficiency o  10x smaller area, 3x less power, and 2.7x slower than FPGA o  Still 13x speedup over Xeon cores

22  

Page 23: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Thanks! Questions?

Page 24: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Backup

Page 25: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Design flow q  Verilog/VHDL programming and similar synthesis flow

o  DRAF has the same primitives (LUT, FF, DSP, BRAM) as FPGA

q  Specific tweaks o  Wider LUT: more efficient packing o  Optimize for latency rather than area

•  Routing delay is easier to handle o  Additional timing requirements, e.g. phase, etc.

Page 26: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Multi-Context q  Why not do multi-context in SRAM FPGAs? q  Store contexts in-place

o  High area overhead, can be use to implement more normal LUTs o  In DRAF: little overhead due to dense DRAM MAT array

q  On-chip backup storage o  Significant context switch overheads in power and latency o  In DRAF: zero latency and power for context switch

Page 27: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

Design Exploration q  Lots of data in paper q  Main tradeoff is between area and latency

o  Larger LUT: better area, worse latency o  Smaller LUT: worse area, better latency

q  A major limitation is the CAD tool o  Cannot efficiently map applications to large LUTs

q  Final LUT size o  7-input, 2-output, 8-context o  64 rows, 32 columns, 2048-bit subarray

Page 28: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,
Page 29: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis

Session 8A, Wednesday 9am

Page 30: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

The Need for High-density & Low-power FPGAs

q  FPGA accelerators improve performance and energy efficiency o  Recently used for many datacenter apps (Microsoft, Baidu, …)

q  Datacenter systems o  Need high-density FPGAs for large accelerators for multiple apps o  Need low-power FPGAs to simplify integration in servers and racks

q  Mobile systems o  Need high-density FPGAs for accelerators for multiple apps o  Need low-power FPGAs for low cost and long battery life

Page 31: DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

DRAF: A High-density & Low-power FPGA q  Based on dense DRAM arrays instead of SRAM LUTs

o  10-100x density of convectional FPGAs o  1/3 power consumption of convectional FPGAs o  13x speedup over Xeon cores

q  Come to the talk to learn about o  Dense, slow DRAM arrays as small, fast LUTs o  Phase-based timing to address the problem of destructive reads o  Multi-context support with instantaneous context switch

q  Session 8A, Wednesday 9am