Towards Accelerator-Rich Architectures and...

Towards Accelerator-Rich Architectures and Systems

Zhenman Fang, PostdocComputer Science Department, UCLA

Center for Domain-Specific Computing

Center for Future Architectures Research

https://sites.google.com/site/fangzhenman/

2

SpecializedAccelerators,e.g.,audio,video,face,imaging,DSP,…

Die photo of Apple A8 SoC[www.anandtech.com/show/8562/chipworks-a8]

GPU CPU

The Trend of Accelerator-Rich Chips

Fixed-function accelerators (ASIC):Application-SpecificIntegratedCircuitinsteadofgeneral-purposeprocessors

Maltiel Consulting estimates

Harvard’s estimates [Shao, IEEE Micro'15]

A42010

A5 A6 A7 A8 A9 A102016

05

10152025303540

#of

IPB

lock

s

Increasing # of Accelerators in Apple SoC (Estimated)

3

Cloud service providers begin to deploy FPGAs in their datacenters

The Trend of Accelerator-Rich Cloud

FPGA2x throughput improvement!

[Putnam,ISCA'14]

Field-Programmable Gate Array (FPGA) acceleratorsü Reconfigurable commodity HW ü Energy-efficient, a high-end

board costs ~25W

4

Cloud service providers begin to deploy FPGAs in their datacenters

Accelerators are becoming 1st class citizens§ Intel expectation: 30% datacenter nodes

with FPGAs by 2020, after the $16.7 billion acquisition of Altera

The Trend of Accelerator-Rich Cloud

FPGA2x throughput improvement!

[Putnam,ISCA'14]

5

Post-Moore Era: Potential for Customized Accelerators

Source: Bob Broderson, Berkeley Wireless group

Accelerators promise 10X -1000x gains of performanceper watt by trading off flexibility for performance!

ASICsFPGAs

Moore’s law dead!

6

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream

“Extended” Amdahl’s law: 𝐨𝐯𝐞𝐫𝐚𝐥𝐥_𝐬𝐩𝐞𝐞𝐝𝐮𝐩 =𝟏

𝒌𝒆𝒓𝒏𝒆𝒍%𝒂𝒄𝒄_𝒔𝒑𝒆𝒆𝒅𝒖𝒑 + 𝟏 − 𝒌𝒆𝒓𝒏𝒆𝒍% + 𝒊𝒏𝒕𝒆𝒈𝒓𝒂𝒕𝒊𝒐𝒏

Accelerator CPU Integration overhead

How to characterize and accelerate killer applications?

How to efficiently integrate accelerators into future chips?§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]

7

How to characterize and accelerate killer applications?

How to efficiently integrate accelerators into future chips?§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]

How to deploy commodity accelerators in big data systems?§ E.g., a naïve integration may lead to 1000x slowdown [HotCloud'16]

How to program such architectures and systems?

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream

8

Overview of My Research

1• Application Drivers

• Workload characterization and acceleration

4• Compiler Support

• From many-core to accelerator-rich architectures

3• Accelerator-Rich Systems

• Accelerator-as-a-Service (AaaS) in cloud deployment

2• Accelerator-Rich Architectures (ARA)

• Modeling and optimizing CPU-Accelerator interaction

9

Dimension #1: Application Drivers

image processing [ISPASS'11] deep learning [ICCAD'16]

ü Analysis and combination of task, pipeline, and data parallelism

ü 13x speedup on 16-core CPU

ü 46x speedup on GPU

ü Caffeine: FPGA engine for Caffe

ü 1.46 TOPS for 8-bit Conv layer

ü 100x speedup for FCN layer

ü 5.7x energy savings over GPU

ü 2.6x speedup for in-memory genome sort (Samtool)

ü Record 9.6GB/s throughputfor genome compression on Intel-Altera HARPv2;

50x speedup over ZLib

genomics [D&T'17]

How do accelerators achieve such speedup?

10

For Review Only

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. X, DEC 2016 3

Fig. 1: Overview of Caffeine Framework

e.g., the definitions used in Caffe which can be compiled intothe underlying hardware. On the left side of Figure 1 is theexisting CNN layer representation and optimization librarieson CPU and GPU devices. Caffeine complements existingframeworks with an FPGA engine. In summary, this papermakes the following contributions.1. We propose a uniformed mathematical representation (con-

volutional MM) for efficient FPGA acceleration of bothCONV and FCN layers in CNN/DNN. In addition, wealso propose a novel optimization framework based onthe roofline model to find the optimal mapping of theuniformed representation to the specialized accelerator.

2. We customize a HW/SW co-designed efficient and reusableCNN/DNN engine called Caffeine, where the FPGA accel-erator maximizes the utilization of computing and band-width resource. Caffeine achieves a peak performance of1,460 GOPS for the CONV layer and 346 GOPS for theFCN layer with 8-bit fixed-point operations on a medium-sized FPGA board (KU060).

3. We provide an automation flow for users to program CNNin high-level network definitions, and the flow directlygenerates the final FPGA accelerator. We also provide theCaffe-Caffeine integration, which achieves 29x and 150xend-to-end performance and energy gains over a 12-coreCPU and 5.2x better energy efficiency over a GPU.The remainder of this paper is organized as follows.

Section II presents an overview of CNN and analyzes thecomputation and bandwidth requirements in different CNNlayers. Section III presents the the microarchitecture designof the convolution FPGA accelerator. Sections IV and Velaborate on the uniformed representation for both CONVand FCN layers, and corresponding design space explorations.Section VI presents the automation flow to compile the high-level network definitions into the final CNN accelerator. Sec-tion VII evaluates end-to-end performance of Caffine and itsCaffe integration with quantitative comparison to state-of-the-art studies. Finally, Section VIII concludes the paper.

II. CNN OVERVIEW AND ANALYSIS

A. Algorithm of CNNs

As a typical supervised learning algorithm, there are twomajor phases in CNN: a training phase and an inference (akafeed-forward) phase. Since many industry applications trainCNN in the background and only perform inferences in areal-time scenario, we mainly focus on the inference phasein this paper. The aim of the CNN inference phase is to get

Fig. 2: Inference (aka feedforward) phase in CNN

a correct inference of classification for input images. Shownin Figure 2, it is composed of multiple layers, where eachimage is fed to the first layer. Each layer receives a numberof feature maps from a previous layer and outputs a newset of feature maps after filtering by certain kernels. Theconvolutional layer, activation layer, and pooling layer are forfeature map extraction, and the fully connected layers are forclassification.

Convolutional (CONV) layers are the main componentsof CNN. The computation of a CONV layer is to extractfeature information by adopting a filter on feature maps froma previous layer. It receives N feature maps as input andoutputs M feature maps. A set of N kernels, each sizedin K1 ⇥ K2, slide across corresponding input feature mapswith element-wise multiplication-accumulation to filter outone output feature map. S1 and S2 are constants representingthe sliding strides. M sets of such kernels can generate Moutput feature maps. The following expression describes itscomputation pattern.

Out[m][r][c] =NX

n=0

K1X

i=0

K2X

j=0

W [m][n][i][j]⇤In[n][S1⇤r+i][S2⇤c+j];

Pooling (POOL) layers are used to achieve spatial invari-ance by sub-sampling neighboring pixels, usually finding themaximum value in a neighborhood in each input feature map.So in a pooling layer, the number of output feature maps isidentical to that of input feature maps, while the dimensionsof each feature map scale down according to the size of thesub-sampling window.

Activation (ReLU) layers are used to adopt an activationfunction (e.g., a ReLU function) on each pixel of featuremaps from previous layers to mimic the biological neuron’sactivation [8].

Fully connected (FCN) layers are used to make finalpredictions. An FCN layer takes “features” in a form ofvector from a prior feature extraction layer, multiplies a weightmatrix, and outputs a new feature vector, whose computationpattern is a dense matrix-vector multiplication. A few cascadedFCNs finally output the classification result of CNN. Some-times, multiple input vectors are processed simultaneously in asingle batch to increase the overall throughput, as shown in thefollowing expression when the batch size of h is greater than1. Note that the FCN layers are also the major components ofdeep neural networks (DNN) that are widely used in speechrecognition.

Out[m][h] =P

N

n=0 Wght[m][n] ⇤ In[n][h]; (1)

B. Analysis of Real-Life CNNsState-of-the-art CNNs for large visual recognition tasks

usually contain billions of neurons and show a trend to godeeper and larger. Table I lists some of the CNN modelsthat have won the ILSVRC (ImageNet Large-Scale Visual

Page 4 of 15

https://mc.manuscriptcentral.com/tcad

Submitted for Review to Transactions on Computer-Aided Design of Integrated Circuits and Systems

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]

Input 0 Input 1

Weight 0

Weight 1

X

X

+

++ Output 0

Weight 2

Weight 3

X

X

+

++ Output 1

E.g., convolutional accelerator on-chip

Weig

ht

Inpu

t Ou

tput

DRAM

#2: Customized pipeline

#3: P

arall

eliza

tion

#1: Caching

#4: Double buffer

#5: DRAM re-organization

#6: Precision customization Kernel: Convolutional Matrix-Multiplication

11



3.2 0.005 1.8 7.6

36.5

68.3

100

0

20

40

60

80

100

GFLO

PS

1.46 TOPS

ü Programmed in Xilinx High-Level Synthesis (HLS)

ü Results collected on Alpha Data PCIe-7v3 FPGA board

12

Dimension #2: Accelerator-Rich Architectures (ARA)

GAM: Global Acc Manager

SPM: ScratchPadMemory

ISA Extension

Overview of Accelerator-Rich Architecture

Acc1

SPMDMA

Customizable Network-on-Chip

C1

L1

Cm

L1

GAM

SPMDMA

TLB

Shared LLC Cache DRAM Controller

Accn

SPMDMA

13

Acc1

SPMDMA

Customizable Network-on-Chip

C1

L1

Cm

L1

GAM

SPMDMA

TLBShared LLC Cache DRAM Controller

Accn

SPMDMA

Dimension #2: Accelerator-Rich Architectures (ARA)

ARA Modeling:ü PARADE simulator:

gem5 + HLS [ICCAD'15]ü Fast ARAPrototyper flow

on FPGA-SoC [arXiv'16]

Multicore Modeling:ü Transformer simulator

[DAC'12, LCTES'12]

ARA Optimization:ü Sources of accelerator

gains [FCCM'16]ü CPU-Acc co-design:

address translation for unified memory space, 7.6x speedup, 6.4% gap to ideal [HPCA'17 best paper nominee]

ü AIM: near memory acceleration gives another 4x speedup [Memsys'17]

More information in ISCA'15 & MICRO'16 tutorials:http://accelerator.eecs.harvard.edu/micro16tutorial/

PARADE is open source: http://vast.cs.ucla.edu/software/parade-ara-simulator

14

Dimension #3: Accelerator-Rich Systems

Accelerator designer(e.g., FPGA)

Cloud service providerw/ accelerator-enabled cloudBig data application

developer (e.g., Spark)

Easy accelerator registration into cloud

Easy and efficient accelerator invocation and sharing

Accelerator-as-a-serviceBlaze prototype: 1 server w/ FPGA~= 3 CPU servers

FPGA[HotCloud'16, ACM SOCC'16]

Blaze works with Spark and YARN and is open source:https://github.com/UCLA-VAST/blaze

CPU-FPGA platform choice [DAC'16]: 1) mainstream PCIe, or 2) coherent PCIe (CAPI), or3) Intel-Altera HARP (coherent, one package)

15

Dimension #4: Compiler Support

[ICS'14, TACO'15, ongoing]

mem systemimprovement

source-to-source compiler forcoordinated data prefetching:

1.5x speedup on Xeon Phi many-core processor

Accelerator-Rich Architectures & Systems

Future work

16

Overview of My Research


[ICS'14, TACO'15, ongoing]

mem systemimprovement

[DAC'16]CPU-FPGA: PCIe or QPI?

PARADE [ICCAD'15]ARAPrototyper [arXiv'16]

Transformer [DAC'12, LCTES'12]

Accelerator-Rich Architectures & Systems

Tool: System-Level Automation

ApplicationDrivers

Compiler Support

Accelerator-Rich Systems

Accelerator-Rich Architecturestutorials [ISCA'15& MICRO'16]

FPGAAaaS: Blaze [HotCloud'16, ACM SOCC'16]

sources of gains [FCCM'16] CPU-Acc address translation

[HPCA'17 best paper nominee] near mem acceleration

[Memsys'17]

17

Chip-Level CPU-Accelerator Co-design:Address Translation for Unified Memory Space

[HPCA'17 Best Paper Nominee]

Better programmability and performance

18

Virtual memory and its benefits§ Shared memory for multi-process§ Memory isolation for security§ Conceptually more memory

Virtual Memory and Address Translation 101

MMU

Core

Memory

TLB

Memory Management Unit (MMU): virtual-to-physical address translation

Translation Lookaside Buffer (TLB): cache address translation results

Virtual memory(per process)

Physical memory

Address translation

Page TableVirtual-to-physical

address mapping in page granularity

19

Accel

TLB

Main Memory

IOMMUMMU

Core

TLB

MMU

Core Accel

Interconnect

Scratchpad

Accel Datapath

DMA

IOMMU IOTLIOTLB

Accelerator-Rich Architecture (ARA)

#1 Inefficient TLB Support. TLBs are not specialized to provide low-latency and capture page locality

#2 High Page Walk Latency.On an IOTLB miss, 4 main memory accesses are required to walk page table

Inefficiency in Today’s ARA Address Translation

Today’s ARA address translationusing IOMMU with

IOTLB (e.g. 32-entries)

0

0.2

0.4

0.6

0.8

1

Deblu

r

Deno

ise

Regis

t.

Segm

ent.

Blac

k.

Stre

am.

Swap

t.

Disp

Map

LPCI

P

EKFS

LAM

RobL

oc

gmea

n

Medical Imaging Commercial Vision Navigation

Perfo

rman

ceRe

lative

toIde

alAd

dres

sTra

nslat

ion IOMMU

IOMMU only achieves 12% performance of ideal address translation

20

Must provide efficient address translation support

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1024

Perfo

rman

ce R

elativ

e to

Ideal

Addr

ess T

rans

lation

Translation latency (cycles)

gmean

Accelerator Performance Is Highly Sensitive to Address Translation Latency

21

Opportunities for relatively simple TLB and page walker designs

TLB miss behavior ofBlackScholesbenchmark

Characteristic #1: Regular Bulk Transfer of Consecutive Data (Pages)

Access of consecutive pagesof one large memory reference

22

A shared TLB can be very helpful

Original: 32 * 32 * 32 data array

Rectangular tiling: 16 * 16 *16 tiles

Characteristic #2: Impact of Data Tiling– Breaking a Page into Multiple Accelerators

Page 0Page 1

…

0 15 31

Page 15

Page 31

Page 16

…

Each tile is mapped to a different accelerator for parallel processing.

But 1 page is split into 4 accelerators!

23

Accel

TLB

Shared TLB

To IOMMU

Accel

TLB

Accel

TLB

Accel

TLB

0

0.2

0.4

0.6

0.8

1

Deblu

r

Deno

ise

Regis

t.

Segm

ent.

Blac

k.

Stre

am.

Swap

t.

Disp

Map

LPCI

P

EKFS

LAM

RobL

oc

gmea

n


Perfo

rman

ceRe

lative

toIde

alAd

dres

sTra

nslat

ion

Our Two-Level TLB Design

ü 32-entry private TLBü 512-entry shared TLB

Utilization wall limits the number ofsimultaneously powered accelerators

Still only achieves half the ideal performance => need to improve pagewalker design

IOMMU Private TLB Two-level TLB

24

#1 Improve the IOMMU design to reduce page walk latency§ Need to design a more complex IOMMU, e.g., GPU MMU with parallel

page walker [Power, HPCA'14]

#2 Leverage host core MMU that launches accelerators§ Very simple and efficient as host core has MMU cache & data cache

Page Walker Design Alternatives

L4 L3 L2 L1 Page offset

4-Level Page Walk in64-bit Virtual Address

MMUcache

CR3

Page TableBase Address

Prefetch entries to8 consecutive pages

One datacache line

25

TLB

Main Memory

MMU

Core

TLB

MMU

Core

Interconnect

Accel

TLB

Shared TLB

Accel

TLB

Host

0

0.2

0.4

0.6

0.8

1

Deblu

r

Deno

ise

Regis

t.

Segm

ent.

Blac

k.

Stre

am.

Swap

t.

Disp

Map

LPCI

P

EKFS

LAM

RobL

oc

gmea

n


Perfo

rman

ceRe

lative

toIde

alAd

dres

sTra

nslat

ion

On average: 7.6X speedup over naïve IOMMU design, only 6.4% gap between ideal translation

Two-level TLB + hostPageWalkIOMMU Private TLBTwo-level TLB

Final Proposal: Two-Level TLB + Host Page Walk[HPCA'17 Best Paper Nominee]

26

Datacenter-Level: Deploying FPGAAccelerators at Cloud Scale

27

Deploying Accelerators in Datacenters

Accelerator designer(e.g., FPGA)

Cloud service provider

Big data application developer (e.g., Spark)

How to install my accelerators …?

How to acquire accelerator resource …?

How to program with your accelerators…?

Programming challenges:ü Java/Scala vs OpenCL/C/C++ü Explicit accelerator sharing by

multiple threads & apps

Performance challenges:ü JVM-to-accelerator

communication overheadü FPGA reconfiguration

overhead

28

Client RMAM

NM

NMContainerContainer

GAMNAM

NAM

FPGA

Global Accelerator ManagerAccelerator-centric scheduling

Node Accelerator ManagerLocal accelerator service

GAM

NAMRM: Resource ManagerNM: Node ManagerAM: Application Master

Blaze works with Apache Spark and YARN,Open source link: https://github.com/UCLA-VAST/blaze

Big data applications,e.g., Spark programs

Blaze Proposal: Accelerator-as-a-Service[ACM SOCC'16, C-FAR Best Demo Award 3/49]

Programming APIs

29

Blaze Programming Overview

Big Data Application(e.g., Spark programs)

Global ACC Manager

Node ACCManager

FPGA

GPUACCX

ACC Labels Containers

Container Info ACC Info

ACC InvokeInput dataOutput data

Register Accelerators§ APIs to add

accelerator service to corresponding nodes

Request Accelerators§ APIs to invoke

acceleratorsthrough acc_id

§ GAM allocates corresponding nodesto applications

30

Transparent and Efficient Accelerator Sharing

Tas

kSc

hedu

ler

Platform Queue

Platform Queue

App

App

Task Queue

App

licat

ion

Sche

dule

r

Task Queue

Task Queue

Accelerator Scheduling

FPGA1

FPGA2

#1 Overlapping (pipelining) computation and communication from multiple threads

#2 Data caching on FPGA device memory

2

2

1

1

3

#3 Delayed scheduling: samelogical tasks are scheduled tosame FPGA to avoid reprogram

31

A 22-node cluster with FPGA-based accelerators

20 workers

1 master / driver

Each node:1. Two Xeon processors2. One FPGA PCIe card

(Alpha Data)3. 64 GB RAM4. 10GbE NIC

Alpha Data board:1. Virtex 7 FPGA2. 16GB on-board

RAM

Spark:• Computation framework• In-memory MapReduce system

HDFS:• Distributed storage

framework1 file server

1 10GbE switch

CDSC FPGA-Accelerated Cluster

32

Programming Efforts Reduction with Blaze

Applications LOC Reduction

Logistic Regression 325

K-Means 364

Computational Genomics

Measured in Lines of Code (LOC) reduction for accelerator management

Applications LOC ReductionGenome Sequence Alignment [HotCloud'16] 896

Genome Compression 360

33

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Logistic Regression K-Means Genome Sequence Alignment

Genome Compression

Spee

dup o

f FPG

A Ta

sk ov

er C

PU

4.3X

CPU

Naive Pipelining Pipelining + Caching

Performance of Single-Accelerator with Blaze

With Blaze, a server with an FPGA can replace 1.7~4.3 CPU servers, while providing the same throughput

34

Performance of Multi-Accelerator Scheduling w/ Blaze

0.0

0.2

0.4

0.6

0.8

1.0

1 0.8 0.6 0.5 0.4 0.2 0

Norm

alize

dth

roug

hput

Ratio of LR in the mixed Logistic Regression & KMeans workloads

Theoretical optimalStatic partition

Half nodes for LR,half nodes for KM

CPU-sharing

Default schedulingpolicy for CPU

Blaze-GAM

Accelerator-centricdelayed scheduling

StaticorCPU-style sharingcannothandledynamicworkloaddistributions; Blaze-GAMperformsgoodinmost cases.

35

Great promise of accelerator-rich architectures and systems§ Orders-of magnitude performance and energy gains in customized chips§ Several folds consolidation of the datacenter size with commodity FPGAs

My contributions for chip-level Accelerator-Rich Architectures§ Developed the open-source ARA simulator PARADE§ Analyzed sources of performance gains for customized accelerators§ Proposed an efficient and unified address translation scheme for ARA

My contributions for datacenter-level accelerator deployment§ Proposed accelerator-as-a-service in the cloud § Contributed the open-source Blaze system

Lots of opportunities to be explored..

Summary So Far

36

When Internet-of-Things (IoT) Marries Accelerator

IoT devices are very sensitive to power/energy consumption

IoT cloud handles big data for real-time analytics

Customizable chips

Customized datacentersTrillions of

dollars market

Communication costsmore energy thancomputation in IoT,especially afteracceleration

Communication-Efficient Accelerator-Rich IoT (CearIoT)

37

IoT devices: Local low-poweraccelerator to preprocess data(e.g., filtering, compression)

Regional edge devices: Simpleprocessing & data aggregation(e.g., genome to variants, imageto neural bits, request aggregate)

Cloud: Large-scale data processing with customized datacenters; nearmem/storage computing for big data

Communication-Efficient Accelerator-Rich IoT (CearIoT)

#1 Architecture support#2 Programming support#3 Runtime support#4 Security supportLots of opportunities…

Towards Accelerator-Rich Architectures and...

Documents

Transcript of Towards Accelerator-Rich Architectures and...