Towards Accelerator-Rich Architectures and...
Transcript of Towards Accelerator-Rich Architectures and...
Towards Accelerator-Rich Architectures and Systems
Zhenman Fang, PostdocComputer Science Department, UCLA
Center for Domain-Specific Computing
Center for Future Architectures Research
https://sites.google.com/site/fangzhenman/
2
SpecializedAccelerators,e.g.,audio,video,face,imaging,DSP,…
Die photo of Apple A8 SoC[www.anandtech.com/show/8562/chipworks-a8]
GPU CPU
The Trend of Accelerator-Rich Chips
Fixed-function accelerators (ASIC):Application-SpecificIntegratedCircuitinsteadofgeneral-purposeprocessors
Maltiel Consulting estimates
Harvard’s estimates [Shao, IEEE Micro'15]
A42010
A5 A6 A7 A8 A9 A102016
05
10152025303540
#of
IPB
lock
s
Increasing # of Accelerators in Apple SoC (Estimated)
3
Cloud service providers begin to deploy FPGAs in their datacenters
The Trend of Accelerator-Rich Cloud
FPGA2x throughput improvement!
[Putnam,ISCA'14]
Field-Programmable Gate Array (FPGA) acceleratorsü Reconfigurable commodity HW ü Energy-efficient, a high-end
board costs ~25W
4
Cloud service providers begin to deploy FPGAs in their datacenters
Accelerators are becoming 1st class citizens§ Intel expectation: 30% datacenter nodes
with FPGAs by 2020, after the $16.7 billion acquisition of Altera
The Trend of Accelerator-Rich Cloud
FPGA2x throughput improvement!
[Putnam,ISCA'14]
5
Post-Moore Era: Potential for Customized Accelerators
Source: Bob Broderson, Berkeley Wireless group
Accelerators promise 10X -1000x gains of performanceper watt by trading off flexibility for performance!
ASICsFPGAs
Moore’s law dead!
6
Challenges in Making Accelerator-Rich Architectures and Systems Mainstream
“Extended” Amdahl’s law: 𝐨𝐯𝐞𝐫𝐚𝐥𝐥_𝐬𝐩𝐞𝐞𝐝𝐮𝐩 =𝟏
𝒌𝒆𝒓𝒏𝒆𝒍%𝒂𝒄𝒄_𝒔𝒑𝒆𝒆𝒅𝒖𝒑 + 𝟏 − 𝒌𝒆𝒓𝒏𝒆𝒍% + 𝒊𝒏𝒕𝒆𝒈𝒓𝒂𝒕𝒊𝒐𝒏
Accelerator CPU Integration overhead
How to characterize and accelerate killer applications?
How to efficiently integrate accelerators into future chips?§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]
7
How to characterize and accelerate killer applications?
How to efficiently integrate accelerators into future chips?§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]
How to deploy commodity accelerators in big data systems?§ E.g., a naïve integration may lead to 1000x slowdown [HotCloud'16]
How to program such architectures and systems?
Challenges in Making Accelerator-Rich Architectures and Systems Mainstream
8
Overview of My Research
1• Application Drivers
• Workload characterization and acceleration
4• Compiler Support
• From many-core to accelerator-rich architectures
3• Accelerator-Rich Systems
• Accelerator-as-a-Service (AaaS) in cloud deployment
2• Accelerator-Rich Architectures (ARA)
• Modeling and optimizing CPU-Accelerator interaction
9
Dimension #1: Application Drivers
image processing [ISPASS'11] deep learning [ICCAD'16]
ü Analysis and combination of task, pipeline, and data parallelism
ü 13x speedup on 16-core CPU
ü 46x speedup on GPU
ü Caffeine: FPGA engine for Caffe
ü 1.46 TOPS for 8-bit Conv layer
ü 100x speedup for FCN layer
ü 5.7x energy savings over GPU
ü 2.6x speedup for in-memory genome sort (Samtool)
ü Record 9.6GB/s throughputfor genome compression on Intel-Altera HARPv2;
50x speedup over ZLib
genomics [D&T'17]
How do accelerators achieve such speedup?
10
For Review Only
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. X, DEC 2016 3
Fig. 1: Overview of Caffeine Framework
e.g., the definitions used in Caffe which can be compiled intothe underlying hardware. On the left side of Figure 1 is theexisting CNN layer representation and optimization librarieson CPU and GPU devices. Caffeine complements existingframeworks with an FPGA engine. In summary, this papermakes the following contributions.1. We propose a uniformed mathematical representation (con-
volutional MM) for efficient FPGA acceleration of bothCONV and FCN layers in CNN/DNN. In addition, wealso propose a novel optimization framework based onthe roofline model to find the optimal mapping of theuniformed representation to the specialized accelerator.
2. We customize a HW/SW co-designed efficient and reusableCNN/DNN engine called Caffeine, where the FPGA accel-erator maximizes the utilization of computing and band-width resource. Caffeine achieves a peak performance of1,460 GOPS for the CONV layer and 346 GOPS for theFCN layer with 8-bit fixed-point operations on a medium-sized FPGA board (KU060).
3. We provide an automation flow for users to program CNNin high-level network definitions, and the flow directlygenerates the final FPGA accelerator. We also provide theCaffe-Caffeine integration, which achieves 29x and 150xend-to-end performance and energy gains over a 12-coreCPU and 5.2x better energy efficiency over a GPU.The remainder of this paper is organized as follows.
Section II presents an overview of CNN and analyzes thecomputation and bandwidth requirements in different CNNlayers. Section III presents the the microarchitecture designof the convolution FPGA accelerator. Sections IV and Velaborate on the uniformed representation for both CONVand FCN layers, and corresponding design space explorations.Section VI presents the automation flow to compile the high-level network definitions into the final CNN accelerator. Sec-tion VII evaluates end-to-end performance of Caffine and itsCaffe integration with quantitative comparison to state-of-the-art studies. Finally, Section VIII concludes the paper.
II. CNN OVERVIEW AND ANALYSIS
A. Algorithm of CNNs
As a typical supervised learning algorithm, there are twomajor phases in CNN: a training phase and an inference (akafeed-forward) phase. Since many industry applications trainCNN in the background and only perform inferences in areal-time scenario, we mainly focus on the inference phasein this paper. The aim of the CNN inference phase is to get
Fig. 2: Inference (aka feedforward) phase in CNN
a correct inference of classification for input images. Shownin Figure 2, it is composed of multiple layers, where eachimage is fed to the first layer. Each layer receives a numberof feature maps from a previous layer and outputs a newset of feature maps after filtering by certain kernels. Theconvolutional layer, activation layer, and pooling layer are forfeature map extraction, and the fully connected layers are forclassification.
Convolutional (CONV) layers are the main componentsof CNN. The computation of a CONV layer is to extractfeature information by adopting a filter on feature maps froma previous layer. It receives N feature maps as input andoutputs M feature maps. A set of N kernels, each sizedin K1 ⇥ K2, slide across corresponding input feature mapswith element-wise multiplication-accumulation to filter outone output feature map. S1 and S2 are constants representingthe sliding strides. M sets of such kernels can generate Moutput feature maps. The following expression describes itscomputation pattern.
Out[m][r][c] =NX
n=0
K1X
i=0
K2X
j=0
W [m][n][i][j]⇤In[n][S1⇤r+i][S2⇤c+j];
Pooling (POOL) layers are used to achieve spatial invari-ance by sub-sampling neighboring pixels, usually finding themaximum value in a neighborhood in each input feature map.So in a pooling layer, the number of output feature maps isidentical to that of input feature maps, while the dimensionsof each feature map scale down according to the size of thesub-sampling window.
Activation (ReLU) layers are used to adopt an activationfunction (e.g., a ReLU function) on each pixel of featuremaps from previous layers to mimic the biological neuron’sactivation [8].
Fully connected (FCN) layers are used to make finalpredictions. An FCN layer takes “features” in a form ofvector from a prior feature extraction layer, multiplies a weightmatrix, and outputs a new feature vector, whose computationpattern is a dense matrix-vector multiplication. A few cascadedFCNs finally output the classification result of CNN. Some-times, multiple input vectors are processed simultaneously in asingle batch to increase the overall throughput, as shown in thefollowing expression when the batch size of h is greater than1. Note that the FCN layers are also the major components ofdeep neural networks (DNN) that are widely used in speechrecognition.
Out[m][h] =P
N
n=0 Wght[m][n] ⇤ In[n][h]; (1)
B. Analysis of Real-Life CNNsState-of-the-art CNNs for large visual recognition tasks
usually contain billions of neurons and show a trend to godeeper and larger. Table I lists some of the CNN modelsthat have won the ILSVRC (ImageNet Large-Scale Visual
Page 4 of 15
https://mc.manuscriptcentral.com/tcad
Submitted for Review to Transactions on Computer-Aided Design of Integrated Circuits and Systems
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
Dimension #1: Application Drivers
image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]
Input 0 Input 1
Weight 0
Weight 1
X
X
+
++ Output 0
Weight 2
Weight 3
X
X
+
++ Output 1
E.g., convolutional accelerator on-chip
Weig
ht
Inpu
t Ou
tput
DRAM
#2: Customized pipeline
#3: P
arall
eliza
tion
#1: Caching
#4: Double buffer
#5: DRAM re-organization
#6: Precision customization Kernel: Convolutional Matrix-Multiplication
11
Dimension #1: Application Drivers
image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]
3.2 0.005 1.8 7.6
36.5
68.3
100
0
20
40
60
80
100
GFLO
PS
1.46 TOPS
ü Programmed in Xilinx High-Level Synthesis (HLS)
ü Results collected on Alpha Data PCIe-7v3 FPGA board
12
Dimension #2: Accelerator-Rich Architectures (ARA)
GAM: Global Acc Manager
SPM: ScratchPadMemory
ISA Extension
Overview of Accelerator-Rich Architecture
Acc1
SPMDMA
Customizable Network-on-Chip
C1
L1
Cm
L1
GAM
SPMDMA
TLB
Shared LLC Cache DRAM Controller
Accn
SPMDMA
13
Acc1
SPMDMA
Customizable Network-on-Chip
C1
L1
Cm
L1
GAM
SPMDMA
TLBShared LLC Cache DRAM Controller
Accn
SPMDMA
Dimension #2: Accelerator-Rich Architectures (ARA)
ARA Modeling:ü PARADE simulator:
gem5 + HLS [ICCAD'15]ü Fast ARAPrototyper flow
on FPGA-SoC [arXiv'16]
Multicore Modeling:ü Transformer simulator
[DAC'12, LCTES'12]
ARA Optimization:ü Sources of accelerator
gains [FCCM'16]ü CPU-Acc co-design:
address translation for unified memory space, 7.6x speedup, 6.4% gap to ideal [HPCA'17 best paper nominee]
ü AIM: near memory acceleration gives another 4x speedup [Memsys'17]
More information in ISCA'15 & MICRO'16 tutorials:http://accelerator.eecs.harvard.edu/micro16tutorial/
PARADE is open source: http://vast.cs.ucla.edu/software/parade-ara-simulator
14
Dimension #3: Accelerator-Rich Systems
Accelerator designer(e.g., FPGA)
Cloud service providerw/ accelerator-enabled cloudBig data application
developer (e.g., Spark)
Easy accelerator registration into cloud
Easy and efficient accelerator invocation and sharing
Accelerator-as-a-serviceBlaze prototype: 1 server w/ FPGA~= 3 CPU servers
FPGA[HotCloud'16, ACM SOCC'16]
Blaze works with Spark and YARN and is open source:https://github.com/UCLA-VAST/blaze
CPU-FPGA platform choice [DAC'16]: 1) mainstream PCIe, or 2) coherent PCIe (CAPI), or3) Intel-Altera HARP (coherent, one package)
15
Dimension #4: Compiler Support
[ICS'14, TACO'15, ongoing]
mem systemimprovement
source-to-source compiler forcoordinated data prefetching:
1.5x speedup on Xeon Phi many-core processor
Accelerator-Rich Architectures & Systems
Future work
16
Overview of My Research
image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]
[ICS'14, TACO'15, ongoing]
mem systemimprovement
[DAC'16]CPU-FPGA: PCIe or QPI?
PARADE [ICCAD'15]ARAPrototyper [arXiv'16]
Transformer [DAC'12, LCTES'12]
Accelerator-Rich Architectures & Systems
Tool: System-Level Automation
ApplicationDrivers
Compiler Support
Accelerator-Rich Systems
Accelerator-Rich Architecturestutorials [ISCA'15& MICRO'16]
FPGAAaaS: Blaze [HotCloud'16, ACM SOCC'16]
sources of gains [FCCM'16] CPU-Acc address translation
[HPCA'17 best paper nominee] near mem acceleration
[Memsys'17]
17
Chip-Level CPU-Accelerator Co-design:Address Translation for Unified Memory Space
[HPCA'17 Best Paper Nominee]
Better programmability and performance
18
Virtual memory and its benefits§ Shared memory for multi-process§ Memory isolation for security§ Conceptually more memory
Virtual Memory and Address Translation 101
MMU
Core
Memory
TLB
Memory Management Unit (MMU): virtual-to-physical address translation
Translation Lookaside Buffer (TLB): cache address translation results
Virtual memory(per process)
Physical memory
Address translation
Page TableVirtual-to-physical
address mapping in page granularity
19
Accel
TLB
Main Memory
IOMMUMMU
Core
TLB
MMU
Core Accel
Interconnect
Scratchpad
Accel Datapath
DMA
IOMMU IOTLIOTLB
Accelerator-Rich Architecture (ARA)
#1 Inefficient TLB Support. TLBs are not specialized to provide low-latency and capture page locality
#2 High Page Walk Latency.On an IOTLB miss, 4 main memory accesses are required to walk page table
Inefficiency in Today’s ARA Address Translation
Today’s ARA address translationusing IOMMU with
IOTLB (e.g. 32-entries)
0
0.2
0.4
0.6
0.8
1
Deblu
r
Deno
ise
Regis
t.
Segm
ent.
Blac
k.
Stre
am.
Swap
t.
Disp
Map
LPCI
P
EKFS
LAM
RobL
oc
gmea
n
Medical Imaging Commercial Vision Navigation
Perfo
rman
ceRe
lative
toIde
alAd
dres
sTra
nslat
ion IOMMU
IOMMU only achieves 12% performance of ideal address translation
20
Must provide efficient address translation support
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1024
Perfo
rman
ce R
elativ
e to
Ideal
Addr
ess T
rans
lation
Translation latency (cycles)
gmean
Accelerator Performance Is Highly Sensitive to Address Translation Latency
21
Opportunities for relatively simple TLB and page walker designs
TLB miss behavior ofBlackScholesbenchmark
Characteristic #1: Regular Bulk Transfer of Consecutive Data (Pages)
Access of consecutive pagesof one large memory reference
22
A shared TLB can be very helpful
Original: 32 * 32 * 32 data array
Rectangular tiling: 16 * 16 *16 tiles
Characteristic #2: Impact of Data Tiling– Breaking a Page into Multiple Accelerators
Page 0Page 1
…
0 15 31
Page 15
Page 31
Page 16
…
Each tile is mapped to a different accelerator for parallel processing.
But 1 page is split into 4 accelerators!
23
Accel
TLB
Shared TLB
To IOMMU
Accel
TLB
Accel
TLB
Accel
TLB
0
0.2
0.4
0.6
0.8
1
Deblu
r
Deno
ise
Regis
t.
Segm
ent.
Blac
k.
Stre
am.
Swap
t.
Disp
Map
LPCI
P
EKFS
LAM
RobL
oc
gmea
n
Medical Imaging Commercial Vision Navigation
Perfo
rman
ceRe
lative
toIde
alAd
dres
sTra
nslat
ion
Our Two-Level TLB Design
ü 32-entry private TLBü 512-entry shared TLB
Utilization wall limits the number ofsimultaneously powered accelerators
Still only achieves half the ideal performance => need to improve pagewalker design
IOMMU Private TLB Two-level TLB
24
#1 Improve the IOMMU design to reduce page walk latency§ Need to design a more complex IOMMU, e.g., GPU MMU with parallel
page walker [Power, HPCA'14]
#2 Leverage host core MMU that launches accelerators§ Very simple and efficient as host core has MMU cache & data cache
Page Walker Design Alternatives
L4 L3 L2 L1 Page offset
4-Level Page Walk in64-bit Virtual Address
MMUcache
CR3
Page TableBase Address
Prefetch entries to8 consecutive pages
One datacache line
25
TLB
Main Memory
MMU
Core
TLB
MMU
Core
Interconnect
Accel
TLB
Shared TLB
Accel
TLB
Host
0
0.2
0.4
0.6
0.8
1
Deblu
r
Deno
ise
Regis
t.
Segm
ent.
Blac
k.
Stre
am.
Swap
t.
Disp
Map
LPCI
P
EKFS
LAM
RobL
oc
gmea
n
Medical Imaging Commercial Vision Navigation
Perfo
rman
ceRe
lative
toIde
alAd
dres
sTra
nslat
ion
On average: 7.6X speedup over naïve IOMMU design, only 6.4% gap between ideal translation
Two-level TLB + hostPageWalkIOMMU Private TLBTwo-level TLB
Final Proposal: Two-Level TLB + Host Page Walk[HPCA'17 Best Paper Nominee]
26
Datacenter-Level: Deploying FPGAAccelerators at Cloud Scale
27
Deploying Accelerators in Datacenters
Accelerator designer(e.g., FPGA)
Cloud service provider
Big data application developer (e.g., Spark)
How to install my accelerators …?
How to acquire accelerator resource …?
How to program with your accelerators…?
Programming challenges:ü Java/Scala vs OpenCL/C/C++ü Explicit accelerator sharing by
multiple threads & apps
Performance challenges:ü JVM-to-accelerator
communication overheadü FPGA reconfiguration
overhead
28
Client RMAM
NM
NMContainerContainer
GAMNAM
NAM
FPGA
Global Accelerator ManagerAccelerator-centric scheduling
Node Accelerator ManagerLocal accelerator service
GAM
NAMRM: Resource ManagerNM: Node ManagerAM: Application Master
Blaze works with Apache Spark and YARN,Open source link: https://github.com/UCLA-VAST/blaze
Big data applications,e.g., Spark programs
Blaze Proposal: Accelerator-as-a-Service[ACM SOCC'16, C-FAR Best Demo Award 3/49]
Programming APIs
29
Blaze Programming Overview
Big Data Application(e.g., Spark programs)
Global ACC Manager
Node ACCManager
FPGA
GPUACCX
ACC Labels Containers
Container Info ACC Info
ACC InvokeInput dataOutput data
Register Accelerators§ APIs to add
accelerator service to corresponding nodes
Request Accelerators§ APIs to invoke
acceleratorsthrough acc_id
§ GAM allocates corresponding nodesto applications
30
Transparent and Efficient Accelerator Sharing
Tas
kSc
hedu
ler
Platform Queue
Platform Queue
App
App
Task Queue
App
licat
ion
Sche
dule
r
Task Queue
Task Queue
Accelerator Scheduling
FPGA1
FPGA2
#1 Overlapping (pipelining) computation and communication from multiple threads
#2 Data caching on FPGA device memory
2
2
1
1
3
#3 Delayed scheduling: samelogical tasks are scheduled tosame FPGA to avoid reprogram
31
A 22-node cluster with FPGA-based accelerators
20 workers
1 master / driver
Each node:1. Two Xeon processors2. One FPGA PCIe card
(Alpha Data)3. 64 GB RAM4. 10GbE NIC
Alpha Data board:1. Virtex 7 FPGA2. 16GB on-board
RAM
Spark:• Computation framework• In-memory MapReduce system
HDFS:• Distributed storage
framework1 file server
1 10GbE switch
CDSC FPGA-Accelerated Cluster
32
Programming Efforts Reduction with Blaze
Applications LOC Reduction
Logistic Regression 325
K-Means 364
Computational Genomics
Measured in Lines of Code (LOC) reduction for accelerator management
Applications LOC ReductionGenome Sequence Alignment [HotCloud'16] 896
Genome Compression 360
33
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Logistic Regression K-Means Genome Sequence Alignment
Genome Compression
Spee
dup o
f FPG
A Ta
sk ov
er C
PU
4.3X
CPU
Naive Pipelining Pipelining + Caching
Performance of Single-Accelerator with Blaze
With Blaze, a server with an FPGA can replace 1.7~4.3 CPU servers, while providing the same throughput
34
Performance of Multi-Accelerator Scheduling w/ Blaze
0.0
0.2
0.4
0.6
0.8
1.0
1 0.8 0.6 0.5 0.4 0.2 0
Norm
alize
dth
roug
hput
Ratio of LR in the mixed Logistic Regression & KMeans workloads
Theoretical optimalStatic partition
Half nodes for LR,half nodes for KM
CPU-sharing
Default schedulingpolicy for CPU
Blaze-GAM
Accelerator-centricdelayed scheduling
StaticorCPU-style sharingcannothandledynamicworkloaddistributions; Blaze-GAMperformsgoodinmost cases.
35
Great promise of accelerator-rich architectures and systems§ Orders-of magnitude performance and energy gains in customized chips§ Several folds consolidation of the datacenter size with commodity FPGAs
My contributions for chip-level Accelerator-Rich Architectures§ Developed the open-source ARA simulator PARADE§ Analyzed sources of performance gains for customized accelerators§ Proposed an efficient and unified address translation scheme for ARA
My contributions for datacenter-level accelerator deployment§ Proposed accelerator-as-a-service in the cloud § Contributed the open-source Blaze system
Lots of opportunities to be explored..
Summary So Far
36
When Internet-of-Things (IoT) Marries Accelerator
IoT devices are very sensitive to power/energy consumption
IoT cloud handles big data for real-time analytics
Customizable chips
Customized datacentersTrillions of
dollars market
Communication costsmore energy thancomputation in IoT,especially afteracceleration
Communication-Efficient Accelerator-Rich IoT (CearIoT)
37
IoT devices: Local low-poweraccelerator to preprocess data(e.g., filtering, compression)
Regional edge devices: Simpleprocessing & data aggregation(e.g., genome to variants, imageto neural bits, request aggregate)
Cloud: Large-scale data processing with customized datacenters; nearmem/storage computing for big data
Communication-Efficient Accelerator-Rich IoT (CearIoT)
#1 Architecture support#2 Programming support#3 Runtime support#4 Security supportLots of opportunities…