A 481pJ/decision 3.4M decision/s Multifunctional Deep In ... · Mingu Kang, Sujan Gonugondla, Ameya...
Transcript of A 481pJ/decision 3.4M decision/s Multifunctional Deep In ... · Mingu Kang, Sujan Gonugondla, Ameya...
1
A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor
using Standard 6T SRAM Array
Mingu Kang, Sujan Gonugondla, Ameya Patil, and Naresh Shanbhag
Dept. Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
Abstract
This paper describes a multi-functional deep in-memory processor for inference applications. Deep in-
memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard
6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to
5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (≤1%) accuracy degradation
in all four applications as compared to the conventional architecture.
arX
iv:1
610.
0750
1v1
[cs
.AR
] 2
4 O
ct 2
016
2
Emerging inference applications require processing of huge data volumes [1]. A conventional
inference architecture (Fig. 1) implements memory access, data transfer from memory to processor, data
aggregation, and slicing. In such architectures, memory access energy dominates, e.g., an 8-b SRAM read
access and an 8-b MAC consumes 5pJ and 1pJ in 65nm CMOS, respectively. Additionally, the memory-
processor interface presents a severe throughput bottleneck. Deep in-memory signal processing concept
was proposed in [2] to overcome these challenges by embedding mixed-signal processing in the periphery
of the SRAM bit-cell array (BCA). However, an IC implementation needs to address a host of new
challenges including the stringent row & column pitch-matching requirements imposed by the BCA without
altering its storage density or its read/write functionality, and enabling multiple functions with mixed signal
circuitry. Recently [3], a single function, 5×1-b in-memory classifier IC has been demonstrated.
The proposed deep in-memory inference architecture has four stages (Fig.1): 1) multi-row
functional read (MR-FR), 2) bit-line (BL) processing (BLP), 3) cross BL processing (CBLP), and 4) ADC and
slicing. The MR-FR accesses multiple rows in one pre-charge cycle using pulse-width modulated word-line
(PWM-WL) signals to generate a BL voltage drop proportional to a weighted sum of multiple bits stored in
multiple rows in the column, and also performs word-level add/subtract. The BLP implements
reconfigurable column pitch-matched mixed-signal circuits to execute computations such as
multiply/absolute value/comparison on the BL voltages, in a massively column-parallel fashion. The CBLP
aggregates the BLP outputs into a scalar which is sliced to obtain the final decision. The BLP and CBLP
can be reconfigured to operate the architecture in either a dot product (DP) mode or Manhattan distance
(MD) mode. Reconfigurable stages enable multiple functions (Fig. 1 table) including normal read/write.
The chip architecture (Fig. 2) includes a digital controller (CTRL) and a CORE. The normal
3
read/write circuitry (lower) and in-memory processing blocks (upper) are physically separated to maintain
functionality. Functional WL drivers generate PWM-WL signals while the reconfiguration word (RCFG)
reconfigures local controllers in the CTRL. The in-memory processing chain is pipelined to enable BL pre-
charge when the MR-FR step is complete. The architecture processes 128 8-b words per access cycle
requiring two consecutive access cycles to process a 256-dimensional vector with 8-b elements. Thus, two
consecutive CBLP outputs are sampled on different sampling capacitors and charge-shared before
conversion by the ADC. Four 8-bit single-slope slow but energy efficient ADCs execute in parallel to
process 36 128-dimensional vectors/µs.
The MR-FR step (Fig. 3) generates BL swing ΔVBL proportional to binary-weighted bits (di) in a
column [2] via the use of PWM-WL signals. An 8-b array data (𝐷) and streamed input (𝑃) precision is
chosen to satisfy the requirements of many inference applications. The longest PWM-WL pulse width with
VWL < VDD is chosen to be less than 40% of BL RC time constant in order to ensure sufficient linearity and
prevent destructive read [2]. The shortest pulse width needs to be <250ps while driving a large RC WL,
which is challenging due to the row pitch-matching constraints. Hence, sub-ranged read is proposed where
4 MSBs and 4 LSBs are stored in adjacent columns (column pair), and read simultaneously on BLMSB and
BLLSB. Then, the charge on BLMSB is shared with 1/16 of BLLSB charge via switches Ø con and Ø merge.
Capacitors attached to BLs enable fine-tuning of the 1/16 capacitance ratio. The sub-ranged MR-FR (Fig. 3)
achieves a maximum INL = 0.03 LSB.
In the MD mode, the MR-FR enables D-to-A conversion, and replica cell read performs word-level
add/subtract by reading 𝑃 (�̅� for subtract) from the replica bit-cell array simultaneously with 𝐷 (Fig. 3)
[2]. The replica bit-cell array stores streamed data 𝑃 and can be written directly by write BL (WBL)
4
reducing energy and latency overheads.
The BLP (Fig. 4) can be reconfigured to operate in either the DP or the MD mode for dot product
or absolute computation, respectively. In the MD mode, an analog comparator and a mux is used to obtain
the absolute value, and the multiplier circuit is reconfigured as a BL-wise sampler. In the DP mode, the
comparator is bypassed and BLB is chosen by the mux. The mixed-signal capacitive multiplier [4] uses
identical bit capacitors to meet the column pitch constraints necessitating sequential processing of
multiplicand bits (pi) and thereby limiting the throughput. Sub-ranged multiplication alleviates this problem
by employing two 4-b MSB/LSB multipliers operating in parallel. The BLP outputs are charge-shared in the
CBLP and sampled and later converted into a digital value by the ADC. In the CBLP, the MSB and LSB
rails are charge-shared by first opening Ø con_rail and then closing Ø merge_rail to obtain the final output as a
weighted sum. The measured accuracy of BLP and CBLP (including MR-FR) (Fig. 4) shows that the
maximum error magnitude in the DP (MD) mode is 5.8% (8.6%) of output dynamic range.
The proposed architecture requires 16X fewer read accesses (and precharges) as compared to
the conventional architecture for a fixed volume of data, resulting in up to 5.8X throughput enhancement.
This is because MR-FR and BLP process data in massively parallel manner (128 8-b words per precharge)
whereas the normal SRAM mode fetches only 8 8-b words through 4:1 column muxing. Smaller ΔVBL and
fewer read access reduces data access energy. Charge redistribution-based low-swing computation adds
to the energy savings.
Fig. 5 (left) indicates that CORE energy and decision accuracy trade-off with ΔVBL. For binary (64-
class) decisions, ΔVBL > 15 mV (> 25mV) results in > 90% detection accuracy, and the CORE energy
reduces by 0.2pJ (0.4pJ) per 20mV reduction in ΔVBL. Energy breakdown in Fig. 5 (right) indicates that
5
much of the savings is due to MR-FR. The CTRL energy will be amortized in a multi-bank scenario. The
measured energy in DP (MD) mode is 5.6× (3.7×) smaller than conventional architecture, with savings up
to 9.7× (5.4×) in a multi-bank scenario.
The multifunctional IC (Fig. 6) implements four different algorithms with 2, 4, and 64-class
decisions in a 512×256 SRAM array achieving better decision accuracy and comparable EDP (scaled for
65nm) than single function ICs [1,3]. The chip micrograph (Fig. 7) shows that the deep in-memory circuitry
incurs an area overhead of 25% not counting the CTRL.
Acknowledgement
This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC
STARnet Centers, sponsored by SRC and DARPA.
References
[1] H. Kaul, et al., “A 21.5M-Query-Vectors/s 3.37nJ/Vector Reconfigurable k-Nearest-Neighbor Accelerator with Adaptive Precision in 14nm Tri-
Gate CMOS,” ISSCC Dig. Tech. Papers, pp. 260-261, 2016.
[2] M. Kang, et al., “An Energy-efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM,” ICASSP, pp.
8326-8330, 2014.
[3] J. Zhang, et al., “A Machine-learning Classifier Implemented in a Standard 6T SRAM Array,” Dig. Symp. VLSI Circuits, June 2016.
[4] M. Kang, et al., “An Energy-efficient Memory-based High-throughput VLSI Architecture for Convolutional Networks,” ICASSP, pp. 1037-1041,
2015.
6
Multi-rowfunctional READ
(MR-FR)
③ ADD/SUBT
① MULT
Multi-rowFunctional
READ (MR-FR)
① NormalREAD
② Digital toAnalog conv.
BLP
CBLP
slicing ① MIN/MAX② Linear
Combination
③ COMP/ABS
② ADD① Sub-ranging
Reconfigurable MULT
Stage Configurations
② BL Sample
D/A
(1b)D/A
(1b)
BLDischarge
× Bitcell Array
A/D
(1b)A/D
(1b)SenseAmp
Columnmux
Buffer/mux
D/A
(N-b)D/A
(N-b) A/D
BitlineProcessing
(BLP)
co
l B
LPCrossBLP
(CBLP)
CBLP
Co
nve
nti
on
alIn
-me
mo
ry
> ≥
data access transfer
-bitdecision
slicing
-bit Bus DigitalProcessor
Support VectorMachine (SVM)
K-nearestNeighbor (KNN)
TemplateMatching (TM)
MatchedFilter (MF)
4 algorithmsMR-FR
BLP CBLP
② ①,② ①,② ②
② ①,② ①,② ②
②, ③ ②, ③ ② Externalprocess
②, ③ ②, ③ ② ①
Dot Product(DP) mode
Manhattan Distance
(MD) mode
slicingmode
× Bitcell Array
-bit
0 1 1 01 1 0 10 0 0 11 1 0 0
0 1 0 11 1 0 11 0 0 00 1 1 0
(ΣDi Pi)
(Σ|Di – Pi|)
aggregation
slicing -bitdecision
data access & aggregation
single-rowaccess
Multi-functions in each stage Configurations of each stage for 4 algorithms
B: bit precision
Figure 1: Conventional and proposed multi-functional deep in-memory architecture for inference applications.
7
Precharge
4:1 Column mux
Sense AMP/Write Driver
RC
FG w
ord
se
rial
-in
Y-DEC.
16 X 128-b Streamed Input Buffer Reg.
BL Processor(BLP)
(mult / comp
/ abs)
Cross BL processor (CBLP)
Dec
isio
nSc
an-o
ut
No
rmal
Rea
d/w
rite
Cir
cuit
ry
Inst.Set Reg.
512 X 256-b6T SRAM
Bitcell Array
Bitcell
Bitcell
Bitcell
BitcellFun
ctio
nal
WL
Dri
ver
4 X 256-b Replica Bitcell Array
ADC[0:3]
Re-
con
fig.
An
alo
gP
roce
sso
r
X-D
ec. &
Pu
lse
Gen
.
X-D
ec. &
Pu
lse
Gen
.
Slice
P
(C)BLPCTRL
MainCTRL
ADCCTRL
R/WCTRL
BL Processor(BLP)
(mult / comp
/ abs)
Exec
ute
Fetch
MODE
CTRL
(C)BLP EN
R/W CTRL
EN
Sampler[0:3]
WBL[0] WBL[255]
BL 0
BLB
0
BL 2
55
BLB
25
5
CORE
Figure 2: Deep in-memory processor architecture.
8
-0.2
-0.1
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10 12 14 16
INL
(LSB
)
ΔV
BLB
(V)
fro
m M
R-F
R
4-b DMSB
INL
ΔVBLB
BL M
SB
BLB
MSB
BL L
SB
BLB
LSB
d7
d6
d5
d4
d3
d2
d1
d0
WL0
WL1
WL2
WL3
p7
p6
p5
p4
p3
p2
p1
p0
WLP0
WLP1
WLP2
WLP3
d7
d6
d5
d4
d3
d2
d1
d0
WB
L MSB
WB
L LSB
WWL3
WWL2
WWL1
WWL0
512X256 bitcell array
4X256 Replica bitcell array
LSB
co
l.
MSB
co
l.
D0,0 D0,1 D0,127D0
D1
P0 P1 P127P
D1,0 D1,1 D1,127
D127,0 D127,1 D127,127
Column pair
D0
,0D
1,0
D127
LSB
co
l.
MSB
co
l.
LSB
co
l.
MSB
co
l.
Multi-row functional read (MR-FR)
ΔVBL_M(L)SB ∝DM(L)SB ΔVBL_MERGE ∝ΔVBL_MSB + ΔVBL_LSB/16
ΔVBLB_M(L)SB∝DM(L)SB ΔVBLB_MERGE∝ΔVBLB_MSB+ΔVBLB_LSB/16
Sub-ranged readD
P ΔVBL_M(L)SB ∝PM(L)SB -DM(L)SB
ΔVBLB_M(L)SB∝DM(L)SB-PM(L)SB
MD
MR
-FR
read
Sub
-ran
ged
read
Replica cell write & readused in only MD mode
Re
plic
a ce
ll re
ad
WL1
WL2
WL3
WLP0
WLP1
WLP2
WLP3
øcon
ømerge
WL0VWL<VDD
Sub
-ran
ged
re
adM
R-F
R
Re
plic
a ce
llw
rite
øcon
BLMSB
ømerge
(1/16)CBL
CBL
Chargesharing
BLLSB
(15/16)CBL
sub-ranged read
sub-range switchesøcon, ømerge
Figure 3: Sub-ranged multi-row functional read (MR-FR) and measured accuracy.
9
DP mode (ΣDi Pi∝ ∆VSample) MD mode (Σ|Di – Pi|∝ Vsample)
MSB_Rail
LSB_Rail
Cross BL Processor (CBLP)
Co
lum
n p
air[
1]
ømerge_railøcon_rail
Co
lum
n p
air[
11
1]
Co
lum
n p
air[
11
2]
Co
lum
n p
air[
12
7]
WB
L MSB
Reconfig.Mult.
MSB 4b
WB
L LSB
BL Processor (BLP)
Co
lum
n p
air[
0]
OUT
IN
WB
L
Reconfig.Mult.
LSB 4b
OUT
IN
MUX_OUTW
BL
Ø Ø
MultCTRLs
COMP_EN
DP_MODE
MD_MODE
pitch matched to SRAM bitcell
BL
MSB
BLB
MSB
+ -
01
COMPOUT
EN
MUX
Prech
Prech Vsample
DP mode: VMUX_OUT = VBLB_MSB
MD mode: MSB_Rail= V LSB_Rail+V sampleVMD mode:
MSB_Rail= V LSB_Rail+V /16sampleVDP mode:
VMUX_OUT = max(VBL_MSB, VBLB_MSB) ∝ max(P-D, D-P) = |D-P|
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 50 100 150 200 250 300
∆V
Sam
ple
[V]
8-b D
250
50
100
150
200
0
8-b P
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 50 100 150 200 250 300
VSa
mp
le[V
]
8-b D
0
50
100
150
200
250
8-b P
Figure 4: Pitch-matched BL processing (BLP), cross BL processing (CBLP), and measured accuracy (@ D0 = D1 = … = D255, P0 =
P1 = … = P255).
10
0
20
40
60
80
100
120
0.5 0.7 0.9 1.1
Pro
bab
ility
of
det
ecti
on
[%
]
CORE energy per 8-b word [pJ]
Face detection (SVM)
Face recognition (TM)
BL swing per LSB: 0~30mV
0
1
2
3
4
5
6
7
8
9
Conv. Proposed Conv. Proposed
mode modeC
OR
E e
ne
rgy
pe
r 8
-b w
ord
[pJ]
(In-memory) Computation READ
×6.2↓×10.4↓
(MR-FR)
MD DP
CORE energy breakdownfrom post-layout simulations
Figure 5: Measured energy vs. detection accuracy trade-offs, and CORE energy breakdown.
11
Tech.
(nm)# of
AlgorithmsMemory size
Precision
(b)
DecisionThroughput
(Decisions/s)
DecisionEnergy
(pJ/decision)
DecisionEDP
(f J·s)Accuracy
Thiswork
65 CMOS4
(SVM, MF, KNN, TM)
SRAM 512 X 256-b 8x8
SVM: 9.3M 963.1 / 462.4† 0.1 / 0.05† 95 %
MF: 18.5M 481.5 / 231.2† 0.03 / 0.01† 100 %
TM: 312.5K 33.6K / 17.5K† 107.3 / 56.0† 100 %
KNN: 312.5K 33.6K / 17.5K† 107.4 / 56.0† 92 %
8-b
digi tal*65 CMOS
synthesized
dedicatedprocessor
per
algorithm
SRAM 512 X 256-b 8x8
SVM: 1.7M 4.5K 2.6 96 %
MF: 3.4M 2.2K 0.6 100 %
TM: 54.3K 93.0K 1715.3 100 %
KNN: 54.3K 93.0K 1715.3 90 %
[1]†† 14 Tri -gate 1 (KNN) 128 byte 8x8 21.5M 3.4K 0.2 Not reported
[3]** 130 CMOS 1 (Adaboost) SRAM 128 X 128-b 5x1 50M 633.4 0.01 90 %
Application AlgorithmDataset
(P: Query input, D: data stored in array)
1Face detection(binary class)
Support VectorMachine (SVM)
MIT CBCL dataset
P: 23 X 22 8-b pixel image (face / non-face), 100 query inputs testedD: Feature extractor and classifier combined 23 X 22 8-b word coefficient
2Event (Gun shot)
detection(binary class)
Matched filter (MF)
P1: Gun shot sound contaminated by AWGN with 3 dB SNRor P2: Only AWGN with equal power of "signal + AWGN" in P1
total 100 query inputs testedD: gun shot mono sound data with 256 8-b words
3Face recognition
(64 classes)Template matching
with Manhattan distance (TM)
MIT CBCL dataset, 16 X 16 8-b pixel image for P and DP: one of the 64 candidate faces in D, 64 query inputs tested
D: 64 candidate faces
4Hand-written numberrecognition (4 classes)
K-nearestneighbor (KNN)
MNIST dataset, 4 classes from "0" to “3" (due to array size limit)
16 X 16 8-b pixel image for P and DP: image from 4 classes, 100 query inputs tested
D: 16 images per class
* memory (digital) energy & delay measured from prototype IC (post-layout simulations); † assumes a 32 bank configuration;
†† single function with SRAM memory access cost not included; ** single function with 1b weight vector
Figure 6: Application level gains in energy efficiency, delay, accuracy, and comparison with prior arts.
12
Technology
Die size
CTRL operating freq.
SRAM capacity
Bitcell dimension
Supply voltage
SVM 963.1
Matched filter 481.5
KNN 33.6K
Template matching 33.6K
SVM 1.7M
Matched filter 3.4M
KNN 54.3K
Template matching 54.3K
Energy
per decision
(pJ)
Decision
Throughput
(Decisions/s)
65 nm CMOS
1.2 mm × 1.2 mm
1 GHz
16 KB (1 bank of 512 × 256-b)
CORE: 1.0 V, CTRL: 0.85 V
2.11 × 0.92 um2
Bitcell
Array
BLP & CBLP
Dig
ital
CTR
L
R/W
Testblock
Sampler & ADC
Input buffer register
Decision
Replica bitcell array
WL
dri
ver
& P
uls
e ge
n
WL
dri
ver
& P
uls
e ge
n
WL
dri
ver
& P
uls
e ge
n
Bitcell
Array
1.2 mm1
.2 m
m
Figure 7: Die micrograph and chip summary.