A 481pJ/decision 3.4M decision/s Multifunctional Deep In ... · Mingu Kang, Sujan Gonugondla, Ameya...

1

A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor

using Standard 6T SRAM Array

Mingu Kang, Sujan Gonugondla, Ameya Patil, and Naresh Shanbhag

Dept. Electrical and Computer Engineering, University of Illinois at Urbana-Champaign

Abstract

This paper describes a multi-functional deep in-memory processor for inference applications. Deep in-

memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard

6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to

5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (≤1%) accuracy degradation

in all four applications as compared to the conventional architecture.

arX

iv:1

610.

0750

1v1

[cs

.AR

] 2

4 O

ct 2

016

2

Emerging inference applications require processing of huge data volumes [1]. A conventional

inference architecture (Fig. 1) implements memory access, data transfer from memory to processor, data

aggregation, and slicing. In such architectures, memory access energy dominates, e.g., an 8-b SRAM read

access and an 8-b MAC consumes 5pJ and 1pJ in 65nm CMOS, respectively. Additionally, the memory-

processor interface presents a severe throughput bottleneck. Deep in-memory signal processing concept

was proposed in [2] to overcome these challenges by embedding mixed-signal processing in the periphery

of the SRAM bit-cell array (BCA). However, an IC implementation needs to address a host of new

challenges including the stringent row & column pitch-matching requirements imposed by the BCA without

altering its storage density or its read/write functionality, and enabling multiple functions with mixed signal

circuitry. Recently [3], a single function, 5×1-b in-memory classifier IC has been demonstrated.

The proposed deep in-memory inference architecture has four stages (Fig.1): 1) multi-row

functional read (MR-FR), 2) bit-line (BL) processing (BLP), 3) cross BL processing (CBLP), and 4) ADC and

slicing. The MR-FR accesses multiple rows in one pre-charge cycle using pulse-width modulated word-line

(PWM-WL) signals to generate a BL voltage drop proportional to a weighted sum of multiple bits stored in

multiple rows in the column, and also performs word-level add/subtract. The BLP implements

reconfigurable column pitch-matched mixed-signal circuits to execute computations such as

multiply/absolute value/comparison on the BL voltages, in a massively column-parallel fashion. The CBLP

aggregates the BLP outputs into a scalar which is sliced to obtain the final decision. The BLP and CBLP

can be reconfigured to operate the architecture in either a dot product (DP) mode or Manhattan distance

(MD) mode. Reconfigurable stages enable multiple functions (Fig. 1 table) including normal read/write.

The chip architecture (Fig. 2) includes a digital controller (CTRL) and a CORE. The normal

3

read/write circuitry (lower) and in-memory processing blocks (upper) are physically separated to maintain

functionality. Functional WL drivers generate PWM-WL signals while the reconfiguration word (RCFG)

reconfigures local controllers in the CTRL. The in-memory processing chain is pipelined to enable BL pre-

charge when the MR-FR step is complete. The architecture processes 128 8-b words per access cycle

requiring two consecutive access cycles to process a 256-dimensional vector with 8-b elements. Thus, two

consecutive CBLP outputs are sampled on different sampling capacitors and charge-shared before

conversion by the ADC. Four 8-bit single-slope slow but energy efficient ADCs execute in parallel to

process 36 128-dimensional vectors/µs.

The MR-FR step (Fig. 3) generates BL swing ΔVBL proportional to binary-weighted bits (di) in a

column [2] via the use of PWM-WL signals. An 8-b array data (𝐷) and streamed input (𝑃) precision is

chosen to satisfy the requirements of many inference applications. The longest PWM-WL pulse width with

VWL < VDD is chosen to be less than 40% of BL RC time constant in order to ensure sufficient linearity and

prevent destructive read [2]. The shortest pulse width needs to be <250ps while driving a large RC WL,

which is challenging due to the row pitch-matching constraints. Hence, sub-ranged read is proposed where

4 MSBs and 4 LSBs are stored in adjacent columns (column pair), and read simultaneously on BLMSB and

BLLSB. Then, the charge on BLMSB is shared with 1/16 of BLLSB charge via switches Ø con and Ø merge.

Capacitors attached to BLs enable fine-tuning of the 1/16 capacitance ratio. The sub-ranged MR-FR (Fig. 3)

achieves a maximum INL = 0.03 LSB.

In the MD mode, the MR-FR enables D-to-A conversion, and replica cell read performs word-level

add/subtract by reading 𝑃 (�̅� for subtract) from the replica bit-cell array simultaneously with 𝐷 (Fig. 3)

[2]. The replica bit-cell array stores streamed data 𝑃 and can be written directly by write BL (WBL)

4

reducing energy and latency overheads.

The BLP (Fig. 4) can be reconfigured to operate in either the DP or the MD mode for dot product

or absolute computation, respectively. In the MD mode, an analog comparator and a mux is used to obtain

the absolute value, and the multiplier circuit is reconfigured as a BL-wise sampler. In the DP mode, the

comparator is bypassed and BLB is chosen by the mux. The mixed-signal capacitive multiplier [4] uses

identical bit capacitors to meet the column pitch constraints necessitating sequential processing of

multiplicand bits (pi) and thereby limiting the throughput. Sub-ranged multiplication alleviates this problem

by employing two 4-b MSB/LSB multipliers operating in parallel. The BLP outputs are charge-shared in the

CBLP and sampled and later converted into a digital value by the ADC. In the CBLP, the MSB and LSB

rails are charge-shared by first opening Ø con_rail and then closing Ø merge_rail to obtain the final output as a

weighted sum. The measured accuracy of BLP and CBLP (including MR-FR) (Fig. 4) shows that the

maximum error magnitude in the DP (MD) mode is 5.8% (8.6%) of output dynamic range.

The proposed architecture requires 16X fewer read accesses (and precharges) as compared to

the conventional architecture for a fixed volume of data, resulting in up to 5.8X throughput enhancement.

This is because MR-FR and BLP process data in massively parallel manner (128 8-b words per precharge)

whereas the normal SRAM mode fetches only 8 8-b words through 4:1 column muxing. Smaller ΔVBL and

fewer read access reduces data access energy. Charge redistribution-based low-swing computation adds

to the energy savings.

Fig. 5 (left) indicates that CORE energy and decision accuracy trade-off with ΔVBL. For binary (64-

class) decisions, ΔVBL > 15 mV (> 25mV) results in > 90% detection accuracy, and the CORE energy

reduces by 0.2pJ (0.4pJ) per 20mV reduction in ΔVBL. Energy breakdown in Fig. 5 (right) indicates that

5

much of the savings is due to MR-FR. The CTRL energy will be amortized in a multi-bank scenario. The

measured energy in DP (MD) mode is 5.6× (3.7×) smaller than conventional architecture, with savings up

to 9.7× (5.4×) in a multi-bank scenario.

The multifunctional IC (Fig. 6) implements four different algorithms with 2, 4, and 64-class

decisions in a 512×256 SRAM array achieving better decision accuracy and comparable EDP (scaled for

65nm) than single function ICs [1,3]. The chip micrograph (Fig. 7) shows that the deep in-memory circuitry

incurs an area overhead of 25% not counting the CTRL.

Acknowledgement

This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC

STARnet Centers, sponsored by SRC and DARPA.

References

[1] H. Kaul, et al., “A 21.5M-Query-Vectors/s 3.37nJ/Vector Reconfigurable k-Nearest-Neighbor Accelerator with Adaptive Precision in 14nm Tri-

Gate CMOS,” ISSCC Dig. Tech. Papers, pp. 260-261, 2016.

[2] M. Kang, et al., “An Energy-efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM,” ICASSP, pp.

8326-8330, 2014.

[3] J. Zhang, et al., “A Machine-learning Classifier Implemented in a Standard 6T SRAM Array,” Dig. Symp. VLSI Circuits, June 2016.

[4] M. Kang, et al., “An Energy-efficient Memory-based High-throughput VLSI Architecture for Convolutional Networks,” ICASSP, pp. 1037-1041,

2015.

6

Multi-rowfunctional READ

(MR-FR)

③ ADD/SUBT

① MULT

Multi-rowFunctional

READ (MR-FR)

① NormalREAD

② Digital toAnalog conv.

BLP

CBLP

slicing ① MIN/MAX② Linear

Combination

③ COMP/ABS

② ADD① Sub-ranging

Reconfigurable MULT

Stage Configurations

② BL Sample

D/A

(1b)D/A

(1b)

BLDischarge

× Bitcell Array

A/D

(1b)A/D

(1b)SenseAmp

Columnmux

Buffer/mux

D/A

(N-b)D/A

(N-b) A/D

BitlineProcessing

(BLP)

co

l B

LPCrossBLP

(CBLP)

CBLP

Co

nve

nti

on

alIn

-me

mo

ry

> ≥

data access transfer

-bitdecision

slicing

-bit Bus DigitalProcessor

Support VectorMachine (SVM)

K-nearestNeighbor (KNN)

TemplateMatching (TM)

MatchedFilter (MF)

4 algorithmsMR-FR

BLP CBLP

② ①,② ①,② ②

② ①,② ①,② ②

②, ③ ②, ③ ② Externalprocess

②, ③ ②, ③ ② ①

Dot Product(DP) mode

Manhattan Distance

(MD) mode

slicingmode

× Bitcell Array

-bit

0 1 1 01 1 0 10 0 0 11 1 0 0

0 1 0 11 1 0 11 0 0 00 1 1 0

(ΣDi Pi)

(Σ|Di – Pi|)

aggregation

slicing -bitdecision

data access & aggregation

single-rowaccess

Multi-functions in each stage Configurations of each stage for 4 algorithms

B: bit precision

Figure 1: Conventional and proposed multi-functional deep in-memory architecture for inference applications.

7

Precharge

4:1 Column mux

Sense AMP/Write Driver

RC

FG w

ord

se

rial

-in

Y-DEC.

16 X 128-b Streamed Input Buffer Reg.

BL Processor(BLP)

(mult / comp

/ abs)

Cross BL processor (CBLP)

Dec

isio

nSc

an-o

ut

No

rmal

Rea

d/w

rite

Cir

cuit

ry

Inst.Set Reg.

512 X 256-b6T SRAM

Bitcell Array

Bitcell

Bitcell

Bitcell

BitcellFun

ctio

nal

WL

Dri

ver

4 X 256-b Replica Bitcell Array

ADC[0:3]

Re-

con

fig.

An

alo

gP

roce

sso

r

X-D

ec. &

Pu

lse

Gen

.

X-D

ec. &

Pu

lse

Gen

.

Slice

P

(C)BLPCTRL

MainCTRL

ADCCTRL

R/WCTRL

BL Processor(BLP)

(mult / comp

/ abs)

Exec

ute

Fetch

MODE

CTRL

(C)BLP EN

R/W CTRL

EN

Sampler[0:3]

WBL[0] WBL[255]

BL 0

BLB

0

BL 2

55

BLB

25

5

CORE

Figure 2: Deep in-memory processor architecture.

8

-0.2

-0.1

0

0.1

0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10 12 14 16

INL

(LSB

)

ΔV

BLB

(V)

fro

m M

R-F

R

4-b DMSB

INL

ΔVBLB

BL M

SB

BLB

MSB

BL L

SB

BLB

LSB

d7

d6

d5

d4

d3

d2

d1

d0

WL0

WL1

WL2

WL3

p7

p6

p5

p4

p3

p2

p1

p0

WLP0

WLP1

WLP2

WLP3

d7

d6

d5

d4

d3

d2

d1

d0

WB

L MSB

WB

L LSB

WWL3

WWL2

WWL1

WWL0

512X256 bitcell array

4X256 Replica bitcell array

LSB

co

l.

MSB

co

l.

D0,0 D0,1 D0,127D0

D1

P0 P1 P127P

D1,0 D1,1 D1,127

D127,0 D127,1 D127,127

Column pair

D0

,0D

1,0

D127

LSB

co

l.

MSB

co

l.

LSB

co

l.

MSB

co

l.

Multi-row functional read (MR-FR)

ΔVBL_M(L)SB ∝DM(L)SB ΔVBL_MERGE ∝ΔVBL_MSB + ΔVBL_LSB/16

ΔVBLB_M(L)SB∝DM(L)SB ΔVBLB_MERGE∝ΔVBLB_MSB+ΔVBLB_LSB/16

Sub-ranged readD

P ΔVBL_M(L)SB ∝PM(L)SB -DM(L)SB

ΔVBLB_M(L)SB∝DM(L)SB-PM(L)SB

MD

MR

-FR

read

Sub

-ran

ged

read

Replica cell write & readused in only MD mode

Re

plic

a ce

ll re

ad

WL1

WL2

WL3

WLP0

WLP1

WLP2

WLP3

øcon

ømerge

WL0VWL<VDD

Sub

-ran

ged

re

adM

R-F

R

Re

plic

a ce

llw

rite

øcon

BLMSB

ømerge

(1/16)CBL

CBL

Chargesharing

BLLSB

(15/16)CBL

sub-ranged read

sub-range switchesøcon, ømerge

Figure 3: Sub-ranged multi-row functional read (MR-FR) and measured accuracy.

9

DP mode (ΣDi Pi∝ ∆VSample) MD mode (Σ|Di – Pi|∝ Vsample)

MSB_Rail

LSB_Rail

Cross BL Processor (CBLP)

Co

lum

n p

air[

1]

ømerge_railøcon_rail

Co

lum

n p

air[

11

1]

Co

lum

n p

air[

11

2]

Co

lum

n p

air[

12

7]

WB

L MSB

Reconfig.Mult.

MSB 4b

WB

L LSB

BL Processor (BLP)

Co

lum

n p

air[

0]

OUT

IN

WB

L

Reconfig.Mult.

LSB 4b

OUT

IN

MUX_OUTW

BL

Ø Ø

MultCTRLs

COMP_EN

DP_MODE

MD_MODE

pitch matched to SRAM bitcell

BL

MSB

BLB

MSB

+ -

01

COMPOUT

EN

MUX

Prech

Prech Vsample

DP mode: VMUX_OUT = VBLB_MSB

MD mode: MSB_Rail= V LSB_Rail+V sampleVMD mode:

MSB_Rail= V LSB_Rail+V /16sampleVDP mode:

VMUX_OUT = max(VBL_MSB, VBLB_MSB) ∝ max(P-D, D-P) = |D-P|

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 50 100 150 200 250 300

∆V

Sam

ple

[V]

8-b D

250

50

100

150

200

0

8-b P

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200 250 300

VSa

mp

le[V

]

8-b D

0

50

100

150

200

250

8-b P

Figure 4: Pitch-matched BL processing (BLP), cross BL processing (CBLP), and measured accuracy (@ D0 = D1 = … = D255, P0 =

P1 = … = P255).

10

0

20

40

60

80

100

120

0.5 0.7 0.9 1.1

Pro

bab

ility

of

det

ecti

on

[%

]

CORE energy per 8-b word [pJ]

Face detection (SVM)

Face recognition (TM)

BL swing per LSB: 0~30mV

0

1

2

3

4

5

6

7

8

9

Conv. Proposed Conv. Proposed

mode modeC

OR

E e

ne

rgy

pe

r 8

-b w

ord

[pJ]

(In-memory) Computation READ

×6.2↓×10.4↓

(MR-FR)

MD DP

CORE energy breakdownfrom post-layout simulations

Figure 5: Measured energy vs. detection accuracy trade-offs, and CORE energy breakdown.

11

Tech.

(nm)# of

AlgorithmsMemory size

Precision

(b)

DecisionThroughput

(Decisions/s)

DecisionEnergy

(pJ/decision)

DecisionEDP

(f J·s)Accuracy

Thiswork

65 CMOS4

(SVM, MF, KNN, TM)

SRAM 512 X 256-b 8x8

SVM: 9.3M 963.1 / 462.4† 0.1 / 0.05† 95 %

MF: 18.5M 481.5 / 231.2† 0.03 / 0.01† 100 %

TM: 312.5K 33.6K / 17.5K† 107.3 / 56.0† 100 %

KNN: 312.5K 33.6K / 17.5K† 107.4 / 56.0† 92 %

8-b

digi tal*65 CMOS

synthesized

dedicatedprocessor

per

algorithm

SRAM 512 X 256-b 8x8

SVM: 1.7M 4.5K 2.6 96 %

MF: 3.4M 2.2K 0.6 100 %

TM: 54.3K 93.0K 1715.3 100 %

KNN: 54.3K 93.0K 1715.3 90 %

[1]†† 14 Tri -gate 1 (KNN) 128 byte 8x8 21.5M 3.4K 0.2 Not reported

[3]** 130 CMOS 1 (Adaboost) SRAM 128 X 128-b 5x1 50M 633.4 0.01 90 %

Application AlgorithmDataset

(P: Query input, D: data stored in array)

1Face detection(binary class)

Support VectorMachine (SVM)

MIT CBCL dataset

P: 23 X 22 8-b pixel image (face / non-face), 100 query inputs testedD: Feature extractor and classifier combined 23 X 22 8-b word coefficient

2Event (Gun shot)

detection(binary class)

Matched filter (MF)

P1: Gun shot sound contaminated by AWGN with 3 dB SNRor P2: Only AWGN with equal power of "signal + AWGN" in P1

total 100 query inputs testedD: gun shot mono sound data with 256 8-b words

3Face recognition

(64 classes)Template matching

with Manhattan distance (TM)

MIT CBCL dataset, 16 X 16 8-b pixel image for P and DP: one of the 64 candidate faces in D, 64 query inputs tested

D: 64 candidate faces

4Hand-written numberrecognition (4 classes)

K-nearestneighbor (KNN)

MNIST dataset, 4 classes from "0" to “3" (due to array size limit)

16 X 16 8-b pixel image for P and DP: image from 4 classes, 100 query inputs tested

D: 16 images per class

* memory (digital) energy & delay measured from prototype IC (post-layout simulations); † assumes a 32 bank configuration;

†† single function with SRAM memory access cost not included; ** single function with 1b weight vector

Figure 6: Application level gains in energy efficiency, delay, accuracy, and comparison with prior arts.

12

Technology

Die size

CTRL operating freq.

SRAM capacity

Bitcell dimension

Supply voltage

SVM 963.1

Matched filter 481.5

KNN 33.6K

Template matching 33.6K

SVM 1.7M

Matched filter 3.4M

KNN 54.3K

Template matching 54.3K

Energy

per decision

(pJ)

Decision

Throughput

(Decisions/s)

65 nm CMOS

1.2 mm × 1.2 mm

1 GHz

16 KB (1 bank of 512 × 256-b)

CORE: 1.0 V, CTRL: 0.85 V

2.11 × 0.92 um2

Bitcell

Array

BLP & CBLP

Dig

ital

CTR

L

R/W

Testblock

Sampler & ADC

Input buffer register

Decision

Replica bitcell array

WL

dri

ver

& P

uls

e ge

n

WL

dri

ver

& P

uls

e ge

n

WL

dri

ver

& P

uls

e ge

n

Bitcell

Array

1.2 mm1

.2 m

m

Figure 7: Die micrograph and chip summary.

A 481pJ/decision 3.4M decision/s Multifunctional Deep In ... · Mingu Kang, Sujan Gonugondla, Ameya...

Documents

Transcript of A 481pJ/decision 3.4M decision/s Multifunctional Deep In ... · Mingu Kang, Sujan Gonugondla, Ameya...