The Dot-Product Engine (DPE): exploring high efficiency analog ...

The Dot-Product Engine (DPE): exploring high efficiency analog multiplication with memristor arrays

John Paul Strachan

Hewlett Packard Labs

December 11, 2015

2

Outline

DPE concept and applications

Material requirements

DPE demonstration efforts

Integrating CMOS/memristors

Control circuitry for programming/reading/computing

Mapping matrix values to conductances

Impact of noise sources

Preliminary DPE performance simulations and benchmarking

3

Dot-Product engine (DPE) Concept

Input 1:Vector of voltagesVi

I

Output: Vector of currents IiO

IjO= ∑j Gij

. ViI

Input 2: Array of conductances Gij

4

Dot-Product engine (DPE) Concept

Memristor array naturally represents a matrix

Compute dot product through Ohm’s Law

Highly parallel multiply & accumulate –favorable scaling with array size

Input 1:Vector of voltagesVi

I

Output: Vector of currents IiO

IjO= ∑j Gij

. ViI

Input 2: Array of conductances Gij

Present work explores the challenges!What about noise and device variability?Issues with accuracy? Actual performance numbers?

Found large class of well-matched applications Performance estimates surpass custom ASICsFor 512x512 array including peripheral circuits

10 PetaOPS (ASIC = 0.1)>100 PetaOPS/Watt (ASIC =1-10)

5

Killer App #1: Deep Learning Neural networks

1) 70-90% of computation time consumed in the Convolution layers [1]2) Recent work shows that only 10-12 bit representations required to maintain state-

of-the-art classification accuracy [2]

Processing pipeline

[1] F. Abuzaid, et al., “Caffe con Troll: Shallow Ideas to Speed Up Deep Learning” arXiv:1504.04343 [cs.LG][2] M. Courbariaux, J.P. David, Y. Bengio “Low precision storage for deep learning” ICLR 2015

Human-level accuracy in image classification

6

Example: Discrete Fourier transforms are just vector-matrix multiplicationEach row of the DFT matrix computes a single frequency component:

DFT =

𝑤0 𝑤0

𝑤0 𝑤1𝑤0 𝑤0

𝑤2 𝑤3 …

𝑤0 𝑤2

𝑤0 𝑤3𝑤4 𝑤6

𝑤6 𝑤9 …

⋮ ⋮ ⋱

Compared to the Fast Fourier Transform, a matrix implementation allows 1) Computation of the N-point DFT in constant time O(1) vs O(NlogN) 2) The flexibility to only include the frequency components of interest.3) Inverse Fourier transform is a symmetric and easily implemented reverse operation4) The ability to handle non-square transforms (rectangular DFT matrices). 5) The ability to take the DFT for input signals that have non-uniform spacing

Killer App #2: Any Linear Transformation

=

𝑤 = 𝑒 −2𝜋𝑖𝑁

7

IARPA seedling to look at materials issues for implementing DPE

Materials engineering• High resistance• Large OFF/ON ratio• Low switching current• Linear electronic transport

Cell integration, arrays, and circuits• Integration/fab of memristors with selectors/transistors• Optimizing reading and writing circuits• Construct platform to write/read device arrays• Compact modeling of memristor characteristics

Algorithms, Simulations, Operation• Quantifying the total sources of error • Matching DPE capability to favorable applications• Securing IP for key application spaces• Benchmarking and optimizing DPE performance; GOPS/Watts

8

What are the memristor device properties needed?

J. Joshua Yang et al., Nature Nanotechnology 8, 13 (2013)

Key properties

Large OFF/ON resistance ratio= More bits

Linearity of resistance states= Higher accuracy computation

Number of stable resistance states= More bits

Higher resistances= Lower energy, but reading

challenges

9

Two modes of operation:

1) Dot-product computation2) Programming memristor array (analog values)

Key Assumption:

Input vectors change frequently, but crossbar matrix values are relatively fixed

This assumption enables higher bit-accuracy but through closed-loop (expensive) programming

10

Empirical model for Current-voltage relationship in TaOxmemristors

Electronic transport described by two thermally-activated mechanisms in parallel:

Schottky-like emission + Frenkel-Poole hopping

State variable is a defect density

0 0.05 0.1 0.15 0.2 0.25

10-12

10-10

10-8

10-6

Voltage (V)

Cu

rren

t (A

)

100 K

500 K

Exp device data

Model results

𝑰 = 𝑨𝑺𝑪𝑯𝑻𝟐𝒆 −𝑬𝑺𝑪𝑯 𝒌𝑻 𝒆

𝑩𝑺𝑪𝑯 𝑽𝒌𝑻+ 𝑨𝑭𝑷𝑽𝒆

−𝑬𝑭𝑷 𝒌𝑻 𝒆 𝑩𝑭𝑷 𝑽𝒌𝑻

11

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10-12

10-10

10-8

10-6

10-4

10-2

Voltage (V)

Cu

rre

nt

(A)

ASCH=1.3959e-011, ESCH=0.069431, BSCH=0.26648, AFP=0.0079649, EFP=0, BFP=0

ASCH=3.2666e-011, ESCH=0.12072, BSCH=0.27369, AFP=0.0026109, EFP=0, BFP=0.0008913

ASCH=7.6443e-011, ESCH=0.172, BSCH=0.28089, AFP=0.00085583, EFP=0.00059054, BFP=0.0021953

ASCH=1.7889e-010, ESCH=0.22329, BSCH=0.28809, AFP=0.00028054, EFP=0.0015563, BFP=0.0034992

ASCH=4.1863e-010, ESCH=0.27458, BSCH=0.29529, AFP=9.196e-005, EFP=0.002522, BFP=0.0048032











Predictive model for I-V shape for any resistance state

Captures device non-linearities and temperature dependence

Model used in Array-level simulations

Increasing Resistance

12

0 1000 2000 3000

0.0

5.0x105

1.0x106

1.5x106

2.0x106

Re

sis

tan

ce

(

)

Cycle

8-levels – 2kΩ-200kΩ 16-levels – 2kΩ-2MΩ

1% tolerance – limited by current measuring electronics used

32-levels – 2kΩ-2MΩ

Multilevel (analog) capability

Repeated access to each target resistance levelGetting to each target level used close-loop programming

Cycle Cycle Cycle

13

Tape-out

TE 1, 2, 3, 4, …., n)

G 1, 2, 3, 4, …., n)

S1

, 2, 3

, 4, …

., n

)

S1

, 2, 3

, 4, …

., n

)

GndGnd

Gnd Gnd

Example 4x4 matrix

wafer image

BETE

Transistor Gate

1T1R arrays

Layout design

Initial choice to use a 1T1R array

Allows faster and more accurate memristor programming

All transistors turned ON for DPE computation (use depletion mode transistors)

Developing a demonstration platform

128x64

64x64

32x32 16x16

14

Back End Of Line (BEOL) fabrication of memristors

15

After BEOL memristor-transistor integration

Individual transistor tests

1Transistor-1Memristor, Multilevel control

||

16

32 levels in BEOL fabricated TaOx memristors

2.0x10-5

4.0x10-5

6.0x10-5

8.0x10-5

1.0x10-4

2

4

6

8

10

12

14

16

18

32 levels

16 levels

8 levels

Nu

m T

rain

ing

Atte

mp

tsTarget Current (A)

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

2.0x10-5

3.0x10-5

4.0x10-5

5.0x10-5

6.0x10-5

7.0x10-5

8.0x10-5

9.0x10-5

1.0x10-4

1.1x10-4

Level

Pro

gra

mm

ed

Cu

rre

nt (A

)

Programming within 1%

Programming with 1µs pulses

1T1R allows rapid and precise conductance tuning

“Random Telegraph Noise and Resistance Switching Analysis of Oxide Based Resistive Memory”Shinhyun Choi, Yuchao Yang, and Wei Lu. Nanoscale, 2014, 6, 400-404.

Major challenge – fluctuations in resistance levels

Random Telegraph Noise (RTN) is ubiquitous in nanoscale atomic/electronic systems

Drift and RTN measured over 2 seconds while bias applied

Fluctuations depend on resistance level, voltage applied, and time scale

Up to 20% for high resistance states

But <0.4% for <10 kΩ10-6

10-5

10-4

10-3

0

0.05

0.1

0.15

0.2

0.25

GInitial

o

f [

G/G

Initia

l]

-0.1V

-0.2V

18

Putting the pieces together: Signal flow of the DPE demonstrator

Probe card PCB

Cantilever probes

Workstation(running application)

Memristor1T1R Array

Control boards

Needs to drive two modes of operation:

1) Dot-product computation2) Programming the matrix (analog, not binary values!)

19

Overall performance specifications

• Computation (DPE) time – 100ns • Ultimate limit will be parasitic RC time <10ns

• Can utilize trade-off with ADC conversion time vs needed accuracy

• Voltage range: -10 V to +10V, Current read: 10nA to 2.5mA

• Currently measures 128x64 arrays

• Extensible design: just add more column or row boards

• Microprocessor configures, programs and sets registers for DPE operation

• Dot-product computation time (with pre-configuring) < 100ns (10MHz)

20

Programming the array

How do we map numbers to conductances?

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0 100 200 300

Fin

al D

PE

Bit

-acc

ura

cy

Number of Rows (Array = N x N)

Linear Mapping (LM)

Device non-linearity and finite wire resistances kill the accuracy

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Gradient-based conversion algorithm

Gradient-based optimization

– Evaluate Kirchoff Current Law (KCL) at every cross-point

– Tweak conductance to minimize error between actual current and ideal current

– Pre-calculated Jacobian of the crossbar for faster optimization

IError << 1%

Desired matrix W Ideal current at each node

Calibrated conductance G’ Remaining absolute error

M. Hu, et. al, ICCAD (2015)


Improvements with Conversion algorithm

• Test case: 128x128 crossbar, 10 ohm wire segment, calibrated at 0.25 V

Std = 0.0045

Actual value/Ideal value for DCT matrix

10.98 0.99 1.01 1.02


23

Noise – The bane of the analog world

Johnson noise

Shot noise

Wire block resistance

Random telegraph noise (RTN)

Output resistance

Input resistance

Sources of Noise

𝑰𝑹𝑴𝑺 = 𝟒𝒌𝑻∆𝒇𝑹

𝑰𝑹𝑴𝑺 = 𝟒𝒒𝑰∆𝒇

G G + ΔG


Impact of Johnson and Shot noise negligible

Example: 256x256 array

Repeated simulations randomizing the input vector and sampling noise distribution

Error is dominated by nonlinearity and finite wire resistances

With noise

Occ

urr

ence

Occ

urr

ence

Stand Dev =0.0046994 Stand Dev =0.0040643

Without noise


Impact of RTN Resistance fluctuations on DPE Accuracy

0 0.06 0.12 0.18 0.24 0.3

Size of fluctuation Δ, where G G (1 ± Δ)

Stan

dar

d D

evia

tio

n o

f D

PE

resu

lt


26

data

1×784

w1

784×500

×softma

x

Bias_w1

softma

x

Bias_w2

w2

500×500

×softma

x

Bias_w3

w3

500×2000

×

w_class

2000×10

×maxou

tResult

Bias_w_class

Softmax: y = 1/(1+exp(-x))

Algorithm

Break problem across 128x128 arrays with analog buffers + biasing

Xbar(1,4)

Xbar(1,4)

Xbar(1,4)

Xbar(1,4)

Xbar(1,5)

Analog signal to next layer

Analog processing:Sum+ Bias +

softmax

Analog signal from previous layer

w1

784×500

Simulation of a Neural network - MNIST

100

120

140

160

180

200

0% 10% 20% 30%

Total errors out of 10,000

Min_Error Max_Error

Avg_Error

Random Telegraph Noise

Even with 30% noise fluctuations in every device,worst case error goes from 1.4% 1.9%

27

Comparison to a 22nm digital ASIC circuitSpeed and energy comparison between DPE and state of the art 8bit ASIC*

* Hsu, S. K., et al, IEEE Journal of Solid-State Circuits, 48, 118 (2013)

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

32 64 128 256 512

GO

PS

Crossbar size (N×N)

DPE

ASIC, max speed

ASIC, max efficiency

1.0E+03

1.0E+04

1.0E+05

1.0E+06

32 64 128 256 512

GO

PS/

W

Crossbar size (N×N)

Power consumption dominated by peripherals (TIA, ADC, DAC) – room for optimization of circuitry

100x gain10-100x gain

Results and future work

Quantified noise/fluctuations, impact of nonlinearity and parasitics

Robust Neural network classification accuracy despite device state noise

Outperforms state-of-the-art Digital ASIC

Throughput 100x improvement

GOPS/Watt 10x improvement

Working on full DPE experimental demonstrations and further application-level benchmarking

TE 1, 2, 3, 4, …., n)

G 1, 2, 3, 4, …., n)

S1

, 2, 3

, 4, …

., n

)

S1

, 2, 3

, 4, …

., n

)

GndGnd

Gnd Gnd

Nora Davilla

Emma Merced Grafals

Ning Ge

Cat Graves

Miao Hu

Sity Lam

Eric Montgomery

Stan Williams

Joshua Yang (UMass)

IARPA Program Manager:

Karl Roenigk

Thanks to the DPE Team

The Dot-Product Engine (DPE): exploring high efficiency analog ...

Documents

Transcript of The Dot-Product Engine (DPE): exploring high efficiency analog ...