The Dot-Product Engine (DPE): exploring high efficiency analog ...
-
Upload
nguyencong -
Category
Documents
-
view
220 -
download
1
Transcript of The Dot-Product Engine (DPE): exploring high efficiency analog ...
The Dot-Product Engine (DPE): exploring high efficiency analog multiplication with memristor arrays
John Paul Strachan
Hewlett Packard Labs
December 11, 2015
2
Outline
DPE concept and applications
Material requirements
DPE demonstration efforts
Integrating CMOS/memristors
Control circuitry for programming/reading/computing
Mapping matrix values to conductances
Impact of noise sources
Preliminary DPE performance simulations and benchmarking
3
Dot-Product engine (DPE) Concept
Input 1:Vector of voltagesVi
I
Output: Vector of currents IiO
IjO= ∑j Gij
. ViI
Input 2: Array of conductances Gij
4
Dot-Product engine (DPE) Concept
Memristor array naturally represents a matrix
Compute dot product through Ohm’s Law
Highly parallel multiply & accumulate –favorable scaling with array size
Input 1:Vector of voltagesVi
I
Output: Vector of currents IiO
IjO= ∑j Gij
. ViI
Input 2: Array of conductances Gij
Present work explores the challenges!What about noise and device variability?Issues with accuracy? Actual performance numbers?
Found large class of well-matched applications Performance estimates surpass custom ASICsFor 512x512 array including peripheral circuits
10 PetaOPS (ASIC = 0.1)>100 PetaOPS/Watt (ASIC =1-10)
5
Killer App #1: Deep Learning Neural networks
1) 70-90% of computation time consumed in the Convolution layers [1]2) Recent work shows that only 10-12 bit representations required to maintain state-
of-the-art classification accuracy [2]
Processing pipeline
[1] F. Abuzaid, et al., “Caffe con Troll: Shallow Ideas to Speed Up Deep Learning” arXiv:1504.04343 [cs.LG][2] M. Courbariaux, J.P. David, Y. Bengio “Low precision storage for deep learning” ICLR 2015
Human-level accuracy in image classification
6
Example: Discrete Fourier transforms are just vector-matrix multiplicationEach row of the DFT matrix computes a single frequency component:
DFT =
𝑤0 𝑤0
𝑤0 𝑤1𝑤0 𝑤0
𝑤2 𝑤3 …
𝑤0 𝑤2
𝑤0 𝑤3𝑤4 𝑤6
𝑤6 𝑤9 …
⋮ ⋮ ⋱
Compared to the Fast Fourier Transform, a matrix implementation allows 1) Computation of the N-point DFT in constant time O(1) vs O(NlogN) 2) The flexibility to only include the frequency components of interest.3) Inverse Fourier transform is a symmetric and easily implemented reverse operation4) The ability to handle non-square transforms (rectangular DFT matrices). 5) The ability to take the DFT for input signals that have non-uniform spacing
Killer App #2: Any Linear Transformation
=
𝑤 = 𝑒 −2𝜋𝑖𝑁
7
IARPA seedling to look at materials issues for implementing DPE
Materials engineering• High resistance• Large OFF/ON ratio• Low switching current• Linear electronic transport
Cell integration, arrays, and circuits• Integration/fab of memristors with selectors/transistors• Optimizing reading and writing circuits• Construct platform to write/read device arrays• Compact modeling of memristor characteristics
Algorithms, Simulations, Operation• Quantifying the total sources of error • Matching DPE capability to favorable applications• Securing IP for key application spaces• Benchmarking and optimizing DPE performance; GOPS/Watts
8
What are the memristor device properties needed?
J. Joshua Yang et al., Nature Nanotechnology 8, 13 (2013)
Key properties
Large OFF/ON resistance ratio= More bits
Linearity of resistance states= Higher accuracy computation
Number of stable resistance states= More bits
Higher resistances= Lower energy, but reading
challenges
9
Two modes of operation:
1) Dot-product computation2) Programming memristor array (analog values)
Key Assumption:
Input vectors change frequently, but crossbar matrix values are relatively fixed
This assumption enables higher bit-accuracy but through closed-loop (expensive) programming
10
Empirical model for Current-voltage relationship in TaOxmemristors
Electronic transport described by two thermally-activated mechanisms in parallel:
Schottky-like emission + Frenkel-Poole hopping
State variable is a defect density
0 0.05 0.1 0.15 0.2 0.25
10-12
10-10
10-8
10-6
Voltage (V)
Cu
rren
t (A
)
100 K
500 K
Exp device data
Model results
𝑰 = 𝑨𝑺𝑪𝑯𝑻𝟐𝒆 −𝑬𝑺𝑪𝑯 𝒌𝑻 𝒆
𝑩𝑺𝑪𝑯 𝑽𝒌𝑻+ 𝑨𝑭𝑷𝑽𝒆
−𝑬𝑭𝑷 𝒌𝑻 𝒆 𝑩𝑭𝑷 𝑽𝒌𝑻
11
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10-12
10-10
10-8
10-6
10-4
10-2
Voltage (V)
Cu
rre
nt
(A)
ASCH=1.3959e-011, ESCH=0.069431, BSCH=0.26648, AFP=0.0079649, EFP=0, BFP=0
ASCH=3.2666e-011, ESCH=0.12072, BSCH=0.27369, AFP=0.0026109, EFP=0, BFP=0.0008913
ASCH=7.6443e-011, ESCH=0.172, BSCH=0.28089, AFP=0.00085583, EFP=0.00059054, BFP=0.0021953
ASCH=1.7889e-010, ESCH=0.22329, BSCH=0.28809, AFP=0.00028054, EFP=0.0015563, BFP=0.0034992
ASCH=4.1863e-010, ESCH=0.27458, BSCH=0.29529, AFP=9.196e-005, EFP=0.002522, BFP=0.0048032
ASCH=9.7965e-010, ESCH=0.32587, BSCH=0.3025, AFP=3.0144e-005, EFP=0.0034877, BFP=0.0061072
ASCH=2.2925e-009, ESCH=0.37715, BSCH=0.3097, AFP=9.8811e-006, EFP=0.0044534, BFP=0.0074111
ASCH=5.3649e-009, ESCH=0.42844, BSCH=0.3169, AFP=3.239e-006, EFP=0.0054191, BFP=0.0087151
ASCH=1.2555e-008, ESCH=0.47973, BSCH=0.3241, AFP=1.0617e-006, EFP=0.0063848, BFP=0.010019
ASCH=2.938e-008, ESCH=0.53101, BSCH=0.33131, AFP=3.4803e-007, EFP=0.0073505, BFP=0.011323
ASCH=6.8753e-008, ESCH=0.5823, BSCH=0.33851, AFP=1.1408e-007, EFP=0.0083162, BFP=0.012627
ASCH=1.6089e-007, ESCH=0.63359, BSCH=0.34571, AFP=3.7396e-008, EFP=0.0092819, BFP=0.013931
ASCH=3.7651e-007, ESCH=0.68487, BSCH=0.35291, AFP=1.2258e-008, EFP=0.010248, BFP=0.015235
ASCH=8.811e-007, ESCH=0.73616, BSCH=0.36012, AFP=4.0182e-009, EFP=0.011213, BFP=0.016539
ASCH=2.0619e-006, ESCH=0.78745, BSCH=0.36732, AFP=1.3172e-009, EFP=0.012179, BFP=0.017843
Predictive model for I-V shape for any resistance state
Captures device non-linearities and temperature dependence
Model used in Array-level simulations
Increasing Resistance
12
0 1000 2000 3000
0.0
5.0x105
1.0x106
1.5x106
2.0x106
Re
sis
tan
ce
(
)
Cycle
8-levels – 2kΩ-200kΩ 16-levels – 2kΩ-2MΩ
1% tolerance – limited by current measuring electronics used
32-levels – 2kΩ-2MΩ
Multilevel (analog) capability
Repeated access to each target resistance levelGetting to each target level used close-loop programming
Cycle Cycle Cycle
13
Tape-out
TE 1, 2, 3, 4, …., n)
G 1, 2, 3, 4, …., n)
S1
, 2, 3
, 4, …
., n
)
S1
, 2, 3
, 4, …
., n
)
GndGnd
Gnd Gnd
Example 4x4 matrix
wafer image
BETE
Transistor Gate
1T1R arrays
Layout design
Initial choice to use a 1T1R array
Allows faster and more accurate memristor programming
All transistors turned ON for DPE computation (use depletion mode transistors)
Developing a demonstration platform
128x64
64x64
32x32 16x16
14
Back End Of Line (BEOL) fabrication of memristors
15
After BEOL memristor-transistor integration
Individual transistor tests
1Transistor-1Memristor, Multilevel control
||
16
32 levels in BEOL fabricated TaOx memristors
2.0x10-5
4.0x10-5
6.0x10-5
8.0x10-5
1.0x10-4
2
4
6
8
10
12
14
16
18
32 levels
16 levels
8 levels
Nu
m T
rain
ing
Atte
mp
tsTarget Current (A)
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
2.0x10-5
3.0x10-5
4.0x10-5
5.0x10-5
6.0x10-5
7.0x10-5
8.0x10-5
9.0x10-5
1.0x10-4
1.1x10-4
Level
Pro
gra
mm
ed
Cu
rre
nt (A
)
Programming within 1%
Programming with 1µs pulses
1T1R allows rapid and precise conductance tuning
“Random Telegraph Noise and Resistance Switching Analysis of Oxide Based Resistive Memory”Shinhyun Choi, Yuchao Yang, and Wei Lu. Nanoscale, 2014, 6, 400-404.
Major challenge – fluctuations in resistance levels
Random Telegraph Noise (RTN) is ubiquitous in nanoscale atomic/electronic systems
Drift and RTN measured over 2 seconds while bias applied
Fluctuations depend on resistance level, voltage applied, and time scale
Up to 20% for high resistance states
But <0.4% for <10 kΩ10-6
10-5
10-4
10-3
0
0.05
0.1
0.15
0.2
0.25
GInitial
o
f [
G/G
Initia
l]
-0.1V
-0.2V
18
Putting the pieces together: Signal flow of the DPE demonstrator
Probe card PCB
Cantilever probes
Workstation(running application)
Memristor1T1R Array
Control boards
Needs to drive two modes of operation:
1) Dot-product computation2) Programming the matrix (analog, not binary values!)
19
Overall performance specifications
• Computation (DPE) time – 100ns • Ultimate limit will be parasitic RC time <10ns
• Can utilize trade-off with ADC conversion time vs needed accuracy
• Voltage range: -10 V to +10V, Current read: 10nA to 2.5mA
• Currently measures 128x64 arrays
• Extensible design: just add more column or row boards
• Microprocessor configures, programs and sets registers for DPE operation
• Dot-product computation time (with pre-configuring) < 100ns (10MHz)
20
Programming the array
How do we map numbers to conductances?
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0 100 200 300
Fin
al D
PE
Bit
-acc
ura
cy
Number of Rows (Array = N x N)
Linear Mapping (LM)
Device non-linearity and finite wire resistances kill the accuracy
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Gradient-based conversion algorithm
Gradient-based optimization
– Evaluate Kirchoff Current Law (KCL) at every cross-point
– Tweak conductance to minimize error between actual current and ideal current
– Pre-calculated Jacobian of the crossbar for faster optimization
IError << 1%
Desired matrix W Ideal current at each node
Calibrated conductance G’ Remaining absolute error
M. Hu, et. al, ICCAD (2015)
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Improvements with Conversion algorithm
• Test case: 128x128 crossbar, 10 ohm wire segment, calibrated at 0.25 V
Std = 0.0045
Actual value/Ideal value for DCT matrix
10.98 0.99 1.01 1.02
M. Hu, et. al, ICCAD (2015)
23
Noise – The bane of the analog world
Johnson noise
Shot noise
Wire block resistance
Random telegraph noise (RTN)
Output resistance
Input resistance
Sources of Noise
𝑰𝑹𝑴𝑺 = 𝟒𝒌𝑻∆𝒇𝑹
𝑰𝑹𝑴𝑺 = 𝟒𝒒𝑰∆𝒇
G G + ΔG
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Impact of Johnson and Shot noise negligible
Example: 256x256 array
Repeated simulations randomizing the input vector and sampling noise distribution
Error is dominated by nonlinearity and finite wire resistances
With noise
Occ
urr
ence
Occ
urr
ence
Stand Dev =0.0046994 Stand Dev =0.0040643
Without noise
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Impact of RTN Resistance fluctuations on DPE Accuracy
0 0.06 0.12 0.18 0.24 0.3
Size of fluctuation Δ, where G G (1 ± Δ)
Stan
dar
d D
evia
tio
n o
f D
PE
resu
lt
M. Hu, et. al, ICCAD (2015)
26
data
1×784
w1
784×500
×softma
x
Bias_w1
softma
x
Bias_w2
w2
500×500
×softma
x
Bias_w3
w3
500×2000
×
w_class
2000×10
×maxou
tResult
Bias_w_class
Softmax: y = 1/(1+exp(-x))
Algorithm
Break problem across 128x128 arrays with analog buffers + biasing
Xbar(1,4)
Xbar(1,4)
Xbar(1,4)
Xbar(1,4)
Xbar(1,5)
Analog signal to next layer
Analog processing:Sum+ Bias +
softmax
Analog signal from previous layer
w1
784×500
Simulation of a Neural network - MNIST
100
120
140
160
180
200
0% 10% 20% 30%
Total errors out of 10,000
Min_Error Max_Error
Avg_Error
Random Telegraph Noise
Even with 30% noise fluctuations in every device,worst case error goes from 1.4% 1.9%
27
Comparison to a 22nm digital ASIC circuitSpeed and energy comparison between DPE and state of the art 8bit ASIC*
* Hsu, S. K., et al, IEEE Journal of Solid-State Circuits, 48, 118 (2013)
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
32 64 128 256 512
GO
PS
Crossbar size (N×N)
DPE
ASIC, max speed
ASIC, max efficiency
1.0E+03
1.0E+04
1.0E+05
1.0E+06
32 64 128 256 512
GO
PS/
W
Crossbar size (N×N)
Power consumption dominated by peripherals (TIA, ADC, DAC) – room for optimization of circuitry
100x gain10-100x gain
Results and future work
Quantified noise/fluctuations, impact of nonlinearity and parasitics
Robust Neural network classification accuracy despite device state noise
Outperforms state-of-the-art Digital ASIC
Throughput 100x improvement
GOPS/Watt 10x improvement
Working on full DPE experimental demonstrations and further application-level benchmarking
TE 1, 2, 3, 4, …., n)
G 1, 2, 3, 4, …., n)
S1
, 2, 3
, 4, …
., n
)
S1
, 2, 3
, 4, …
., n
)
GndGnd
Gnd Gnd
Nora Davilla
Emma Merced Grafals
Ning Ge
Cat Graves
Miao Hu
Sity Lam
Eric Montgomery
Stan Williams
Joshua Yang (UMass)
IARPA Program Manager:
Karl Roenigk
Thanks to the DPE Team