ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel · PDF...
Transcript of ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel · PDF...
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 1 of 56
ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-
Voltage-Accuracy-Frequency-Scalable CNN Processor
in 28nm FDSOIBert Moons, Roel Uytterhoeven,Wim Dehaene, Marian Verhelst
ESAT/MICAS - KU Leuven
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 2 of 56
Augmented Reality Face Recognition Artificial Intelligence
CLOUD GPU
Raw Data
Information
Embedded Neural Networks
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 3 of 56
Augmented Reality Face Recognition Artificial Intelligence
Local Processing
Embedded Neural Networks
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 4 of 56
Augmented Reality Face Recognition Artificial Intelligence
Local Processing
1-to-10 TOPS/W CNN processing is crucial for
always-on embedded operation.
Embedded Neural Networks
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 5 of 56
AAA
Large-Scale, highly accurate CNN’s are too expensive for embedded always-on operation.
VGG-16 Recognition on LFW*
Classes 5760
Accuracy 92.5%
Complexity 15.4 GMACsModel Size 15 MBProcessingEnergy / frame@ 1 TOPS/W
~ 30 mJ/f~ 900 mW@ 30 fps
LFW
1200mAh - 1.5V
Drains in 2h
Always-on Neural Networks
[*] Labeled Faces in the Wild Data set
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 6 of 56
Presentation Outline
A. 1. Hierarchical Recognition2. DVAFS: Dynamic-Voltage-Accuracy-Frequency-Scaling
B. 1. Hardware Implementation 2. Results
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 7 of 56
Hierarchical recognition
Hierarchical processing enables always on CNN-based visual recognition
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 8 of 56
Hierarchical Face Recognition
Hierarchical processing enables always-on compute
6 MMACs 15.4GMACs
FaceDetected ?
Large-ScaleRecognition
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 9 of 56
Hierarchical Face Recognition
Hierarchical processing enables always-on compute
NY
12MMACs6 MMACs 15.4GMACs
FaceDetected ?
OwnerDetected ?
Large-ScaleRecognition
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 10 of 56
Hierarchical Face Recognition
Hierarchical processing enables always-on compute
FaceDetected ?
OwnerDetected ?
FriendDetected ?
Large-ScaleRecognition
NY N
N
12MMACs6 MMACs 15.4GMACs500MMACs
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 11 of 56
FaceDetected ?
OwnerDetected ?
FriendDetected ?
Large-ScaleRecognition
12MMACs6 MMACs 15.4GMACs500MMACs
Hierarchical Face Recognition
Hierarchical processing enables always-on compute
NY N
N
Always-on ~1% on ~0.1% on ~0.01% on
Increasing # Classes / Network Size / FP precision/ Energy per frame
CONV-16 MMACs
22 kB5-44%=02-4b Ops
94 % acc.
CONV-212 MMACs
42 kB8-45%=03-4b Ops
96 % acc.
CONV-3500 MMACs
742 kB 8-47%=04-6b Ops
94 % acc.
CONV-415 GMACs
15 MB5-82%=04-6b Ops
92.5 % acc.
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 12 of 56
DVAFS: Dynamic-Voltage-Accuracy-Frequency-Scaling
An at run-time Energy-vs-Computational Precision trade-off
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 13 of 56
Precision Scaling - DVASDVAS – Dynamic-Voltage-Accuracy-Scaling4
y3 y2
x2
x3
y1/0 y0/0x0/0
x1/0
z3z2z1z0 0000
Standard Multiplier
x3x2x1x0 y3y2y1y0
Gate LSB Gate LSB
As in [4] Moons, VLSI2016 ; Moons, JSSC2016
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 14 of 56
Precision Scaling - DVAFSDVAFS – Dynamic-Voltage-Accuracy-Frequency-Scaling
y11 y01 y10 y00
x01
x11
x00
x10
x11x01 y11y01
z31z21z11z01
Subword-Parallel Mult.
x10x00 y10y00
z30z20z10z00
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 15 of 56
Precision Scaling - DVAFSDVAFS – Dynamic-Voltage-Accuracy-Frequency-Scaling
y11 y01 y10 y00
x01
x11
x00
x10
x11x01 y11y01
z31z21z11z01
Subword-p. Multiplier
x10x00 y10y00
z30z20z10z00
DVAFS is a dynamic precision technique, lowering all run-time adaptable parameters:
activity , frequency and supply voltage
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 16 of 56
Precision Scaling – System Level
DVAFS
Ene
rgy/
wor
d
ComputeOverhead
DVAS
Ene
rgy/
wor
d
DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision
CTRL &Transfer
Compute
Memory
ComputeOverheadHigh precision DVAS
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 17 of 56
Precision Scaling – System Level
DVAFS
Ene
rgy/
wor
d
ComputeOverhead
DVAFS
DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision
CTRL &Transfer
Compute
Memory
ComputeOverhead
Ene
rgy/
wor
d
High precision DVAFS
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 18 of 56
Precision Scaling – System Level
DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision
Precision [bits]
* T = 76 GOPS
Rel
. Ene
rgy
/ ope
ratio
n [-]
8x inDVAS
20x inDVAFS
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 19 of 56
Precision Scaling – BB in FDSOI
• DVAFS modulates leakage-vs-dynamic balance• Body-Bias tuning allows minimizing energy
High precision Low precision
Dom
inan
t
Pow
er @
f
Dom
inan
t
Pow
er @
f
Reduce VT, V@ constant (V - VT) and f
Increase VT, V@ constant (V - VT) and f
DynamicLeakage
BBnom BBoptimal BBnom BBoptimal
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 20 of 56
Processor Architecture
Exploits:
A. Parallelism and Data Reuse;B. Network sparsity;C. Varying precision through DVAFS.
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 21 of 56
• Convolution operators are highly parallel• Algorithm allows inherent data reuse
Convolutional Reuse Image Reuse
Filter
Image ImageFilters
1
2
Filter Reuse
Images
Filter
2
1
Three types of Reuse supported in Envision
Optimization: CNN Characteristics (A)
[3] Chen, ISSCC2016
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 22 of 56
RELU activations
Network SparsityLeNet-5 26-87%AlexNet 5-90%VGG 5-82%
Optimization: CNN Characteristics (B,C)
Network Precision LeNet-5 1-5 bitsAlexNet 4-9 bitsVGG (*95%) 4-6 bits
Non-uniform precision@ 99*% relative
benchmark accuracy
• CNN weights and activations are sparse.• Precision varies between apps, networks, layers
Sparsity Varying precision
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 23 of 56
A 2D-SIMD DVAFS Architecture
DMA
IOen/de-coder
RISCCTRL
data
2D-SIMD MAC-arrayInput processing
Inpu
t pro
cesi
ng
ALU 1D-SIMD: ReLu, Max-pool, MAC,
data
PM
GRDGRD
DMA DMB
DMC DMD
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 24 of 56
A 2D-SIMD DVAFS Architecture
DMA
IOen/de-coder
RISCCTRL
data
2D-SIMD MAC-arrayInput processing
Inpu
t pro
cesi
ng
ALU 1D-SIMD: ReLu,Max-pool, MAC,
data
PM
GRDGRD
DMA DMB
DMC DMD
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 25 of 56
A 2D-SIMD DVAFS Architecture
……
… … …
……
… … …
Filter Image Partial Sum
* =
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 26 of 56
A 2D-SIMD DVAFS Architecture
1x16b
No Reuse in Scalar Solution
Filter SRAM
Feature SRAM……
… … …
1 Feature
*1 Weight
1x16bM
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 27 of 56
A 2D-SIMD DVAFS Architecture
1x16b
Convolutional Reuse in 1D-SIMD
16 Features
*1 Weight
16x16b / 1x16b
Filter SRAM
Feature SRAM……
… … …
M M M…
…
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 28 of 56
A 2D-SIMD DVAFS Architecture
1x16b
Convolutional Reuse in 1D-SIMD
16 Features
*1 Weight
16x16b / 1x16b
Filter SRAM
Feature SRAM……
… … …
M
F I F O
M M…
…
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 29 of 56
A 2D-SIMD DVAFS Architecture
M
F I F O
M
M M M
M
M
M
M
…
…… … …
…
…
…
16x16b
Convolutional + Image Reuse in 2D-SIMD
Filter SRAM
Feature SRAM……
… … …
16 Features
*16 Weights
16x16b / 1x16b
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 30 of 56
A 2D-SIMD DVAFS Architecture
Feature SRAM……
… … …
MFilter SRAM
M
M M M
M
M
M
M
…
…… … …
…
…
…
16x(Nx16b/N) / 1x(Nx16b/N)
16x(Nx16b/N)
16N Features
*16N Weights
Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS
……
… … …
MM MM MM
MM MM MM
MM MM MM
N=2
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 31 of 56
N=1
A 2D-SIMD DVAFS Architecture
Feature SRAM……
… … …
MFilter SRAM
F I F O
M
M M M
M
M
M
M
…
…… … …
…
…
…
16x(Nx16b/N) / 1x(Nx16b/N)
16x(Nx16b/N)
16N Features
*16N Weights
Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS
……
… … …
MFilter SRAM
Feature SRAM……
… … …
16b
16b
Accumulate
48b
48b
N = 1, 1x16b 256 MAC units
SR*
*Status Register
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 32 of 56
A 2D-SIMD DVAFS Architecture
Feature SRAM……
… … …
Filter SRAM
MM
MM M M
M
M
M
M
…
…… …
…
…
…
16x(Nx16b/N) / 1x(Nx16b/N)
16x(Nx16b/N)
16N Features
*16N Weights
Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS
……
… … …
MM MM
MM MM
MM MM
N=2
Feature SRAM……
… … …
Filter SRAM
……
… … …
Unused
8b 8b
48b
8b
2x24b
2x24b
N = 2, 2x8b
Unused
512 MAC units
SR*
*Status Register
M
…
MM
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 33 of 56
A 2D-SIMD DVAFS Architecture
Feature SRAM……
… … …
MFilter SRAM
M
M M M
M
M
M
M
…
…… … …
…
…
…
16x(Nx16b/N) / 1x(Nx16b/N)
16x(Nx16b/N)
16N Features
*16N Weights
Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS
……
… … …
MM MM MM
MM MM MM
MM MM MM
N=4
4b 4b 4b 4b
44
44
4x12b
4x12b
N = 4, 4x4bSR
*
*Status Register
1024 MAC units
Unused
Unused
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 34 of 56
A 2D-SIMD DVAFS Architecture
Feature SRAM……
… … …
MFilter SRAM
F I F O
M
M M M
M
M
M
M
…
…… … …
……
Guard SRAM and 2D-Array from sparse operators4
GR
D
0…
1
GRD 0 … 1
GRDSRAM
[4] Moons, VLSI 2016
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 35 of 56
Flexible Memory / IO compression
data
2D-SIMD MAC-arrayInput processing
Inpu
t pro
cesi
ng
ALU 1D-SIMD: ReLu,Max-pool, MAC,
DMA
IOen/de-coder
RISCCTRL
data
PM
GRDGRD
DMA DMB
DMC DMD
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 36 of 56
Flexible Memory / IO compression
data
2D-SIMD MAC-arrayInput processing
Inpu
t pro
cesi
ng
ALU 1D-SIMD: ReLu,Max-pool, MAC,
DMA
IOen/de-coder
RISCCTRL
data
PM
GRDGRD
DMA DMB
DMC DMDAs in [4] Moons, VLSI2016
• C-programmable4
• 16b Instructions4
• Huffman-based IOcompression,up to 5.8x in AlexNet4
• 16 kB PM4
• 128kB DM4
o 3-wise parallel acc.• 4kB GRD SRAM4
o sparsity flags
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 37 of 56
Physical Implementation
Efficiency and –scalability through granular Power and Body-Bias domains
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 38 of 56
Physical Implementation – 28 FDSOI
DMA
IOen/de-coder
RISCCTRL
PM
GRDGRD
DMA DMB
DMC DMD
2D-SIMD MAC-arrayInput processing
Inpu
t pro
cesi
ng
ALU 1D-SIMD: ReLu,Max-pool, MAC,
VMEMBBGND
V2DBB1
VCTRLBB2
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 39 of 56
2D-SIMD MAC array
RISC, DMA
MEM
Physical Implementation – 28 FDSOI1.29 mm
1.45
mm
1.87 mm2
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 40 of 56
Measurement Results
Efficiencies from 0.25-to-10 TOPS/W depending on Precision and Network Sparsity
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 41 of 56
Measurement Results
Throughput [GOPS]
1x16b BBnom
75 150 300
1
10
.6
1
.8
250.1
Eff.
[TO
PS
/W]
Volta
ge [V
]
0.25TOPS/W
1.05V
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 42 of 56
Measurement Results
Throughput [GOPS]
* 2x8b
1x16b BBnom
75 150 300
1
10
.6
1
.8
250.1
Eff.
[TO
PS
/W]
Volta
ge [V
]
1TOPS/W
0.8V
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 43 of 56
Measurement Results
Throughput [GOPS]
+
*
4x4b
2x8b
1x16b BBnom
.6
1
.8E
ff. [T
OP
S/W
]Vo
ltage
[V]
0.67V
1
0.1
4TOPS/W
10
75 150 30025
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 44 of 56
Measurement Results
Throughput [GOPS]
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b BBnom
75 150 300
1
10
.6
1
.8
250.1
Eff.
[TO
PS
/W]
Volta
ge [V
]
8.2TOPS/W
0.61V
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 45 of 56
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b
Measurement Results
Throughput [GOPS]
BBnom
75 150 300
1
10
.6
1
.8
250.1
BBnom = +/- .6VV = 0.85V
LD
Pow
er @
f, T
BBnom
Eff.
[TO
PS
/W]
Volta
ge [V
]
0.33TOPS/W
0.85V
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 46 of 56
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b
Measurement Results
Throughput [GOPS] Throughput [GOPS]
BBoptBBnom
75 150 300 30025 75 150
1
10
.6
1
.8
250.1
BBnom = +/- .6VV = 0.85V
BBopt = +/- 1.2VV = 0.70V
LD
LD
Pow
er @
f, T
BBoptBBnom
Eff.
[TO
PS
/W]
Volta
ge [V
]
1.6x
0.33TOPS/W
0.85V
0.53TOPS/W
0.70V
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 47 of 56
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b
Measurement Results
8.2TOPS/W
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16bBBnom = +/- .6VV = 0.61V
Pow
er @
f, T
BBoptBBnom
LD 1
10
.6
1
.8
0.1
Throughput [GOPS]75 150 30025
8.2TOPS/W
0.61V
BBnom
Eff.
[TO
PS
/W]
Volta
ge [V
]
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 48 of 56
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b
Measurement Results
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16bBBnom = +/- .6VV = 0.61V
BBopt = +/- 0.2VV = 0.63V
Pow
er @
f, T
BBoptBBnom
1.2x
Throughput [GOPS]
BBopt
30025 75 150
8.2TOPS/W
1
10
.6
1
.8
0.1
Throughput [GOPS]75 150 30025
8.2TOPS/W
0.61V
10TOPS/W
0.63V
LD
BBnomL
D Eff.
[TO
PS
/W]
Volta
ge [V
]
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 49 of 56
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16b
Measurement Results
+
*
o30-60%Sparse 4x3-4b
4x4b
2x8b
1x16bBBnom = +/- .6VV = 0.61V
BBopt = +/- 0.2VV = 0.63V
LD
Pow
er @
f, T
BBoptBBnom
Throughput [GOPS]
BBopt
30025 75 150
8.2TOPS/W 40x
1
10
.6
1
.8
0.1
Throughput [GOPS]75 150 30025
8.2TOPS/W
0.61V
10TOPS/W
0.63V
LD
BBnom
1.2x
Eff.
[TO
PS
/W]
Volta
ge [V
]
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 50 of 56
Hierarchical Face Recognition Revisited
Hierarchical processing enables always-on compute
3 uJ/f2-4b CONV4.2 TOPS/W
6 uJ/fCONV
4 TOPS/W
500 uJ/fCONV
1.8TOPS/W
23100 uJ/f4-6b CONV1.3 TOPS/W
NY N
N
Always-on ~1% on ~0.1% on ~0.01% on
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 51 of 56
3 uJ/f2-4b CONV4.2 TOPS/W
Hierarchical processing enables always-on compute
6 uJ/fCONV
4 TOPS/W
500 uJ/fCONV
1.8TOPS/W
23100 uJ/fCONV
1.3 TOPS/W
NY N
N
Always-on ~1% on ~0.1% on ~0.01% on
This Functionality Always-onAt 6uJ / frame average CONV-
layer energy consumption
Hierarchical Face Recognition Revisited
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 52 of 56
Comparison
A. Highest scalability of Energy-vs-Computational Precision (40x)
B. Efficiencies up to 10 TOPS/W
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 53 of 56
Eyeriss3
ISSCC ’16Moons4
VLSI ‘16This work
N = 1, 2 or 4Technology 65nm LP 40nm LP 28nm FDSOI
fnomV @ fnom
Peak GOPS
200MHz1V67
200MHz1.1V102
200MHz1V
N x 102
ANet CONVVGG CONV
278mW@35fps-
76mW @ 47fps-
44mW @ 47fps26mW @ 1.7fps
Power [mW]@ GOPSnom
Min. Eff. Max. Eff.
235-332 (1.5x)@ 46 GOPS
0.17 TOPS/W0.25 TOPS/W
35-300 (8.5x)@ 80 GOPS
0.27 TOPS/W2.60 TOPS/W
7.5-300 (40x)@ 76 GOPS
0.25 TOPS/W10.0 TOPS/W
Comparison with SotA
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 54 of 56
Comparison with SotA
Throughput [GOPS]
Ene
rgy-
Effi
cien
cy [T
OP
S/W
]
1 10 100 10000.1
110
4-bit8-bit
16-bit
ID14.2This work
ID14.1
ID14.6
Moons4
Chen3
2017
2016
homes.esat.kuleuven.be/~mverhels/DLICsurvey.html
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 55 of 56
Summary
Envision: A 0.25-to-10 TOPS/W CNN processor, trading energy-vs-computational precision
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI
© 2017 IEEE International Solid-State Circuits Conference 56 of 56
• Always-on through hierarchical computing.
• An energy-efficient CNN-architecture:1. 2D-SIMD baseline;2. DVAFS-compatible3. Operator guarding and IO-compression.
• Envision: a 0.25-to-10 TOPS/W @ 76 GOPSvarying with the required network precision.
Acknowledgement: This work was partly funded by FWO and Intel Corporation. We thank Synopsys for tool support, STMicroelectronics for silicon donation.
Summary