Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse...
Transcript of Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse...
Massachusetts Institute of Technology
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
Once for All: Train One Network and Specialize it for Efficient Deployment
Once-for-All, ICLR’20
• Memory: 32GB • Computation: FLOPS1012
Cloud AI Mobile AI Tiny AI (AIoT)
• Memory: 4GB • Computation: FLOPS109
• Memory: 100 KB • Computation: < FLOPS106
Challenge: Efficient Inference on Diverse Hardware Platforms
• Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices.
less resource
less resource
3
Diverse Hardware Platforms
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
Design Cost (GPU hours)
40K
Challenge: Efficient Inference on Diverse Hardware Platforms
4
Diverse Hardware Platforms
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
160K
40K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
2019 2017 2015 2013
5
Diverse Hardware Platforms
Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106
…
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
160K
40K
1600K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
6
Diverse Hardware Platforms
Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106
…
160K
40K
1600K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
11.4k lbs CO2 emission→
45.4k lbs CO2 emission→
454.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
7
Diverse Hardware Platforms
?
…
Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106
Challenge: Efficient Inference on Diverse Hardware Platforms
160K
40K
1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission→
454.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
45.4k lbs CO2 emission→
8
Diverse Hardware Platforms
…
Once-for-All Network
Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106
Challenge: Efficient Inference on Diverse Hardware Platforms
160K
40K
1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission→
454.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
45.4k lbs CO2 emission→
Once-for-All Network: Decouple Model Training and Architecture Design
9
once-for-all network
Once-for-All Network: Decouple Model Training and Architecture Design
10
once-for-all network
Once-for-All Network: Decouple Model Training and Architecture Design
11
once-for-all network
Once-for-All Network: Decouple Model Training and Architecture Design
12
…
once-for-all network
Progressive Shrinking for Training OFA Networks
13
• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.
1019
Progressive Shrinking
14
• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.
1019
Train the full model
Shrink the model (4 dimensions)
Jointly fine-tune both large and
small sub-networks
• Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.
once-for-all network
Progressive Shrinking
Connection to Network Pruning
15
Train the full model
Shrink the model (only width)
Fine-tune the small net
single pruned network
Network Pruning
Train the full model
Shrink the model (4 dimensions)
Fine-tune both large and small sub-nets
once-for-all network
• Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions.
Progressive Shrinking
16
Randomly sample input image size for each batch
Elastic Kernel Size
Elastic Depth
Elastic Width
Full Full FullElastic Resolution
Full
Partial
Progressive Shrinking
17
7x7
Transform Matrix 25x25
5x5
Transform Matrix 9x9
3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution
Elastic Kernel Size
Elastic Depth
Elastic Width
Full Full Full Full
Partial Partial
Progressive Shrinking
18
unit i
train with full depth
unit i
shrink the depth
O1
O2
unit i
shrink the depth
O1
O2
O3
Gradually allow later layers in each unit to be skipped to reduce the depth
Elastic Width
Full Full Full Full
Partial Partial Partial
Elastic Resolution
Elastic Kernel Size
Elastic Depth
Progressive Shrinking
19
train with full width
channel importance
0.02
0.15
0.85
0.63
channel sorting
progressively shrink the width
channel importance
0.82
0.11
0.46 reorg.
reorg.
progressively shrink the width
channel sorting
O1O2
O3
O1
O2
O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting
Full Full Full Full
Partial Partial Partial
Elastic Resolution
Elastic Kernel Size
Partial
Elastic Width
Elastic Depth
Progressive Shrinking
20
Performances of Sub-networks on ImageNetIm
ageN
et T
op-1
Acc
(%)
67
70
73
75
78w/o PS w/ PS
D=2 W=3 K=3
D=2 W=3 K=7
D=2 W=6 K=3
D=2 W=6 K=7
D=4 W=3 K=3
D=4 W=3 K=7
D=4 W=6 K=3
D=4 W=6 K=7
2.5%2.8%
3.5%3.4% 3.3%
3.4%3.7%
3.5%
Sub-networks under various architecture configurations D: depth, W: width, K: kernel size
• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
OFA: 80% Top-1 Accuracy on ImageNet
21
0 1 2 3 4 5 6 7 8 9MACs (Billion)
69
71
73
75
77
79
81
Imag
eNet
Top
-1 a
ccur
acy
(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
→→
The higher the better
The lower the better
Once-for-All (ours)
EfficientNet
ProxylessNASMBNetV3
AmoebaNet
MBNetV2PNASNetShuffleNetDARTS
IGCV3-D
MobileNetV1 (MBNetV1)
NASNet-A
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
InceptionV3
DenseNet-264
DPN-92
ResNet-101
Xception
ResNetXt-101
14x less computation595M MACs 80.0% Top-1
Model Size
• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile setting (< 600M MACs).
Comparison with EfficientNet and MobileNetV3
22
Top-
1 Im
ageN
et A
cc (%
)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA EfficientNet
76.3
78.8
79.879.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher accuracy
Google Pixel1 Latency (ms)
Top-
1 Im
ageN
et A
cc (%
)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA MobileNetV3
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher accuracy
1.5x faster
• Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3 on Google Pixel1 without loss of accuracy.
OFA for Fast Specialization on Diverse Hardware Platforms
23
Samsung S7 Edge Latency (ms)
Top-
1 Im
ageN
et A
cc (%
)
67
69
71
73
75
77
25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2
73.3
70.4
67.4
70.5
73.1
74.7
76.3
Google Pixel2 Latency (ms)
67
69
71
73
75
77
23 28 33 38 43 48 53 58 63 68
75.2
73.3
70.4
67.4
75.874.7
73.4
71.5
LG G8 Latency (ms)
67
69
71
73
75
77
7 10 13 16 19 22 25
75.2
73.3
70.4
67.4
76.4
74.7
73.0
71.1
Top-
1 Im
ageN
et A
cc (%
)
58
62
66
69
73
77
10 14 18 22 26 30NVIDIA 1080Ti Latency (ms)
Batch Size = 64
60.3
65.4
69.872.0
72.673.8
75.3 76.4
58
62
66
69
73
77
9 11 13 15 17 19Intel Xeon CPU Latency (ms)
Batch Size = 1
60.3
65.4
69.872.0
71.1
74.675.7
72.0
58
62
66
69
73
77
3.0 4.0 5.0 6.0 7.0 8.0Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 1 (Quantized)
59.1
63.3
69.071.5
67.069.6
72.873.7
OFA Saves Orders of Magnitude Design Cost
24
• Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing the carbon footprint by 1,335x compared to MnasNet under 40 platforms.
Measured results on FPGA
OFA for FPGA Accelerators
Arit
hmet
ic In
tens
ity (
OP
S/B
yte)
0.0
12.5
25.0
37.5
50.0
ZU3E
G F
PG
A (G
OP
S/s
)
0.0
20.0
40.0
60.0
80.0
MobileNetV2 MnasNet OFA (Ours)
40% higher 57%
higher
• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for improvement via neural network specialization.
Summary
• Released 50 different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP). net, image_size = ofa_specialized(net_id, pretrained=True)
• Released the training code & pre-trained OFA network that provides diverse sub-networks without training. ofa_network = ofa_net(net_id, pretrained=True)
• We introduce once-for-all network for efficient inference on diverse hardware platforms. • We present an effective progressive shrinking approach for training once-for-all networks.
Project Page: https://ofa.mit.edu
• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios, setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19 • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19
Train the full model
Shrink the model In 4 dimensions
Fine-tune both large and small sub-nets
once-for-all network
Progressive Shrinking