Energy-Aware Neural Architecture Optimization with ...

Energy-Aware Neural Architecture Optimization with Splitting Steepest Descent Dilin Wang 1 , Lemeng Wu 1 , Meng Li 2 , Vikas Chandra 2 , Qiang Liu 1 1 UT Austin 2 Facebook Neural architecture optimization Splitting Yields Adaptive Net Structure Optimization I Starting from a small net, gradually grow the net during training. I Grow by “splitting” existing neurons into multiple off-springs. Why, when and how? I Why splitting? Does splitting decrease the loss? How much? I When to split? What neurons should be split first? I How to split a neuron optimally? How many copies to split into? Why & when: escaping local minima I Optimization view: the local optima in the low dimensional space can be turned into a saddle point in a higher dimensional of the augmented networks I Architecture view: lower-dimensional space → smaller networks; higher-dimensional space → larger networks How: splitting steepest descent I Consider a single-neuron network L(θ ) := E x ∼D [Φ(σ (θ, x ))], where Φ(·) is the map from the output of the neuron to the final loss. I Split θ into m off-springs: θ →{θ i , w i } m i =1 , we have, L({θ i , w i }) := E x ∼D Φ( m X i =1 w i σ (θ i , x )) . I Smooth loss change: X i =1 w i =1, ||θ i - θ || 2 ≤ , ∀i . Deriving optimal splitting strategies I Structural descent at stable local minima min m,{θ i },{w i } L m ({θ i , w i }) -L(θ ), s .t .||θ i - θ || ≤ , m X i =1 w i =1, w i > 0. (1) I The optimum of Eqn. 1 is determined by min m,{θ i },{w i } L m ({θ i , w i }) -L(θ ) = 2 2 min{ λ min (S (θ )) | {z } splitting index , 0} + O ( 3 ), with S (θ )= E x ∼D ∇ σ Φ(σ (θ, x ))∇ 2 θθ σ (θ, x ), | {z } Splitting matrix where λ min (S (θ )) denotes the minimum eigenvalue of S (θ ). Optimal splitting I When λ min (S (θ )) ≥ 0, no splitting I When λ min (S (θ )) < 0: m =2, θ 1 = θ + v min (S (θ )), θ 2 = θ - v min (S (θ )), w 1 = w 2 =1/2. The corresponding maximum decrease of loss is 2 λ min (S (θ ))/2. Energy-aware splitting I Our formulation min β n X ‘=1 β ‘ λ min (S (θ ‘ )) | {z } gain s .t . β ‘ ∈{0, 1} n X ‘=1 e ‘ |{z} flops β ‘ ≤ budget Main algorithm Experiments Results on CIFAR100 I We apply splitting on a small version of MobileNetV1 Howard et al., 2017 Test Accuracy 6.5 7.0 7.5 8.0 0.55 0.60 0.65 0.70 Baseline (full size) Width multiplier Pruning (L1) Pruning (Bn) MorphNet Splitting (ours) 6.0 6.5 7.0 7.5 8.0 0.5 0.6 0.7 Pruning (Bn) Splitting (ours) (a) Log10(flops) (b) Log10(flops) Results on ImageNet I MobileNetV1 Model MACs (G) Top-1 Accuracy Top-5 Accuracy MobileNetV1 (1.0x) 0.569 72.93 91.14 Splitting-4 0.561 73.96 91.49 MobileNetV1 (0.75x) 0.317 70.25 89.49 AMC He el al., 2018 0.301 70.50 89.30 Splitting-3 0.292 71.47 89.67 MobileNetV1 (0.5x) 0.150 65.20 86.34 Splitting-2 0.140 68.26 87.93 Splitting-1 0.082 64.06 85.30 Splitting-0 (seed) 0.059 59.20 81.82 I MobileNetV2 Model MACs (G) Top-1 Accuracy Top-5 Accuracy MobileNetV2 (1.0x) 0.300 72.04 90.57 Splitting-3 0.298 72.84 90.83 MobileNetV2 (0.75x) 0.209 69.80 89.60 AMC He et al., 2018 0.210 70.85 89.91 Splitting-2 0.208 71.76 90.07 MobileNetV2 (0.5x) 0.097 65.40 86.40 Splitting-1 0.095 66.53 87.00 Splitting-0 (seed) 0.039 55.61 79.55 Conclusion I Incremental training with splitting gradient. I Simple and fast, promising in practice. I Opens a new dimension for energy-efficient NAS. The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019

Upload
others
Category

Documents
view
6
download
0

Embed Size (px):

Transcript of Energy-Aware Neural Architecture Optimization with ...

Page 1: Energy-Aware Neural Architecture Optimization with ...

Energy-Aware Neural Architecture Optimizationwith Splitting Steepest Descent

Dilin Wang1, Lemeng Wu1, Meng Li2, Vikas Chandra2, Qiang Liu1

1 UT Austin 2 Facebook

Neural architecture optimization

Splitting Yields Adaptive Net Structure Optimization

I Starting from a small net, gradually grow the net during training.

I Grow by “splitting” existing neurons into multiple off-springs.

Why, when and how?

I Why splitting? Does splitting decrease the loss? How much?

I When to split? What neurons should be split first?

I How to split a neuron optimally? How many copies to split into?

Why & when: escaping local minima

I Optimization view: the local optima in the low dimensional space can beturned into a saddle point in a higher dimensional of the augmented networks

I Architecture view: lower-dimensional space → smaller networks;higher-dimensional space → larger networks

How: splitting steepest descent

I Consider a single-neuron network

L(θ) := Ex∼D[Φ(σ(θ, x))],

where Φ(·) is the map from the output of the neuron to the final loss.I Split θ into m off-springs:θ → {θi,wi}m

i=1, we have,

L({θi,wi}) := Ex∼D

[Φ(

m∑i=1

wiσ(θi, x))

I Smooth loss change:∑i=1

wi = 1, ||θi − θ||2 ≤ ε,∀i .

Deriving optimal splitting strategies

I Structural descent at stable local minima

minm,{θi},{wi}

{Lm({θi,wi})− L(θ), s.t.||θi − θ|| ≤ ε,

m∑i=1

wi = 1,wi > 0.

}(1)

I The optimum of Eqn. 1 is determined by

minm,{θi},{wi}

{Lm({θi,wi})− L(θ)

}=ε2

2min{λmin(S(θ))︸︷︷︸

splitting index

, 0} +O(ε3),

with S(θ) = Ex∼D

[∇σΦ(σ(θ, x))∇2

θθσ(θ, x),

]︸︷︷︸

Splitting matrix

where λmin(S(θ)) denotes the minimum eigenvalue of S(θ).

Optimal splittingI When λmin(S(θ)) ≥ 0, no splitting

I When λmin(S(θ)) < 0:

m = 2, θ1 = θ + εvmin(S(θ)), θ2 = θ − εvmin(S(θ)), w1 = w2 = 1/2.

The corresponding maximum decrease of loss is ε2λmin(S(θ))/2.

Energy-aware splitting

I Our formulation

minβ

n∑`=1

β` λmin(S(θ`))︸︷︷︸gain

s.t. β` ∈ {0, 1}n∑`=1

e`︸︷︷︸flops

β` ≤ budget

Main algorithm

Experiments

Results on CIFAR100

I We apply splitting on a small version of MobileNetV1 Howard et al., 2017

Tes

ccur

acy

6.5 7.0 7.5 8.0

0.55

0.60

0.65

0.70

Baseline (full size)Width multiplierPruning (L1)Pruning (Bn)MorphNetSplitting (ours)

6.0 6.5 7.0 7.5 8.0

0.5

0.6

0.7

Pruning (Bn)Splitting (ours)

(a) Log10(flops) (b) Log10(flops)

Results on ImageNet

I MobileNetV1

Model MACs (G) Top-1 Accuracy Top-5 Accuracy

MobileNetV1 (1.0x) 0.569 72.93 91.14Splitting-4 0.561 73.96 91.49

MobileNetV1 (0.75x) 0.317 70.25 89.49AMC He el al., 2018 0.301 70.50 89.30Splitting-3 0.292 71.47 89.67

MobileNetV1 (0.5x) 0.150 65.20 86.34Splitting-2 0.140 68.26 87.93

Splitting-1 0.082 64.06 85.30Splitting-0 (seed) 0.059 59.20 81.82

I MobileNetV2

Model MACs (G) Top-1 Accuracy Top-5 Accuracy

MobileNetV2 (1.0x) 0.300 72.04 90.57Splitting-3 0.298 72.84 90.83

MobileNetV2 (0.75x) 0.209 69.80 89.60AMC He et al., 2018 0.210 70.85 89.91Splitting-2 0.208 71.76 90.07

MobileNetV2 (0.5x) 0.097 65.40 86.40Splitting-1 0.095 66.53 87.00

Splitting-0 (seed) 0.039 55.61 79.55

Conclusion

I Incremental training with splitting gradient.

I Simple and fast, promising in practice.

I Opens a new dimension for energy-efficientNAS.

The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019

Bayesian Optimization with Robust Bayesian Neural …papers.nips.cc/paper/6117-bayesian-optimization-with-robust... · Bayesian Optimization with Robust Bayesian Neural Networks Jost

Phase-Aware Optimization in Approximate Computing

Neural Network Modelling and Multi-Objective Optimization ...ethesis.nitrkl.ac.in/3892/1/EDM_neural_network_genetic.pdf · Neural Network Modelling and Multi-Objective Optimization

Character-Aware Neural Language Modelsnlp.seas.harvard.edu/slides/aaai16.pdf · Character-Aware Neural Language Models ... Renewed interest in language modeling. Kim, ... Convolutional

Convex Optimization for Neural Networks

Subcategory-aware Convolutional Neural …yuxng.github.io/Xiang_WACV17_03292017.pdfSubcategory-aware Convolutional Neural Networks for Object Proposals and Detection Yu Xiang1, Wongun

Vulnerability-aware Energy Optimization for Reconﬁgurable ...

Qualitatively Characterizing Neural Network … Characterizing Neural Network Optimization ... There are of course multiple minima in neural network optimization ... of SGD on generative

Stochastic optimization for power-aware distributed scheduling

Improving Context-Aware Neural Machine Translation Using Self …milab.snu.ac.kr/pub/AAAI2020Yun.pdf · 2020-01-31 · Improving Context-Aware Neural Machine Translation Using Self-Attentive

Orientation-Aware Deep Neural Network for Real Image Super ...openaccess.thecvf.com/content_CVPRW_2019/papers/NTIRE/Du_Orientation... · Orientation-aware Deep Neural Network for

Quality-Aware Neural Complementary Item …faculty.cse.tamu.edu/caverlee/pubs/zhang18recsys.pdfQuality-Aware Neural Complementary Item Recommendation Yin Zhang, Haokai Lu, Wei Niu,

Discourse-Aware Neural Rewards for Coherent Text Generation · Discourse-Aware Neural Rewards for Coherent Text Generation Antoine Bosselut1, Asli Celikyilmaz2, Xiaodong He3, Jianfeng

Neural Nets for Optimization

Neural Architecture Generator Optimization

Subcategory-aware Convolutional Neural Networks for Object ... · Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection Yu Xiang1, Wongun Choi2, Yuanqing

Geometry-Aware Neural Renderingpapers.nips.cc/paper/9331-geometry-aware-neural-rendering.pdf · Geometry-Aware Neural Rendering Josh Tobin OpenAI & UC Berkeley [email protected] OpenAI

Network-aware Placement Optimization for Edge Computing ...

Neural Networks for Optimization

Explainable and Discourse Topic-aware Neural Language ...proceedings.mlr.press/v119/chaudhary20a/chaudhary20a.pdfExplainable and Discourse Topic-aware Neural Language Understanding

Energy-Aware Neural Architecture Optimization with ...

Documents

Transcript of Energy-Aware Neural Architecture Optimization with ...