Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

22
2016/06/17 @ DeNA, Shibuya Hikarie Hokuto Kagaya (@_hokkun_) Optimization as a Model for Few-Shot Learning 1

Transcript of Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

2016/06/17 @ DeNA, Shibuya Hikarie

Hokuto Kagaya(@_hokkun_)

Optimization as a Model for Few-Shot Learning

1

TL; DR • Purpose• Better inference for few-shot/one-shot learning problem

•Method• Meta-learning based on LSTM of deep neural network

• Result• Competitive with deep metric-learning techniques

2

Background (1)•Why deep learning succeeded?• machine power• amount of data

• Large Dataset• ImageNet (Image)• Microsoft COCO Captions (Image & Caption)• YouTube 8M (Video)• WikiText (Text)

3

Background (2)•However, In many fields, to collect a large amount of training samples is:• difficult• Ex: Fine-grained recognition (car, bird, food..)

• time-consuming• scraping, crawling, annotating…

• Actually human beings can generalize using few samples of targets.

4

Problem & Purpose (1)•How can we acquire generalized model using few samples and a set number of updates?• Existed gradient-based training algorithm (SGD, ADAM, AdaGlad..) does not fit the problem with a set number of parameter updates.

• In other simple words, authors want to find good initial parameters of NN.• cf) review comments: it is much better to be able to find architectural parameters of NN.

5

Problem & Purpose (2)•How?• Meta learning• Learning to learn. Train learner itself.

• A variety of meta learning• Transfer learning• Use the experiences of different domain• Popular in the field of image classification, especially for

fine-grained visual classification• Ensemble classifier• combine multiple classifier

6

- This article is very good to understand meta learning- http://http://www.scholarpedia.org/article/Metalearning

Proposed method

• LSTM-based meta learning

7

* Prerequisites•What is LSTM?• Long-time Short Term Memory • 時系列を扱いたい、でも誤差が発散/消失しちゃう• 過去のデータの重みを1にして忘れないようにした上で、選択的に⼊⼒/出⼒を⾏うようにした (ʻ97) • しかし急激な状況の変化(?)に対応できなかったので、忘却ゲートを設置することで選択的に過去のデータの記憶を消去できるようにした (ʼ99)

• 参考(⽇本語)• http://qiita.com/t_Signull/items/21b82be280b46f467d1b

8

* Data Separation•meta-train dataset•meta-test dataset• meta sample

9

target training samples

target testing samples

one meta sample (a.k.a. episode)

Proposed Method (2)𝜃" = 𝜃"$% + 𝛼∇)*+,ℒ"

𝑐" = 𝑓"⨀𝑐"$% + 𝑖"⨀𝑐"̅where

𝑖" = 𝜎(𝑊6 7 ? + b9)𝑓" = 𝜎 𝑊; 7 ? + b<

where? = current gradients, current loss,

previous 𝜃, previous itself (𝑖, 𝑓)

10

Normal SGD

Metaphor

not constant 1, to escape from bad

local optima

Proposed Method (3)

11

←meta learner‘s iteration

←learner‘s iteration

(meta) loss value is computed by final state of LSTM (= parameters of target model) and 𝐷"?@"’s data and labels.

Proposed Method (4)

12

• From authorʼs slide

Proposed Method (5)•What will be improved gradually?• First: LSTM parameter (a.k.a. meta-learner parameters)• that is, ”how should we update target models?”

• Second: LSTM states (outputs?)• Final 𝜃A is shared among each batch, so learning proceeds

rapidly thanks for good initialization

13

Other Topics• coordinate-wise LSTM• Preprocessing to LSTM inputs• about both topics, see [Andrychowicz, NIPS 2016] (preprocessing is in appendix)

• adjust the scaling of gradients and losses• separate info of magnitude and sign

• Batch normalization• avoid ”dataset” (episode) level leakage of information

• Related work: metric learning• ex: Siamese network

14

Evaluation Method• Baseline 1: nearest neighbor• meta-train: train neural network using all sample• meta-test: training sample をNNにぶちこんだ結果とtesting sample のそれを⽐較

• Baseline 2: fine-tune• meta-train: 1 に加えて、 meta-validation dataset を hyper parameter の探索に使い、1 の network をfinetune する

• Baseline 3: Matching network• 距離学習のSOTA

15

Evaluation Result

16

Visualization and Insight• input gates• 1. different among datasets• = meta-learner isnʼt simply learning a fixed optimization

strategy• 2. different among tasks• = meta-learner has used different ways to solve each

setting

• forget gates• simple decay• 結局ほとんど constant

17

Visualization and Insight

18

Conclusion• Found LSTM-based model to learn a learner, which is inspired by a metaphor between SGD updates and LSTM.

• Train meta-learner to discover:• 1. Good initialization of learner• 2. Good mechanism for updating learnerʼs parameters.

• competitive experimental result with SOTA metric learning methods.

19

Future work• few samples / lots of classes• more challenging scenarios

• from review comment• it is much better to be able to find architectural parameters of NN.

20

所感• transfer learning における「別ドメインの経験を活かす」という作業を「時系列の学習」的に捉えて LSTM モデルとして学習した、というのは⾃然に思えた• すでにあった発想?時間なく関連研究まで読み込めず。。

• review comment にあった、構造の最適化までできるとすごくよさそうだと思った• シンプルなフィルタをたくさん重ねるといいという話もあるが。。• 学部時代 cuda-convnet を使ってたくさんハイパパラメータを試した苦労が蘇った

21

多分わかってないこと•結局、この論⽂で初めてわかったのはどこ?LSTM を learning-to-learn に使ったのは多分初めてじゃない?• 例えば Andrychowicz+ʼ16 では、勾配を⼊⼒にしてtarget learner の parameter updates を出⼒するLSTM を学習• パラメタそのものを直接出⼒してるところ?

22