TADAM: Task dependent adaptive metric for improved few-shot...

22
1/22 TADAM: Task dependent adaptive metric for improved few-shot learning (NeurIPS-2018) B. N. Oreshkin, P. Rodriguez, and A. Lacoste (Element AI) Jungtaek Kim ([email protected]) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea March 11, 2019

Transcript of TADAM: Task dependent adaptive metric for improved few-shot...

Page 1: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

1/22

TADAM: Task dependent adaptive metricfor improved few-shot learning (NeurIPS-2018)

B. N. Oreshkin, P. Rodriguez, and A. Lacoste (Element AI)

Jungtaek Kim ([email protected])

Machine Learning Group,Department of Computer Science and Engineering, POSTECH,

77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea

March 11, 2019

Page 2: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

2/22

Table of Contents

Motivation

Contributions

Metric Scaling

Page 3: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

3/22

Motivation

I Two recent approaches have attracted significant attention inthe few-shot learning domain: Matching Networks andPrototypical Networks.

I In both approaches, the support set and the query set areembedded with a neural network, and nearest neighborclassification is used given a metric in the embedded space.

I This paper extends the very notion of the metric space bymaking it task dependent via conditioning the featureextractor on the specific task.

I The authors find a solution in exploiting the interactionbetween the conditioned feature extractor and the trainingprocedure based on auxiliary co-training on a simpler task.

Page 4: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

4/22

Contributions

I Metric Scaling: This paper proposes metric scaling to improveperformance of few-shot algorithms, and mathematicallyanalyzes its effects on objective function updates.

I Task Conditioning: It uses a task encoding network to extracta task representation based on the support set. This is used toinfluence the behavior of the feature extractor through FiLM.

I Auxiliary task co-training: The authors show that co-trainingthe feature extraction on a conventional classification taskreduces training complexity and provides better generalization.

Page 5: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

5/22

Metric Scaling

I Prototypical networks based on Euclidean distance is betterthan matching networks based on cosine distance.

I The authors hypothesize that the improvement could bedirectly attributed to the interaction of the different scaling ofthe metrics with the softmax.

I Moreover, the dimensionality of the output is known to have adirect impact on the output scale even for the Euclideandistance.

I This paper proposes to scale the distance metric by alearnable temperature α.

Page 6: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

6/22

Metric Scaling

I Therefore, a class probability can be written as

pφ,α(y = k|x) = softmax(−αd(z, ck)). (1)

I Class-wise cross-entropy function is

Jk(φ, α) =∑xi∈Qk

αd(fφ(xi ), ck) + log∑j

exp(−αd(fφ(xi ), cj)

.(2)

Page 7: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

7/22

Metric Scaling

Page 8: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

8/22

Metric Scaling

I From Eq. (3), it is clear that for small α values, the first termminimizes the embedding distance between query samples andtheir corresponding prototypes. The second term maximizesthe embedding distance between the samples and theprototypes of the non-belonging categories.

I For large α values (Eq. (4)), the first term is the same as inEq. (3); while the second term maximizes the distance of thesample with the closest wrongly assigned prototype cj∗i .

Page 9: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

9/22

Task Conditioning

I Up until now we assumed the feature extractor fφ(·) to betask-independent.

I The authors define a dynamic feature extractor fφ(x, Γ), whereΓ is the set of parameters predicted from a task representationsuch that the performance of fφ(x, Γ) is optimized given thesupport set S .

I This is related to FiLM conditioning layer and conditionalbatch normalization of the form of hl+1 = γ � hl + β.

I The task representation defined as the mean of task classcentroids

c̄ =∑k

ck (3)

(i) reduces the dimensionality of a task embedding network(TEN) input and (ii) replaces expensive RNN/CNN/attentionmodeling.

Page 10: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

10/22

Task Conditioning

Page 11: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

11/22

Task Conditioning

Page 12: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

12/22

Task Conditioning

Page 13: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

13/22

Auxiliary Task Co-Training

I The TEN introduces additional complexity into thearchitecture via task conditioning layers inserted after theconvolutional and batch norm blocks.

I The TEN network is difficult to train. Thus, the authors usethe technique, auxiliary task co-training.

I It applies auxiliary co-training with an additional logit head(the normal 64-way classification in mini-Imagenet case).

I The authors anneal it using an exponential decay schedule ofthe form 0.9b20t/Tc.

Page 14: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

14/22

Overall Architecture

Page 15: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

15/22

Experimental Results

Page 16: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

16/22

Experimental Results

Page 17: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

17/22

Experimental Results

Page 18: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

18/22

Experimental Results

Page 19: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

19/22

Feature-wise Linear Modulation (FiLM)

I This paper suggests a general-purpose model that can be usedin learning a visual reason.

I A FiLM layer carries out a simple, feature-wise affinetransformation on a neural network’s intermediate features,conditioned on an arbitrary input.

I The FiLM model consists of a FiLM-generating linguisticpipeline and a FiLM-ed visual pipeline.

Page 20: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

20/22

Feature-wise Linear Modulation (FiLM)

Page 21: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

21/22

Feature-wise Linear Modulation (FiLM)

Page 22: TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019  · the few-shot learning domain: Matching Networks

22/22

Feature-wise Linear Modulation (FiLM)