Properties of Good Meta-Learning...
Transcript of Properties of Good Meta-Learning...
![Page 1: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/1.jpg)
Chelsea Finn UC Berkeley & Google Brain -> Stanford
Properties of Good Meta-Learning Algorithms (And How to Achieve Them)
![Page 2: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/2.jpg)
- effectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly adapt to unexpected scenarios (inevitable failures,
long tail) - learn how to learn with weak supervision
Why Learn to Learn?
![Page 3: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/3.jpg)
Problem Domains: - few-shot classification & generation - hyperparameter optimization - architecture search - faster reinforcement learning - domain generalization - learning structure - …
Approaches: - recurrent networks - learning optimizers or update rules - learning initial parameters &
architecture - acquiring metric spaces - Bayesian models - …
What is the meta-learning problem statement?
![Page 4: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/4.jpg)
The Meta-Learning Problem
Inputs: Outputs:Supervised Learning:
Meta-Supervised Learning:Inputs: Outputs:
Data:
Data:
Why is this view useful? Reduces the problem to the design & optimization of f.
{
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
![Page 5: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/5.jpg)
Example: Few-Shot Classification
diagram adapted from Ravi & Larochelle ‘17
Given 1 example of 5 classes: Classify new examples
meta-training
training data test set
training classes
… …
![Page 6: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/6.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
![Page 7: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/7.jpg)
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
![Page 8: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/8.jpg)
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
Learned optimizer (often uses recurrence)
Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …
![Page 9: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/9.jpg)
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
Learned optimizer (often uses recurrence)
Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …
Black box approaches
Expressive power Consistency
Can we incorporate structure into the learning procedure?
![Page 10: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/10.jpg)
Nearest Neighbors(in a learned metric space)
Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Vinyals et al. ’16
Expressive powerConsistency
Approaches that incorporate learning structure
Can we get both?
Koch ’15
![Page 11: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/11.jpg)
Nearest Neighbors(in a learned metric space)
Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Gradient Descent(from learned initialization)
Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …
Key idea: Train over many tasks, to learn parameter vector θ that transfers via fine-tuning
Fine-tuningtraining data for test task
pretrained parameters
[test-time]
Model-AgnosticMeta-Learning:
(MAML)
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
Approaches that incorporate learning structure
![Page 12: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/12.jpg)
Learning procedure:
Expressive powerConsistency
?
Approaches that incorporate learning structureNearest Neighbors
(in a learned metric space)Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Gradient Descent(from learned initialization)
Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …
Does consistency come at a cost?
![Page 13: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/13.jpg)
Universal Function Approximation Theorem
Hornik et al. ’89, Cybenko ’89, Funahashi ‘89A neural network with one hidden layer of finite width can approximate any continuous function.
How can we define a notion of expressive power for meta-learning?
“universal function approximator”
Recurrent network Learned optimizer
“universal learning procedure approximator”
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
![Page 14: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/14.jpg)
For a sufficiently deep f, MAML function can approximate any function of
Finn & Levine, ICLR 2018
Why is this interesting?MAML has benefit of consistency without losing expressive power.
Recurrent network MAML
Assumptions: - nonzero - loss function gradient does not lose information about the label - datapoints in are unique
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
![Page 15: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/15.jpg)
Black box approaches
Expressive power Consistency
MAML
Empirically, what does consistency get you?
![Page 16: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/16.jpg)
How well can methods generalize to similar, but extrapolated tasks?MAML TCML, MetaNetworks
task variability
perf
orm
ance
Omniglot image classification
The world is non-stationary.
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
![Page 17: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/17.jpg)
MAML TCMLer
ror
Sinusoid curve regression
task variabilityTakeaway: Strategies learned with MAML consistently generalize better to out-of-distribution tasks
How well can methods generalize to similar, but extrapolated tasks?
The world is non-stationary.
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
![Page 18: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/18.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
![Page 19: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/19.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
![Page 20: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/20.jpg)
Model-AgnosticMeta-Learning:
(MAML)
What if we stop the gradient through this term?
Note: In some tasks, we have observed a more substantial drop.
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
![Page 21: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/21.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
![Page 22: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/22.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
No uncertainty awareness in original MAML
![Page 23: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/23.jpg)
A Probabilistic Interpretation of MAML
that’s nice but is it… Bayesian?kind of… but it can be more Bayesian
Erin Grant
(Santos, 1996)
Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18
![Page 24: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/24.jpg)
Erin Grant
MAP inference in this model
can we do better than MAP?can use Laplace estimate:
estimate Hessian for neural nets with KFAC
A Probabilistic Interpretation of MAML
Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18
![Page 25: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/25.jpg)
Modeling ambiguity
Can we sample classifiers?
Intuition: we want to learn a prior where a random kick can put us in different modes
smiling, hatsmiling, young
![Page 26: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/26.jpg)
Meta-learning with ambiguity
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
![Page 27: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/27.jpg)
Sampling parameter vectors
approximate with MAPthis is extremely crude
but extremely convenient!Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
![Page 28: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/28.jpg)
What does ancestral sampling look like?
smiling, hatsmiling, young
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
Sampling parameter vectors
![Page 29: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/29.jpg)
PLATIPUSProbabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning
Ambiguous regression:
Ambiguous classification:
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
![Page 30: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/30.jpg)
Probabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning
PLATIPUS
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
![Page 31: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/31.jpg)
Related concurrent work
Kim et al. “Bayesian Model-Agnostic Meta-Learning”: uses Stein variational gradient descent for sampling parameters
Gordon et al. “Decision-Theoretic Meta-Learning”: unifies a number of meta-learning algorithms under a variational inference framework
![Page 32: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/32.jpg)
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
![Page 33: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/33.jpg)
We can build meta-learning algorithms with many nice properties (expressive power, consistency, first-order, uncertainty awareness)
![Page 34: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/34.jpg)
CollaboratorsSergey Levine
Pieter Abbeel
Tianhe Yu Trevor Darrell
Tom Griffiths
Erin Grant
Tianhao Zhang
Josh Abbott
Josh Peterson
Questions?
Blog post, code, and papers: eecs.berkeley.edu/~cbfinn
Annie Xie
Sudeep Dasari
Kelvin Xu
![Page 35: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,](https://reader036.fdocuments.net/reader036/viewer/2022070723/5f01ee3c7e708231d401bd08/html5/thumbnails/35.jpg)