Properties of Good Meta-Learning...

Chelsea Finn UC Berkeley & Google Brain -> Stanford

Properties of Good Meta-Learning Algorithms (And How to Achieve Them)

- effectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly adapt to unexpected scenarios (inevitable failures,

long tail) - learn how to learn with weak supervision

Why Learn to Learn?

Problem Domains: - few-shot classification & generation - hyperparameter optimization - architecture search - faster reinforcement learning - domain generalization - learning structure - …

Approaches: - recurrent networks - learning optimizers or update rules - learning initial parameters &

architecture - acquiring metric spaces - Bayesian models - …

What is the meta-learning problem statement?

The Meta-Learning Problem

Inputs: Outputs:Supervised Learning:

Meta-Supervised Learning:Inputs: Outputs:

Data:

Data:

Why is this view useful? Reduces the problem to the design & optimization of f.

{

Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018

Example: Few-Shot Classification

diagram adapted from Ravi & Larochelle ‘17

Given 1 example of 5 classes: Classify new examples

meta-training

training data test set

training classes

… …

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Design of f ?Recurrent network

(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …



Learned optimizer (often uses recurrence)

Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …



Learned optimizer (often uses recurrence)

Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …

Black box approaches

Expressive power Consistency

Can we incorporate structure into the learning procedure?

Nearest Neighbors(in a learned metric space)

Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Vinyals et al. ’16

Expressive powerConsistency

Approaches that incorporate learning structure

Can we get both?

Koch ’15

Nearest Neighbors(in a learned metric space)

Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Gradient Descent(from learned initialization)

Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …

Key idea: Train over many tasks, to learn parameter vector θ that transfers via fine-tuning

Fine-tuningtraining data for test task

pretrained parameters

[test-time]

Model-AgnosticMeta-Learning:

(MAML)

Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017

Approaches that incorporate learning structure

Learning procedure:

Expressive powerConsistency

?

Approaches that incorporate learning structureNearest Neighbors

(in a learned metric space)Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Gradient Descent(from learned initialization)

Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …

Does consistency come at a cost?

Universal Function Approximation Theorem

Hornik et al. ’89, Cybenko ’89, Funahashi ‘89A neural network with one hidden layer of finite width can approximate any continuous function.

How can we define a notion of expressive power for meta-learning?

“universal function approximator”

Recurrent network Learned optimizer

“universal learning procedure approximator”


For a sufficiently deep f, MAML function can approximate any function of

Finn & Levine, ICLR 2018

Why is this interesting?MAML has benefit of consistency without losing expressive power.

Recurrent network MAML

Assumptions: - nonzero - loss function gradient does not lose information about the label - datapoints in are unique


Black box approaches

Expressive power Consistency

MAML

Empirically, what does consistency get you?

How well can methods generalize to similar, but extrapolated tasks?MAML TCML, MetaNetworks

task variability

perf

orm

ance

Omniglot image classification

The world is non-stationary.


MAML TCMLer

ror

Sinusoid curve regression

task variabilityTakeaway: Strategies learned with MAML consistently generalize better to out-of-distribution tasks

How well can methods generalize to similar, but extrapolated tasks?

The world is non-stationary.




Consistency








Consistency


No need to differentiatethrough learning better scalability to long learning processes



Model-AgnosticMeta-Learning:

(MAML)

What if we stop the gradient through this term?

Note: In some tasks, we have observed a more substantial drop.




Consistency








Consistency






No uncertainty awareness in original MAML

A Probabilistic Interpretation of MAML

that’s nice but is it… Bayesian?kind of… but it can be more Bayesian

Erin Grant

(Santos, 1996)

Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18

Erin Grant

MAP inference in this model

can we do better than MAP?can use Laplace estimate:

estimate Hessian for neural nets with KFAC

A Probabilistic Interpretation of MAML

Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18

Modeling ambiguity

Can we sample classifiers?

Intuition: we want to learn a prior where a random kick can put us in different modes

smiling, hatsmiling, young

Meta-learning with ambiguity

Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Sampling parameter vectors

approximate with MAPthis is extremely crude

but extremely convenient!Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

What does ancestral sampling look like?

smiling, hatsmiling, young


Sampling parameter vectors

PLATIPUSProbabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning

Ambiguous regression:

Ambiguous classification:


Probabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning

PLATIPUS


Related concurrent work

Kim et al. “Bayesian Model-Agnostic Meta-Learning”: uses Stein variational gradient descent for sampling parameters

Gordon et al. “Decision-Theoretic Meta-Learning”: unifies a number of meta-learning algorithms under a variational inference framework



Consistency






We can build meta-learning algorithms with many nice properties (expressive power, consistency, first-order, uncertainty awareness)

CollaboratorsSergey Levine

Pieter Abbeel

Tianhe Yu Trevor Darrell

Tom Griffiths

Erin Grant

Tianhao Zhang

Josh Abbott

Josh Peterson

Questions?

Blog post, code, and papers: eecs.berkeley.edu/~cbfinn

Annie Xie

Sudeep Dasari

Kelvin Xu

http://people.eecs.berkeley.edu/~cbfinn

Properties of Good Meta-Learning...

Documents

Transcript of Properties of Good Meta-Learning...