The Meta-Learning Problem & Black-Box Meta-Learning

CS 330

The Meta-Learning Problem & Black-Box Meta-Learning

Logistics

Homework 1 posted to be today, due Wednesday, October 6

One more CA!

Optional Homework 0 due tonight.

Azure guide to be posted today

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Multi-Task Learning vs. Transfer Learning

4

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

Key assumption: Cannot access data during transfer.𝒟a

Transfer learning is a valid solution to multi-task learning.(but not vice versa)

Question: What are some problems/applications where transfer learning might make sense?

when is very large (don’t want to retain & retrain on )

𝒟a𝒟a

when you don’t care about solving & simultaneously𝒯a 𝒯b

training data for new task 𝒯b

Parameters pre-trained on 𝒟a

� ✓ � ↵r✓L(✓,Dtr)

5

Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have

(typically for many gradient steps)

Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)

Pre-trained models oOen available online.

WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16

Transfer learning via fine-tuning

6

UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with very small target task datasets

This is where meta-learning can help.

The Meta-Learning Problem Statement(that we will consider in this class)

Two ways to view meta-learning algorithms

Mechanisticview

➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints

➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask

Probabilisticview

➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks

➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters

Fornow:Focusprimarilyonthemechanisticview.

(Bayes will come back later)

How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples

training data test set

How does meta-learning work? An example.

meta-trainingtraining classes

… …

meta-testing Ttest

Given 1 example of 5 classes: Classify new examples

training data test set

regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)

What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)

- giving feedback to students on different exams

- classifying species in different regions of the world

- a robot performing different tasks

How many tasks do you need? The more the better.

Like before, tasks must share structure.

(analogous to more data in ML)

“support set”

Some terminology

“query set”

Dtestitask test datasetDtr

itask training set

k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)

Question: What are k and N for the above example?

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

In all settings: tasks must share structure.

In transfer learning and meta-learning: generally impractical to access prior tasks

Problem Settings Recap

One View on the Meta-Learning Problem

Inputs: Outputs:

Supervised Learning:

Meta Supervised Learning:

Inputs: Outputs:

Data:

Data:

Why is this view useful? Reduces the meta-learning problem to the design & optimization of .f

{

Finn. Learning to Learn with Gradients. PhD Thesis. 2018

Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>

GeneralrecipeHowtodesignameta-learningalgorithm

1. Choose a form of

2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓

meta-parameters

Dtri

�i

Keyidea:Train a neural network to represent

Train with standard supervised learning!

L(�i,Dtesti )

Dtesti

f✓

g�i

yts

xts

Black-BoxAdapta6on

max✓

X

Ti

L(f✓(Dtri ),Dtest

i )

max✓

X

Ti

X

(x,y)⇠Dtesti

log g�i(y|x)

�i = f✓(Dtri )

Predict test points with yts = g�i(xts)

<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>

“learner"

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

Di

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

3. Compute �i f✓(Dtri )

4. Update ✓ using r✓L(�i,Dtesti )

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts


Black-BoxAdapta6on

ChallengeOutpu_ng all neural net parameters does not seem scalable?

�i

Dtri

f✓

represents contextual task informa8onhi

low-dimensional vector hi

generalform:

Dtesti

yts

xts

g�i

yts = f✓(Dtri , x

ts)

(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs

�i = {hi, ✓g}

recall:

22Whatarchitectureshouldweusefor ?f✓


Black-BoxAdapta6onArchitectures

23

Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16

LSTMs or Neural turing machine (NTM)

Feedforward + average

Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18

Other external memory mechanisms

MetaNetworks Munkhdalai, Yu. ICML ‘17

Convolu8ons & aMen8on

ASimpleNeuralARen6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18

Question: Why might feedforward+average be better than a recurrent model?

HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier

Let’srunthroughanexample

the “transpose” of MNIST

manyclasses, fewexamples1623characters from 50differentalphabets

Omniglotdataset Lake et al. Science 2015

…sta8s8cs more reflec8ve

of therealworld20instances of each character

More few-shotimagerecogni6ondatasets: 8eredImageNet, CIFAR, CUB, CelebA, ORBIT, others

More benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20), channel coding (Li et al. ’21)

*whiteboard*

Black-BoxAdapta6on

+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)

- complexmodel w/ complextask: challengingopCmizaCon problem - oOen data-inefficientDtr

i

�i

Dtesti

f✓

g�i

Next&me(Wednesday):What if we treat it as an opCmizaCon procedure?

yts

xts


How else can we represent ? �i = f✓(Dtri )

Case Study: GPT-3

May 2020

What is GPT-3?

black-box meta-learner trained on language generation tasks

architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size

: sequence of characters𝒟tri : the following sequence of characters𝒟tsi

[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora

a language model

What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks

How can those tasks all be solved by a single architecture?

How can those tasks all be solved by a single architecture? Put them all in the form of text!

spelling correctionsimple math problems translating between languages

Why is that a good idea? Very easy to get a lot of meta-training data.

Some ResultsOne-shot learning from dictionary definitions:

Few-shot language editing:

Non-few-shot learning tasks:

General Notes & TakeawaysThe results are extremely impressive.

The model fails in unintuitive ways.

Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md

The choice of at test time is important. (“priming”)𝒟tri

The model is far from perfect.



Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Reminders

Next time: Optimization-based meta-learning

Homework 1 posted to be today, due Wednesday, October 6

Optional Homework 0 due tonight.

Azure guide to be posted today

The Meta-Learning Problem & Black-Box Meta-Learning

Documents

Transcript of The Meta-Learning Problem & Black-Box Meta-Learning