The Meta-Learning Problem & Black-Box Meta-Learning
Transcript of The Meta-Learning Problem & Black-Box Meta-Learning
CS 330
The Meta-Learning Problem & Black-Box Meta-Learning
Logistics
Homework 1 posted to be today, due Wednesday, October 6
One more CA!
Optional Homework 0 due tonight.
Azure guide to be posted today
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques
Topic of Homework 1!}
Multi-Task Learning vs. Transfer Learning
4
minθ
T
∑i=1
ℒi(θ, 𝒟i)
Multi-Task Learning
Solve multiple tasks at once.𝒯1, ⋯, 𝒯T
Transfer Learning
Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a
Key assumption: Cannot access data during transfer.𝒟a
Transfer learning is a valid solution to multi-task learning.(but not vice versa)
Question: What are some problems/applications where transfer learning might make sense?
when is very large (don’t want to retain & retrain on )
𝒟a𝒟a
when you don’t care about solving & simultaneously𝒯a 𝒯b
training data for new task 𝒯b
Parameters pre-trained on 𝒟a
� ✓ � ↵r✓L(✓,Dtr)
5
Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have
(typically for many gradient steps)
Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)
Pre-trained models oOen available online.
WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16
Transfer learning via fine-tuning
6
UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18
Fine-tuning doesn’t work well with very small target task datasets
This is where meta-learning can help.
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
The Meta-Learning Problem Statement(that we will consider in this class)
Two ways to view meta-learning algorithms
Mechanisticview
➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints
➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask
Probabilisticview
➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks
➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters
Fornow:Focusprimarilyonthemechanisticview.
(Bayes will come back later)
How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples
training data test set
How does meta-learning work? An example.
meta-trainingtraining classes
… …
meta-testing Ttest
Given 1 example of 5 classes: Classify new examples
training data test set
regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:
The Meta-Learning Problem
Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test
Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)
What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)
- giving feedback to students on different exams
- classifying species in different regions of the world
- a robot performing different tasks
How many tasks do you need? The more the better.
Like before, tasks must share structure.
(analogous to more data in ML)
“support set”
Some terminology
“query set”
Dtestitask test datasetDtr
itask training set
k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)
Question: What are k and N for the above example?
minθ
T
∑i=1
ℒi(θ, 𝒟i)
Multi-Task Learning
Solve multiple tasks at once.𝒯1, ⋯, 𝒯T
Transfer Learning
Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a
The Meta-Learning Problem
Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test
In all settings: tasks must share structure.
In transfer learning and meta-learning: generally impractical to access prior tasks
Problem Settings Recap
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
One View on the Meta-Learning Problem
Inputs: Outputs:
Supervised Learning:
Meta Supervised Learning:
Inputs: Outputs:
Data:
Data:
Why is this view useful? Reduces the meta-learning problem to the design & optimization of .f
{
Finn. Learning to Learn with Gradients. PhD Thesis. 2018
Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>
GeneralrecipeHowtodesignameta-learningalgorithm
1. Choose a form of
2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓
meta-parameters
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Dtri
�i
Keyidea:Train a neural network to represent
Train with standard supervised learning!
L(�i,Dtesti )
Dtesti
f✓
g�i
yts
xts
Black-BoxAdapta6on
max✓
X
Ti
L(f✓(Dtri ),Dtest
i )
max✓
X
Ti
X
(x,y)⇠Dtesti
log g�i(y|x)
�i = f✓(Dtri )
Predict test points with yts = g�i(xts)
<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>
“learner"
Black-BoxAdapta6on
1. Sample task Ti2. Sample disjoint datasets Dtr
i ,Dtesti from Di
(or mini batch of tasks)
Di
Dtri Dtest
i
Dtri
�i
Dtesti
f✓
g�i
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6on
1. Sample task Ti2. Sample disjoint datasets Dtr
i ,Dtesti from Di
(or mini batch of tasks)
3. Compute �i f✓(Dtri )
4. Update ✓ using r✓L(�i,Dtesti )
Dtri Dtest
i
Dtri
�i
Dtesti
f✓
g�i
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6on
ChallengeOutpu_ng all neural net parameters does not seem scalable?
�i
Dtri
f✓
represents contextual task informa8onhi
low-dimensional vector hi
generalform:
Dtesti
yts
xts
g�i
yts = f✓(Dtri , x
ts)
(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs
�i = {hi, ✓g}
recall:
22Whatarchitectureshouldweusefor ?f✓
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6onArchitectures
23
Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16
LSTMs or Neural turing machine (NTM)
Feedforward + average
Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18
Other external memory mechanisms
MetaNetworks Munkhdalai, Yu. ICML ‘17
Convolu8ons & aMen8on
ASimpleNeuralARen6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18
Question: Why might feedforward+average be better than a recurrent model?
HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier
Let’srunthroughanexample
the “transpose” of MNIST
manyclasses, fewexamples1623characters from 50differentalphabets
Omniglotdataset Lake et al. Science 2015
…sta8s8cs more reflec8ve
of therealworld20instances of each character
More few-shotimagerecogni6ondatasets: 8eredImageNet, CIFAR, CUB, CelebA, ORBIT, others
More benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20), channel coding (Li et al. ’21)
*whiteboard*
Black-BoxAdapta6on
+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)
- complexmodel w/ complextask: challengingopCmizaCon problem - oOen data-inefficientDtr
i
�i
Dtesti
f✓
g�i
Next&me(Wednesday):What if we treat it as an opCmizaCon procedure?
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
How else can we represent ? �i = f✓(Dtri )
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Case Study: GPT-3
May 2020
What is GPT-3?
black-box meta-learner trained on language generation tasks
architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size
: sequence of characters𝒟tri : the following sequence of characters𝒟tsi
[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora
a language model
What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks
How can those tasks all be solved by a single architecture?
How can those tasks all be solved by a single architecture? Put them all in the form of text!
spelling correctionsimple math problems translating between languages
Why is that a good idea? Very easy to get a lot of meta-training data.
Some ResultsOne-shot learning from dictionary definitions:
Few-shot language editing:
Non-few-shot learning tasks:
General Notes & TakeawaysThe results are extremely impressive.
The model fails in unintuitive ways.
Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html
Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md
The choice of at test time is important. (“priming”)𝒟tri
The model is far from perfect.
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques
Topic of Homework 1!}
Reminders
Next time: Optimization-based meta-learning
Homework 1 posted to be today, due Wednesday, October 6
Optional Homework 0 due tonight.
Azure guide to be posted today