UCL x DeepMind lecture series
Transcript of UCL x DeepMind lecture series
![Page 1: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/1.jpg)
In this lecture series, leading research scientists from leading AI research lab, DeepMind, will give12 lectures on an exciting selection of topicsin Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation.
Please join us for a deep dive lecture seriesinto Deep Learning!
#UCLxDeepMind
UCL x DeepMindWELCOME TO THE
lecture series
![Page 2: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/2.jpg)
TODAY’S SPEAKER
Felix HillFelix Hill has been a Research Scientistat DeepMind since 2016. He did his undergraduate and masters degrees in pure maths at the University of Oxford, and his PhD in Computer Science at the University of Cambridge. His PhD focused on unsupervised learning from text in neural networks. Since joining DeepMind, he has worked on abstract and relational reasoning, and situated models of human language learning learning and usage, primarily in simulated environments.
![Page 3: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/3.jpg)
TODAY’S LECTURE
Deep Learning for Language Understanding
Since the late 1970s, a key motivation for research into artificial neural networks was as a model for the highly context-dependent nature of semantic cognition. Today, their intrinsic parallel, interactive and context-dependent processing allows deep nets to power many state-of-the-art language technology applications. In this lecture we explain why neural networks can be such an effective tool for modelling language, focusing on the Transformer, a recently-proposed architecture for sequence processing. We also dive into recent research to develop embodied neural-network 'agents' that can relate language to the world around them.
![Page 4: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/4.jpg)
UCL x DeepMind Lectures
Deep Learning for Language Understanding
Felix Hill
![Page 5: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/5.jpg)
Private & ConfidentialPlan for this Lecture
01Background: deep learning and language
02The Transformer
03Unsupervised and transfer learning with BERT
ContentHigh level plan of the lecture
GoalPrepare people for the overall structure, make sure that they know what to expect
04Grounded language learning at DeepMind: towards language understanding in a situated agent
![Page 6: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/6.jpg)
Private & ConfidentialWhat is not covered in this lecture
01Recurrent networks
- Seq-to-seq models and neural MT
- Speech recognition or synthesis
03Grounding in image/video
Visual question-answering or captioning
CLEVR and visual reasoning
Video captioning
02Many NLP tasks and applications
- Machine comprehension
- Question answering
- Dialogue
ContentNeural networks in the wild
GoalProvide people with slightly broader perspective, and also sneak peek into future lectures
![Page 7: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/7.jpg)
1 Background:Deep learning and language
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 8: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/8.jpg)
![Page 9: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/9.jpg)
ContentVarious successes
GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities
Machine Translation
Speech recognition
Question answering
Speech synthesis
Summarization
Text classification
Search / information
retrieval
Dialogue systems
Home assistants
Very neural
Less / non-neural
![Page 10: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/10.jpg)
ContentVarious successes
GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities
![Page 11: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/11.jpg)
ContentVarious successes
GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities
![Page 12: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/12.jpg)
Why is Deep Learning such an effective tool for language processing?
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 13: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/13.jpg)
Why is Deep Learning such an effective tool for language processing?
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
Can it be improved?
![Page 14: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/14.jpg)
Some things about language...
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 15: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/15.jpg)
1. Words are not discrete symbols
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 16: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/16.jpg)
● Did you see the look on her face1 ?
● We could see the clock face2 from below
● It could be time to face3 his demons
● There are a few new faces4 in the office today
![Page 17: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/17.jpg)
● Did you see the look on her face1 ?
● We could see the clock face2 from below
● It could be time to face3 his demons
● There are a few new faces4 in the office today
Face 1
● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you
address/confront something
![Page 18: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/18.jpg)
● Did you see the look on her face1 ?
● We could see the clock face2 from below
● It could be time to face3 his demons
● There are a few new faces4 in the office today
Face 1
● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you
address/confront something
Face 2
● The most important side (of the head)● Used to inform / communicate
![Page 19: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/19.jpg)
● Did you see the look on her face1 ?
● We could see the clock face2 from below
● It could be time to face3 his demons
● There are a few new faces4 in the office today
Face 1
● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you
address/confront something
Face 2
● The most important side (of the head)● Used to inform / communicate
Face 3● Points forward when you
address/confront something
![Page 20: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/20.jpg)
● Did you see the look on her face1 ?
● We could see the clock face2 from below
● It could be time to face3 his demons
● There are a few new faces4 in the office today
Face 1
● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you
address/confront something
Face 2
● The most important side (of the head)● Used to inform / communicate
Face 3● Points forward when you
address/confront something
Face 4
● Represents you / yourself
![Page 21: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/21.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
1. Words are not discrete symbols
![Page 22: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/22.jpg)
Rumelhart, D. E. (1976). Toward an interactive model of reading
![Page 23: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/23.jpg)
![Page 24: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/24.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
3. Important interactions can be non-local
1. Words are not discrete symbols
![Page 25: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/25.jpg)
The man who ate the pepper sneezed
![Page 26: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/26.jpg)
The man who ate the pepper sneezed
![Page 27: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/27.jpg)
But note: The cat who bit the dog barked
The man who ate the pepper sneezed
![Page 28: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/28.jpg)
The cat who bit the dog barked
The man who ate the pepper sneezed
![Page 29: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/29.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
4. How meanings combine depends on those meanings
1. Words are not discrete symbols
3. Important interactions can be non-local
![Page 30: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/30.jpg)
![Page 31: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/31.jpg)
![Page 32: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/32.jpg)
![Page 33: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/33.jpg)
Pet
PET
brownwhiteblack
![Page 34: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/34.jpg)
Pet Fish
PET
brownwhiteblack
FISH silver grey
![Page 35: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/35.jpg)
PET FISH
OrangeGreenBluePurpleYellow
![Page 36: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/36.jpg)
PLANT
GreenLeavesGrows
![Page 37: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/37.jpg)
CARNIVORE
Eats meatSharp teeth
![Page 38: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/38.jpg)
CARNIVOROUS PLANT
GreenLeavesGrowsSharp teethEats insects
![Page 39: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/39.jpg)
1. Words have many related senses
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
4. How meanings combine depends on those meanings
3. Important interactions can be non-local
![Page 40: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/40.jpg)
2 The Transformer
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
Vaswani et al. 2018
![Page 41: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/41.jpg)
![Page 42: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/42.jpg)
Distributed representations of words
the
theethe
![Page 43: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/43.jpg)
girl
the girlethe
egirl
Distributed representations of words
![Page 44: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/44.jpg)
the girl smiled
Input layer has V x D weights
Embedding dimension = D units
Vocabulary size = V units
esmiled
Distributed representations of words
![Page 45: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/45.jpg)
Mikkulainen & Dyer, 1991
The Transformer builds on solid foundations
![Page 46: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/46.jpg)
Elman, Finding Structure in Time,
1990
Emergent semantic and syntactic structure in distributed representations
Early contextual word embeddings
![Page 47: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/47.jpg)
vkq
WvWq
Wk
q = ebeetle Wq
k = ebeetle Wk
V = ebeetle Wv
The Transformer: Self-attention over word input embeddings
the beetle drove off
ebeetle
![Page 48: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/48.jpg)
vkq
WvWq
Wk
Self-attention over word input embeddings
the beetle drove off
ebeetle
![Page 49: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/49.jpg)
vkq
WvWq
Wk
λ1 λ3 λ4λ2
Self-attention over word input embeddings
the beetle drove off
ebeetle
![Page 50: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/50.jpg)
vkq
WvWq
Wk
λ1 λ3 λ4λ2
Self-attention over word input embeddings
the beetle drove off
ebeetle
![Page 51: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/51.jpg)
vkq
WvWq
Wk
λ1 λ3 λ4λ2
v'
Self-attention over word input embeddings
the beetle drove off
ebeetle
![Page 52: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/52.jpg)
vkq
WvWq
Wk
Parallelself-attention
Compute self-attention for all words in input (in parallel)
the beetle drove off
ebeetle
![Page 53: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/53.jpg)
ebeetle
Multi-head self-attention (H = 4)
D / 4 units
D units
the beetle drove off
ebeetle
WvWqWk
![Page 54: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/54.jpg)
Multi-head self-attention (H = 4)
the beetle drove off
ebeetle
WvWqWk
![Page 55: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/55.jpg)
the beetle drove off
ebeetle
Multi-head self-attention (H = 4)
WvWqWk
![Page 56: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/56.jpg)
Multi-head self-attention (H = 4)
the beetle drove off
ebeetle
WvWqWk
![Page 57: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/57.jpg)
ebeetle
WO
D units
D units
the beetle drove off
ebeetle
WvWqWk
![Page 58: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/58.jpg)
W1
W2
W1
W2
W1
W2
W1
W2
reLu
Feedforward Layer
the beetle drove off
ebeetle
![Page 59: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/59.jpg)
the beetle drove off
Multi-head self-attention
A complete Transformer block
![Page 60: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/60.jpg)
the beetle drove off
Multi-head self-attention
+
Layer Norm
Skip connections
A complete Transformer block
![Page 61: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/61.jpg)
the beetle drove off
Multi-head self-attention
+
Layer Norm
Feedforward layer
Skip connections
A complete Transformer block
![Page 62: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/62.jpg)
the beetle drove off
Multi-head self-attention
+
Layer Norm
Feedforward layer
+
Layer Norm
Skip connections
Skip connections
Skip-connections - for "top-down" influences
![Page 63: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/63.jpg)
The Transformer: Position encoding of words
the beetle drove off
ebeetle
+
● Add fixed quantity to embedding activations
● The quantity added to each input embedding unit ∈ [-1, 1] depends on:
○ The dimension of the unit within the embedding○ The (absolute) position of the words in the input
![Page 64: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/64.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
4. How meanings combine depends on those meanings
1. Words are not discrete symbols
3. Important interactions can be non-local
![Page 65: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/65.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
Multi-head processing
4. How meanings combine depends on those meanings
1. Words are not discrete symbols Distributed representations
3. Important interactions can be non-local
![Page 66: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/66.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
Multi-head processing
Self-attention
4. How meanings combine depends on those meanings
1. Words are not discrete symbols Distributed representations
3. Important interactions can be non-local
![Page 67: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/67.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
Multi-head processing
Self-attention
Self-attention
4. How meanings combine depends on those meanings
1. Words are not discrete symbols Distributed representations
3. Important interactions can be non-localMultiple layers
![Page 68: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/68.jpg)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
Multi-head processing
Self-attention
Self-attention
Distributed representations
4. How meanings combine depends on those meanings
1. Words are not discrete symbols Distributed representations
Skip connections
3. Important interactions can be non-localMultiple layers
![Page 69: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/69.jpg)
the beetle drove off
+
Layer Norm
Feedforward layer
+
Layer Norm
Skip connections
Skip connections
Skip-connections - for "top-down" influences
Multi-head self-attention
![Page 70: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/70.jpg)
3 Unsupervised Learning With Transformers (BERT)
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 71: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/71.jpg)
Time flies like an arrow
![Page 72: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/72.jpg)
Time flies like an arrow
Fruit flies like a banana
![Page 73: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/73.jpg)
Time flies like an arrow
Fruit flies like a banana
John works like a trojanThe trains run like clockwork
![Page 74: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/74.jpg)
Time flies like an arrow
Fruit flies like a banana
Fido likes having his tummy rubbedGrandma likes a good cuppa
John works like a trojanThe trains run like clockwork
![Page 75: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/75.jpg)
1. Words have many related senses
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
3. Relevant context can be non-local
4. 'Composition' depends on what words mean
5. Understanding is balancing input with knowledge
![Page 76: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/76.jpg)
sucking up knowledge from words ...
sucking up knowledge from words ...
BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)
![Page 77: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/77.jpg)
BERT (Devlin et al. 2019)
sucking up _[MASK]_ from words ...
knowledge
15% of words masked - predict them!
Masked language model pretraining
![Page 78: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/78.jpg)
sucking up knowledge from words ...
knowledge
15% of words masked - predict them!
10% of these instances - leave word in placeBERT (Devlin et al. 2019)
Masked language model pretraining
![Page 79: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/79.jpg)
Next sentence prediction pretraining
_[CLS]_ Sid went outside . _[SEP]_ It began to rain .
✅
BERT (Devlin et al. 2019)
![Page 80: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/80.jpg)
Next sentence prediction pretraining
_[CLS]_ Sid went outside . _[SEP]_ Unfortunately it wasn't .
❌
BERT (Devlin et al. 2019)
![Page 81: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/81.jpg)
BERT pretraining
_[CLS]_ Masses of text . _[SEP]_ From the internet .
BERT (Devlin et al. 2019)
![Page 82: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/82.jpg)
BERT fine-tuning
BERT (Devlin et al. 2019)
_[CLS]_ A small amount of . _[SEP]_ Task-specific data
![Page 83: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/83.jpg)
ContentVarious successes
GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities
BERT supercharges transfer learning
![Page 84: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/84.jpg)
1. Words have many related senses
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
3. Relevant context can be non-local
4. 'Composition' depends on what words mean
5. Understanding is balancing input with knowledge
![Page 85: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/85.jpg)
1. Words have many related senses
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
2. Disambiguation depends on context
3. Relevant context can be non-local
4. 'Composition' depends on what words mean
Transfer with unsupervised learning
5. Understanding is balancing input with knowledge
![Page 86: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/86.jpg)
4 Extracting language-relevant knowledge from the environment
Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info
![Page 87: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/87.jpg)
Building a multi-sensory understanding of the world
Language Vision Actions
Peters, Matthew E., et al. "Deep contextualized word representations." arXiv:1802.05365 (2018).Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv:1810.04805 (2018).Pathak, Deepak, et al. "Context encoders: Feature learning by inpainting." CVPR 2016.Pathak, Deepak, et al. "Learning features by watching objects move." CVPR 2017.Wayne, Greg, et al. "Unsupervised predictive memory in a goal-directed agent." arXiv:1803.10760 (2018).Ha, David, and Jürgen Schmidhuber. "World models." arXiv:1803.10122 (2018).
http://jalammar.github.io/illustrated-bert
![Page 88: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/88.jpg)
P. EliasPredictive coding
H. BarlowCortex as a model builder.
William James
U. NeisserAnalysis by synthesis
McClelland and RumelhartInteractive activation model
Rao and Ballard
Dayan and HintonHelmholtz machine
J. ElmanFinding structure in time
Knowledge aggregation from prediction
Helmholtz
![Page 89: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/89.jpg)
![Page 90: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/90.jpg)
Questions to diagnose knowledge acquisition
![Page 91: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/91.jpg)
Predictive agents
![Page 92: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/92.jpg)
Results
Oracle:● Without stop-gradient
Baselines:● Question only ● LSTM
Predictive Models:● SimCore● CPC|A
![Page 93: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/93.jpg)
Results
![Page 94: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/94.jpg)
To conclude
![Page 95: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/95.jpg)
1. Words have many related senses
2. Disambiguation depends on context
3. Relevant context can be non-local and non linguistic
4. 'Composition' depends on what words mean
5. Understanding requires background knowledge....
not always from language
![Page 96: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/96.jpg)
2. Disambiguation depends on context
3. Relevant context can be non-local and non linguistic
4. 'Composition' depends on what words mean
5. Understanding requires background knowledge....
not always from language
1. Words have many related senses
Transformers
![Page 97: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/97.jpg)
2. Disambiguation depends on context
3. Relevant context can be non-local and non linguistic
4. 'Composition' depends on what words mean
5. Understanding requires background knowledge....
not always from language
1. Words have many related senses
Self-supervised / unsupervised learning
Transformers
![Page 98: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/98.jpg)
2. Disambiguation depends on context
3. Relevant context can be non-local and non linguistic
4. 'Composition' depends on what words mean
5. Understanding requires background knowledge....
not always from language
Transformers
1. Words have many related senses
Self-supervised / unsupervised learning
Embodied learning
![Page 99: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/99.jpg)
Understanding intentions
Social learning Event cognition
Fast-mapping
Goal-directed dialogue
![Page 100: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/100.jpg)
letters words Syntax Meaning Prediction/response
The 'pipeline' view of language processing
![Page 101: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/101.jpg)
An alternative sehematic model of language processing
Letters / sounds
Meaning Prediction/response
context
general knowledge / model of the world
External stimulus Knowledge we have
words syntax
See. e.g. McClelland et al. 2019
![Page 102: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/102.jpg)
Selected references
Early treatment of distributed representations in neural language mogelsNatural language processing with modular PDP networks and distributed lexicon. Miikkulainen, Risto, and Michael G. Dyer. Cognitive Science 1991
The transformer architectureAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin:Attention Is All You Need. Neurips 2017
BERT Bert: Pre-training of deep bidirectional transformers for language understanding. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. NAACL 2019.
Embodied language learning at DeepMindEnvironmental Drivers of Generalization and Systematicity in the Situated AgentHill et al. ICLR 2020
Robust Instruction-Following in a Situated Agent via Transfer Learning from TextHill et al. Under review
Probing emergent semantic knowledge in predictive agents via question answeringCarnevale et al. Under review
![Page 103: UCL x DeepMind lecture series](https://reader033.fdocuments.net/reader033/viewer/2022052013/6285e8fa18c0fb41315aede5/html5/thumbnails/103.jpg)
Thank you