The Uphill Battles
Transcript of The Uphill Battles
The Uphill Battles of Doing NLP without Psychology and
Theoretical Machine Learning7th Swedish Language Technology Conference
Anders Søgaard University of Copenhagen, Dpt. of Computer Science
The Uphill Battles of Doing NLP without Psychology and
Theoretical Machine Learning7th Swedish Language Technology Conference
Anders Søgaard University of Copenhagen, Dpt. of Computer Science
NLP WHEN I STARTED NLP TODAY
The Uphill Battles of Doing NLP without Psychology and
Theoretical Machine Learning7th Swedish Language Technology Conference
Anders Søgaard University of Copenhagen, Dpt. of Computer Science
NLP WHEN I STARTED+ NAÏVE EMPIRICISM + LESS INTUITIVE METHODS
OutlineThree examples of psychologically naïve NLP:
Two examples of machine learningly naïve NLP
1 The Dot Plot Syntactic parsing
2 The Bad Fortune Speller Word recognition
3 Attention without Attention Sentiment
4 Flavors of Failure Dictionary induction
5 Flavors of Succes Multi-modal MT
OutlineThree examples of psychologically naïve NLP:
Two examples of machine learningly naïve NLP
1 The Dot Plot Syntactic parsing
2 The Bad Fortune Speller Word recognition
3 Attention without Attention Sentiment
4 Flavors of Failure Dictionary induction
5 Flavors of Succes Multi-modal MT
MO
DEL
SIN
TERP
RETA
TIO
N
Psychologically naïve NLP
Gold (1967) showed that even regular languages are unlearnable from finite
samples.
Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.
Ur Example
Gold (1967) showed that even regular languages are unlearnable from finite
samples.
That is, in the sense of inducing the exact grammar that generated the observed
strings.
Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.
Ur Example
Gold (1967) showed that even regular languages are unlearnable from finite
samples.
That is, in the sense of inducing the exact grammar that generated the observed
strings.
Problem: More than strings, strings generated by many grammars, and no need for
identical grammars.
Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.
Ur Example
The Dot Plot
Modern parsers are trained to analyze sentences written by
journalists.
Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.
The Dot Plot
[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .
Modern parsers are trained to analyze sentences written by
journalists.
Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.
The Dot Plot
[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .
Modern parsers are trained to analyze sentences written by
journalists.
Parsers learn to rely on near-perfect punctuation, a give-
away of the syntactic analysis.
Poor performance on texts with non-standard
punctuation.
Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.
The Dot Plot
Modern parsers are trained to analyze sentences written by
journalists.
Parsers learn to rely on near-perfect punctuation, a give-
away of the syntactic analysis.
Poor performance on texts with non-standard
punctuation.
Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.
The Dot Plot
Humans are not sensitive to non-standard punctuation
(Baldwin and Coady, 1978).
Hypothesis: Punctuation prevents parsers from
learning deeper generalisations.
Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.
Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.
The Dot Plot
Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.
Labe
led
atta
chm
ent s
core
s
50
62,5
75
87,5
100
UUParser MaltParser
67,5
79,4
75,7
85,1
80,4
88,6
79,1
86,7
83,6
90,1
80,5
86,985,8
91,8
Minimum
Maximum
d=0,c=0 nopunct d=.01,c=.01 d=.01,c=.05 d=.05,c=.01 d=.05,c=.05 d=.1,c=.1
NOPUNCT (89.8)
The Bad Fortune Speller
Modern NLP is often trained on sentences written by
journalists.
This includes near-perfect spelling, making word
recognition easy.
Poor performance on texts with non-standard spelling.
The Bad Fortune Speller
Modern NLP is often trained on sentences written by
journalists.
This includes near-perfect spelling, making word
recognition easy.
Poor performance on texts with non-standard spelling.
The Bad Fortune Speller
Huamns are not snesitvie to non-stnadard spellign (Forster
et al., 1987).
Hypothesis: Near-perfect spelling prevents models
from learning deeper generalisations.
Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.
Didn’t character-based RNNs solve this?
The Bad Fortune Speller
The Bad Fortune Speller
Heigold et al. (2018). How robust are character-based word embeddings in tagging and MT against wrod scramlbing or random nouse? AMTA.
Labe
led
atta
chm
ent s
core
s
25
50
75
100
POS MT
21,7
82
25
84
30,7
94
s=0,f=0 s=0.1/0.05,f=0 s=0,f=0.1/0.05
Huamns are not snesitvie to non-stnadard spellign (Forster
et al., 1987).
Sakaguchi et al. (2017). Robsut Wrod Reocginiton via semi-Character RNN. AAAI.
The Bad Fortune Speller
Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.
(Semi-character RNNs)
Attention without Attention
LSTMs with attention functions are popular models.
Attention without Attention
Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.
LSTMs with attention functions are popular models.
Attention functions are latent variables with no direct
supervision.
Problem: Attention functions are prone to over-fitting (Rei
and Søgaard, 2018).
Attention without Attention
Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.
Attention without Attention
LSTMs with attention functions are popular models.
Attention functions are latent variables with no direct
supervision.
Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.
Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.
Problem: Attention functions are prone to over-fitting (Rei
and Søgaard, 2018).
Gaze data reflects word-level human attention - or
relevance (Loboda et al., 2011).
Hypothesis: Gaze can be used to regularize neural
attention (Barrett et al., 2018).
Loboda et al. (2011). Inferring word relevance from eye-movements of readers. IUI.
Attention without Attention
Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.
Labe
led
atta
chm
ent s
core
s
40
50
60
70
80
90
Sentiment Error Abusive
72,45
84,28
52,37
71,23
83,84
50,05
Learned attention Human attention
Summary
Human reading insensitive to punctuation.
Human reading insensitive to (a lot of) spelling variation.
Human attention during reading is partial and
systematic.
Summary
Ignore punctuation. Semi-character RNNs. Regularize attention.
Machine learningly naïve NLP
Flavors of Failure
Conneau et al. (2018) show GANs with linear generators
align cross-lingual embedding spaces.
Flavors of Failure
Conneau et al. 2018. Word translation without parallel data. ICLR
Conneau et al. (2018) show GANs with linear generators
align cross-lingual embedding spaces.
Flavors of Failure
Conneau et al. 2018. Word translation without parallel data. ICLR
Conneau et al. (2018) show GANs with linear generators
align cross-lingual embedding spaces.
Søgaard et al. (2018) show their approach is challenged
by some language pairs.
Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.
Flavors of Failure
Conneau et al. 2018. Word translation without parallel data. ICLR
0
30
60
90
es et fi el hu pl tr
334745
000
82
Conneau et al. (2018) show GANs with linear generators
align cross-lingual embedding spaces.
Søgaard et al. (2018) show their approach is challenged
by some language pairs.It’s the morphology, stupid!
Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.
Flavors of Failure
Conneau et al. 2018. Word translation without parallel data. ICLR
Conneau et al. (2018) show GANs with linear generators
align cross-lingual embedding spaces.
Søgaard et al. (2018) show their approach is challenged
by some language pairs.It’s the morphology, stupid!
Flavors of Failure
Problem: How do we know when to blame morphology?
Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.
Conneau et al. 2018. Word translation without parallel data. ICLR
Chocolate consumption correlates with Nobel
laureates.
Flavors of Failure
Chocolate consumption correlates with Nobel
laureates.
Flavors of Failure
https://en.wikipedia.org/wiki/Galton%27s_problem
Geographical diffusion caused by borrowing and
common ancestors.
Morphology correlates with siestas (Roberts and Winters,
2013).
Flavors of Failure
Chocolate consumption correlates with Nobel
laureates.
Morphology correlates with siestas (Roberts and Winters,
2013).
Problem: How do we know when to blame morphology?
Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.
Flavors of Failure
Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.
Control for it?
Flavors of Failure
Chocolate consumption correlates with Nobel
laureates.
Morphology correlates with siestas (Roberts and Winters,
2013).
Problem: How do we know when to blame morphology?
Hartmann et al. (2018) show failure cases for English-
English.
Flavors of Failure
Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.
Hartmann et al. (2018) show failure cases for English-
English.
Hartmann et al. (2019) show the loss landscapes of these
cases.
Flavors of Failure
Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.
Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.
Discriminator
Performance
LOCAL MINIMUM
GLOBAL MINIMUM
Discriminator
Performance
LOCAL MINIMUM
GLOBAL MINIMUM
Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.
Hartmann et al. (2018) show failure cases for English-
English.
Hartmann et al. (2019) show the loss landscapes of these
cases.
Diagnostic for loss functions based on Wasserstein or
Sinkhorn.
Flavors of Failure
Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.
Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.
Flavors of Success
Caglayan et al. (2017) use an attentive encoder-decoder for
image-aware MT.
Flavors of Success
Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.
Man with Mardi Gras beads around his neck holding pole with banner.
Caglayan et al. (2017) use an attentive encoder-decoder for
image-aware MT.
They report a 1.2 point METEOR improvement from
adding related images.
… but in what sense is the system aware of the related
image?
Flavors of Success
Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.
Caglayan et al. (2017) use an attentive encoder-decoder for
image-aware MT.
They report a 1.2 point METEOR improvement from
adding related images.
… but in what sense is the system aware of the related
image?
Flavors of Success
Elliott (2018). Adversarial evaluation of multimodal machine translation. EMNLP.
Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.
Elliott (2018) show that pairing texts with random
images obtains same improvements.
Hypothesis: The images simply help the optimizer.
Hypothesis: The images simply help the optimizer.
Flavors of Success
Rationale: Over-parameterisation and skip
connections help the optimizer.
Flavors of Success
Li et al. (2017). Visualizing the loss landscapes of neural nets. ICLR.
Summary
The Nobel-Chocolate Fallacy: Is it really the morphology?
Are our models really aware of externally information?
Summary
Understand the limitations of models and optimizers
In high-dimensional space, we have limited intuitions.
Summary
Understand the limitations of models and optimizers
In high-dimensional space, we have limited intuitions.
Bingel & Søgaard (2017), e.g., show linguistic intuitions poor predictor for multi-task gains.
Bingel & Søgaard (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. EACL.
What’s wrong with being naïve?
The scientific dance
Simplification Anomaly
The scientific dance
Common: Random finite samples are representative
Anomaly
The scientific dance
Alternative: Controlled/synthetic language.
Anomaly
Hegel’s holiday?
Controlled language Finite samples
NLP WHEN I STARTED NLP TODAY
Hegel’s holiday?
Controlled language Finite samples
NLP WHEN I STARTED NLP TODAY
NLP TOMORROW
Questions?
Supplementary slides
John Dewey (1910)
A: “It will probably rain tomorrow.” B: “Why do you think so?” A: “Because the sky was lowering at sunset.” B: “What has that to do with it?” A: “I do not know, but it generally does rain after such a sunset.”
John Dewey (1910)
[The scientific] method of proceeding is by varying conditions one by one so far as possible, and noting just what happens when a given condition is eliminated. There are two methods for varying conditions. The first […] consists in comparing very carefully the results of a great number of observations which have occurred under accidentally different conditions. […] [This] method […] is, however, badly handicapped; it can do nothing until it is presented with a certain number of diversified cases. […] The method is passive and dependent upon external accidents. Hence the superiority of the active or experimental method. Even a small number of observations may suggest an explanation - a hypothesis or theory. Working upon this suggestion, the scientist may then intentionally vary conditions and note what happens.
• NON-SCIENTIFIC METHOD: Extract regularities from available data (Penn Treebank).
• EMPIRICAL METHOD: Get more data (Web Treebank).
• EXPERIMENTAL METHOD: Create data (punctuation injection, etc).