The Uphill Battles

64
The Uphill Battles of Doing NLP without Psychology and Theoretical Machine Learning 7th Swedish Language Technology Conference Anders Søgaard University of Copenhagen, Dpt. of Computer Science

Transcript of The Uphill Battles

Page 1: The Uphill Battles

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

Page 2: The Uphill Battles

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

NLP WHEN I STARTED NLP TODAY

Page 3: The Uphill Battles

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

NLP WHEN I STARTED+ NAÏVE EMPIRICISM + LESS INTUITIVE METHODS

Page 4: The Uphill Battles

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

Page 5: The Uphill Battles

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

MO

DEL

SIN

TERP

RETA

TIO

N

Page 6: The Uphill Battles

Psychologically naïve NLP

Page 7: The Uphill Battles

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

Page 8: The Uphill Battles

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

Page 9: The Uphill Battles

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.

Problem: More than strings, strings generated by many grammars, and no need for

identical grammars.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

Page 10: The Uphill Battles

The Dot Plot

Page 11: The Uphill Battles

Modern parsers are trained to analyze sentences written by

journalists.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .

Page 12: The Uphill Battles

Modern parsers are trained to analyze sentences written by

journalists.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .

Page 13: The Uphill Battles

Modern parsers are trained to analyze sentences written by

journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Page 14: The Uphill Battles

Modern parsers are trained to analyze sentences written by

journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Humans are not sensitive to non-standard punctuation

(Baldwin and Coady, 1978).

Hypothesis: Punctuation prevents parsers from

learning deeper generalisations.

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.

Page 15: The Uphill Battles

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.

Labe

led

atta

chm

ent s

core

s

50

62,5

75

87,5

100

UUParser MaltParser

67,5

79,4

75,7

85,1

80,4

88,6

79,1

86,7

83,6

90,1

80,5

86,985,8

91,8

Minimum

Maximum

d=0,c=0 nopunct d=.01,c=.01 d=.01,c=.05 d=.05,c=.01 d=.05,c=.05 d=.1,c=.1

NOPUNCT (89.8)

Page 16: The Uphill Battles

The Bad Fortune Speller

Page 17: The Uphill Battles

Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.

The Bad Fortune Speller

Page 18: The Uphill Battles

Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.

The Bad Fortune Speller

Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Hypothesis: Near-perfect spelling prevents models

from learning deeper generalisations.

Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

Page 19: The Uphill Battles

Didn’t character-based RNNs solve this?

The Bad Fortune Speller

Page 20: The Uphill Battles

The Bad Fortune Speller

Heigold et al. (2018). How robust are character-based word embeddings in tagging and MT against wrod scramlbing or random nouse? AMTA.

Labe

led

atta

chm

ent s

core

s

25

50

75

100

POS MT

21,7

82

25

84

30,7

94

s=0,f=0 s=0.1/0.05,f=0 s=0,f=0.1/0.05

Page 21: The Uphill Battles

Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Sakaguchi et al. (2017). Robsut Wrod Reocginiton via semi-Character RNN. AAAI.

The Bad Fortune Speller

Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

(Semi-character RNNs)

Page 22: The Uphill Battles

Attention without Attention

Page 23: The Uphill Battles

LSTMs with attention functions are popular models.

Attention without Attention

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

Page 24: The Uphill Battles

LSTMs with attention functions are popular models.

Attention functions are latent variables with no direct

supervision.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).

Attention without Attention

Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.

Page 25: The Uphill Battles

Attention without Attention

LSTMs with attention functions are popular models.

Attention functions are latent variables with no direct

supervision.

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).

Gaze data reflects word-level human attention - or

relevance (Loboda et al., 2011).

Hypothesis: Gaze can be used to regularize neural

attention (Barrett et al., 2018).

Loboda et al. (2011). Inferring word relevance from eye-movements of readers. IUI.

Page 26: The Uphill Battles

Attention without Attention

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

Labe

led

atta

chm

ent s

core

s

40

50

60

70

80

90

Sentiment Error Abusive

72,45

84,28

52,37

71,23

83,84

50,05

Learned attention Human attention

Page 27: The Uphill Battles

Summary

Human reading insensitive to punctuation.

Human reading insensitive to (a lot of) spelling variation.

Human attention during reading is partial and

systematic.

Page 28: The Uphill Battles

Summary

Ignore punctuation. Semi-character RNNs. Regularize attention.

Page 29: The Uphill Battles

Machine learningly naïve NLP

Page 30: The Uphill Battles

Flavors of Failure

Page 31: The Uphill Battles

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Page 32: The Uphill Battles

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Page 33: The Uphill Battles

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

0

30

60

90

es et fi el hu pl tr

334745

000

82

Page 34: The Uphill Battles

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.It’s the morphology, stupid!

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Page 35: The Uphill Battles

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.It’s the morphology, stupid!

Flavors of Failure

Problem: How do we know when to blame morphology?

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Conneau et al. 2018. Word translation without parallel data. ICLR

Page 36: The Uphill Battles

Chocolate consumption correlates with Nobel

laureates.

Flavors of Failure

Page 37: The Uphill Battles

Chocolate consumption correlates with Nobel

laureates.

Flavors of Failure

https://en.wikipedia.org/wiki/Galton%27s_problem

Geographical diffusion caused by borrowing and

common ancestors.

Page 38: The Uphill Battles

Morphology correlates with siestas (Roberts and Winters,

2013).

Flavors of Failure

Page 39: The Uphill Battles

Chocolate consumption correlates with Nobel

laureates.

Morphology correlates with siestas (Roberts and Winters,

2013).

Problem: How do we know when to blame morphology?

Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Flavors of Failure

Page 40: The Uphill Battles

Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Control for it?

Flavors of Failure

Chocolate consumption correlates with Nobel

laureates.

Morphology correlates with siestas (Roberts and Winters,

2013).

Problem: How do we know when to blame morphology?

Page 41: The Uphill Battles

Hartmann et al. (2018) show failure cases for English-

English.

Flavors of Failure

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Page 42: The Uphill Battles

Hartmann et al. (2018) show failure cases for English-

English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Flavors of Failure

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM

Page 43: The Uphill Battles

Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Page 44: The Uphill Battles

Hartmann et al. (2018) show failure cases for English-

English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Diagnostic for loss functions based on Wasserstein or

Sinkhorn.

Flavors of Failure

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Page 45: The Uphill Battles

Flavors of Success

Page 46: The Uphill Battles

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

Flavors of Success

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Man with Mardi Gras beads around his neck holding pole with banner.

Page 47: The Uphill Battles

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Page 48: The Uphill Battles

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success

Elliott (2018). Adversarial evaluation of multimodal machine translation. EMNLP.

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Elliott (2018) show that pairing texts with random

images obtains same improvements.

Hypothesis: The images simply help the optimizer.

Page 49: The Uphill Battles

Hypothesis: The images simply help the optimizer.

Flavors of Success

Page 50: The Uphill Battles

Rationale: Over-parameterisation and skip

connections help the optimizer.

Flavors of Success

Li et al. (2017). Visualizing the loss landscapes of neural nets. ICLR.

Page 51: The Uphill Battles

Summary

The Nobel-Chocolate Fallacy: Is it really the morphology?

Are our models really aware of externally information?

Page 52: The Uphill Battles

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Page 53: The Uphill Battles

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Bingel & Søgaard (2017), e.g., show linguistic intuitions poor predictor for multi-task gains.

Bingel & Søgaard (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. EACL.

Page 54: The Uphill Battles

What’s wrong with being naïve?

Page 55: The Uphill Battles

The scientific dance

Simplification Anomaly

Page 56: The Uphill Battles

The scientific dance

Common: Random finite samples are representative

Anomaly

Page 57: The Uphill Battles

The scientific dance

Alternative: Controlled/synthetic language.

Anomaly

Page 58: The Uphill Battles

Hegel’s holiday?

Controlled language Finite samples

NLP WHEN I STARTED NLP TODAY

Page 59: The Uphill Battles

Hegel’s holiday?

Controlled language Finite samples

NLP WHEN I STARTED NLP TODAY

NLP TOMORROW

Page 60: The Uphill Battles

Questions?

Page 61: The Uphill Battles

Supplementary slides

Page 62: The Uphill Battles

John Dewey (1910)

A: “It will probably rain tomorrow.” B: “Why do you think so?” A: “Because the sky was lowering at sunset.” B: “What has that to do with it?” A: “I do not know, but it generally does rain after such a sunset.”

Page 63: The Uphill Battles

John Dewey (1910)

[The scientific] method of proceeding is by varying conditions one by one so far as possible, and noting just what happens when a given condition is eliminated. There are two methods for varying conditions. The first […] consists in comparing very carefully the results of a great number of observations which have occurred under accidentally different conditions. […]  [This] method […] is, however, badly handicapped; it can do nothing until it is presented with a certain number of diversified cases. […] The method is passive and dependent upon external accidents. Hence the superiority of the active or experimental method. Even a small number of observations may suggest an explanation - a hypothesis or theory. Working upon this suggestion, the scientist may then intentionally vary conditions and note what happens.

Page 64: The Uphill Battles

• NON-SCIENTIFIC METHOD: Extract regularities from available data (Penn Treebank).

• EMPIRICAL METHOD: Get more data (Web Treebank).

• EXPERIMENTAL METHOD: Create data (punctuation injection, etc).