The Uphill Battles

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science




NLP WHEN I STARTED NLP TODAY




NLP WHEN I STARTED+ NAÏVE EMPIRICISM + LESS INTUITIVE METHODS

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

MO

DEL

SIN

TERP

RETA

TIO

N

Psychologically naïve NLP

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example


samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.


Ur Example


samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.

Problem: More than strings, strings generated by many grammars, and no need for

identical grammars.


Ur Example

The Dot Plot

Modern parsers are trained to analyze sentences written by

journalists.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .


journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.


The Dot Plot


journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.


The Dot Plot

Humans are not sensitive to non-standard punctuation

(Baldwin and Coady, 1978).

Hypothesis: Punctuation prevents parsers from

learning deeper generalisations.

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.


The Dot Plot

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.

Labe

led

atta

chm

ent s

core

s

50

62,5

75

87,5

100

UUParser MaltParser

67,5

79,4

75,7

85,1

80,4

88,6

79,1

86,7

83,6

90,1

80,5

86,985,8

91,8

Minimum

Maximum

d=0,c=0 nopunct d=.01,c=.01 d=.01,c=.05 d=.05,c=.01 d=.05,c=.05 d=.1,c=.1

NOPUNCT (89.8)

The Bad Fortune Speller

Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.


Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.


Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Hypothesis: Near-perfect spelling prevents models

from learning deeper generalisations.

Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

Didn’t character-based RNNs solve this?



Heigold et al. (2018). How robust are character-based word embeddings in tagging and MT against wrod scramlbing or random nouse? AMTA.

Labe

led

atta

chm

ent s

core

s

25

50

75

100

POS MT

21,7

82

25

84

30,7

94

s=0,f=0 s=0.1/0.05,f=0 s=0,f=0.1/0.05

Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Sakaguchi et al. (2017). Robsut Wrod Reocginiton via semi-Character RNN. AAAI.


Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

(Semi-character RNNs)

Attention without Attention

LSTMs with attention functions are popular models.


Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.


Attention functions are latent variables with no direct

supervision.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).


Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.



Attention functions are latent variables with no direct

supervision.


Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).

Gaze data reflects word-level human attention - or

relevance (Loboda et al., 2011).

Hypothesis: Gaze can be used to regularize neural

attention (Barrett et al., 2018).

Loboda et al. (2011). Inferring word relevance from eye-movements of readers. IUI.



Labe

led

atta

chm

ent s

core

s

40

50

60

70

80

90

Sentiment Error Abusive

72,45

84,28

52,37

71,23

83,84

50,05

Learned attention Human attention

Summary

Human reading insensitive to punctuation.

Human reading insensitive to (a lot of) spelling variation.

Human attention during reading is partial and

systematic.

Summary

Ignore punctuation. Semi-character RNNs. Regularize attention.

Machine learningly naïve NLP

Flavors of Failure

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR



Søgaard et al. (2018) show their approach is challenged

by some language pairs.

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Flavors of Failure


0

30

60

90

es et fi el hu pl tr

334745

000

82




by some language pairs.It’s the morphology, stupid!


Flavors of Failure





by some language pairs.It’s the morphology, stupid!

Flavors of Failure

Problem: How do we know when to blame morphology?



Chocolate consumption correlates with Nobel

laureates.

Flavors of Failure


laureates.

Flavors of Failure

https://en.wikipedia.org/wiki/Galton%27s_problem

Geographical diffusion caused by borrowing and

common ancestors.

Morphology correlates with siestas (Roberts and Winters,

2013).

Flavors of Failure


laureates.


2013).


Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Flavors of Failure

Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Control for it?

Flavors of Failure


laureates.


2013).


Hartmann et al. (2018) show failure cases for English-

English.

Flavors of Failure

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.


English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Flavors of Failure

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.


Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM

Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM



English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Diagnostic for loss functions based on Wasserstein or

Sinkhorn.

Flavors of Failure



Flavors of Success

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

Flavors of Success

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Man with Mardi Gras beads around his neck holding pole with banner.


image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success



image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success

Elliott (2018). Adversarial evaluation of multimodal machine translation. EMNLP.


Elliott (2018) show that pairing texts with random

images obtains same improvements.

Hypothesis: The images simply help the optimizer.

Hypothesis: The images simply help the optimizer.

Flavors of Success

Rationale: Over-parameterisation and skip

connections help the optimizer.

Flavors of Success

Li et al. (2017). Visualizing the loss landscapes of neural nets. ICLR.

Summary

The Nobel-Chocolate Fallacy: Is it really the morphology?

Are our models really aware of externally information?

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Bingel & Søgaard (2017), e.g., show linguistic intuitions poor predictor for multi-task gains.

Bingel & Søgaard (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. EACL.

What’s wrong with being naïve?

The scientific dance

Simplification Anomaly


Common: Random finite samples are representative

Anomaly


Alternative: Controlled/synthetic language.

Anomaly

Hegel’s holiday?

Controlled language Finite samples


Hegel’s holiday?

Controlled language Finite samples


NLP TOMORROW

Questions?

Supplementary slides

John Dewey (1910)

A: “It will probably rain tomorrow.” B: “Why do you think so?” A: “Because the sky was lowering at sunset.” B: “What has that to do with it?” A: “I do not know, but it generally does rain after such a sunset.”

John Dewey (1910)

[The scientific] method of proceeding is by varying conditions one by one so far as possible, and noting just what happens when a given condition is eliminated. There are two methods for varying conditions. The first […] consists in comparing very carefully the results of a great number of observations which have occurred under accidentally different conditions. […] [This] method […] is, however, badly handicapped; it can do nothing until it is presented with a certain number of diversified cases. […] The method is passive and dependent upon external accidents. Hence the superiority of the active or experimental method. Even a small number of observations may suggest an explanation - a hypothesis or theory. Working upon this suggestion, the scientist may then intentionally vary conditions and note what happens.

• NON-SCIENTIFIC METHOD: Extract regularities from available data (Penn Treebank).

• EMPIRICAL METHOD: Get more data (Web Treebank).

• EXPERIMENTAL METHOD: Create data (punctuation injection, etc).

The Uphill Battles

Documents

Transcript of The Uphill Battles