What’s so Hard about Natural Language Understanding? · What’s so Hard about Natural Language...

71
What’s so Hard about Natural Language Understanding? Alan Ritter Computer Science and Engineering The Ohio State University Collaborators: Jiwei Li, Dan Jurafsky (Stanford) Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State) Brendan O’Connor (Umass)

Transcript of What’s so Hard about Natural Language Understanding? · What’s so Hard about Natural Language...

What’s so Hard about Natural Language Understanding?

Alan RitterComputer Science and Engineering

The Ohio State University

Collaborators: Jiwei Li, Dan Jurafsky (Stanford)

Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State)

Brendan O’Connor (Umass)

What’s so Hard about Natural Language Understanding?

Alan RitterComputer Science and Engineering

The Ohio State University

Collaborators: Jiwei Li, Dan Jurafsky (Stanford)

Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State)

Brendan O’Connor (Umass)

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Large, End-to-End Datasets for NLU?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Large, End-to-End Datasets for NLU?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

Data-Driven Conversation

6

• Twitter: ~ 500 Million Public SMS-Style Conversations per Month

• Goal: Learn conversational agents directly from massive volumes of data.

Data-Driven Conversation

6

• Twitter: ~ 500 Million Public SMS-Style Conversations per Month

• Goal: Learn conversational agents directly from massive volumes of data.

Noisy Channel Model

7

[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Noisy Channel Model

7

[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Output:

Yum ! I

{

Noisy Channel Model

7

[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Output:

{want toYum ! I

{

Noisy Channel Model

7

[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Output:

{want toYum ! I

{be there

{

Noisy Channel Model

7

[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Output:

{want toYum ! I

{be there

{tomorrow !

{

Neural Conversation

8

[Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016][Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016]

Neural Conversation

8

[Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016][Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016]

9

How old are you?

Slide Credit: Jiwei Li

10

How old are you?

i 'm 16 .

Slide Credit: Jiwei Li

11

How old are you?

i 'm 16 .

16 ?

Slide Credit: Jiwei Li

12

How old are you?

i 'm 16 .

16 ?

i don 't know what you 're talking about

Slide Credit: Jiwei Li

13

How old are you?

i 'm 16 .

16 ?

i don 't know what you 're talking about

you don 't know what you 're saying

Slide Credit: Jiwei Li

How old are you?

i 'm 16 .

16 ?

i don 't know what you 're talking about

you don 't know what you 're saying

i don 't know what you 're talking about

… Slide Credit: Jiwei Li

How old are you?

i 'm 16 .

16 ?

i don 't know what you 're talking about

you don 't know what you 're saying

i don 't know what you 're talking about

… Slide Credit: Jiwei Li

Bad Action

How old are you?

i 'm 16 .

16 ?

i don 't know what you 're talking about

you don 't know what you 're saying

i don 't know what you 're talking about

… Slide Credit: Jiwei Li

Bad Action

Outcome

how old are you

Encoding

State

Deep Reinforcement Learning[Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016]

How old are you?

I’m 16 . EOS

Decoding

EOS I’m 16 .how old are you

Encoding

Deep Reinforcement Learning

Action

[Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016]

How old are you?

i 'm 16 .

I’m 16 . EOS

Decoding

EOS I’m 16 .how old are you

Encoding

REINFORCE Algorithm (Williams,1992)

What we want to learn

Learning: Policy Gradient

ActionHow old are you?

i 'm 16 .

Q: Rewards?

Q: Rewards?A: Turing Test

Q: Rewards?A: Turing Test

Adversarial Learning (Goodfellow et al., 2014)

Adversarial Learning for Neural Dialogue

Real-world conversations

Response Generator

generate response

samplehuman response

Discriminator Real or Fake?

[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]

Adversarial Learning for Neural Dialogue

Real-world conversations

Response Generator

generate response

samplehuman response

Discriminator Real or Fake?

(Alternate Between Training Generator and Discriminator)

[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]

Adversarial Learning for Neural Dialogue

Real-world conversations

Response Generator

generate response

samplehuman response

Discriminator Real or Fake?

(Alternate Between Training Generator and Discriminator)

REINFORCE Algorithm (Williams,1992)

[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]

Adversarial Learning Improves Response Generation

Human Evaluator:

Machine Evaluator:

Adversarial Success (How often can you fool a machine)

Adversarial Learning 8.0%Standard Seq2Seq model 4.9%

Adversarial Win

Adversarial Lose

Tie

62% 18% 20%

vs vanilla generation model

Slide Credit: Jiwei Li

[Bowman et. al. 2016]

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

Q: Large, End-to-End Datasets for NLU?

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

Generates fluent open domain replies

Q: Large, End-to-End Datasets for NLU?

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

Generates fluent open domain replies

Really Natural Language Understanding?

Q: Large, End-to-End Datasets for NLU?

• Web-scale Conversations?

• Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)?

People naturally translate and transcribe.

Generates fluent open domain replies

Really Natural Language Understanding?

Q: Large, End-to-End Datasets for NLU?

Learning from Distant Supervision

3) Time Normalization

4) Event Extraction

Challenge: diversity in noisy text

Challenge: lack of negative examples

[Tabassum, Ritter, Xu, EMNLP 2016]

[Ritter, et. al. WWW 2015]

O(✓) =

NX

i

log p✓(yi|xi)

| {z }Log Likelihood

��UD(p̃||p̂unlabeled✓ )| {z }Label regularization

� �L2 X

j

w2

j

| {z }L2

regularization

1) Named Entity RecognitionChallenge: highly ambiguous labels[Ritter, et. al. EMNLP 2011]

2) Relation ExtractionChallenge: missing data[Ritter, et. al. TACL 2013]

[Konovalov, et. al. WWW 2017]

[Mintz et. al. 2009]

Learning from Distant Supervision

3) Time Normalization

4) Event Extraction

Challenge: diversity in noisy text

Challenge: lack of negative examples

[Tabassum, Ritter, Xu, EMNLP 2016]

[Ritter, et. al. WWW 2015]

O(✓) =

NX

i

log p✓(yi|xi)

| {z }Log Likelihood

��UD(p̃||p̂unlabeled✓ )| {z }Label regularization

� �L2 X

j

w2

j

| {z }L2

regularization

1) Named Entity RecognitionChallenge: highly ambiguous labels[Ritter, et. al. EMNLP 2011]

2) Relation ExtractionChallenge: missing data[Ritter, et. al. TACL 2013]

[Konovalov, et. al. WWW 2017]

[Mintz et. al. 2009]

Time Normalization[Tabassum, Ritter, Xu EMNLP 2016]

1 Jan 2016

State-of-the-art time resolvers

TempEXHeidelTimeSUTimeUWTime

{ }

Time NormalizationDistant Supervision

(no human labels or rules!)

[Tabassum, Ritter, Xu EMNLP 2016]

1 Jan 2016

State-of-the-art time resolvers

TempEXHeidelTimeSUTimeUWTime

{ }

Mercury Transit May 9,2016

Distant Supervision Assumption

Mercury Transit May 9,2016

Distant Supervision Assumption

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

9 May 10 May8 May

Distant Supervision AssumptionMercury Transit

May 9,2016

…w1 w2 w3 wn

[ Mercury, 5/9/2016 ]

Words

[Event Database]

Sentence Level Tags

t1 t2 t3 t41

31

Mon

Sun

1

12

Past Present Future

… … …

Multiple Instance Learning Tagger

…w1 w2 w3 wn

[ Mercury, 5/9/2016 ]

Words

[Event Database]

Sentence Level Tags

t1 t2 t3 t41

31

Mon

Sun

1

12

Past Present Future

… … …

…z1 z2 z3 zn Word Level Tags

Local Classifierexp(✓ · f(wi, zi))

Multiple Instance Learning Tagger

Deterministic OR

…w1 w2 w3 wn

[ Mercury, 5/9/2016 ]

Words

[Event Database]

Sentence Level Tags

t1 t2 t3 t41

31

Mon

Sun

1

12

Past Present Future

… … …

…z1 z2 z3 zn Word Level Tags

Local Classifierexp(✓ · f(wi, zi))

[Hoffmann et. al. 2011]

Multiple Instance Learning Tagger

Deterministic OR

…w1 w2 w3 wn

[ Mercury, 5/9/2016 ]

Maximize ConditionalLikelihood:X

z

P (z, t|w, ✓)

Words

[Event Database]

Sentence Level Tags

t1 t2 t3 t41

31

Mon

Sun

1

12

Past Present Future

… … …

…z1 z2 z3 zn Word Level Tags

Local Classifierexp(✓ · f(wi, zi))

[Hoffmann et. al. 2011]

Multiple Instance Learning Tagger

Sentence Level Tags: TL = Future MOY= May DOM=9 DOW= Mon

Missing Data Problem

AggregatedSentence

Level Tags

…w1 w2 w3 wn

…z1 z2 z3 zn

t1 t2 t3 t4

[Event Database]

Missing Data Extension

…w1 w2 w3 wn

…z1 z2 z3 zn

[Event Database]

t01 t02 t03 t04

m4m3m2m1

Missing Data Problem In Distant Supervision

[Ritter, et. al. TACL 2013]

Missing Data Extension

Mentioned in Text

…w1 w2 w3 wn

…z1 z2 z3 zn

[Event Database]

t01 t02 t03 t04

m4m3m2m1

Missing Data Problem In Distant Supervision

[Ritter, et. al. TACL 2013]

Missing Data Extension

Mentioned in Text

Implied by Event Date

…w1 w2 w3 wn

…z1 z2 z3 zn

[Event Database]

t01 t02 t03 t04

m4m3m2m1

Missing Data Problem In Distant Supervision

[Ritter, et. al. TACL 2013]

Missing Data Extension

Mentioned in Text

Encourage Agreement

Implied by Event Date

…w1 w2 w3 wn

…z1 z2 z3 zn

[Event Database]

t01 t02 t03 t04

m4m3m2m1

Missing Data Problem In Distant Supervision

[Ritter, et. al. TACL 2013]

Missing Data Extension

Example TagsWord Im Hella excited for tomorrowTag NA NA Future NA Future

Word Thnks for a Christmas party on friTag NA NA NA December NA NA Friday

Evaluation

Evaluation17% increase in

F- score over SUTime

Where can we find NLU? Follow the data!

Where can we find NLU? Follow the data!

Where can we find NLU? Follow the data!

Opportunistically Gathered Data:•Twitter Events (Time Normalization) •Billions of Internet Conversations

Where can we find NLU? Follow the data!

Opportunistically Gathered Data:•Twitter Events (Time Normalization) •Billions of Internet Conversations

Design Models for the Data(rather than the other way around)

Where can we find NLU? Follow the data!

Opportunistically Gathered Data:•Twitter Events (Time Normalization) •Billions of Internet Conversations

Design Models for the Data(rather than the other way around)

Thank You!