What’s so Hard about Natural Language Understanding? · What’s so Hard about Natural Language...
Transcript of What’s so Hard about Natural Language Understanding? · What’s so Hard about Natural Language...
What’s so Hard about Natural Language Understanding?
Alan RitterComputer Science and Engineering
The Ohio State University
Collaborators: Jiwei Li, Dan Jurafsky (Stanford)
Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State)
Brendan O’Connor (Umass)
What’s so Hard about Natural Language Understanding?
Alan RitterComputer Science and Engineering
The Ohio State University
Collaborators: Jiwei Li, Dan Jurafsky (Stanford)
Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State)
Brendan O’Connor (Umass)
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Large, End-to-End Datasets for NLU?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Large, End-to-End Datasets for NLU?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
Data-Driven Conversation
6
• Twitter: ~ 500 Million Public SMS-Style Conversations per Month
• Goal: Learn conversational agents directly from massive volumes of data.
Data-Driven Conversation
6
• Twitter: ~ 500 Million Public SMS-Style Conversations per Month
• Goal: Learn conversational agents directly from massive volumes of data.
Noisy Channel Model
7
[Ritter, Cherry, Dolan EMNLP 2011]
Who wants to come over for dinner tomorrow?Input:
Noisy Channel Model
7
[Ritter, Cherry, Dolan EMNLP 2011]
Who wants to come over for dinner tomorrow?Input:
Output:
Yum ! I
{
Noisy Channel Model
7
[Ritter, Cherry, Dolan EMNLP 2011]
Who wants to come over for dinner tomorrow?Input:
Output:
{want toYum ! I
{
Noisy Channel Model
7
[Ritter, Cherry, Dolan EMNLP 2011]
Who wants to come over for dinner tomorrow?Input:
Output:
{want toYum ! I
{be there
{
Noisy Channel Model
7
[Ritter, Cherry, Dolan EMNLP 2011]
Who wants to come over for dinner tomorrow?Input:
Output:
{want toYum ! I
{be there
{tomorrow !
{
Neural Conversation
8
[Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016][Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016]
Neural Conversation
8
[Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016][Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016]
13
How old are you?
i 'm 16 .
16 ?
i don 't know what you 're talking about
you don 't know what you 're saying
Slide Credit: Jiwei Li
How old are you?
i 'm 16 .
16 ?
i don 't know what you 're talking about
you don 't know what you 're saying
i don 't know what you 're talking about
… Slide Credit: Jiwei Li
How old are you?
i 'm 16 .
16 ?
i don 't know what you 're talking about
you don 't know what you 're saying
i don 't know what you 're talking about
… Slide Credit: Jiwei Li
Bad Action
How old are you?
i 'm 16 .
16 ?
i don 't know what you 're talking about
you don 't know what you 're saying
i don 't know what you 're talking about
… Slide Credit: Jiwei Li
Bad Action
Outcome
how old are you
Encoding
State
Deep Reinforcement Learning[Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016]
How old are you?
I’m 16 . EOS
Decoding
EOS I’m 16 .how old are you
Encoding
Deep Reinforcement Learning
Action
[Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016]
How old are you?
i 'm 16 .
I’m 16 . EOS
Decoding
EOS I’m 16 .how old are you
Encoding
REINFORCE Algorithm (Williams,1992)
What we want to learn
Learning: Policy Gradient
ActionHow old are you?
i 'm 16 .
Adversarial Learning for Neural Dialogue
Real-world conversations
Response Generator
generate response
samplehuman response
Discriminator Real or Fake?
[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]
Adversarial Learning for Neural Dialogue
Real-world conversations
Response Generator
generate response
samplehuman response
Discriminator Real or Fake?
(Alternate Between Training Generator and Discriminator)
[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]
Adversarial Learning for Neural Dialogue
Real-world conversations
Response Generator
generate response
samplehuman response
Discriminator Real or Fake?
(Alternate Between Training Generator and Discriminator)
REINFORCE Algorithm (Williams,1992)
[Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016]
Adversarial Learning Improves Response Generation
Human Evaluator:
Machine Evaluator:
Adversarial Success (How often can you fool a machine)
Adversarial Learning 8.0%Standard Seq2Seq model 4.9%
Adversarial Win
Adversarial Lose
Tie
62% 18% 20%
vs vanilla generation model
Slide Credit: Jiwei Li
[Bowman et. al. 2016]
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
Q: Large, End-to-End Datasets for NLU?
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
Generates fluent open domain replies
Q: Large, End-to-End Datasets for NLU?
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
Generates fluent open domain replies
Really Natural Language Understanding?
Q: Large, End-to-End Datasets for NLU?
• Web-scale Conversations?
• Web-scale Structured Data?
Q: Why are we so good at Speech, MT (but bad at NLU)?
People naturally translate and transcribe.
Generates fluent open domain replies
Really Natural Language Understanding?
Q: Large, End-to-End Datasets for NLU?
Learning from Distant Supervision
3) Time Normalization
4) Event Extraction
Challenge: diversity in noisy text
Challenge: lack of negative examples
[Tabassum, Ritter, Xu, EMNLP 2016]
[Ritter, et. al. WWW 2015]
O(✓) =
NX
i
log p✓(yi|xi)
| {z }Log Likelihood
��UD(p̃||p̂unlabeled✓ )| {z }Label regularization
� �L2 X
j
w2
j
| {z }L2
regularization
1) Named Entity RecognitionChallenge: highly ambiguous labels[Ritter, et. al. EMNLP 2011]
2) Relation ExtractionChallenge: missing data[Ritter, et. al. TACL 2013]
[Konovalov, et. al. WWW 2017]
[Mintz et. al. 2009]
Learning from Distant Supervision
3) Time Normalization
4) Event Extraction
Challenge: diversity in noisy text
Challenge: lack of negative examples
[Tabassum, Ritter, Xu, EMNLP 2016]
[Ritter, et. al. WWW 2015]
O(✓) =
NX
i
log p✓(yi|xi)
| {z }Log Likelihood
��UD(p̃||p̂unlabeled✓ )| {z }Label regularization
� �L2 X
j
w2
j
| {z }L2
regularization
1) Named Entity RecognitionChallenge: highly ambiguous labels[Ritter, et. al. EMNLP 2011]
2) Relation ExtractionChallenge: missing data[Ritter, et. al. TACL 2013]
[Konovalov, et. al. WWW 2017]
[Mintz et. al. 2009]
Time Normalization[Tabassum, Ritter, Xu EMNLP 2016]
1 Jan 2016
State-of-the-art time resolvers
TempEXHeidelTimeSUTimeUWTime
{ }
Time NormalizationDistant Supervision
(no human labels or rules!)
[Tabassum, Ritter, Xu EMNLP 2016]
1 Jan 2016
State-of-the-art time resolvers
TempEXHeidelTimeSUTimeUWTime
{ }
…w1 w2 w3 wn
[ Mercury, 5/9/2016 ]
Words
[Event Database]
Sentence Level Tags
t1 t2 t3 t41
31
Mon
Sun
1
12
Past Present Future
… … …
Multiple Instance Learning Tagger
…w1 w2 w3 wn
[ Mercury, 5/9/2016 ]
Words
[Event Database]
Sentence Level Tags
t1 t2 t3 t41
31
Mon
Sun
1
12
Past Present Future
… … …
…z1 z2 z3 zn Word Level Tags
Local Classifierexp(✓ · f(wi, zi))
Multiple Instance Learning Tagger
Deterministic OR
…w1 w2 w3 wn
[ Mercury, 5/9/2016 ]
Words
[Event Database]
Sentence Level Tags
t1 t2 t3 t41
31
Mon
Sun
1
12
Past Present Future
… … …
…z1 z2 z3 zn Word Level Tags
Local Classifierexp(✓ · f(wi, zi))
[Hoffmann et. al. 2011]
Multiple Instance Learning Tagger
Deterministic OR
…w1 w2 w3 wn
[ Mercury, 5/9/2016 ]
Maximize ConditionalLikelihood:X
z
P (z, t|w, ✓)
Words
[Event Database]
Sentence Level Tags
t1 t2 t3 t41
31
Mon
Sun
1
12
Past Present Future
… … …
…z1 z2 z3 zn Word Level Tags
Local Classifierexp(✓ · f(wi, zi))
[Hoffmann et. al. 2011]
Multiple Instance Learning Tagger
AggregatedSentence
Level Tags
…w1 w2 w3 wn
…z1 z2 z3 zn
t1 t2 t3 t4
[Event Database]
Missing Data Extension
…w1 w2 w3 wn
…z1 z2 z3 zn
[Event Database]
t01 t02 t03 t04
m4m3m2m1
Missing Data Problem In Distant Supervision
[Ritter, et. al. TACL 2013]
Missing Data Extension
Mentioned in Text
…w1 w2 w3 wn
…z1 z2 z3 zn
[Event Database]
t01 t02 t03 t04
m4m3m2m1
Missing Data Problem In Distant Supervision
[Ritter, et. al. TACL 2013]
Missing Data Extension
Mentioned in Text
Implied by Event Date
…w1 w2 w3 wn
…z1 z2 z3 zn
[Event Database]
t01 t02 t03 t04
m4m3m2m1
Missing Data Problem In Distant Supervision
[Ritter, et. al. TACL 2013]
Missing Data Extension
Mentioned in Text
Encourage Agreement
Implied by Event Date
…w1 w2 w3 wn
…z1 z2 z3 zn
[Event Database]
t01 t02 t03 t04
m4m3m2m1
Missing Data Problem In Distant Supervision
[Ritter, et. al. TACL 2013]
Missing Data Extension
Example TagsWord Im Hella excited for tomorrowTag NA NA Future NA Future
Word Thnks for a Christmas party on friTag NA NA NA December NA NA Friday
Where can we find NLU? Follow the data!
Opportunistically Gathered Data:•Twitter Events (Time Normalization) •Billions of Internet Conversations
Where can we find NLU? Follow the data!
Opportunistically Gathered Data:•Twitter Events (Time Normalization) •Billions of Internet Conversations
Design Models for the Data(rather than the other way around)