Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP...
Transcript of Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP...
![Page 1: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/1.jpg)
DRL4NLP:Deep Reinforcement Learning
for Natural Language Processing
Jiwei LiShannon.ai
William WangUC Santa Barbara
Xiaodong HeJD AI Research
1
![Page 2: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/2.jpg)
Tutorial Outline
• Introduction• Fundamentals and Overview (William Wang)• Deep Reinforcement Learning for Dialog (Jiwei Li)• Challenges (Xiaodong He)• Conclusion
2
![Page 3: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/3.jpg)
Introduction
3
![Page 4: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/4.jpg)
DRL for Atari Games (Mnih et al., 2015)
4
![Page 5: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/5.jpg)
AlphaGo Zero (Oct., 2017)
5
![Page 6: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/6.jpg)
Reinforcement Learning
• RL is a general purpose framework for decision making• RL is for an agent with the capacity to act• Each action influences the agent’s future state• Success is measured by a scalar reward signal
6
Big three: action, state, reward
Some slides adapted from David Silver.
![Page 7: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/7.jpg)
Agent and Environment
7
→←MoveRightMoveLeft
observation ot
action at
reward rt
Agent
Environment
Some slides adapted from David Silver.
![Page 8: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/8.jpg)
Major Components in an RL Agent• An RL agent may include one or more of these components
• Policy: agent’s behavior function• Value function: how good is each state and/or action• Model: agent’s representation of the environment
8Some slides adapted from David Silver.
![Page 9: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/9.jpg)
Reinforcement Learning Approach
• Policy-based RL• Search directly for optimal policy
• Value-based RL• Estimate the optimal value function
• Model-based RL• Build a model of the environment• Plan (e.g. by lookahead) using model
9
is the policy achieving maximum future reward
is maximum value achievable under any policy
Some slides adapted from David Silver.
![Page 10: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/10.jpg)
RL Agent Taxonomy
10Some slides adapted from David Silver.
![Page 11: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/11.jpg)
Deep Reinforcement Learning
• Idea: deep learning for reinforcement learning• Use deep neural networks to represent
• Value function• Policy• Model
• Optimize loss function by SGD
11Some slides adapted from David Silver.
![Page 12: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/12.jpg)
Value-Based Deep RLEstimate How Good Each State and/or Action is
12Some slides adapted from David Silver.
![Page 13: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/13.jpg)
Value Function Approximation
• Value functions are represented by a lookup table
• too many states and/or actions to store• not able to learn the value of each entry individually
• Values can be estimated with function approximation
13Some slides adapted from David Silver.
![Page 14: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/14.jpg)
Q-Networks
• Q-networks represent value functions with weights
• generalize from seen states to unseen states• update parameter for function approximation
14Some slides adapted from David Silver.
![Page 15: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/15.jpg)
Q-Learning
• Goal: estimate optimal Q-values• Optimal Q-values obey a Bellman equation
• Value iteration algorithms solve the Bellman equation
15
learning target
Some slides adapted from David Silver.
![Page 16: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/16.jpg)
Deep Q-Networks (DQN)
• Represent value function by deep Q-network with weights
• Objective is to minimize MSE loss by SGD• Starts with initial state s, takes a, gets r, and sees 𝑠"• but we want w to give us better Q early in the game.
• Leading to the following Q-learning gradient
16Some slides adapted from David Silver.
![Page 17: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/17.jpg)
Stability Issues with Deep RL
• Naive Q-learning oscillates or diverges with neural nets1. Data is sequential
• Successive samples are correlated, non-iid(independent and identically distributed)
2. Policy changes rapidly with slight changes to Q-values• Policy may oscillate• Distribution of data can swing from one extreme
to another3. Scale of rewards and Q-values is unknown
• Naive Q-learning gradients can be unstable when backpropagated
17Some slides adapted from David Silver.
![Page 18: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/18.jpg)
Stable Solutions for DQN
18
• DQN provides a stable solutions to deep value-based RL1. Use experience replay
• Break correlations in data, bring us back to iidsetting
• Learn from all past policies2. Freeze target Q-network
• Avoid oscillation• Break correlations between Q-network and target
3. Clip rewards or normalize network adaptively to sensible range
• Robust gradients
Some slides adapted from David Silver.
![Page 19: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/19.jpg)
Policy-Based Deep RLEstimate How Good An Agent’s Behavior is
19Some slides adapted from David Silver.
![Page 20: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/20.jpg)
Deep Policy Networks
• Represent policy by deep network with weights
• Objective is to maximize total discounted reward by SGD
20
stochastic policy deterministic policy
Some slides adapted from David Silver.
![Page 21: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/21.jpg)
Policy Gradient
• The gradient of a stochastic policy is given by
• The gradient of a deterministic policy is given by
21Some slides adapted from David Silver.
![Page 22: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/22.jpg)
Actor-Critic (Value-Based + Policy-Based)
• Estimate value function • Update policy parameters by SGD
• Stochastic policy
• Deterministic policy
22Some slides adapted from David Silver.
![Page 23: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/23.jpg)
Reinforcement Learning in Action
23
![Page 24: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/24.jpg)
DRL4NLP: Overview of Applications
• Information Extraction• Narasimhan et al., EMNLP 2016
• Relational Reasoning• DeepPath (Xiong et al., EMNLP 2017)
• Sequence Learning• MIXER (Ranzato et al., ICLR 2016)
• Text Classification• Learning to Active Learn (Fang et al., EMNLP 2017)• Reinforced Co-Training (Wu et al., NAACL 2018)• Relation Classification (Qin et al., ACL 2018)
24
![Page 25: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/25.jpg)
DRL4NLP: Overview of Applications
• Coreference Resolution• Clark and Manning (EMNLP 2016)• Yin et al., (ACL 2018)
• Summarization• Paulus et al., (ICLR 2018)• Celikyilmaz et al., (ACL 2018)
• Language and Vision• Video Captioning (Wang et al., CVPR 2018)• Visual-Language Navigation (Xiong et al., IJCAI 2018)• Model-Free + Model-Based RL (Wang et al., ECCV
2018)
25
![Page 26: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/26.jpg)
Tutorial Outline
• Introduction• Fundamentals and Overview (William Wang)• Deep Reinforcement Learning for Dialog (Jiwei Li)• Challenges (Xiaodong He)• Conclusion
26
![Page 27: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/27.jpg)
Fundamentals and Overview
• Why DRL4NLP?• Important Directions of DRL4NLP.
27
![Page 28: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/28.jpg)
Why does one use (D)RL in NLP?
1. Learning to search and reason.2. Instead of minimizing the surrogate loss (e.g., XE,
hinge loss), optimize the end metric (e.g., BLEU, ROUGE) directly.
3. Select the right (unlabeled) data.4. Back-propagate the reward to update the model.
28
![Page 29: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/29.jpg)
Learning to Search and Reason
29
![Page 30: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/30.jpg)
SEARN (Daume III et al., 2009):Learning to Search
1. use a good initial policy at training time to produce a sequence of actions (e.g., the choice of the next word)
2. a search algorithm is run to determine the optimal action at each time step
3. a new classifier (a.k.a. policy) is trained to predict that action
30
![Page 31: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/31.jpg)
Reasoning on Knowledge Graph
Query node: Band of brothersQuery relation: tvProgramLanguage
tvProgramLanguage(Band of Brothers, ?)
31
Band of Brothers
Mini-Series
HBO
tvProgramCreator
tvProgramGenre
Graham Yost
writtenBy music
United States
countryOfOrigin
Neal McDonough
English
Tom Hanks
awardWorkWinnercastActor
...
professionpersonLanguages
Caesars Entertain
…serviceLocation-1
Actor
Michael Kamen
![Page 32: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/32.jpg)
DeepPath: DRL for KG Reasoning(Xiong et al., EMNLP 2017)
32
![Page 33: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/33.jpg)
Components of MDP
• Markov decision process < 𝑆, 𝐴, 𝑃, 𝑅 >• 𝑆: continuousstatesrepresentedwithembeddings• 𝐴:actionspace(relations)• 𝑃 𝑆@AB = 𝑠" 𝑆@ = 𝑠, 𝐴@ = 𝑎 :transitionprobability• 𝑅 𝑠, 𝑎 : rewardreceivedforeachtakenstep
• Withpretrained KGembeddings• 𝑠@ = 𝑒@ ⊕ (𝑒@KLMN@ − 𝑒@)• 𝐴 = 𝑟B, 𝑟Q, … , 𝑟S ,allrelationsintheKG
33
![Page 34: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/34.jpg)
Reward Functions
• Global Accuracy
• Path Efficiency
• Path Diversity
34
![Page 35: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/35.jpg)
Training with Policy Gradient• Monte-Carlo Policy Gradient (REINFORCE,
William, 1992)
35
![Page 36: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/36.jpg)
Learning Data Selection Policy with DRL
36
![Page 37: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/37.jpg)
DRL for Information Extraction (Narasimhan et al., EMNLP 2016)
37
![Page 38: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/38.jpg)
DRL for Information Extraction (Narasimhan et al., EMNLP 2016)
38
![Page 39: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/39.jpg)
Can DRL help select unlabeled data for semi-supervised text classification?
39
![Page 40: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/40.jpg)
A Classic Example of Semi-Supervised Learning
• Co-Training (Blum and Mitchell, 1998)
40
![Page 41: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/41.jpg)
Challenges• The two classifiers in co-training have to be
independent.• Choosing highly-confident self-labeled examples
could be suboptimal.• Sampling bias shift is common.
41
![Page 42: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/42.jpg)
Reinforced SSL
• Assumption: not all the unlabeled data are useful. • Idea: performance-driven semi-supervised learning
that learns an unlabeled data selection policy with RL, instead of using random sampling.
• 1. Partition the unlabeled data space• 2. Train a RL agent to select useful unlabeled data• 3. Reward: change in accuracy on the validation set
42
![Page 43: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/43.jpg)
Reinforced Co-Training(Wu et al., NAACL 2018)
43
![Page 44: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/44.jpg)
Directly Optimization of the Metric
44
![Page 45: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/45.jpg)
MIXER (Ranzato et al., ICLR 2016)
• Optimize the cross-entropy loss and the BLEU score directly using REINFORCE (Williams, 1992).
45
![Page 46: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/46.jpg)
Backprop Reward via One-Step RL
46
![Page 47: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/47.jpg)
One-Step Reinforcement Learning in Action
47
![Page 48: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/48.jpg)
KBGAN: Learning to Generate High-Quality Negative Examples(Cai and Wang, NAACL 2018)Idea: use adversarial learning to iteratively learn better negative examples.
48
![Page 49: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/49.jpg)
KBGAN: Overview
• Both G and D are KG embedding models.
• Input:• Pre-trained generator G with score function 𝑓U ℎ, 𝑟, 𝑡 .• Pre-trained discriminator D with score function 𝑓Y ℎ, 𝑟, 𝑡 .
• Adversarial Learning: • Use softmax to score and rank negative triples.• Update D with original positive examples and highly-ranked negative
examples.• Pass the reward for policy gradient update for G.
• Output:• Adversarially trained KG embedding discriminator D.
49
![Page 50: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/50.jpg)
KBGAN: Adversarial Negative Training
For each positive triple from the minibatch:1. Generator: Rank negative examples.
2. Discriminator: Standard margin-based update.
50
![Page 51: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/51.jpg)
KBGAN: One-Step RL for Updating the Generator3. Compute the Reward for the Generator.
r = −𝑓Y ℎ′, 𝑟, 𝑡′ .
4. Policy gradient update for the Generator.
The baseline b is total reward sum / mini-batch size.
51
![Page 52: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/52.jpg)
Important Directions in DRL
•Learning from Demonstration.•Hierarchical DRL.•Inverse DRL.•Sample-efficiency.
52
![Page 53: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/53.jpg)
Learning from Demonstration53
![Page 54: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/54.jpg)
Motivations
• Exploitation vs exploration.• Cold-start DRL is very challenging.• Pre-training (a.k.a. demonstration is common).• Some entropy in an agent’s policy is healthy.
54
![Page 55: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/55.jpg)
Scheduled Policy Optimization (Xiong et al., IJCAI-ECAI 2018)
55
![Page 56: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/56.jpg)
Hierarchical Deep Reinforcement Learning
56
![Page 57: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/57.jpg)
Hierarchical Deep Reinforcement Learning for Video Captioning (Wang et al., CVPR 2018)
57
![Page 58: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/58.jpg)
Inverse Deep Reinforcement Learning58
![Page 59: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/59.jpg)
No Metrics Are Perfect: Adversarial Reward Learning (Wang et al., ACL 2018)
• Task: visual storytelling (generate a story from a sequence of images in a photo album).
• Difficulty: how to quantify a good story?• Idea: given a policy, learn the reward function.
59
![Page 60: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/60.jpg)
No Metrics Are Perfect: Adversarial Reward Learning (Wang et al., ACL 2018)
60
![Page 61: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/61.jpg)
When will IRL work?
• When the optimization target is complex.
• There are no easy formulations of the reward.
• If you can clearly define the reward, don’t use IRL and it will not work.
61
![Page 62: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/62.jpg)
Improving Sample Efficiency:Combine Model-Free and Model-Based RL
62
![Page 63: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/63.jpg)
Vision and Language Navigation(Anderson et al., CVPR 2018)
63
![Page 64: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/64.jpg)
Look Before You Leap:Combine Model-Free and Model-Based RL for Look-Ahead Search(Wang et al., ECCV 2018)
64
![Page 65: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/65.jpg)
Conclusion of Part I
• We provided a gentle introduction to DRL.• We showed the current landscape of DRL4NLP
research.• What do (NLP) people use DRL for?• Intriguing directions in DRL4NLP.
65
![Page 66: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/66.jpg)
Open-sourced software:
• DeepPath: https://github.com/xwhan/DeepPath• KBGAN: https://github.com/cai-lw/KBGAN• Scheduled Policy Optimization:
https://github.com/xwhan/walk_the_blocks• AREL: https://github.com/littlekobe/AREL
66
![Page 67: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/67.jpg)
Outline
• Introduction• Fundamentals and Overview (William Wang)• Deep Reinforcement Learning for Dialog
(Jiwei Li)• Challenges (Xiaodong He)• Conclusion
67
![Page 68: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/68.jpg)
Seq2Seq Models for Response Generation
how are you ?
I’m fine . EOS
Encoding Decoding
eos I’m fine .
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
68
![Page 69: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/69.jpg)
Seq2Seq Models for Response Generation
how are you ?
Encoding
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
69
![Page 70: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/70.jpg)
Seq2Seq Models for Response Generation
how are you ?
I’m
Encoding Decoding
eos
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
70
![Page 71: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/71.jpg)
Seq2Seq Models for Response Generation
how are you ?
I’m fine
Encoding Decoding
eos I’m
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
71
![Page 72: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/72.jpg)
Seq2Seq Models for Response Generation
how are you ?
I’m fine .
Encoding Decoding
eos I’m fine
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
72
![Page 73: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/73.jpg)
Seq2Seq Models for Response Generation
how are you ?
I’m fine . EOS
Encoding Decoding
eos I’m fine .
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015)
73
![Page 74: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/74.jpg)
Issues
How do we handle long-term dialogue success?
• Problem 1: Repetitive responses.
74
![Page 75: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/75.jpg)
Two bots talk with each other
75
![Page 76: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/76.jpg)
Shut up !
Repetitive responses.
76
![Page 77: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/77.jpg)
Shut up !
Repetitive responses.
77
![Page 78: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/78.jpg)
No, you shut up !
Shut up !
Repetitive responses.
78
![Page 79: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/79.jpg)
No, you shut up !
No, you shut up !
Shut up !
Repetitive responses.
79
![Page 80: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/80.jpg)
No, you shut up !
No, you shut up !
Shut up !
No, you shut up !
Repetitive responses.
80
![Page 81: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/81.jpg)
No, you shut up !
No, you shut up !
Shut up !
No, you shut up !
No, you shut up !
Repetitive responses.
…… 81
![Page 82: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/82.jpg)
See you later !
See you later !
See you later !
See you later !
See you later !……
Repetitive responses.
82
![Page 83: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/83.jpg)
Issues
How do we handle long-term dialogue success?
• Problem 1: Repetitive responses.• Problem 2: Short-sighted conversation decisions.
83
![Page 84: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/84.jpg)
How old are you ?
Short-sighted conversation decisions
84
![Page 85: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/85.jpg)
How old are you ?
i 'm 16 .
Short-sighted conversation decisions
85
![Page 86: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/86.jpg)
How old are you ?
i 'm 16 .
16 ?
Short-sighted conversation decisions
86
![Page 87: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/87.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
87
![Page 88: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/88.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
you don 't know what you 're saying
88
![Page 89: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/89.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
you don 't know what you 're saying
i don 't know what you 're talking about
89
![Page 90: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/90.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
you don 't know what you 're saying
i don 't know what you 're talking about
you don 't know what you 're saying
90
![Page 91: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/91.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
you don 't know what you 're saying
i don 't know what you 're talking about
you don 't know what you 're saying
A bad action
91
![Page 92: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/92.jpg)
How old are you ?
i 'm 16 .
16 ?i don 't know what you 're
talking about
Short-sighted conversation decisions
you don 't know what you 're saying
i don 't know what you 're talking about
you don 't know what you 're sayingOutcome does not emergeuntil a few turns later
92
![Page 93: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/93.jpg)
Reinforcement Learning
93
![Page 94: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/94.jpg)
Notations: State
How old are you ?
how old are you
Encoding
94
![Page 95: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/95.jpg)
Notations: Action
How old are you ?
i 'm 16 .
95
![Page 96: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/96.jpg)
Reward
96
![Page 97: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/97.jpg)
Reward
1. Ease of answering
97
![Page 98: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/98.jpg)
Reward
1. Ease of answering
”I don’t know what you are talking about”
98
![Page 99: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/99.jpg)
Reward
2. Information Flow
99
![Page 100: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/100.jpg)
See you later !
See you later !
See you later !
See you later !
Reward
2. Information Flow
100
![Page 101: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/101.jpg)
See you later !
See you later !
Reward
2. Information Flow
101
![Page 102: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/102.jpg)
How old are you ?
i 'm 16 .
messageresponse
Reward
3. Meaningfulness
102
![Page 103: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/103.jpg)
Simulation
A message from training setEncode
103
![Page 104: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/104.jpg)
Simulation
A message from training setEncode
s1Decode
…
104
![Page 105: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/105.jpg)
Simulation
A message from training setEncode
s1Decode
Encode
…
105
![Page 106: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/106.jpg)
Simulation
A message from training setEncode
s1Decode
Encode
s2Decode
…
106
![Page 107: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/107.jpg)
InputMessage
Encode Decode
Turn 1
Encode
Turn 2
Decode
Encode
…
Decode
Turn N
S1 S2 Sn
Compute Accumulated Reward R(S1,S2,…,Sn)
107
![Page 108: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/108.jpg)
InputMessage
Encode Decode
Turn 1
Encode
Turn 2
Decode
Encode
…
Decode
Turn N
S1 S2 Sn
REINFORCE Algorithm (William,1992)
108
![Page 109: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/109.jpg)
Evaluation
RL-win RL-lose Tie
Single-Turn Quality 0.40 0.36 0.24
Multi-Turn Quality 0.72 0.12 0.16
109
![Page 110: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/110.jpg)
Evaluation
RL-win RL-lose Tie
Single-Turn Quality 0.40 0.36 0.24
Multi-Turn Quality 0.72 0.12 0.16
110
![Page 111: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/111.jpg)
Results
Input MutualInformation
The RL model
How old are you ? I’m 16. I’m 16. why are you asking ?
111
![Page 112: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/112.jpg)
Results
Input MutualInformation
The RL model
How old are you ? I’m 16. I’m 16. why are you asking ?
what is your full name ?
i have no idea what 's yours ?
112
![Page 113: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/113.jpg)
Results
Input MutualInformation
The RL model
How old are you ? I’m 16. I’m 16. why are you asking ?
what is your full name ?
i have no idea what 's yours ?
I don 't want to go home tonight .
Really ? Why ?
113
![Page 114: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/114.jpg)
Results
Input MutualInformation
The RL model
How old are you ? I’m 16. I’m 16. why are you asking ?
what is your full name ?
i have no idea what 's yours ?
I don 't want to go home tonight .
Really ? Why ?
Do you have any feelings for me ?
I don’t know what you are talking about.
Would I see you if I didn 't ?
114
![Page 115: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/115.jpg)
Reward for Good Dialogue
1. Easy to answer2. Information Flow 3. Meaningfulness
115
![Page 116: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/116.jpg)
What Should Rewards for GoodDialogue Be Like ?
116
![Page 117: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/117.jpg)
TuringTest
Reward for GoodDialogue
117
![Page 118: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/118.jpg)
How old areyou ?
I don’t know what you aretalking about
I’m 25.
A human evaluator/ judge
Reward for GoodDialogue
118
![Page 119: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/119.jpg)
How old areyou ?
I don’t know what you aretalking about
I’m 25.
Reward for GoodDialogue
119
![Page 120: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/120.jpg)
How old areyou ?
I don’t know what you aretalking about
I’m 25.
P= 90% human generated
P= 10% human generated
Reward for GoodDialogue
120
![Page 121: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/121.jpg)
Adversarial Learning inImage Generation (Goodfellow et al., 2014)
121
![Page 122: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/122.jpg)
Model BreakdownGenerative Model (G)
how are you ?
I’m fine . EOS
Encoding Decoding
eos I’m fine .
122
![Page 123: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/123.jpg)
Model BreakdownGenerative Model (G)
how are you ?
I’m fine . EOS
Encoding Decoding
eos I’m fine .
Discriminative Model (D)
how are you ? eos I’m fine .
P= 90% human generated
123
![Page 124: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/124.jpg)
Model BreakdownGenerative Model (G)
how are you ?
I’m fine . EOS
Encoding Decoding
eos I’m fine .
Discriminative Model (D)
how are you ? eos I’m fine .
Reward P= 90% human generated
124
![Page 125: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/125.jpg)
Policy Gradient
REINFORCE Algorithm (William,1992)
Generative Model (G)
how are you ?
I’m fine EOS
Encoding Decoding
eos I’m fine .
125
![Page 126: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/126.jpg)
Adversarial Learning for Neural Dialogue Generation
Update the Discriminator
Update the Generator
The discriminator forces the generator to produce correct responses
126
![Page 127: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/127.jpg)
Human Evaluation
The previous RL model onlyperform better on multi-turnconversations
127
![Page 128: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/128.jpg)
Results: Adversarial Learning Improves Response Generation
Human Evaluator
vs a vanilla generation model
Adversarial Win
Adversarial Lose
Tie
62% 18% 20%
128
![Page 129: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/129.jpg)
Sample responseTell me ... how long have you had this falling sickness ?
System Response
129
![Page 130: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/130.jpg)
Sample responseTell me ... how long have you had this falling sickness ?
System ResponseVanilla-Seq2Seq I don’t know what you are
talking about.
130
![Page 131: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/131.jpg)
Sample responseTell me ... how long have you had this falling sickness ?
System ResponseVanilla-Seq2Seq I don’t know what you are
talking about.Mutual Information I’m not a doctor.
131
![Page 132: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/132.jpg)
Sample responseTell me ... how long have you had this falling sickness ?
System ResponseVanilla-Seq2Seq I don’t know what you are
talking about.Mutual Information I’m not a doctor.
Adversarial Learning A few months, I guess.
132
![Page 133: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/133.jpg)
Outline
• Introduction• Fundamentals and Overview (William Wang)• Deep Reinforcement Learning for Dialog (Jiwei Li)• Frontiers and Challenges (Xiaodong He)• Conclusion
133
![Page 134: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/134.jpg)
Frontiers and Challenges
• NLP problems that presents new challenges to RL• An unbounded action space defined by natural language• Dealing with combinatorial actions and external
knowledges• Learning reward functions for NLG
• RL problems that are particularly relevant to NLP• Sample complexity • Model-based vs. model free RL• Acquiring rewards
134
![Page 135: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/135.jpg)
Consider a Sequential Decision Making Problem in NLP• E.g., Playing text-based games, Webpage navigation, task
completion, …• At time t:
• Agent observes the state as a string of text , e.g., state-text 𝑠@
• Agent also knows a set of possible actions, each is described as a string text, e.g., action-texts
• Agent tries to understand the “state text” and all possible “action texts”, and takes the right action – to maximize the long term reward
• Then, the environment state transits to a new state, agent receives an immediate reward, and move to t+1
135
![Page 136: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/136.jpg)
RL for Natural Language Understanding Tasks• Reinforcement learning (RL) with a natural language state
and action space• Applications such as text games, webpage navigation, dialog
systems• Challenging because the potential state and action space are
large and sparse
• An example: text-based game
State text
Action texts
136
![Page 137: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/137.jpg)
DQN for RL in NLP
• LSTM-DQN• State is represented
by a continuous vector (by a LSTM)
• Actions and objects are considered as independent symbols
• Tested on a MUD style text-based game playing benchmark
Narasimhan, K., Kulkarni, T. and Barzilay, R., 2015. Language understanding for text-based games using deep reinforcement learning. EMNLP. 137
![Page 138: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/138.jpg)
Unbounded action space in RL for NLP
But, not only the state space is huge, the action space is huge, too.– Action is characterized by unbounded natural language description.
138
![Page 139: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/139.jpg)
The Reinforcement Learning for NL problem• RL for text understanding
• Unbounded state and action spaces (both in texts)• Time-varying feasible action set
• At each time, the actions are different texts.• At each time, the number of actions are different.
𝑠 𝑎 …
Relevance
𝑠 𝑎 …
Relevance𝑝(𝑠 |𝑠 , 𝑎 )
𝑟 𝑟
… …𝑎| | 𝑎| |
139
![Page 140: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/140.jpg)
Baselines: Variants of Deep Q-Network• Q-function: using a single deep neural network as function
approximation• Input: concatenated state-actions (BoW)• Output: Q-values for different actions
Max-action DQN Per-action DQNà max over 𝑎@[
140
![Page 141: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/141.jpg)
Deep Reinforcement Relevance Network (DRRN)• Similar to the DSSM (deep structured semantic
model), project both s and a into a continuous space • Separate state and action embeddings• Interaction at the embedding space•
Motivation:- Language is different
in these two contexts.- Text similarity does
NOT always lead to the best action.
[Huang, He, Gao, Deng, Acero, Heck, 2013. “Learning Deep Structured Semantic Models for for Web Search using Clickthrough Data,” CIKM]; [He, Chen, He, Gao, Li, Deng, Ostendorf, 2016. “Deep Reinforcement Learning with a Natural Language Action Space,” ACL] 141
![Page 142: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/142.jpg)
Reflection: DRRN• Prior DQN work (e.g., Atari game, AlphaGo): state space unbounded,
action space bounded.
• In NLP tasks, usually the action space is unbounded since it is characterized by natural language, which is discrete and nearly unconstrained.
• New DRRN: (Deep Reinforcement Relevance Network)
• Project both the state and the action into a continuous space
• Q-function is an relevance function of the state vector and the action vector
142
![Page 143: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/143.jpg)
Experiments: Tasks
• Two text games
• Hand annotate rewards for distinct endings• Simulators available at: https://github.com/jvking/text-
games
143
![Page 144: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/144.jpg)
Experiments• Tasks: Text Games/Interactive Fictions
Task 2: “Machine of Death”
Task 1: “Save John”
144
![Page 145: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/145.jpg)
Learning curve: DRRN vs. DQN
145
![Page 146: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/146.jpg)
Experiments: Final Performance
-5
0
5
10
15
20
n_hidden=20 n_hidden=50 n_hidden=100
PADQN(L=1) PADQN(L=2) MADQN(L=1)
MADQN(L=2) DRRN(L=1) DRRN(L=2)
-5
0
5
10
15
n_hidden=20 n_hidden=50 n_hidden=100
PADQN(L=1) PADQN(L=2) MADQN(L=1)
MADQN(L=2) DRRN(L=1) DRRN(L=2)
Game 1: “Saving John” Game 2: “Machine of Death”
The DRRN performs consistently better than all baselines, and often with a lower variance.Big gain from having separate state & action embedding spaces (DQN vs. DRRN).
146
![Page 147: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/147.jpg)
Visualization of the learned continuous space
147
![Page 148: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/148.jpg)
Experiments: Generalization
• In the testing stage, use unseen paraphrased actions
-5
0
5
10
15
n_hidden=20 n_hidden=50 n_hidden=100
PADQN(L=2) MADQN(L=2) DRRN(L=2)
Game 2: “Machine of Death”
148
![Page 149: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/149.jpg)
Q-function example values after converged
Text(withpredictedQ-values)
State Asyoumoveforward,thepeoplesurroundingyousuddenlylookupwithterrorintheirfaces,andfleethestreet.
Actionsintheoriginalgame
Ignorethealarm ofothersandcontinuemovingforward.(-21.5)Lookup.(16.6)
Paraphrasedactions(notoriginal)
Disregardthecautionofothersandkeeppushingahead.(-11.9)Turnupandlook.(17.5)
Fakeactions(notoriginal)
Staythere.(2.8)Staycalmly.(2.0)Screwit.I’mgoingcarefully.(-17.4)Yellateveryone.(-13.5)Insertacoin.(-1.4)Throwacointotheground.(-3.6)
149
![Page 150: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/150.jpg)
From games to large scale real-world scenarios
• Task: Build an agent runs on real world Reddit dataset https://www.reddit.com/
reads Reddit postsrecommends threads in real time with most future popularity
• Approach:• RL with specially designed Q-function for combinatorial action spaces
150
![Page 151: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/151.jpg)
Motivation
• we consider Reddit popularity prediction, which is different to newsfeed recommendation in two respects:
• Making recommendations based on the anticipated long-term interest level of a broad group of readers from a target community, rather than for individuals.
• Community interest level is not often immediately clear -- there is a time lag before the level of interest starts to take off. Here, the goal is recommendation in real time –attempting to identify hot updates before they become hot to keep the reader at the leading edge.
151
![Page 152: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/152.jpg)
Solution
• Problem fits reinforcement learning paradigm• Combinatorial action space
• Sub-action is a post• Action is a set of interdependent documents
• Two problems: i) potentially high computational complexity, ii) estimating the long-term reward (the Q-value in reinforcement learning) from a combination of sub-actions characterized by natural language.
• The paper focuses on (ii).
152
![Page 153: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/153.jpg)
Problem Setting• Registered Reddit users initiate a post and
people respond with comments, Together, the comments and the original post form a discussion tree.
• Comments (and posts) are associated with positive and negative votes (i.e., likes and dislikes) that are combined to get a karma score, which can be used as a measure for popularity.
• As in Fig 1., it is quite common that a lower karma comment will lead to more children and popular comments in the future (e.g. “true dat”).
• In a real-time comment recommendation system, the eventual karma of a comment is not immediately available, so prediction of popularity is based on the text in the comment in the context of prior comments in the subtree and other comments in the current time window.
153
![Page 154: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/154.jpg)
Solution
• State• the collection of comments previously recommended.
• Action• Picking a new set of comments. Note that we only consider
new comments associated with the threads of the discussion that we are currently following with the assumption that prior context is needed to interpret the comments.
• Reward• Long term Reddit voting scores, e.g., Karma scores after the
thread settles down.
• Environment• The partially observed discussion tree
154
![Page 155: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/155.jpg)
Model
[He, Ostendorf, He, Chen, Gao, Li, Deng, 2016. “Deep Reinforcement Learning with a Combinatorial Action Space for Predicting Popular Reddit Threads,” EMNLP]
155
![Page 156: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/156.jpg)
Experiments
• Data and stats
156
![Page 157: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/157.jpg)
Results
• On the askscience sub-reddit
157
![Page 158: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/158.jpg)
Example
• In table 6, by combining the second sub-action compared to choosing just the first sub-action alone, DRRN-Sum and DRRN-BiLSTM predict 86% and 26% relative increase in action-value, respectively. Since these two sub-actions are highly redundant, we hypothesize DRRN-BiLSTM is better than DRRN-Sum at capturing interdependency between sub-actions.
158
![Page 159: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/159.jpg)
Results on more sub-redditdomains Average karma score gains over the baseline and standard deviation across different subreddits (N = 10,K = 3)
159
![Page 160: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/160.jpg)
Incorporating External Knowledge• In many NLP tasks such as Reddit post
understanding, external knowledge (such as world knowledge) is helpful
• How to incorporate the knowledge into a RL framework is interesting
• How to retrieve complementary knowledge to enrich the state?
160
![Page 161: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/161.jpg)
Reinforcement Learning with External Knowledge
[He, J., Ostendorf, M. and He, X., 2017. Reinforcement Learning with External Knowledge and Two-Stage Q-functions for Predicting Popular Reddit Threads. arXiv:1704.06217.]
Retrieve external knowledge to augment a state-side representationAn attention-like approach is usedNot content-based retrievalBut event-based knowledge retrieval
Event features:• Timing feature• Semantic similarity• Popularity
161
![Page 162: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/162.jpg)
Incorporating external knowledge
DRRN (with different ways of incorporating knowledge) performance gains over baseline DRRN (without external knowledge) across 5 different subreddits
• External knowledge helps in general.• The most useful knowledge not necessarily the most “semantically similar” knowledge!• Event based knowledge retrieval is effective 162
![Page 163: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/163.jpg)
Examples
163
![Page 164: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/164.jpg)
RL in Long Text Generation Tasks
The challenges:
• Multi-sentence
• Weak correspondence between input and output
• Structural language requires correct order of events and aware of state changes!
Kiddon, Zettlemoyer, Choi. 2016. "Globally coherent text generation with neural checklist models." EMNLP
Generating Recipes
• Preheat pan over medium heat. • Generously butter one side of a slice of bread. • Place bread butter-side-down onto skillet bottom
and add 1 slice of cheese. • Butter a second slice of bread on one side and
place butter-side-up on top of sandwich. • Grill until lightly browned and flip over; continue
grilling until cheese is melted.
“Grilled Cheese Sandwich”
Ingredients:
4 slices of white bread2 slices of Cheddar cheese3 tablespoons butter, divided
Recipe
164
![Page 165: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/165.jpg)
Challenges in Long Form Text Generation
Sequence to Sequence Training Methods:• MLE• RL (Policy gradient)• GAN (!)
Issues:• Designedforshortformgeneration(e.g.,MTordialogresponse)• Lossfunctionsdoesnotreflecthigh-levelsemanticsforlongform• Notdirectmetricoptimization,exposurebias, creditassignment,strugglemaintainingcoherence,objectivefunctionbalancing,……
RL has been applied in text generation -- the challenge, however, is to define a global score that can measure the complex aspects of text quality beyond local n-gram patterns. 165
![Page 166: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/166.jpg)
Neural Reward Functions for Long Form Text Generation
Goal: • Capture individual semantic
properties of the generation task • Capture the coherence and long-
term dependencies among sentences
• Generate temporally correct text
Approach: • Use Policy Gradients• Train Neural Reward functions as
teachers • Generate task specific rewards.• Ensemble of rewards provide a better
signal ?
Bosselut, Celikyilmaz, Huang, He, Choi, 2018. “Discourse-Aware Neural Rewards for Coherent Text Generation”, NAACL
The generator is rewarded for imitating the discourse structure of the gold sequence
166
![Page 167: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/167.jpg)
Train the teacher
The teacher encodes the sentences of the document in the forward and reverse order
Two neural teachers that can learn to score an ordered sequence of sentences.1. Absolute Order Teacher
1. evaluates the temporal coherence of the entire generation2. Relative Order Teacher
1. reward how a sentence fits with surrounding sentences3. A DSSM like architecture is used in implementation
167
![Page 168: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/168.jpg)
Policy Learning to optimize the reward• The model
generates a recipe by sampling
• Also greedily decodes a baseline recipe.
• the teacher yields a reward for each sentence
168
![Page 169: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/169.jpg)
Results
169
![Page 170: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/170.jpg)
Challenges and opportunities
• Open questions in RL that are important to NLP
• Sample complexity • Model-based RL vs. Model-free RL• Acquiring rewards for many NLP tasks
170
![Page 171: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/171.jpg)
Reducing Sample Complexity• One of the core problems of RL: estimation with sampling.
• The problem: High variance and slow convergence
𝛻]𝐽 𝜃 ≈1𝑁c𝛻]
d
[eB
log𝜋] 𝜏 𝑟(𝜏)
log𝜋] 𝑎@ 𝑠@ = −12𝜎Q 𝑘𝑠@ − 𝑎@ Q + 𝑐𝑜𝑛𝑠𝑡
𝑟 𝑠@, 𝑎@ = −𝑠@Q − 𝑎@Q
slow convergence
171
![Page 172: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/172.jpg)
• Variance reduction using value function
Subtracting a baseline is unbiased in expectation but reduce variance greatly!
Various forms for 𝑏
𝛻]𝐽 𝜃 ≈1𝑁c𝛻] 𝑙𝑜𝑔𝜋] 𝜏 [𝑟 𝜏 − 𝑏]
d
[eB
𝐸 𝛻] 𝑙𝑜𝑔𝜋] 𝜏 𝑏 = u𝜋] 𝜏 𝛻]𝜋] 𝜏 𝑏𝑑𝜏�
�= u𝛻]𝜋] 𝜏 𝑏𝑑𝜏
�
�= 𝑏𝛻]∫ 𝜋] 𝜏 𝑑𝜏 = 𝑏𝛻]1 = 0
(1) 𝑏 = Bd∑ 𝑟(𝜏)d[eB (2) 𝑏 = 𝑉|,} 𝑠@ : 𝛻]𝐽 𝜃 ≈
1𝑁cc𝛻] log𝜋] 𝑎@S 𝑠@S
~
@e�
d
SeB
𝐴|,}(𝑠@, 𝑎@)
[GAE, John Schulman et al.2016]
(3) 𝑏 = 𝑏(𝑠@, 𝑎@): 𝛻]𝐽 𝜃 ≈1𝑁cc𝛻] log𝜋] 𝑎@S 𝑠@S
�
@e�
d
SeB
𝑄�S,@ − 𝑏 𝑠@�, 𝑎@S +1𝑁cc𝛻]𝐸K~|�(K�|���)[𝑏(𝑠@
S, 𝑎@S)]d
@
d
S
[Q-prop, Gu et al.2016]
Reducing Sample Complexity
172
![Page 173: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/173.jpg)
Improve convergence rate via constrained optimization
𝜃 ← 𝜃 +𝛼𝛻]𝐽(𝜃) 𝜋](𝑎@|𝑠@)
Problems with direct SGD: some parameters change probabilities a lot than others Ø Rescale the gradient with constraint divergence
𝜃" ← argmax]�
𝜃" − 𝜃 ~𝛻]𝐽 𝜃 s.t. 𝐷�� 𝜋]�, 𝜋] ≤ 𝜖
𝐷�� 𝜋]�, 𝜋] = 𝐸|�� log𝜋]� − log𝜋] ≈ 𝜃" − 𝜃 ~𝐹(𝜃" − 𝜃)
Fisher-information matrix𝐹 = 𝐸|�[log𝜋] 𝑎 𝑠 log𝜋] 𝑎 𝑠 ~]Equivalence with natural gradient ! [TRPO, Schulman et al.2015]
Ø Use penalty instead of constraint
min]c
𝜋] 𝑎S 𝑠S𝜋]���(𝑎S|𝑠S)
d
SeB
𝐴�S − 𝛽𝐷��[𝜋]��� , 𝜋]]Increase/decrease 𝛽 if KL is too high/low
[PPO, Schulman et al.2017]
Reducing Sample Complexity
173
![Page 174: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/174.jpg)
Model-based v.s. Model-free RL• Improve sample-efficiency via fast model-based RL
Pros Cons
Model-free RL Handling arbitrary dynamic systems with minimal bias
Substantially less sample-efficient
Model-based RLSample-efficient planning when given accurate dynamics
Cannot handle unknown dynamical systems that might be hard to model
Can we combine model-based and model-free RL? Guided policy search
min]c𝐸|�(��,��)[𝑙(𝑥@, 𝑢@)]~
@eB
min],��,…,��
cc𝐸� (��,��)[𝑙(𝑥@, 𝑢@)]~
@eB
d
[eB
𝑠. 𝑡𝑞[ 𝑢@ 𝑥@ = 𝜋] 𝑢@ 𝑥@ ∀𝑥@, 𝑢@, 𝑡, 𝑖
𝑞[ 𝑢@ 𝑥@ ~𝑁(𝑘@ +𝐾@𝑥@|𝑄��@¥B )
Planning𝑞[ 𝑢@ 𝑥@ through a local approximate dynamics 𝑝 𝑥@AB 𝑥@, 𝑢@ .
Differentiable Dynamic Programming(DDP)
174
![Page 175: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/175.jpg)
• Improve sample-efficiency via fast model-based RL
[Levine&Abbeel, NIPS 2014]
Model-based v.s. Model-free RL
175
![Page 176: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/176.jpg)
Acquiring Rewards• How can we rewards for complex real-world tasks?
*Many tasks are easier to provide expert data instead of reward function
Inverse RL: infer reward function from roll-outs of expert policy
176
![Page 177: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/177.jpg)
• Inverse RL: infer reward function from demonstrations
given:- state & action space- roll-out from 𝜋∗- dynamics model[sometimes]
goal:- Recover reward function- then use reward to get policy
Challenges:- underdefined problem- difficult to evaluate a learned reward- demonstrations may not be precisely
optimal
• Newest works: combined with generative adversarial networks
[Kalman ’64, Ng & Russell ’00]
Similar to inverse RL, GANs learn an objective for generative modeling
[Finn*, Christiano*, et al. ’16]
Acquiring Rewards
177
![Page 178: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/178.jpg)
• Generative adversarial inverse RL
Reward/discriminator optimization
[Guided cost learning, Finn et al. ICML ’16][GAIL, Ho & Ermon NIPS ’16]
𝐿Y 𝜓 = 𝐸ª~« −𝑙𝑜𝑔𝐷¬ 𝜏 + 𝐸ª~�[− 𝑙𝑜𝑔(1 −𝐷¬ 𝜏 )]
Policy/Generator optimization
𝐿 𝜃 = 𝐸ª~� 𝑙𝑜𝑔(1 −𝐷¬ 𝜏 ) − 𝑙𝑜𝑔(𝐷¬ 𝜏 )
Acquiring Rewards
178
![Page 179: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/179.jpg)
Session Summary
• Learn Q function in a common vector space for states and actions
• Add external knowledges to help NL understanding• The reward could be learned to reflect the goal of
long form text generation• Open questions in RL that are important to NLP
• Sample complexity • Model-based RL vs. Model-free RL• Acquiring rewards for many NLP tasks
179
![Page 180: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/180.jpg)
Conclusion
• Deep Reinforcement Learning is a very natural solution for many NLP applications.
• DRL can be interpreted in many different ways.• We have seen many exciting research directions.• In particular, DRL for dialog is a very promising
direction.• Opportunities and challenges are ahead of us.
180
![Page 181: Deep Reinforcement Learning for Natural Language …william/papers/ACL2018DRL4NLP.pdf · • In NLP tasks, usually the action space is unbounded since it is characterized by natural](https://reader030.fdocuments.net/reader030/viewer/2022020204/5b5a343c7f8b9ad0048e1abd/html5/thumbnails/181.jpg)
Questions?181