Deep Reasoning

Deep Reasoning2016-03-16

Taehoon [email protected]

References1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in

Neural Information Processing Systems. 2015.2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit

Memory Representations” arXiv preprint arXiv:1511.02301 (2015).3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me

Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint arXiv:1511.06038 (2015).

4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question Answering” arXiv preprint arXiv:1603.01417 (2016).

5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015).

6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question Answering” arXiv preprint arXiv:1510.07526 (2015).

7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015).

8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader Network” arXiv preprint arXiv:1603.01547 (2016).

2

References9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint

arXiv:1511.06038 (2015).10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint

arXiv:1312.6114 (2013).11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation

using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015.

3

ModelsAnswer selection(WikiQA) GeneralQA(CNN) Considered transitiveinference(bAbI)

ABCNN E2E MN E2E MN

Variational ImpatientAttentiveReader DMN

AttentivePooling Attentive(Impatient)Reader ReasoningNet

AttentionSum Reader NTM

4

End-to-End Memory Network [Sukhbaatar, 2015]

5


6

I go to school.He gets ball.

…

EmbedC

Where does he go?u EmbedB

EmbedA

Attention

o

Igoto

he gets ball

Input Memory

Output Memory

softmax

Inner product

weighted sum

Σ linear


7


…

linear

Where does he go?

Σ

Σ

Σ

Sentence representation :𝑖 th sentence:𝑥% = 𝑥%',𝑥%),… , 𝑥%+BoW:𝑚% = ∑ 𝐴𝑥%//PositionEncoding:𝑚% = ∑ 𝑙/ 1 𝐴𝑥%//TemporalEncoding:𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/

Training detailsLinear Start (LS) help avoid local minima- First train with softmax in each memory layer removed, making the model entirely linear except

for the final softmax- When the validation loss stopped decreasing, the softmax layers were re-inserted and training

recommenced

RNN-style layer-wise weight tying- The input and output embeddings are the same across different layers

Learning time invariance by injecting random noise- Jittering the time index with random empty memories- Add “dummy” memories to regularize 𝑇4(𝑖)

8

Example of bAbI tasks

9

The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016]

10


11

• Context sentences : 𝑆 = 𝑠', 𝑠),… , 𝑠+ , 𝑠% ∶BoW word representation• Encoded memory : m; = 𝜙 𝑠 ∀𝑠 ∈ 𝑆

• Lexical memory• Each word occupies a separate slot in the memory• 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature• Multiple hop only beneficial in this memory model

• Window memory (best)• 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆

m; = 𝑤%A BA' )⁄ … 𝑤% …𝑤%D BA' )⁄

• Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words

• Sentential memory• Same as original implementation of Memory Network


12

Self-supervision for window memories- Memory supervision (knowing which memories to attend to) is not provided at training time

- Making gradient steps using SGD to force the model to give a higher score to the supporting memory 𝒎G relative to any other memory from any other candidate using:

Hard attention (training and testing) : 𝑚H' = argmax%M',…,+

𝑐%N𝑞

Soft attention (testing) : 𝑚H' = ∑ 𝛼%𝑚%%M'…+ ,𝑤𝑖𝑡ℎ𝛼% =STUVW

∑ STUVW

X

- If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated

- Can be understood as a way of achieving hard attention over memories (no need any new label information beyond the training data)


13

Gated Recurrent Network (GRU)

14

ℎYA'

𝑥Y 𝑥YD'

ℎYX

𝑟Y

𝑧Y

ℎY\

X

1-

X

+

𝑧Y

1 − 𝑧Y

ℎYA'

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015]

15


16


…

Where does he go?GloVeEmbed

𝑦Y

Igoto

GloVeEmbedwh

dohe

𝑞

𝑎Y

𝑦Y𝑞

𝑎Y

𝐺𝑅𝑈 𝐺𝑅𝑈

< 𝐸𝑂𝑆 >

ℎY 𝑒%

𝑒%

𝑮𝑹𝑼𝒍𝒊𝒔𝒉

EpisodicMemory

𝑔Y%

InputModule

AnswerModule

QuestionModule


17

𝑐Y

ℎY 𝑒%


…

𝑔Y%

Where does he go?

𝑞

GloVeEmbed

𝐺𝑅𝑈(𝐿[𝑤Ys], ℎYA'u ) = ℎYu = 𝑐Y

𝑦Y

Igoto

Input Memory

Episodic Memory 𝑒% 𝑚%

𝑞

GloVeEmbedI

goto𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y

v , 𝑞YA')

ℎN% = 𝑒%

𝑔Y% = 𝑮(𝑐Y,𝑚%A',𝑞)

𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY% = 𝑔Y%𝐺𝑅𝑈 𝑐Y, ℎYA'% + (1 −𝑔Y% ) ℎYA'% 𝑚%

new Memory

Gate

𝑞

𝑎Y

𝑦Y𝑞

𝑎Y


< 𝐸𝑂𝑆 >

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥


18

ℎY 𝑒Y


…



𝑦Y

Igoto

Input Memory

Episodic Memory 𝑒Y 𝑚%

GloVeEmbedI


v , 𝑞YA')

ℎN% = 𝑒%

𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY% = 𝑔Y%𝐺𝑅𝑈 𝑐Y, ℎYA'% + (1 −𝑔Y% ) ℎYA'% 𝑚%

new Memory

𝑞

𝑎Y

𝑦Y𝑞

𝑎Y


< 𝐸𝑂𝑆 >


𝑔Y% = 𝑮(𝑐Y,𝑚%A',𝑞)Gate

feature vector : captures a similarities between c, m, q

G:two-layerfeedforwardneuralnetwork

AttentionMechanism

𝑞𝑞𝑐Y

𝑔Y%


19

𝑐Y

ℎY 𝑒Y


…

𝑔Y%

Where does he go?

𝑞

GloVeEmbed


𝑦Y

Igoto

Input Memory

Episodic Memory

𝑞

GloVeEmbedI


v , 𝑞YA')

ℎN% = 𝑒%


𝑞

𝑎Y

𝑦Y𝑞

𝑎Y


< 𝐸𝑂𝑆 >



ℎY% = 𝑔Y%𝐺𝑅𝑈 𝑐Y, ℎYA'% + (1 −𝑔Y% )ℎYA'%

𝑒% = ℎNy%

new Memory

Episodicmemoryupdate

𝑒% 𝑚%

𝑚%

𝑚% = 𝐺𝑅𝑈(𝑒%,𝑚%A')

EpisodicMemoryModule- Iterates overinput representations,whileupdatingepisodicmemory 𝒆𝒊- Attentionmechanism+Recurrentnetwork→ Updatememory𝒎𝒊

Memoryupdate


20

𝑐YI go to school.He gets ball.

…

𝑔Y%

Where does he go?

𝑞

GloVeEmbed


𝑦Y

Igoto

Input Memory

Episodic Memory

GloVeEmbedI


v , 𝑞YA')

ℎN% = 𝑒%


𝑞

𝑎Y

𝑦Y𝑞

𝑎Y


< 𝐸𝑂𝑆 >




new Memory

ℎY 𝑒%

𝑞

𝑒% 𝑚%

𝑚%

MultipleEpisodes- Allowstoattend todifferent inputs duringeachpass- Allowsforatypeof transitive inference,sincethefirst

passmayuncovertheneedtoretrieveadditional facts.

Q:Whereisthefootball?C1:Johnputdownthefootball.

Onlyoncethemodel seesC1,John isrelevant,canreasonthattheseconditerationshouldretrievewhereJohnwas.

CriteriaforStopping- Append aspecialend-of-passes

representationtotheinput𝒄- Stopifthisrepresentationischosen by

thegate function- Setamaximumnumberof iterations- ThisiswhycalledDynamicMM


21

𝑐YI go to school.He gets ball.

…

𝑔Y%


𝐺𝑅𝑈(𝐿[𝑤Ys], ℎYA'u ) = ℎYu = 𝑐YI

goto

Input Memory

Episodic Memory

GloVeEmbedI


v , 𝑞YA')

ℎN% = 𝑒%




new Memory

ℎY 𝑒Y

𝑞

𝑒Y

𝑚%

𝑞

𝑦Y𝑞

𝑎Y

𝑦Y𝑞

𝑎Y


< 𝐸𝑂𝑆 >


𝑚%

AnswerModule- Triggered onceattheendoftheepisodicmemoryorat

eachtimestep- Concatenatethelast generatedword andthequestion

vectorastheinputateachtimestep- Cross-entropyerror

Training Details

- Adam optimization- 𝐿) regularization, dropout on the word embedding (GloVe)

bAbI dataset- Objective function : 𝐽 = 𝛼𝐸s� 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s�(𝐴𝑛𝑠𝑤𝑒𝑟𝑠)- Gate supervision aims to select one sentence per pass

- Without supervision : GRU of c�, ℎY% and 𝑒% = ℎNy%

- With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y% 𝑐YNYM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y% = �� U

∑ ��(�XU )V

X��and

𝑔Y% is the value before sigmoid- Better results, because softmax encourages sparsity & suited to picking one sentence

22

Training Details

Stanford Sentiment Treebank (Sentiment Analysis)

- Use all full sentences, subsample 50% of phrase-level labels every epoch

- Only evaluated on the full sentences

- Binary classification, neutral phrases are removed from the dataset

- Trained with GRU sequence models

23

Training Details

24

Dynamic Memory Networks for Visual and Textual Question Answering [Xiong 2016]

25

Severaldesignchoicesaremotivatedbyintuitionand accuracyimprovements

Input Module in DMN

- A single GRU for embedding story and store the hidden states

- GRU provides temporal component by allowing a sentence to know the content

of the sentences that came before them

- Cons:- GRU only allows sentences to have context from sentences before them, but not after them

- Supporting sentences may be too far away from each other

- Here comes Input fusion layer

26

Input Module in DMN+

Replacing a single GRU with two different components1. Sentence reader : responsible only for encoding the words into a sentence

embedding• Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%//

• Considered GRUs LSTMs, but required more computational resources, prone to overfitting2. Input fusion layer : interactions between sentences, allows content interaction

between sentences• bi-directional GRU to allow information from both past and future sentences• gradients do not need to propagate through the words between sentences• distant supporting sentences can have a more direct interaction

27

Input Module for DMN+

28

Referencedpaper:AHierarchicalNeuralAutoencoder forParagraphsandDocuments[Li,2015]

Episodic Memory Module in DMN+

- 𝐹 = [𝑓', 𝑓), … ,𝑓�] : output of the input module- Interactions between the fact 𝒇𝒊 and both the question 𝒒 and episode memory

state 𝒎𝒕

29

Attention Mechanism in DMN+

Use attention to extract contextual vector 𝑐Y based on the current focus

1. Soft attention• A weighted summation of 𝐹 : 𝑐Y = ∑ 𝑔%Y𝑓%�

%M'

• Can approximate a hard attention by selecting a single fact 𝑓%• Cons: losses positional and ordering information

• Attention passes can retrieve some of this information, but inefficient

30

Attention Mechanism in DMN+

2. Attention based GRU (best)- position and ordering information : RNN is proper but can’t use 𝑔%Y

- 𝑢%: update, 𝑟%: how much retain from ℎ%A'- Replace 𝑢% (vector) to 𝑔%Y (scalar)- Allows us to easily visualize how the attention gates activate- Use final hidden state as 𝒄𝒕, which is used to update episodic memory 𝒎𝒕

31

𝑔%Y 𝑔%Y

Episode Memory Updates in DMT+

1. Untied and Tied (better) GRU𝑚Y = 𝐺𝑅𝑈(𝐶Y,𝑚YA')

2. Untied ReLU layer (best)𝑚Y = 𝑅𝑒𝐿𝑈(𝑊Y 𝑚YA'; 𝑐Y; 𝑞 + 𝑏)

32

Training Details

- Adam optimization

- Xavier initialization is used for all weights except for the word embeddings

- 𝐿) regularization on all weights except bias

- Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9

33

ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015]

34


35

• Most prior work on answer selection model each sentence separately and

neglects mutual influence

• Human focus on key parts of 𝑠� by extracting parts from 𝑠' related by

identity, synonymy, antonym etc.

• ABCNN : taking into account the interdependence between 𝑠� and 𝑠'

• Convolution layer : increase abstraction of a phrase from words


36

1. Input embedding with word2vec

2-1. Convolution layer with wide convolution

• To make each word 𝑣% to be detected by all weights in 𝑊

2-2. Average pooling layer

• all-ap : column-wise averaging over all columns

• w-ap : column-wise averaging over windows of 𝑤

3. Output layer with logistic regression

• Forward all-ap to all non-final ap layer + final ap layer


37

Attention on feature map (ABCNN-1)• Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠�with respect to 𝑠'• 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹�,¢ : , 𝑖 , 𝐹',¢ : , 𝑗 )

• 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 )

• Generate the attention feature map 𝐹%,¦ for 𝑠%

• Cons : need more parameters


38

Attention after convolution (ABCNN-2)• Attention weights directly on the representation with the aim of improving the

features computed by convolution• 𝑎�,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum• w-ap on convolution feature


39

ABCNN-1 ABCNN-2

Indirectimpacttoconvolution Directconvolutionviapooling(weightedattention)

Need morefeaturesVulnerabletooverfitting Noneedfeatures

handlessmaller-granularityunits(ex. Wordlevel)

handleslarger-granularityunits(ex. Phraselevel,phrasesize=window size)

ABCNN-3


40

Empirical Study on Deep Learning Models for QA [Yu 2015]

41

Empirical Study on Deep Learning Models for QA [Yu 2015]

42

The first to examine Neural Turing Machines on QA problems

Split QA into two step

1. search supporting facts

2. Generate answer from relevant pieces of information

NTM

• Single-layer LSTM network as controller

• Input : word embedding1. Support fact only

2. Fact highlighted : user marker to annotate begin and end of supporting facts

• Output : softmax layer (multiclass classification) for answer

Teaching Machines to Read and Comprehend [Hermann 2015]

43

Teaching Machines to Read and Comprehend [Hermann 2015]

44

where𝑓% = 𝑦§ 𝑡

𝑠(𝑡) : degree to which the network attends to a particular token in the document when answering the query(soft attention)

Text Understanding with the Attention Sum Reader Network [Kadlec 2016]

45

Answer should be in context

Inspired by Pinter Network

Contrast to Attentive Reader:

• We select answer from context directly using weighted sum of

individual representation Attentive Reader

Variational Inference Framework

47

𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧)¬

log𝑝® 𝑥, 𝑧 = log«𝑞 ℎ𝑞 ℎ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ

¬≥ « 𝑞 ℎ log

𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧𝑞 ℎ 𝑑ℎ

¬

= « 𝑞 ℎ log𝑝 𝑥 ℎ 𝑝 ℎ 𝑧

𝑞 ℎ 𝑑ℎ¬

+« 𝑞 ℎ log𝑝 𝑧𝑞 ℎ 𝑑ℎ

¬

= 𝐸± ¬ log𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 )

= 𝐸± ¬ log𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log𝑞 ℎ

Variational Inference Framework

48

𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧)¬

log𝑝® 𝑥, 𝑧 = log«𝑞 ℎ𝑞 ℎ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ

¬≥ « 𝑞 ℎ log

𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧𝑞 ℎ 𝑑ℎ

¬

= « 𝑞 ℎ log𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧

𝑞 ℎ 𝑑ℎ¬

+ « 𝑞 ℎ log𝑝 𝑧𝑞 ℎ 𝑑ℎ

¬

= 𝐸± ¬ log𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧

= 𝐸± ¬ log𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log𝑞 ℎ atightlowerboundif𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧)

Jensen’s Inequality

Conditional Variational Inference Framework

49

𝑝® 𝑦|𝑥 = ¨𝑝®(𝑦, 𝑧|𝑥)©

=¨𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥©

log 𝑝(𝑦|𝑥) = log «𝑞 𝑧𝑞 𝑧 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧

©≥ « 𝑞 𝑧 log

𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥𝑞 𝑧 𝑑𝑧

©

= « 𝑞 𝑧 log𝑝 𝑦 𝑧, 𝑥𝑞 𝑧 𝑑𝑧

©+ « 𝑞 𝑧 log

𝑝 𝑧 𝑥𝑞 𝑧 𝑑𝑧

¬

= «𝑞 𝑧 log𝑝 𝑦 𝑧, 𝑥 𝑑𝑧©

−«𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧©

+« 𝑞 𝑧 log𝑝 𝑧 𝑥𝑞 𝑧 𝑑𝑧

¬

= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 )

= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 atightlowerboundif𝑞 𝑧 = 𝑝(𝑧|𝑥)

Jensen’s Inequality

Neural Variational Inference Framework

log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ

1. Vector representations of the observed variables

𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓 (𝑥)

2. Joint representation (concatenation)

𝜋 = 𝑔(𝑢, 𝑣)

3. Parameterize the variational distribution

𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋)

50

Neural Variational Document Model [Miao, 2015]

51

Deep Reasoning

Technology

Transcript of Deep Reasoning