Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... ·...

22
Explainable Neural Computation via Stack Neural Module Networks (July, 2018) Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley

Transcript of Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... ·...

Page 1: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Explainable Neural Computation via Stack Neural Module Networks (July, 2018)

Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley

Page 2: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Outline● The Problem

● Motivation and Importance

● The Approach

○ Module layout controller

○ Neural modules with a memory stack

○ Soft program execution

● Dataset

● Results

● Critique

Page 3: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Explain NN by :

❏ Building in attention layers❏ Post hoc extraction of implicit model attention (eg : gradient propagation)❏ Network dissection

Can we go beyond a single heatmap?

❏ Explainable model of more complex task : Question Answering , Referential Expression grounding

❏ [ Requires several reasoning steps to solve ]

Motivation

Page 4: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

The Problem and Importance

❏ Single heat map highlighting important spatial regions may not tell the full story❏ Existing modular nets: analyse question → predict sequence of predefined

modules → predict answer❏ But, need supervised module layouts (expert layout) for training layout policy❏ Explicit modular reasoning process with low supervision

Question : There is a small gray block. Are there any spheres to the left of it

Page 5: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

The Approach❏ Replace layout graph with a stack based data structure

❏ [Instead of making discrete choices of layout, this makes layout soft and

continuous → model can be optimised with in fully differentiable way with

SGD]

❏ Steps :

❖ Module layout controller

❖ Neural modules with a memory stack

❖ Soft program execution

Page 6: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Model

❖ Module layout

controller

❖ Neural modules

with a memory

stack

❖ Soft program

execution

Page 7: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Layout controller

ct = dxd

wt = |M| dim

W(t)1= d x d

W2 = d x 2d

W3= 1 x d

input Q [S words]

Encodes in sequence [h1,. . ., hs] {l = S, dim = d} with BiLSTM

hs = Concatenation of the forward LSTM output and backward LSTM output at the s-th word

At each t applies time dependent linear transform to Q and linearly combines it with previous ct-1 asu = W2[ W

(t)1q + b1; ct-1] +b2

Next, controller runs in a recurrent manner from t=0 to T-1

At each t, a small MLP is applied to u to predict w(t) w(t) = softmax(MLP(u;𝛉MLP))∑M wm

(t) = 1

At each t, controller predicts ct cvt,s = softmax(W3(u*hs))ct = ∑

Scvt,s . hs

Page 8: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Neural modules with a memory stackHow many objects are right of the blue object → Answer[how many](transform[right](find[blue]))

Page 9: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Differentiable memory stack

❏ Modules may take diff # of inputs + compare what they sees at t

with previously seen.

❏ Typical tree structure layout Compare(Find(),Transform(Find()))

❏ Therefore, give them memory to remember.

❏ But, restrict to Last-in-first-out (LIFO) stack.

❏ Thus, functions like functions in programs, allowing only arguments and

returned values to be passed between modules

Page 10: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Differentiable memory stackStack = stores values of fixed length dimension

length L memory array | A={A i}i =1L + stack top pointer p | L-dim 1-hot vec

Push function : pointer increment + value writingp := 1d_cov(p,[0,0,1])Ai := Ai (1-pi) + z.pi i = 1,...,L

Pop function : pointer decrement + value readingp := 1d_cov(p,[1,0,0])z := ∑

LAi . pi

❏ Store HxW image attention maps❏ Each module first pops this image attention map → then pushes it back❏ Eg: Compare(Find(),Transform(Find()))Find → pushes its localization result into stackThen transform pops one attention map then pushes the transformed attentionThen compare module pops two image attention maps & uses them to predict the answers

Page 11: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Soft program execution

Thus, model performs continuous selection of layout through wm(t)

At t = 0 → Initialize (A,P) with uniform image attention & p = [0,...,0,1] i.e at bottom

At every t → execute every module on current(A(t),P(t)) During execution each module m may pop/push to get (Am

(t),Pm(t))

then → use wm(t) to weight + sharpen the stack pointer with

softmaxA(t+1)= ∑M Am

(t)wm(t)

p(t+1)= softmax(∑M pm(t)wm

(t) )

Page 12: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Final Output

VQA: collect outputs from all the modules that have answer outputs from all timesteps

y= ∑T-1 ∑M(ans) ym(t)wm

(t) M(ans) = answer+compare modules

REF: Take the image-attention map at the top of the final stack at t=T and extract attended image features from this attention map. Then, a linear layer is applied on the attended image feature to predict the bounding box offsets from the feature grid location.

Page 13: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Experiments

What does the soft layout performance depend on?

❏ How does choice of training task affect it?

Does the soft layout hurt performance?

❏ Comparison with the models with discrete layouts.

Does explicit modular structure make the models more interpretable?

❏ Human Evaluation❏ Comparison with non modular model

Page 14: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Dataset

CLEVR VQA : Images generated with a graphics engine. Focus on

compositional reasoning.

70,000 | 15,000 | 15,000 | 10 questions/image

CLEVR-Ref : collected by the authors. Uses the same graphics engine.

70,000 | 15,000 | 15,000 | 10 REFs/image

Page 15: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Results : What does the soft layout performance depend on?

Joint training can lead to higher performance on both of these two task (especially when not using the expert layout)

Page 16: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Best models perform better with supervision but fail to converge without it.

Results : Does the soft layout hurt performance?

Page 17: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Real VQA datasets focus more on visual recognition than on compositional reasoning.

Still outperforms N2NMN

Results : Does the soft layout hurt performance?

Page 18: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

MAC : Also performs multi step sequential reasoning and has image and textual attention at each step.

Subject understanding: Can you understand what the step is doing from attention.

Forward Prediction: Can you tell what the model will predict? [tell us if the person can tell if where the model will go wrong].

Results : Explicit modular structure makes models more interpretable?

Percentage of each choice

Page 19: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Critique - The Good

● Motivation:○ Novel idea to increase applicability of modular neural networks which are more interpretable.

● Stack-NMN Model:○ Novel end-to-end differentiable training approach to modular networks. ○ Additional advantage of reduction in model parameters [ PG+EE : 40.4M, TbD-net : 115M ,

StackNMN : 7.32M]● Ablation study:

○ Performed ablation study of all the important model components giving reasoning behind model design decisions.

Page 20: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Critique - The Not So Good

● Dataset : ○ Synthetic datasets are know to suffer from biases. An analysis of the created CLEVR-Ref

would have been good. ● Stack-NMN Model:

○ How many modules are sufficient? [PG+EE, TbD-net : 39 modules | Stack-NMN : 9 modules]○ Can modules themselves be made reusable to decrease parameters?○ Perhaps, learnable generic modules?

● Evaluation Methodology:○ Could given breakdown of accuracy over Count, Compare Numbers, Exist, Query Attribute,

Compare Attribute. ○ Performance on CLEVR-CoGenT dataset provides an excellent test for generalization.

● Output Analysis:○ Could have shown instances of where the model is going wrong.

Page 21: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Development Since then

https://arxiv.org/pdf/1905.11532.pdf

- Learnable module- The cell denotes a generic module, which can

span all the required modules for a visual reasoning task.

- Each cell contains a certain number of nodes. - The function of a node (denoted by O) is to

perform a weighted sum of outputs of different arithmetic operations applied on ′the input feature maps x1 and x2

Page 22: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

References- Ronghang Hu, Jacob Andreas, Trevor Darrell, Kate Saenko, Explainable Neural Computation via Stack Neural Module

Networks, ECCV, 2018.