Multimodal Residual Learning for Visual QA

Multimodal Residual Learning for Visual QA

NamHyuk Ahn

Table of Contents

1. Visual QA

2. Stacked Attention Network (SAN)

3. Residual Learning

4. Multimodal Residual Network (MRN)

Visual QAEvaluation Metric

- Robust to variabilityinter-human

- Human accuracy is almost 90

- 248,349 Training questions (82,783 Images)

- 121,512 Validation questions (40,504 Images)

- 244,302 Testing questions (81,434 Images)

Stacked Attention Network

Motivation

- Answering question requires multi-step reasoning

- With {bicycles, window, street, baskets, dogs} objects

- To answer good question,pinpoint relevant region.

Q: what are sitting in the basket on a bicycle

Stacked Attention Network (SAN)

- SAN allows multi-step reasoning for visual QA

- Extension of Attention mechanism which successfully applied in captioning, translation etc.

Q: what are sitting in the basket on a bicycle

Stacked Attention Network

- Image Model• Extract image feature using

CNN

- Question Model• Extract semantic vector

using CNN or LSTM

- Stacked Attention• Multi-step reasoning

with attention layer

Stacked AttentionMulti-step reasoning

using attention layer

Image / Question Model- Image Model

• Get feature map from raw pixel Image

• Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512)

• Additional layer to fit to question feature

- Question Model•

Stacked Attention Model

- Global image feature leads to suboptimal due to noise from irrelevant object / region.

- Instead use SAM to pinpoint relevant region

- Given image feature matrixand question vector ,

14x14 attention distribution

- Get weighted sum of image vectors from each region.

-refined query vector

Result

Residual Learning

Problem of degradation- More depth, more accurate but deep network can

vanish/explode gradient

• BN, Xavier Init, Dropout can handle (~30 layer)

- More deeper, degradation problem occur

• Not only overfit, but also increase training error

Residual Network (ResNet)

Residual Block- To avoid degradation

problem, add shortcut connection.

- Element-wise addition with F(x) and shortcut connection, and pass through ReLU.

- Similar to LSTM

http://torch.ch/blog/2016/02/04/resnets.html

Shortcut connection

Multimodal Residual Network

Introduction

- Extend deep residual learning for visual QA

- Achieving the state-of-the-art results on visual QA dataset (not today :(.

- Introducing a method to visualize spatial attention effect of joint residual mappings

Background

SAN- But question info contribute

weakly, it cause bottleneck

Baseline [Lu et al.]- With just elem-wise multiple,

visual and question feature embed very well.

MRN- Shortcut mapping and

stacking architecture

- No weighted-sum

- Instead use global multiplication [Lu et al.] does.

Quantitative Analysis- (a) shows large improvement

over SAN, (b) is better.

- (c) add extra embedding in question cause overfitting.

- (d) identity shortcut cause degradation (extra linear mapping is needed).

- (e) performs reasonable, but extra shortcut is not essential.

Quantitative Analysis

# of Learning blocks- 58.85% (L=1), 59.44% (L=2),

60.53% (L=3), 60.42% (L=4)

Visual Features- ResNet-152 is significantly

better than VGGNet

- Even though ResNet has less feature dim (2048 vs 4096).

# of Answer Class- Trade-off relation among

answer type, but 2k is best

- Implicit attention with multiplication

- Get high-resolution attention map

Reference

- Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015).

- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016).

- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Multimodal Residual Learning for Visual QA

Engineering

Transcript of Multimodal Residual Learning for Visual QA