Multimodal Residual Learning for Visual QA
-
Upload
namhyuk-ahn -
Category
Engineering
-
view
32 -
download
1
Transcript of Multimodal Residual Learning for Visual QA
![Page 1: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/1.jpg)
Multimodal Residual Learning for Visual QA
NamHyuk Ahn
![Page 2: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/2.jpg)
Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)
![Page 3: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/3.jpg)
Visual QAEvaluation Metric
- Robust to variabilityinter-human
- Human accuracy is almost 90
- 248,349 Training questions (82,783 Images)
- 121,512 Validation questions (40,504 Images)
- 244,302 Testing questions (81,434 Images)
![Page 4: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/4.jpg)
Stacked Attention Network
![Page 5: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/5.jpg)
Motivation
- Answering question requires multi-step reasoning
- With {bicycles, window, street, baskets, dogs} objects
- To answer good question,pinpoint relevant region.
Q: what are sitting in the basket on a bicycle
![Page 6: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/6.jpg)
Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle
![Page 7: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/7.jpg)
Stacked Attention Network
- Image Model• Extract image feature using
CNN
- Question Model• Extract semantic vector
using CNN or LSTM
- Stacked Attention• Multi-step reasoning
with attention layer
Stacked AttentionMulti-step reasoning
using attention layer
![Page 8: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/8.jpg)
Image / Question Model- Image Model
• Get feature map from raw pixel Image
• Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512)
• Additional layer to fit to question feature
- Question Model•
![Page 9: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/9.jpg)
Stacked Attention Model
- Global image feature leads to suboptimal due to noise from irrelevant object / region.
- Instead use SAM to pinpoint relevant region
- Given image feature matrixand question vector ,
14x14 attention distribution
- Get weighted sum of image vectors from each region.
-refined query vector
![Page 10: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/10.jpg)
Result
![Page 11: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/11.jpg)
![Page 12: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/12.jpg)
Residual Learning
![Page 13: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/13.jpg)
Problem of degradation- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
![Page 14: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/14.jpg)
Residual Network (ResNet)
Residual Block- To avoid degradation
problem, add shortcut connection.
- Element-wise addition with F(x) and shortcut connection, and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection
![Page 15: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/15.jpg)
Multimodal Residual Network
![Page 16: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/16.jpg)
Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA dataset (not today :(.
- Introducing a method to visualize spatial attention effect of joint residual mappings
![Page 17: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/17.jpg)
Background
SAN- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]- With just elem-wise multiple,
visual and question feature embed very well.
MRN- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global multiplication [Lu et al.] does.
![Page 18: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/18.jpg)
![Page 19: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/19.jpg)
![Page 20: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/20.jpg)
Quantitative Analysis- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in question cause overfitting.
- (d) identity shortcut cause degradation (extra linear mapping is needed).
- (e) performs reasonable, but extra shortcut is not essential.
![Page 21: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/21.jpg)
Quantitative Analysis
# of Learning blocks- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less feature dim (2048 vs 4096).
# of Answer Class- Trade-off relation among
answer type, but 2k is best
![Page 22: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/22.jpg)
- Implicit attention with multiplication
- Get high-resolution attention map
![Page 23: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/23.jpg)
![Page 24: Multimodal Residual Learning for Visual QA](https://reader031.fdocuments.net/reader031/viewer/2022022203/587080c11a28ab57368b6531/html5/thumbnails/24.jpg)
Reference
- Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.