TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags]...
Transcript of TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags]...
![Page 1: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/1.jpg)
TAB-VCR: Tags and Attributes based VCR Baselines
Jingxiang Lin, Unnat Jain, Alexander Schwing
NeurIPS 2019
University of Illinois at Urbana-Champaign
![Page 2: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/2.jpg)
Visual Commonsense Reasoning (VCR)
Answers
Question How did [0] and [1] get here?
a) They traveled in a cart.b) [0, 1] got [1] released from jail.c) [0, 1] took the stairs to get up there.d) They both got splashed. Zellers et al. CVPR 2019
From Recognition to Cognition: Visual Commonsense Reasoning
Subtask 1Question
Answering
[Correct]
![Page 3: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/3.jpg)
Visual Commonsense Reasoning (VCR)
Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning
[Tags]
Answers
Question How did [0] and [1] get here?
a) They traveled in a cart.b) [0, 1] got [1] released from jail.c) [0, 1] took the stairs to get up there.d) They both got splashed.
Subtask 1Question
Answering
[Correct]
![Page 4: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/4.jpg)
Visual Commonsense Reasoning (VCR)
a) Presumably they came here to get something from the store.b) They are at a market and [0]'s clothes look like the locals in the background.c) [1] is holding a bag which people often use to carry groceries.d) The cart beside them is likely their mode of transportation.
Question & correct answer
Rationales
[Tags]
Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning
Subtask 2Answer
Justification
How did [0, 1] get here? They traveled in a cart.
[Correct]
![Page 5: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/5.jpg)
1. Simple base network with fewer than half the trainable parameters 2. Leverage attribute information about objects3. Improve image-text grounding by adding new tags
Contributions
![Page 6: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/6.jpg)
1. Simple base network
![Page 7: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/7.jpg)
1. Simple base network
![Page 8: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/8.jpg)
1. Simple base network
![Page 9: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/9.jpg)
1. Simple base network
![Page 10: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/10.jpg)
1. Simple base network
![Page 11: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/11.jpg)
1. Simple base network
![Page 12: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/12.jpg)
Anderson et al. CVPR 2018Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2. Leverage attribute information
![Page 13: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/13.jpg)
2. Leverage attribute information
![Page 14: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/14.jpg)
Q: How did [0] and [1] get here?
a) They traveled in a cart.b) [0, 1] got [1] released from jail.c) [0, 1] took the stairs to get up there.d) They both got splashed.
Q: How did [0, 1] get hereA: They traveled in a cart
a) Presumably they came here to get something from the store.b) They are at a market and [0]'s clothes look like the locals in the background.c) [1] is holding a bag which people often use to carry groceries.d) The cart beside them is likely their mode of transportation.
3. Adding new tags
![Page 15: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/15.jpg)
Q: How did [0] and [1] get here?
a) They traveled in a [cart].b) [0, 1] got [1] released from jail.c) [0, 1] took the stairs to get up there.d) They both got splashed.
Q: How did [0, 1] get hereA: They traveled in a cart
a) Presumably they came here to get something from the store.b) They are at a market and [0]'s clothes look like the locals in the background.c) [1] is holding a bag which people often use to carry groceries.d) The [cart] beside them is likely their mode of transportation.
3. Adding new tags
![Page 16: TAB-VCR: Tags and Attributes based VCR Baselines · Question & correct answer Rationales [Tags] Zellers et al. CVPR 2019 From Recognition to Cognition: Visual Commonsense Reasoning](https://reader034.fdocuments.net/reader034/viewer/2022052010/601f90d1d8555956607a252b/html5/thumbnails/16.jpg)
Vision Backbone Subtask 1Q à A
Subtask 2QA à R
Trainable Params (Mn)
R2C (Zellers et al.) Resnet 50 63.8 67.2 26.8
(1) Base (ours) Resnet 101 67.50 69.75 4.9
(2) Base + attributes (ours) Resnet 101 69.51 71.57 4.7
(3) Base + attributes + new tags (TAB-VCR) Resnet 101 69.89 72.15 4.7
Results