Introduction to Grounded Theory Hilary Engward Grounded Theory1.
Visual7W Grounded Question Answering in Images
-
Upload
universitat-politecnica-de-catalunya -
Category
Technology
-
view
384 -
download
0
Transcript of Visual7W Grounded Question Answering in Images
![Page 1: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/1.jpg)
Visual7WGrounded Question Answering in Images
Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei
Slides by Issey Masuda MoraComputer Vision Reading Group (09/05/2016)
[arXiv] [web] [GitHub]
![Page 2: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/2.jpg)
Context
![Page 3: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/3.jpg)
Visual Question Answering
Goal: predict the answer of a given question related to an image
![Page 4: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/4.jpg)
Motivation
New Turing test? How to evaluate AI’s image understanding?
![Page 5: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/5.jpg)
Visual7W
![Page 6: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/6.jpg)
The 7W
WHAT
WHERE
WHEN
WHO
WHY
HOW
WHICH
Questions: multi-choice4 candidates, only one correct
![Page 7: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/7.jpg)
Grounding: image-text correspondencesExploit the relation between image regions and nouns in the questions
![Page 8: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/8.jpg)
The new answer is...Question-Answer types:
● Telling questions: the answer is text
● Pointing questions: a new type of QA that they introduce where the answers are image regions
![Page 9: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/9.jpg)
Related work
![Page 10: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/10.jpg)
Common approach
Who is under the umbrella?
Extract visual features
Embedding
Merge Predict answer Two women
![Page 11: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/11.jpg)
The Dataset
![Page 12: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/12.jpg)
Visual7W DatasetCharacteristics:
● 47.300 images from COCO dataset● 327.939 QA pairs● 561.459 object bounding boxes spread across 36.579 categories
![Page 13: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/13.jpg)
Creating the DatasetProcedure:
● Write QA pairs● 3 AMT workers evaluate as good or bad each pair● Only the ones with at least 2 good evaluations are considered● Write the 3 wrong answers (having the right one)● Extract object names and draw bounding boxes for each one
![Page 14: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/14.jpg)
The Model
![Page 15: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/15.jpg)
Attention-based modelPointing questions model
![Page 16: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/16.jpg)
Experiments & Results
![Page 17: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/17.jpg)
ExperimentsDifferent experiments have been conducted depending on the information given to the subject:
● Only the question● Question + Image
Subjects/models:
● Human● Logistic regression● LSTM● LSTM + attention model
![Page 18: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/18.jpg)
Results
![Page 19: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/19.jpg)
Conclusions
![Page 20: Visual7W Grounded Question Answering in Images](https://reader035.fdocuments.net/reader035/viewer/2022062523/5a65eb1b7f8b9ad02f8b4bbd/html5/thumbnails/20.jpg)
Conclusions
● Visual QA model has been presented● Attention model to focus on local regions of the image● Dataset created with goundings