Vision, Language and Action

{name.surname}@unimore.it

University of Modena and Reggio Emilia, Italy

Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini

Vision, Language and Action: from Captioning to Embodied AI



Massimiliano Corsini

Part V

Embodied AI and VLN

• Generally speaking, embodied means to give a physical body to something. In this context, this means to give to Artificlal Intelligent algorithm “a body” to make possible that an agent solves some tasks.

• Tasks involved in embodied AI:

• Embodied Visual Recognition

• Embodied Question Answering

• Interactive Question Answering

• Visual Navigation

• Vision-and-Language Navigation

Embodied AI

3

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.

Embodied AI

4Figure adaptep from Savva et al. , “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019. 4

• In the following we will describe in depth Vision-and-Language Navigation (VLN). A very recent and active research trends.

• VLN has been defined as:

• Interpret a previously unseen natural language navigation command in light of images generated by a previously unseen real environment (Anderson et al. CVPR 2018)

• Follow a given instruction to navigate from a starting location to a goal location (Fried et al. NeurIPS 2018)

• Reach a target location by navigating unseen environments, with a natural language instruction as only clue (Landi et al. BMVC 2019)

Embodied AI – VLN

5

• Dataset of spaces:

• ScanNet (Dai et al. 2017)

• Stanford 2D-3D-Semantics (Armeni et al. 2017)

• Matterport3D (Chang et al. 2017)

• Replica (Straub et al. 2019)

• Dataset for VLN:

• R2R – Room to Room (Anderson et al. 2018)

• Touchdown (Chen et al. 2019)

• Simulation Environments

• Matterport3D Simulator (Anderson et al. 2018)

• Gibson (Zamir et al. 2018)

• Habitat (Savva et al. 2019)

Key Aspects

6

• ScanNet is an RGB-D video dataset containing reconstructed scenes with instance-level semantic segmentations.

• 2.5 millions frames (RGBD) acquired through a custom device (similar to a Kinect)

• 1,513 scenes reconstructed (volume fusion)

• Instance-level semantic segmentation (20 classes + one class for free space) through 3D CAD models alignment

ScanNet (Dai et al. 2017)

7

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Niessner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes}”, Proc. Computer Vision and Pattern Recognition (CVPR), 2017.

• Data collected using a Matterport Camera for indoor acquisition

• > 6,000 m2 of indoor environment

• 1,413 equirectangular RGB images with corresponding depth and surface normal plus instance-level semantic data

• Annotation is performed in 3D, then projected onto the images (13 object classes)

Stanford 2D-3D-S (Armeni et al. 2017)

8Iro Armeni, Sasha Sax, Amir Roshan Zamir, Silvio Savarese, “Joint 2D-3D-Semantic Data for Indoor Scene Understanding”, arXiv:1702.01105, 2017.

• Created using the Matterport Camera (again)

• 90 buildings , 10,800 panoramic views , 194,400 RGBD images

• Corresponding textured 3D models

• Instance segmentation is provided

Matterport3D (Chang et al. – 3DV2017)

9

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, Y. Zhang, ”Matterport3D: Learning from RGB-D Data in Indoor Environments”, International Conference on 3D Vision (3DV 2017), 2017.

• Ultra high photo-realism (Replica Turing Test)

• Designed for many tasks: egocentric vision, semantic segmentation, geometric inference, development of embodied agents (VLN, VQA)

• Acquired using a custom device (RGBD rig with IR projector) plus manual refinement

• 18 scenes of real world environments

• Dense hiqh-quality mesh, HDR textures

• Semantic annotation annotated in parallel using a 2D instance-level masking tool → transferred to the mesh using a voting scheme → a SEMANTIC FOREST (88 classes) is the final result.

• A minimal SDK to render the dataset is provided.

Replica (Straub et al. 2019)

10

Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.

Dataset comparison

11

Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.

• Gibson is a perception and physics simulator for the development of embodied agents (developed by Stanford),

• Renderer works with data acquired from Metterport Camera.

• Different types of robotic agents can be trained.

• Physical engine (based on PyBullet).

• Try to fill the perception gap between the rendering and the real world.

• No support for agent-agent interactions.

Gibson

12

Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.

Gibson – Rendering Engine

13


Gibson – Filling the gap between rendering and real images

14


• Habitat is an open source framework for embodied AI

• Two main components:

• Habitat-SIM → 3D simulator

• Habitat-API → high-level library for end-to-end

development

• Features:

• High-quality rendering

• Generic dataset support: Matterport3D, Gibson, Replica

• Agents can be configured and equipped with different sensors

• Human-as-agent → this allow to investigate human-agent interactions and human-human interactions

Habitat

15

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.

• Builds upon Matterport3D dataset of spaces (Chang et al. 3DV 2017)

• 90 different buildings

• ~7k navigation paths

• 3 different descriptions / path

• ~29 words / instruction on average

• 2 different validation splits

• Test server with public leaderboard

R2R – Room to Room Benchmark

16

P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.

R2R – Room to Room Benchmark

17

P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.

• Real-life urban environment (built on Google StreetView image data)

• Touchdown task: to reach a goal position according to the given instructions (e.g. navigation task), and then resolving a spatial description by finding in the observed image a hidden teddy bear (spatial description resolution (SDR) task).

• Referring expression vs SDR:

• A referring expression discriminates an object w.r.t other objects.

• An SDR sentence describe a specific location rather than discriminating.

• 9,326 examples of English instructions

• 27,575 spatial description resolution tasks

• Language more complex than R2R

• Qualified annotators.

Touchdown (Chen et al. – CVPR2019)

18

H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.

Touchdown (Chen et al. – CVPR2019)

19

H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.

SDR task

• The first work to introduce Vision-and-Language Navigation (VLN) task.

• Both VQA and VLN can be seen as a sequence-to-sequence transcoding . VLN has sequences that typically are much longer than the ones in VQA and the model output actions.

• Contribution:

• Matterport3D Simulator → a framework for visual reinforcement learning built on Matterport3D dataset.

• Room-to-Room (R2R) → the first benchmark for VLN.

• A sequence-to-sequence neural network to solve the problem.

Algorithms – Anderson et al. – CVPR2018

20

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.

• Navigation graph: 𝐺 = < 𝑉, 𝐸 >

• The agent select a next reachable viewpoint 𝑣𝑡+1 ∈ 𝑊𝑡+1 from the reachable viewpoint 𝑊𝑡+1and adjust camera direction (azimuth angle) and camera elevation.

• The action space is : left, right, top, bottom, forward and stop.

• Proposed baseline:

• LSTM-based sequence-to-sequence architecture with an attention mechanism (Bahdanau et al. 2015)


21


• Language instructions encoding: each word 𝑥𝑖 is presented sequentially to the LSTM encoder as an embedding vector:

ℎ𝑖 = 𝐿𝑆𝑇𝑀𝑒𝑛𝑐 (𝑥𝑖 , ℎ𝑖−1)

• Image and action embedding: image feature are extracted using a ResNet-152 trained on ImageNet. An embedding is learned for each action. The encoded image and the previous actions features are concatenated together to form a vector 𝑞𝑡.

ℎ′𝑡 = 𝐿𝑆𝑇𝑀𝑑𝑒𝑐 𝑞𝑡, ℎ′𝑡−1

• Finally, the attention mechanism is applied to compute an instruction context 𝑐𝑡 = 𝑓(𝒉, ℎ𝑡′) before

the prediction of the action 𝑎𝑡 .


22


• NE (Navigation Error)

• Distance between the agent final position and the goal

• SR (Success Rate)

• Fraction of episodes terminated within 3 meters from the goal

• OSR (Oracle SR)

• SR that the agent would have achieved if it received an oracle stop signal

• SPL (SR weighted by Path Length)

• SR weighted by normalized inverse path length (penalizes overlong navigations)

Evaluation Metrics

23

• Main idea: to introduce a speaker module able to generate a description given a context.

• To synthesize new couples of path-instructions

• To enable pragmatic reasoning (Hemachandra et al. 2015)

• Hence, we have a Follower and a Speaker:

• Follower →map instructions to sequence of actions

• Speaker →map action sequences to instructions

• Both the Follower and the Speaker are based on standard sequence-to-sequence model.

Algorithms – Fried et al. – NIPS 2018

24

D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.

S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter. Learning models for following natural language directions in unknown environments. arXiv preprint arXiv:1503.05079, 2015.

• Follower estimates: 𝑝𝐹(𝑟|𝑑) r: route d: descriptions

• Speaker estimates: 𝑝𝑆 𝑑 𝑟

• At training time the Speaker is used for data augmentation:

• M new paths 𝑟𝑘 (𝑘 = 1. .𝑀) are sampled as in Anderson et al. 2018

• ( Ƹ𝑟𝑘, መ𝑑𝑘) new path-descriptions data are generated (set S)

• At training time the Follower is trained on 𝑆 ∪ 𝐷 . Then, the Follower is fine tuned on the original dataset D.


25


𝑑𝑘 = argmax𝑃𝑠(𝑑| Ƹ𝑟𝑘)

• At test time the Speaker is used to selected the best path between K candidate paths (pragmatic inference)

• 𝑑 = argmax 𝑃𝑠(𝑑|𝑟) → solving this is not feasible

• Get the best of K candidate path according to:

• Panoramic actions space:

• The agent perceives a 360-degree panoramic image and only high-level decision are taken.

• RESULTS:

• SR metric on unseen environments is about 53.5% that is 30% better than previous approaches → Speaker works !!


26


argmax𝑃𝑠 𝑑 𝑟 𝜆𝑝𝐹 𝑟 𝑑 𝜆−1

• Follows ideas from “Speaker-Follower” but the proposed approach is based on Reinforcement Learning (RL) and Imitation Learning (IL) instead of supervised learning.

• Contribution:

• Reinforced Cross Modal Matching (RCM) framework that employs an extrinsic and an intrinsic reward. This last reward guarantees cycle reconstruction consistency.

• Self-Supervised Imitation Learning (SIL) to explore the unseen environment on self-supervision and improve the overall performance.

Algorithms – Wang et al. – CVPR 2019

27

Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

• Reinforced Cross-Model Matching (RCM)

• INTRINSIC REWARD:

trajectory given the instruction X

An high value of p means that the predicted trajectory is aligned with the reconstructed instructions.

• EXTRINSIC REWARD:

Algorithms – Wang et al. – CVPR 2019

28


• The trade off between exploration and exploitation is one of the fundamental challenge in RL.

• Oh et al. proposed to exploit past good experience to improve exploration with theoretical justifications of the effectiveness of this approach.

• Following this idea the authors proposed SIL (Self-Supervised Imitation Learning):

• The agent imitates its past good decision.

• This is applied to unseen environments with no ground truth information.

• A set of random path is generated, then the matching critic is used to select the best trajectory and store it in a reply buffer.

• The trajectories stored in the reply buffer can be exploited.

Wang et al. 2019 – Self-Supervised Imitation Learning (SIL)

29

1. Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

• The beam search has not been used because it is not feasible for real scenario (note that this is a very interesting consideration of the authors)

• RCM pass from about 28% to 35% on unseen environment (SPL metric).

• SIL improves RCM by 17.5% on SR and by 21% on SPL.

Wang et al. – CVPR 2019 – Results

30


• Recently, a new categorization has been introduced by Landi et al. (Landi et al. BMVC 2019)

• Methods are subdivided between two categories:

• Methods that work in low-level action space

• Methods that work in high-level action space

• Main contribution:

• SOTA for low-level methods

• It uses dynamic filters to jointly decide the agent’s actions.

Algorithms – Landi et al. (BMVC 2019)

31

Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Landi et al. (BMVC 2019) – Action spaces

Low-level action space

Simulates continuous control of the agent

Move forward, turn left/right, tilt up/down, stop

High-level action space

Path selection on a discrete graph

Action space is a list of adjacent nodes

This work!


Landi et al. (BMVC 2019) – Dynamic Convolutional Filters

33


...or “let the sentence drive the

convolution”

Li et al. CVPR 2017

Gavrilyuk et al. CVPR 2018

Tracking

Actor and Action

Segmentation

Query: “Woman with ponytail running”

Query: “Small white fluffy puppy biting the cat”

Landi et al. (BMVC 2019) – Dynamic Convolutional Filters

34

Tanh

L2 Norm

Tanh

L2 Norm

Tanh

L2 Norm

... ...


convolution”


Landi et al. 2019 – Dynamic Convolutional Filters

35



convolution”

# output feature maps

=

# dynamic filters

Landi et al. (BMVC 2019) – Architecture

36Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

• State-of-the-art in low-level action spaces

• Ablation study

• All the components are important

• Dynamic filters play the fundamental role

Landi et al. (BMVC 2019) – Results

37



38

Walk up the stairs. Turn

right at the top of the stairs

and walk along the red

ropes. Walk through the

open doorway straight

ahead along the red carpet.

Walk through that hallway

into the room with couches

and a marble coffee table.



39

Turn around and go down

the entranceway, heading

toward the staircase. Turn

to your left and walk past

the staircase to the open

door way. Stop near the

front of the doorway to the

hall.


• Simulation framework

• Flexibility

• Human interaction

• Evaluation needs further improvements:

• Few datasets (e.g. only one dataset for the Spatial Description Reasoning task)

• Performance on low-level actions vs high-level actions agent

• Metrics (we will see in a moment..)

• Real-world applications

• Lack of case studies

Challenges

40

Metrics

41

Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.

Blue path coming from R4R

SPL for the red path: 1.0 SPL for the orange path: 0.17CLS for the red path: 0.23CLS for the orange path: 0.87

• Common metrics to evaluate VLN performance focus on reaching the goal instead of evaluating the step-by-step correspondences with the given instructions.

• We need instructions-oriented metrics instead of goal-oriented metrics.

• Recently proposed metrics for this purpose:

• Coverage Weighted by Length Score (CLS) (Jain et al. 2019).

• Metrics based on dynamic time warping (nDTW, SDTW) (Magalhães et al. 2019)

Problem with commonly used metrics

42

• A new dataset has been created to evaluate the novel proposed metric: Room-for-Room (R4R)

• R4R is built by joining path and the corresponding description in R2R

• From goal-oriented to instruction-oriented metrics

• A list of desiderata is proposed:

• Path similarity

• Soft penalties

• Unique optimum

• Scale invariance

• Computational tractability

CLS (Jain et al. 2019)

43

Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).

• CLS is defined as :

CLS (Jain et al. 2019)

44

Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.

Path Coverage

Length Score

• There proposed metrics are based on Dynamic Time Warping (DTW).

• DTW is a well-established method to measure the similarity between time series.

• The approach of DTW is to find an optimal warping to align the elements of the series such that the cumulative distance between the aligned elements is minimized.

• This problem can be solved using dynamic programming.

nDTW and SDTW (Magalhães et al. 2019)

45


• Normalized Dynamic Time Warping (nDTW) metric:

• Success Weighted by Normalized Dynamic Time Warping (SDTW) metric:

nDTW and SDTW (Magalhães et al. 2019)

46


• VLN is a modern, complex task which involves visual recognition, 3D scene understanding, and language processing.

• The multi-modal information (oral/text instructions, images, depth) elaborated should be produce a sequence of actions.

• Simulation environments and benchmark are continuously under development (this is a good news).

• The passage from the simulation to the real applications is an important open issue.

Conclusions

47



Thank you for your attention

Vision, Language and Action

Documents

Transcript of Vision, Language and Action