Vision, Language and Action
Transcript of Vision, Language and Action
{name.surname}@unimore.it
University of Modena and Reggio Emilia, Italy
Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini
Vision, Language and Action: from Captioning to Embodied AI
{name.surname}@unimore.it
University of Modena and Reggio Emilia, Italy
Massimiliano Corsini
Part V
Embodied AI and VLN
• Generally speaking, embodied means to give a physical body to something. In this context, this means to give to Artificlal Intelligent algorithm “a body” to make possible that an agent solves some tasks.
• Tasks involved in embodied AI:
• Embodied Visual Recognition
• Embodied Question Answering
• Interactive Question Answering
• Visual Navigation
• Vision-and-Language Navigation
Embodied AI
3
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.
Embodied AI
4Figure adaptep from Savva et al. , “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019. 4
• In the following we will describe in depth Vision-and-Language Navigation (VLN). A very recent and active research trends.
• VLN has been defined as:
• Interpret a previously unseen natural language navigation command in light of images generated by a previously unseen real environment (Anderson et al. CVPR 2018)
• Follow a given instruction to navigate from a starting location to a goal location (Fried et al. NeurIPS 2018)
• Reach a target location by navigating unseen environments, with a natural language instruction as only clue (Landi et al. BMVC 2019)
Embodied AI – VLN
5
• Dataset of spaces:
• ScanNet (Dai et al. 2017)
• Stanford 2D-3D-Semantics (Armeni et al. 2017)
• Matterport3D (Chang et al. 2017)
• Replica (Straub et al. 2019)
• Dataset for VLN:
• R2R – Room to Room (Anderson et al. 2018)
• Touchdown (Chen et al. 2019)
• Simulation Environments
• Matterport3D Simulator (Anderson et al. 2018)
• Gibson (Zamir et al. 2018)
• Habitat (Savva et al. 2019)
Key Aspects
6
• ScanNet is an RGB-D video dataset containing reconstructed scenes with instance-level semantic segmentations.
• 2.5 millions frames (RGBD) acquired through a custom device (similar to a Kinect)
• 1,513 scenes reconstructed (volume fusion)
• Instance-level semantic segmentation (20 classes + one class for free space) through 3D CAD models alignment
ScanNet (Dai et al. 2017)
7
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Niessner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes}”, Proc. Computer Vision and Pattern Recognition (CVPR), 2017.
• Data collected using a Matterport Camera for indoor acquisition
• > 6,000 m2 of indoor environment
• 1,413 equirectangular RGB images with corresponding depth and surface normal plus instance-level semantic data
• Annotation is performed in 3D, then projected onto the images (13 object classes)
Stanford 2D-3D-S (Armeni et al. 2017)
8Iro Armeni, Sasha Sax, Amir Roshan Zamir, Silvio Savarese, “Joint 2D-3D-Semantic Data for Indoor Scene Understanding”, arXiv:1702.01105, 2017.
• Created using the Matterport Camera (again)
• 90 buildings , 10,800 panoramic views , 194,400 RGBD images
• Corresponding textured 3D models
• Instance segmentation is provided
Matterport3D (Chang et al. – 3DV2017)
9
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, Y. Zhang, ”Matterport3D: Learning from RGB-D Data in Indoor Environments”, International Conference on 3D Vision (3DV 2017), 2017.
• Ultra high photo-realism (Replica Turing Test)
• Designed for many tasks: egocentric vision, semantic segmentation, geometric inference, development of embodied agents (VLN, VQA)
• Acquired using a custom device (RGBD rig with IR projector) plus manual refinement
• 18 scenes of real world environments
• Dense hiqh-quality mesh, HDR textures
• Semantic annotation annotated in parallel using a 2D instance-level masking tool → transferred to the mesh using a voting scheme → a SEMANTIC FOREST (88 classes) is the final result.
• A minimal SDK to render the dataset is provided.
Replica (Straub et al. 2019)
10
Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.
Dataset comparison
11
Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.
• Gibson is a perception and physics simulator for the development of embodied agents (developed by Stanford),
• Renderer works with data acquired from Metterport Camera.
• Different types of robotic agents can be trained.
• Physical engine (based on PyBullet).
• Try to fill the perception gap between the rendering and the real world.
• No support for agent-agent interactions.
Gibson
12
Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.
Gibson – Rendering Engine
13
Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.
Gibson – Filling the gap between rendering and real images
14
Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.
• Habitat is an open source framework for embodied AI
• Two main components:
• Habitat-SIM → 3D simulator
• Habitat-API → high-level library for end-to-end
development
• Features:
• High-quality rendering
• Generic dataset support: Matterport3D, Gibson, Replica
• Agents can be configured and equipped with different sensors
• Human-as-agent → this allow to investigate human-agent interactions and human-human interactions
Habitat
15
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.
• Builds upon Matterport3D dataset of spaces (Chang et al. 3DV 2017)
• 90 different buildings
• ~7k navigation paths
• 3 different descriptions / path
• ~29 words / instruction on average
• 2 different validation splits
• Test server with public leaderboard
R2R – Room to Room Benchmark
16
P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.
R2R – Room to Room Benchmark
17
P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.
• Real-life urban environment (built on Google StreetView image data)
• Touchdown task: to reach a goal position according to the given instructions (e.g. navigation task), and then resolving a spatial description by finding in the observed image a hidden teddy bear (spatial description resolution (SDR) task).
• Referring expression vs SDR:
• A referring expression discriminates an object w.r.t other objects.
• An SDR sentence describe a specific location rather than discriminating.
• 9,326 examples of English instructions
• 27,575 spatial description resolution tasks
• Language more complex than R2R
• Qualified annotators.
Touchdown (Chen et al. – CVPR2019)
18
H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.
Touchdown (Chen et al. – CVPR2019)
19
H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.
SDR task
• The first work to introduce Vision-and-Language Navigation (VLN) task.
• Both VQA and VLN can be seen as a sequence-to-sequence transcoding . VLN has sequences that typically are much longer than the ones in VQA and the model output actions.
• Contribution:
• Matterport3D Simulator → a framework for visual reinforcement learning built on Matterport3D dataset.
• Room-to-Room (R2R) → the first benchmark for VLN.
• A sequence-to-sequence neural network to solve the problem.
Algorithms – Anderson et al. – CVPR2018
20
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.
• Navigation graph: 𝐺 = < 𝑉, 𝐸 >
• The agent select a next reachable viewpoint 𝑣𝑡+1 ∈ 𝑊𝑡+1 from the reachable viewpoint 𝑊𝑡+1and adjust camera direction (azimuth angle) and camera elevation.
• The action space is : left, right, top, bottom, forward and stop.
• Proposed baseline:
• LSTM-based sequence-to-sequence architecture with an attention mechanism (Bahdanau et al. 2015)
Algorithms – Anderson et al. – CVPR2018
21
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.
• Language instructions encoding: each word 𝑥𝑖 is presented sequentially to the LSTM encoder as an embedding vector:
ℎ𝑖 = 𝐿𝑆𝑇𝑀𝑒𝑛𝑐 (𝑥𝑖 , ℎ𝑖−1)
• Image and action embedding: image feature are extracted using a ResNet-152 trained on ImageNet. An embedding is learned for each action. The encoded image and the previous actions features are concatenated together to form a vector 𝑞𝑡.
ℎ′𝑡 = 𝐿𝑆𝑇𝑀𝑑𝑒𝑐 𝑞𝑡, ℎ′𝑡−1
• Finally, the attention mechanism is applied to compute an instruction context 𝑐𝑡 = 𝑓(𝒉, ℎ𝑡′) before
the prediction of the action 𝑎𝑡 .
Algorithms – Anderson et al. – CVPR2018
22
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.
• NE (Navigation Error)
• Distance between the agent final position and the goal
• SR (Success Rate)
• Fraction of episodes terminated within 3 meters from the goal
• OSR (Oracle SR)
• SR that the agent would have achieved if it received an oracle stop signal
• SPL (SR weighted by Path Length)
• SR weighted by normalized inverse path length (penalizes overlong navigations)
Evaluation Metrics
23
• Main idea: to introduce a speaker module able to generate a description given a context.
• To synthesize new couples of path-instructions
• To enable pragmatic reasoning (Hemachandra et al. 2015)
• Hence, we have a Follower and a Speaker:
• Follower →map instructions to sequence of actions
• Speaker →map action sequences to instructions
• Both the Follower and the Speaker are based on standard sequence-to-sequence model.
Algorithms – Fried et al. – NIPS 2018
24
D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.
S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter. Learning models for following natural language directions in unknown environments. arXiv preprint arXiv:1503.05079, 2015.
• Follower estimates: 𝑝𝐹(𝑟|𝑑) r: route d: descriptions
• Speaker estimates: 𝑝𝑆 𝑑 𝑟
• At training time the Speaker is used for data augmentation:
• M new paths 𝑟𝑘 (𝑘 = 1. .𝑀) are sampled as in Anderson et al. 2018
• ( Ƹ𝑟𝑘, መ𝑑𝑘) new path-descriptions data are generated (set S)
• At training time the Follower is trained on 𝑆 ∪ 𝐷 . Then, the Follower is fine tuned on the original dataset D.
Algorithms – Fried et al. – NIPS 2018
25
D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.
𝑑𝑘 = argmax𝑃𝑠(𝑑| Ƹ𝑟𝑘)
• At test time the Speaker is used to selected the best path between K candidate paths (pragmatic inference)
• 𝑑 = argmax 𝑃𝑠(𝑑|𝑟) → solving this is not feasible
• Get the best of K candidate path according to:
• Panoramic actions space:
• The agent perceives a 360-degree panoramic image and only high-level decision are taken.
• RESULTS:
• SR metric on unseen environments is about 53.5% that is 30% better than previous approaches → Speaker works !!
Algorithms – Fried et al. – NIPS 2018
26
D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.
argmax𝑃𝑠 𝑑 𝑟 𝜆𝑝𝐹 𝑟 𝑑 𝜆−1
• Follows ideas from “Speaker-Follower” but the proposed approach is based on Reinforcement Learning (RL) and Imitation Learning (IL) instead of supervised learning.
• Contribution:
• Reinforced Cross Modal Matching (RCM) framework that employs an extrinsic and an intrinsic reward. This last reward guarantees cycle reconstruction consistency.
• Self-Supervised Imitation Learning (SIL) to explore the unseen environment on self-supervision and improve the overall performance.
Algorithms – Wang et al. – CVPR 2019
27
Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.
• Reinforced Cross-Model Matching (RCM)
• INTRINSIC REWARD:
trajectory given the instruction X
An high value of p means that the predicted trajectory is aligned with the reconstructed instructions.
• EXTRINSIC REWARD:
Algorithms – Wang et al. – CVPR 2019
28
Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.
• The trade off between exploration and exploitation is one of the fundamental challenge in RL.
• Oh et al. proposed to exploit past good experience to improve exploration with theoretical justifications of the effectiveness of this approach.
• Following this idea the authors proposed SIL (Self-Supervised Imitation Learning):
• The agent imitates its past good decision.
• This is applied to unseen environments with no ground truth information.
• A set of random path is generated, then the matching critic is used to select the best trajectory and store it in a reply buffer.
• The trajectories stored in the reply buffer can be exploited.
Wang et al. 2019 – Self-Supervised Imitation Learning (SIL)
29
1. Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.
• The beam search has not been used because it is not feasible for real scenario (note that this is a very interesting consideration of the authors)
• RCM pass from about 28% to 35% on unseen environment (SPL metric).
• SIL improves RCM by 17.5% on SR and by 21% on SPL.
Wang et al. – CVPR 2019 – Results
30
Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.
• Recently, a new categorization has been introduced by Landi et al. (Landi et al. BMVC 2019)
• Methods are subdivided between two categories:
• Methods that work in low-level action space
• Methods that work in high-level action space
• Main contribution:
• SOTA for low-level methods
• It uses dynamic filters to jointly decide the agent’s actions.
Algorithms – Landi et al. (BMVC 2019)
31
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
Landi et al. (BMVC 2019) – Action spaces
Low-level action space
Simulates continuous control of the agent
Move forward, turn left/right, tilt up/down, stop
High-level action space
Path selection on a discrete graph
Action space is a list of adjacent nodes
This work!
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
Landi et al. (BMVC 2019) – Dynamic Convolutional Filters
33
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
...or “let the sentence drive the
convolution”
Li et al. CVPR 2017
Gavrilyuk et al. CVPR 2018
Tracking
Actor and Action
Segmentation
Query: “Woman with ponytail running”
Query: “Small white fluffy puppy biting the cat”
Landi et al. (BMVC 2019) – Dynamic Convolutional Filters
34
Tanh
L2 Norm
Tanh
L2 Norm
Tanh
L2 Norm
... ...
...or “let the sentence drive the
convolution”
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
Landi et al. 2019 – Dynamic Convolutional Filters
35
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
...or “let the sentence drive the
convolution”
# output feature maps
=
# dynamic filters
Landi et al. (BMVC 2019) – Architecture
36Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
• State-of-the-art in low-level action spaces
• Ablation study
• All the components are important
• Dynamic filters play the fundamental role
Landi et al. (BMVC 2019) – Results
37
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
Landi et al. (BMVC 2019) – Results
38
Walk up the stairs. Turn
right at the top of the stairs
and walk along the red
ropes. Walk through the
open doorway straight
ahead along the red carpet.
Walk through that hallway
into the room with couches
and a marble coffee table.
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
Landi et al. (BMVC 2019) – Results
39
Turn around and go down
the entranceway, heading
toward the staircase. Turn
to your left and walk past
the staircase to the open
door way. Stop near the
front of the doorway to the
hall.
Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.
• Simulation framework
• Flexibility
• Human interaction
• Evaluation needs further improvements:
• Few datasets (e.g. only one dataset for the Spatial Description Reasoning task)
• Performance on low-level actions vs high-level actions agent
• Metrics (we will see in a moment..)
• Real-world applications
• Lack of case studies
Challenges
40
Metrics
41
Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.
Blue path coming from R4R
SPL for the red path: 1.0 SPL for the orange path: 0.17CLS for the red path: 0.23CLS for the orange path: 0.87
• Common metrics to evaluate VLN performance focus on reaching the goal instead of evaluating the step-by-step correspondences with the given instructions.
• We need instructions-oriented metrics instead of goal-oriented metrics.
• Recently proposed metrics for this purpose:
• Coverage Weighted by Length Score (CLS) (Jain et al. 2019).
• Metrics based on dynamic time warping (nDTW, SDTW) (Magalhães et al. 2019)
Problem with commonly used metrics
42
• A new dataset has been created to evaluate the novel proposed metric: Room-for-Room (R4R)
• R4R is built by joining path and the corresponding description in R2R
• From goal-oriented to instruction-oriented metrics
• A list of desiderata is proposed:
• Path similarity
• Soft penalties
• Unique optimum
• Scale invariance
• Computational tractability
CLS (Jain et al. 2019)
43
Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).
• CLS is defined as :
CLS (Jain et al. 2019)
44
Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.
Path Coverage
Length Score
• There proposed metrics are based on Dynamic Time Warping (DTW).
• DTW is a well-established method to measure the similarity between time series.
• The approach of DTW is to find an optimal warping to align the elements of the series such that the cumulative distance between the aligned elements is minimized.
• This problem can be solved using dynamic programming.
nDTW and SDTW (Magalhães et al. 2019)
45
Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).
• Normalized Dynamic Time Warping (nDTW) metric:
• Success Weighted by Normalized Dynamic Time Warping (SDTW) metric:
nDTW and SDTW (Magalhães et al. 2019)
46
Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).
• VLN is a modern, complex task which involves visual recognition, 3D scene understanding, and language processing.
• The multi-modal information (oral/text instructions, images, depth) elaborated should be produce a sequence of actions.
• Simulation environments and benchmark are continuously under development (this is a good news).
• The passage from the simulation to the real applications is an important open issue.
Conclusions
47
{name.surname}@unimore.it
University of Modena and Reggio Emilia, Italy
Thank you for your attention