Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation...
Transcript of Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation...
![Page 1: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/1.jpg)
{name.surname}@unimore.it
University of Modena and Reggio Emilia, Italy
Lorenzo Baraldi
Intelligenza e Visione Artificiale: tecnologie e opportunità per il mondo Education
![Page 2: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/2.jpg)
Who
• 6 Staff people (Professors and Researchers)
• 12 PhD Students
• 5 Research assistants, SW developers
• 3 (ex) spinoff companies
Open collaborations
• Facebook FAIR (F), Eurecom (F)
• Panasonic (USA)
• Ferrari (I), Maserati (I)
• CNR (I)
• MIUR, EU and Italian public bodies
• Italian SuperComputing Resource Allocation – CINECA
• Computer Vision Foundation, CVPL-IAPR, AIXIA
AIMAGELAB
Aimage Lab UNIMORE and Ferrari spa
![Page 3: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/3.jpg)
Outline
• Introduction to Artificial Intelligence
• AI for Images: Convolutional Neural Networks
• Vision and Language
• Vision, Language and Action
THIS TALK
3
![Page 4: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/4.jpg)
A neural network: a composition of differentiable functions with learnable parameters.
Once trained, it can predict an output.
How do we train it?
We define an error (loss) as function of the learnable parameters, then iteratively change the parameters so that the error is minimized.
LEARNING
4
Artificial
Intelligence
Output:
Movement
Text
Input:
Sensors
Data
![Page 5: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/5.jpg)
How do we train it?
We define an error (loss) as function of the learnable parameters, then iteratively change the parameters so that the error is minimized.
GRADIENT DESCENT
![Page 6: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/6.jpg)
LEARNING
6
Machine Learning is a type of Artificial Intelligence that provides
computers with the ability to learn without being explicitly
programmed.
Machine Learning
Algorithm
Learned Model
Data
Prediction
Labeled Data
Training
Prediction
Provides various techniques that can learn from and make predictions on data
![Page 7: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/7.jpg)
CONVOLUTIONAL NEURAL NETWORKS
7
![Page 8: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/8.jpg)
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...Vector:
4096
Fully-Connected:
4096 to 1000
CONVOLUTIONAL NEURAL NETWORKS
![Page 9: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/9.jpg)
CONVNETS ARE EVERYWHERE
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al., 2012]
![Page 10: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/10.jpg)
CONVNETS ARE EVERYWHERE
[Taigman et al. 2014]
[Simonyan et al. 2014]
![Page 11: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/11.jpg)
He et al, “Mask R-CNN”, arXiv 2017
MASK-RCNN ALSO DOES POSE
![Page 12: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/12.jpg)
Lecture 13-
Fei-Fei Li & Justin John
May 17, 201884
Dumoulin, Shlens, and Kudlur, “A Learned Representation for Artistic Style”, ICLR 2017.
NEURAL STYLE TRANSFER
![Page 13: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/13.jpg)
M. Cornia, L. Baraldi, H.R. Tavakoli, R. Cucchiara. “CyTIR-Net: a Unified Cycle-Consistent Neural Model for Text and Image Retrieval.”ECCVW 2017.
Query caption: four men standing, one with an
entire bunch of carrots in his mouth.
Query caption: brown teddy bear with glasses
sitting on blue couch.
CyTIR-Net txt2img
Query caption: two beach chairs and a white
and red umbrella at a beach.
CyTIR-Net txt2img CyTIR-Net txt2img
Query caption: a man on a snowboard using a
parachute.
Query caption: a man surfing on a blue green
wave.
CyTIR-Net txt2img
Query caption: a woman riding a bike down a
street next to a divider.
CyTIR-Net txt2img CyTIR-Net txt2img
Beyond tags and pre-defined concepts: embed text and images into common embedding spaces
VISUAL-SEMANTIC RETRIEVAL
![Page 14: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/14.jpg)
SPEAKING THE SAME LANGUAGE: GENERATING DESCRIPTIONS
CONV-NET
+
Recurrent NET (LSTM)
..a white shark swims
in the ocean water..
Generated caption: A woman is looking at a television screen.
Generated caption: A boat is in the water near a large mountain.
Generated caption: A woman in a red jacket is riding a bicycle.
![Page 15: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/15.jpg)
QUALITATIVE RESULTS
GT: A large passenger jet sitting on top of an airport runway. Prediction: A large jetliner sitting on top of an airport runway.
GT: Family of five people in a green canoe on a lake.Prediction: A group of people sitting on a boat in a lake.
GT: Two people in Swarthmore College sweatshirts are playing frisbee.Prediction: A man and a woman are playing frisbee on a field.
![Page 16: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/16.jpg)
To extend captioning to unknown domains, we decompose the problem of captioning as that of attending a sequence of regions. Potentially:
• We can include out-of-vocabulary words which are not found in the training set.
• We can control which regions are described and in which order, and give more importance to important classes rather than to useless classes.
CONTROLLABLE CAPTIONING
M. Cornia, L. Baraldi, R. Cucchiara, Show, control and Tell: A Framework for generating Controllable and Grounded Captions, CVPR 2019.
![Page 17: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/17.jpg)
Results when Controlling with a sequence of regions
CONTROLLABLE IMAGE CAPTIONING
[1] Cornia Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Grounded and Controllable Captions." CVPR 2019.
![Page 18: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/18.jpg)
Results when Controlling with a set of regions
CONTROLLABLE IMAGE CAPTIONING
[1] Cornia Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Grounded and Controllable Captions." CVPR 2019.
![Page 19: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/19.jpg)
Results when Controlling with a set of regions
CONTROLLABLE IMAGE CAPTIONING
[1] Cornia Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Grounded and Controllable Captions." CVPR 2019.
![Page 20: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/20.jpg)
CONNECTING VISION, LANGUAGE AND ACTIONS
• The navigation goal is given by anatural language instruction;
• Visual information helpsprogressing towards the target;
• The agent must know when tostop (i.e. goal reached).
![Page 21: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/21.jpg)
Instruction:Walk up the stairs.Turn right at the top of the stairs and walk along the red ropes.Walk through the open doorway straight ahead along the red carpet.Walk through that hallway into the room with couches and a marble coffee table.
Dynamic Response Map
Agent position (and next action)
VISION AND LANGUAGE NAVIGATION
![Page 22: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/22.jpg)
Instruction:Walk up the stairs.Turn right at the top of the stairs and walk along the red ropes.Walk through the open doorway straight ahead along the red carpet.Walk through that hallway into the room with couches and a marble coffee table.
Dynamic Response Map
Agent position (and next action)
VISION AND LANGUAGE NAVIGATION
![Page 23: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/23.jpg)
Instruction:Walk up the stairs.Turn right at the top of the stairs and walk along the red ropes.Walk through the open doorway straight ahead along the red carpet.Walk through that hallway into the room with couches and a marble coffee table.
Dynamic Response Map
Agent position (and next action)
VISION AND LANGUAGE NAVIGATION
![Page 24: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/24.jpg)
Instruction:Walk up the stairs.Turn right at the top of the stairs and walk along the red ropes.Walk through the open doorway straight ahead along the red carpet.Walk through that hallway into the room with couches and a marble coffee table.
Dynamic Response Map
Agent position (and next action)
VISION AND LANGUAGE NAVIGATION
![Page 25: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/25.jpg)
Instruction:Walk up the stairs.Turn right at the top of the stairs and walk along the red ropes.Walk through the open doorway straight ahead along the red carpet.Walk through that hallway into the room with couches and a marble coffee table.
Dynamic Response Map
Agent position (and next action)
VISION AND LANGUAGE NAVIGATION
![Page 26: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/26.jpg)
26
![Page 27: Intelligenza e Visione Artificiale - IDEA B3...•Italian SuperComputing Resource Allocation –CINECA •Computer Vision Foundation, CVPL-IAPR, AIXIA AIMAGELAB AimageLab UNIMORE and](https://reader034.fdocuments.net/reader034/viewer/2022043004/5f8591977a0b8d750514612c/html5/thumbnails/27.jpg)
Thank you!
aimagelab.ing.unimore.it
Marcella Cornia Lorenzo Baraldi Rita CucchiaraMatteo Tomei Massimiliano CorsiniFederico Landi Matteo Stefanini