How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C....
-
Upload
meagan-long -
Category
Documents
-
view
225 -
download
0
Transcript of How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C....
How Machines Learn to Talk
Amitabha Mukerjee IIT Kanpur
work done with:Computer Vision: Profs. C. Venkatesh, Pabitra Mitra
Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela
Natural Language: Prof. Achla Raina, V. Shreeniwas
Robotics
Collaborations:IGCAR Kalpakkam
Sanjay Gandhi PG Medical Hospital
Visual Robot Navigation
Time-to-Collisionbased Robot Navigation
Hyper-Redundant Manipulators
• Reconfigurable Workspaces / Emergency Access
• Optimal Design of Hyper-Redundant Systems – Scara and 3D
The same manipulator can work in changing workspaces
Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
Micro-Robots
• Micro Soccer Robots (1999-)
• 8cm Smart Surveillance Robot – 1m/s
• Autonomous Flying Robot (2004)
• Omni-directional platform (2002)
Omni-Directional Robot Sponsor: [email protected]
Flying Robot
heli-flight.wmv
Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production
Start-Up at
IIT Kanpur
WhirligigRobotics
Tracheal Intubation Device
Device for Intubation during general Anesthesia
Aperture for Fibre optic video cable
Endotracheal tube Aperture
Aperture for Oxygenation tube
Hole for suction tube
Control cables Attachment Points
Ball & Socket joint
Assists surgeon while inserting breathing tube during general anaesthesia
Sponsor: DST / [email protected]
Draupadi’s Swayamvar
Can the Arrow hit the rotating mark? Sponsor: Media Lab Asia
High DOF Motion Planning
• Accessing Hard to Reach spaces
• Design of Hyper-Redundant Systems
• Parallel Manipulators
Sponsor: BRNS / [email protected]
10-link 3D Robot – Optimal Design
Multimodal Language Acquisition
Consider a child observing a scene together with adults talking about it
Grounded Language : Symbols are grounded in perceptual signals
Use of simple videos with boxes and simple shapes – standardly used in sociopsychology
Objective
To develop a computational frameworkfor Multimodal Language Acquisition• acquiring the perceptual structure
corresponding to verbs • using Recurrent Neural Networks as
a biologically plausible model for temporal abstraction
• Adapt the learned model to interpret activities in real videos
Visually Grounded Corpus
Two psychological research films, one based on the classic Heider & Simmel (1944) and other based on Hide & Seek
These animation portray motion paths of geometric figures (Big Square, Small square & Circle)
Chase Alt
Cognate clustering Similarity Clustering: Different
expressoins for same action, e.g.: “move away from center” vs “go to a corner”
Frequency: Remove Infrequent lexical units
Synonymy: Set of lexical units being used consistently in the same intervals, to mark the same action, for the same set of agents.
Perceptual Process
Cognate Clustering
Trained Simple
Recurrent Network
Descriptions
FeaturesVideo
Events
Feature Extraction
Multi Modal Input
VICES
Design of Feature Set The features selected here are related to
spatial aspects of conceptual primitives in children, such as position, relative pose, velocity etc.
Use features that are kinematical in nature, temporal derivations or simple transforms of the basic ones.
Monadic Features
Dyadic Predicates
VIdeo and Commentary for Event Structures [VICES]
Cognate Clustering
Trained Simple
Recurrent Network
Descriptions
FeaturesVideo
Events
Feature Extraction
Multi Modal Input
VICES
The classification problem
The problem is of time series classification
Possible methodologies include: Logic based methods Hidden Markov Models Recurrent Neural Networks
Elman Network Commonly a two-
layer network with feedback from the first-layer output to the first layer input
Elman Networks detect and generate time-varying patterns
It is also able to learn spatial patterns
Feature Extraction in Abstract Videos
Each image is read into a 2D matrix Connected Component Analysis is
performed Bounding box is computed for each
such connected component Dynamic tracking is used to keep
track of each object
Working with Real Videos Challenges
Noise in real world videos Illumination Changes Occlusions Extracting Depth Information
Our Setup Camera is fixed at head height. Angle of depression is 0 degrees (approx.).
Video
Background Subtraction Learn on still
background images Find pixel intensity
distributions Classify each pixel as
background if
Remove Shadows Special Case of Reduced
Illumination S = k*P where k<1.0
Background Subtraction
P(x,y) - µ(x,y) < P(x,y) - µ(x,y) < kkσσ(x,y)(x,y)
2
Contd.. Extract Human Blobs
By Connected Component Analysis
Bounding box is computed for each person
Track Human Blobs Each object is tracked
using a mean-shift tracking algorithm.
Contd..
Depth Estimation Two approximations
Using Gibson’s affordances Camera Geometry
Affordances: Visual Clues Action of a human is triggered by the
environment itself. A floor offers walk-on ability
Every object affords certain actions to perceive along with anticipated effects A cups handle affords grasping-lifting-drinking
Contd..
Gibson’s model Horizon is fixed at the head height of the
observer. Monocular Depth Cues
Interposition An object that occludes another is closer.
Height in the visual field Higher the object is the further it is.
Depth Estimation Pin hole Camera Model Mapping (X,Y,Z) to (x,y)
x = X * f / Z y = Y * f / Z
For the point of contact with the ground Z 1 / y X x / y
Depth plot for A chase B Top view (Z-X plane)
Results (contd..)
Results (contd..)
Results (contd..)
Results
Separate-SRN-for-each-action Trained & tested on different parts of the
abstract video Trained on abstract video and tested on
real video Single-SRN-for-all-actions
Trained on synthetic video and tested on real video
Basis for Comparison
E
E'-E Positives False
Mismatches Focus as classified Intervals :FM
E
E' E Positives True
occurring asevent an describe subjects when Intervals : E
E - t : E
occurring asevent an describes VICES when Intervals E'
'' EtE
E
E'-E Negatives False E
FM Mismatches Focus
t
EE 'E'EAccuracy
Let the total time of visual sequence for each verb be t time units
Separate SRN for each action
Framework : Abstract videoVerb True Positives False Positives False Negatives Focus Mismatches Accuracy
hit 46.02% 3.06% 53.98% 2.4% 92.37%
chase 24.44% 0% 75.24% 0.72% 93.71%
come Closer 25.87% 14.61% 73.26% 16.77% 63.66%
move Away 46.34% 7.21% 52.33% 15.95% 73.37 %
spins 82.54% 0% 16.51% 24.7% 97.03%
moves 68.24% 0.12% 31.76% 1.97% 77.33%
Verb True Positives False Positives False Negatives Focus Mismatches
hit 3 3 1 1
chase 6 0 3 4
come Closer 6 20 7 24
move Away 8 3 0 14
spins 22 0 1 9
moves 5 1 2 7
Time Line comparison for Chase
Separate SRN for each action Real video (action recognition only)
Verb Retrieved Relevant True Positives
False Positives
False Negatives
Precision Recall
A Chase B 237 140 135 96 5 58.4% 96.4%
B Chase A 76 130 76 0 56 100% 58.4%
Single SRN for all actions
Framework : Real video
Verb Retrieved Relevant True Positives
False Positives
False Negatives
Precision Recall
Chase 239 270 217 23 5 91.2% 80.7%
Going Away 21 44 13 8 31 61.9% 29.5%
Conclusions & Future Work Sparse nature of video provides for ease of
visual analysis Directly learning event structures from
perceptual stream. Extensions: Learn fine nuances between
event structures of related action words. Learn the Morphological variations. Extend the work towards using Long Short
Term Memory (LSTM). Hierarchical acquisition of higher level
action verbs.