Vision for Robotics – Detection and Tracking
Transcript of Vision for Robotics – Detection and Tracking
Vision for Robotics –Detection and Tracking
Markus VinczeAutomation Control Institute
Vienna University of [email protected]
www.acin.tuwien.ac.at
PSFMR – Fermo, 11.-16.9.2006
Content• Overview
• Tracking– Model-based tracking
– Interest point tracking
– Maximum tracking velocity
• Detection– Perceptual grouping
• Cognitive Vision
„Robot Assistent“„James, please bring me my cup“
Research fields• Machine & cognitive vision• Robotics, visual servoing• System integration
Motivation
3D (6 DoF) Object TrackingRobVision – EU Project 1998 - 2001
• Model-based: line, ellipse determination of 3D pose
• Robustness: integration of model and image cues
• Real time, 25 Hz
Robot Navigation in OfficeVision for Robotics (V4R Tracking Tool)
Detection of Structure• Find cylinders
• First step to perceivefunktion– Container; graspable
• Structure to reducecombinatorial complexity
Vision for Natural Interaction
System functions• Detect, track, recognise
• Spatio-temporal object relationships in 3D
• Semantic interpretation
ActIPret: Interpretation of human activities with objects
ActIPret – EU Project 2001 - 2004
MOVEMENT• Movement of
– Persons, objects, data
• Task: autonomous navigation– Wheelchair and table
– Obstacle avoidance
– Navigation
person
Infor-mationobject
MOVEMENT – EU IST Project 2004-2007
Sensor Concept
Rationale of Sensor Concept• Stereo vision
– 3D: detect tables, chairs
– Cheap, only alternative TOF (time-of-flight camera)
– Both investigated
• Infrared– Special directions: door traversal
• Bumpers– Last resort, hopefully never used
Example: Table Scene• Objects learned from one scan
• Detection in one view, 2 sec.
Summary• 2D: robust detection and
tracking
• 3D: classes of features
• Spatio-temproalrelationships
• Prediction, context, kognitive approaches
• Framework – integrationof vision and robots
Vision for Automation
(Some) Tasks of Perception• Object detection
– Objects? Rather collection of primitives
– Primitives or features• Interest points
• Edge features: line, junction, parallels, rectangle; arc, ellipse
• Surface patches
– Result: object location in image (2D) and/or pose (3D)
• Object tracking: following a detected object/feature– Feature location in image seuqences
– Result: real-time pose through sequence (2D and/or 3D)
Approaches• Model based
– CAD model of object, environment; geometric features
• Appearance based – Enables easier learning of objects
– Interest points or „whole“ object
• Mixture: structure in data – Gestalt principles– Model physics of world and imaging process
(rather than objects)
– Features, perceptual grouping
Content• Overview
• Tracking– Model-based tracking– Interest point tracking
– Maximum tracking velocity
• Detection– Perceptual grouping
• Cognitive Vision
System View
• Task is known
• Objects involved are knwon some model
• Environment is partly known
Object Tracking• Arbitrary motion in 3D
– Navigation, manipulation
• Robustness in real world environment
• System integration: dynamic aspects
State of the Art (1/2)• Model based object tracking
– Gradient: Harris‘88, Dickmanns‘88, Lowe‘92, Nagel‘00, Thompson‘01, Drummond‘02, Kragic‘03
– Motion model: Dickmanns‘88, Gennery‘92, Isard‘98
[Thompson’01] [Drummond’02] [Kragic‘03]
Model-based Object Tracking
State of the Art (2/2)• Integration of image cues (Cue Integration)
– Edge classification: Hoff‘89, Poggio‘89
– Region based: Aloimonos‘89, Toyama‘99, Kragic’01, Schiele‘02
Show object: colour + texture = found
Model-based Object Tracking
Tracking in V4R (Vision for Robotics)
• Model-based system for Object tracking
• Robustness by integrating and evaluating cues
Objecttracking
3D object pose
Image
Features
3D object pose
Approach• Window warping (Hager98):
– z.B. Line is vertical in the image
• Color Edge ProjectedIntegration of Cues: – Pre-selection of relevant
Edgels– Local cues:
• Image: Intensity, color, (texture)• Model: Region belonging to
object
CEPIC
feature candidates
window
Tracking Windows
Warping [Hager98]
...
...
Integration of Image and Model CuesEPIC - Edge Projected Integration of Cues (1/2)
1. Edge detection: all edges
∑=
+=cues
irechtsirechtslinksilinks HwHwe
1,,
Image: intensity, colourAdaptation: μ und σ
Model: object side
YT1T2
2. For each edgel:
Integration of Image and Model Cues EPIC - Edge Projected Integration of Cues (2/2)
3. Selection of the most likely edgels
Adaptiver Schwellwert
Wahrscheinlichkeit e
An z
ahl K
a nte
n ele
men
te
Likelihood e
Adaptive Threshold
# of
edg
els
Extension – Occlusion Handling
Gradient only vs. CEPIC
Tests – Example Magazine Box
• Maximum gradient only • Local and global cues
Model-based Approach• Topological Integration of Cues:
– Test of the feature topologies from the model:
• Junctions, Parallel Lines
– Global evaluation of sets of feature candidates
TOPIC
feature candidates
object candidates
Pose Validation
3D pose
• Pose Validation: – Validation of the image feature to
model feature fit– Final feature selection– Detection of outliers
feature candidates
3D Pose
object candidates
Pose Validation
TOPIC
Approach – Self-evaluation• Scene-dependent
evaluation of cues
• Ambiguity of elements is a measure for the perceived scene complexity– e.g. # candidates / feature
• Implementation: – Optional call of global
evaluation methods (i.e. TOPIC and Pose Validation)
feature candidates
Switch
Tests – Example Toy Helicopter
• CEPIC • Local and global cues
Tests – ResultsMethod % correct % wrong factor time
Max. Gradient 54.1 22.6 1*Epic (Intensity only) 61.3 13.2 1.09Cepic 71.5 11.0 3.06Cepic+Topic 74.5 7.8 3.33Topic+Pose 68.3 11.7 1.54Cepic+Topic+Pose 77.1 5.0 3.35
Switch 77.6 4.6 3.3
*factor 1 = 5.4 ms/Line
Conclusion – Tracking (1/2)• Improvement with each additional cue
• Edges tracked: 77.6 %, Wrong edge: 4.6%
• Remaining: 14.0%– Bad contrast, reflexions, camera saturation
Model-based Object Tracking
Conclusion – Tracking (2/2)• Increasing robustness by self-evaluation using
perceived redundancy
• Size: is known, easy to estimate, exploit it
• Limits– Texture, multi-colored regions
– Few control points
• Problem: automatic initialisation
• Run live - otherwise tuned to sequences
• V4R homepage: http://robsens.acin.tuwien.ac.at/v4r/
Model-based Object Tracking
Content• Overview
• Tracking– Model-based tracking
– Interest point tracking– Maximum tracking velocity
• Detection– Perceptual grouping
• Cognitive Vision
Approaches• Model based
– CAD model of object, environment; geometric features
• Appearance based – Enables easier learning of objects
– Interest points or „whole“ object
• Mixture: structure in data – Gestalt principles– Model physics of world and imaging process
(rather than objects)
– Features, perceptual grouping
Objects and Interest Points• Extraction of interest points
(characteristic locations)
• Computation of local descriptors
• Determining correspondences
• Detect similar image parts (objects)
Extraction of Interest Points• Corner detectors
– Harris, Hessian
• Multi-scale corner detectors (with scale selection)– Scale invariant Harris and Hessian corners
– Difference of Gaussian (DoG) (Lowe)
• Affine covariant regions– Harris-Affine (Mikolajczyk, Schmid ‘02, Schaffalitzky, Zisserman ’02)
– Hessian-Affine (Mikolajczyk and Schmid ’02)
– Maximally stable extremal regions (MSER) (Matas et al. ’02)
– Intensity based regions (IBR) (Tuytelaars and Van Gool ’00)
– Edge based regions (EBR) (Tuytelaars and Van Gool ’00)
– Entropy-based regions (salient regions) (Kadir et al. ’04
Scale Invariant Harris Points• Multi-scale extraction of Harris interest points
• Selection of points at characteristic scale in scale space
Characteristic scale:
• Maximum in scale space
• Scale invariant
[Mikolajczyk 04]
Difference of Gaussian (DoG)• Detect peaks in the difference of Gaussian pyramid
[Lowe 04]
Affine Covariant Regions
[Mikolajczyk 04]
Harris-Affine and Hessian-Affine (1)
[Mik05]
Harris-Affine and Hessian-Affine (2)• Initialization with multi-scale interest points
• Iterative modification of location, scale and neigh-borhood
[Mik04]
Maximally Stable Extremal Regions (MSER)
[Mik05]
Maximally Stable Extremal Regions (MSER)
• Threshold image intensities: I > I0• Extract connected components “Extremal Regions”
• Find threshold when extremal region is “Maximally Stable”, i.e. local minimum of the relative growth of its square
• Approximate a region with an ellipse
• Local Affine Frame
[Matas 02]
Computation of Local Descriptors
• Distinctive
• Robust
• Invariant to geometric & photometric transformation
• Descriptors– Sampled image patch– Gradient orientation histogram – SIFT (Lowe)– Shape context (Belongie et al. ’02)– PCA-SIFT (Ke and Sukthankar ’04)– Moment invariants (Van Gool ’96)– Gaussian derivative-based (Koenderink ’87, Freeman ’91)– Complex filters (Baumberg ’00, Schaffalitzky and Zisserman ’02)
Gradient Orientation Histogram (SIFT – Scale Invariant Feature Transformation)• Thresholded image gradients are sampled over
16x16 array of locations in scale space
• Create array of orientation histograms
• 8 orientations x 4 x 4 histogram array = 128 dimen.
[Lowe 04]
PCA-SIFT Local Descriptor
From Sukthankar 2004 [Ke04]
Interest points can be used for ...• Object recognition
• Object recognition and segmentation
• Robot Localization
• Tracking
Planar Recognition• Planar surfaces can be
reliably recognized at a rotation of 60° away from the camera
• Affine fit approximates perspective projection
• Only 3 points are needed for recognition
Cope with occlusion
[Lowe]
Recognition Under Occlusion
[Lowe]
Recognition and Segmentation• Initialisation of
object surface with dense features
• Iterative search for visible features using affine refinement of features
[Ferrari 04]
Robot Localization
[Se 05]
Tracking of Interest Points
Interest Point Tracking and Occlusion Reasoning
• Grouping KLT features based on motion
• Detect occlusion based on appearance and disappearance of interest points
Approaches• Model based
– CAD model of object, environment; geometric features
• Appearance based – Enables easier learning of objects
– Interest points or „whole“ object
• Mixture: structure in data – Gestalt principles– Model physics of world and imaging process
(rather than objects)
– Features, perceptual grouping
Appearance-based Object Recognition
• Training with segmented images
• Representation in high dimensional or reduced (Principal Component Analysis PCA) space
• Separate objects linear or non-linear (kernel methods, SVM)
• Challenges– Illumination, Scale, Occlusion
[Bischof, Summerschool 2005]
PCA for visual recognition and pose estimation
[Bischof 02]
Origin
Margin
w
H1
H2
Object Recognition using SVM• Approximate 200 trainings images / object
(RGB, different views, different light)
• Background trainings images
• Hyperspace with 3072 dimensions
• Iterative calculation of separating surface betweentwo classes of objects
[Zillich 01]
...... ...
... histogramintersection...
Database of histograms of object models
Image with anunknown object
Histograms for Object Representation
[Swain 90]
Tracking using Colour Histograms
• Simple approach
• Very fast (~30 fps)
Summary• Model-based
+ Robust
– Difficult to model: how to extract high level features for wire-frame model?
• Appearance-based+ Learning by showing is possible
- Sensitive to illumination, view point, pose
* Interest points are presently en vogue
Content• Overview
• Tracking– Model-based tracking
– Interest point tracking
– Maximum tracking velocity
• Detection– Perceptual grouping
• Cognitive Vision
Velocity of Target
Video: Slow motion of target object.
Velocity of Target
Video: Fast motion of target object.
Maximum Target Velocity
• Maximum velocity of target in image:latencyradius
[sec]][pixel
radius
target
Image
radius
Window
Radius
• Calculation time: ∝ #Pixel = 4Cr2
• C depends on image processing method– Z.B.: PETS Workshops, IEEE ICRA, ECCV, CVPR
2r
2r
Latency
• Sum of all times in control loop (T )– E.g., image acquisition, data transfer time, other
latencies
• + time for image processing
Controlsignal
Δx
Δy
Controller Visionsystem
Maximum Tracking Velocity
⇒ Calculation time = sum of latencies
⇒ Maximum at T = 4Cr2
resp. at CTr /2/1=[sec]][
4 2pixel
CrTr
latencyradiusv
+==
Tesselation of Image with Fovea
[Sandini]
[IBIDEM retina]
Log-polar fovea Image pyramide
Tracking Velocity
Video: Fast motion of target object
Maximum Tracking Velocity
⇒ exploit full view angle
Increasing size of Fovea
Radius [pixel]
Trac
king
vel
ocity
[pix
el/s
ec]
ExperimentsTr
acki
ng v
eloc
ity[p
ixel
/sec
] Image pyramidFovea: 21 pixel
const. resolutionWindow
Radius [pixel]
Summary – Obtaining High Tracking Velocity
• Cameras with fovea
• Presently: CCD, CMOS sensors– Adjust tracking window to latency of control loop
• Reduce latency or resolution (1:1)
• Faster computer, higher frame rate (2: )
• Results independent of controller– Imperfect controller only reduces height of peak
2
State of the Art