Post on 23-Jan-2016
SIFT
• Guest Lecture by Jiwon Kim• http://www.cs.washington.edu/homes/jwkim/
SIFT Features andIts Applications
Autostitch Demo
Autostitch
• Fully automatic panorama generation– Input: set of images– Output: panorama(s)
• Uses SIFT (Scale-Invariant Feature Transform) to find/align images
1. Solve for homography
1. Solve for homography
1. Solve for homography
2. Find connected sets of images
2. Find connected sets of images
2. Find connected sets of images
3. Solve for camera parameters
• New images initialised with rotation, focal length of best matching image
3. Solve for camera parameters
• New images initialised with rotation, focal length of best matching image
4. Blending the panorama
• Burt & Adelson 1983– Blend frequency bands over range
Low frequency ( > 2 pixels)
High frequency ( < 2 pixels)
2-band Blending
Linear Blending
2-band Blending
So, what is SIFT?
• Scale-Invariant Feature Transform• David Lowe at UBC• Scale/rotation invariant• Currently best known feature descriptor• Many real-world applications
– Object recognition– Panorama stitching– Robot localization– Video indexing– …
Example: object recognition
SIFT properties
• Locality: features are local, so robust to occlusion and clutter
• Distinctiveness: individual features can be matched to a large database of objects
• Quantity: many features can be generated for even small objects
• Efficiency: close to real-time performance
SIFT algorithm overview
1. Feature detection– Detect points that can be repeatably
selected under location/scale change
2. Feature description– Assign orientation to detected feature
points– Construct a descriptor for image patch
around each feature point
3. Feature matching
1. Feature detection
• Detect points stable under location/scale change
– Build continuous space (x, y, scale)– Approximated by multi-scale Difference-of-
Gaussian pyramid– Select maxima/minima in (x, y, scale)
1. Feature detection
1. Feature detection
• Localize extrema by fitting a quadratic
1) Sub-pixel/sub-scale interpolation using Taylor expansion
2) Take derivative and set to zero
1. Feature detection• Discard low-contrast/edge points
1) Low contrast: discard keypoints with < threshold
2) Edge points: high contrast in one direction, low in the other compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio
)ˆ(xD
1. Feature detection• Example
(a) 233x189 image(b) 832 DOG extrema(c) 729 left after peak value threshold(d) 536 left after testing ratio of principle curvatures
2. Feature description
– Create histogram of local gradient directions computed at selected scale
– Assign canonical orientation at peak of smoothed histogram
0 2
• Assign orientation to keypoints
2. Feature description• Construct SIFT descriptor
– Create array of orientation histograms– 8 orientations x 4x4 histogram array = 128
dimensions
2. Feature description• Advantage over simple correlation
– Gradients less sensitive to illumination change
– Gradients may shift: robust to deformation, viewpoint change
Performance: stability to noise
• Match features after random change in image scale & orientation, with differing levels of image noise
• Find nearest neighbor in database of 30,000 features
Performance:stability to affine change
• Match features after random change in image scale & orientation, with 2% image noise, and affine distortion
• Find nearest neighbor in database of 30,000 features
Performance: distinctiveness
• Vary size of database of features, with 30 degree affine change, 2% image noise
• Measure % correct for single nearest neighbor match
3. Feature matching
• For each feature in A, find nearest neighbor in B
A B
3. Feature matching
• Nearest neighbor search too slow for large database of 128-dimenional data
• Approximate nearest neighbor search:– Best-bin-first [Beis et al. 97]: modification to k-d
tree algorithm– Use heap data structure to identify bins in
order by their distance from query point
• Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time
3. Feature matching• Reject false matches
– Compare distance of nearest neighbor to second nearest neighbor
– Common features aren’t distinctive, therefore bad– Threshold of 0.8 provides excellent separation
3. Feature matching
• Now, given feature matches…– Find an object in the scene– Solve for homography (panorama)– …
3. Feature matching
• Example: 3D object recognition
3. Feature matching
• 3D object recognition– Assume affine transform: clusters of size >=3– Looking for 3 matches out of 3000 that agree
on same object and pose: too many outliers for RANSAC or LMS
– Use Hough Transform• Each match votes for a hypothesis for object
ID/pose• Voting for multiple bins & large bin size allow for
error due to similarity approximation
3. Feature matching• 3D object recognition: solve for pose
– Affine transform of [x,y] to [u,v]:
– Rewrite to solve for transform parameters:
3. Feature matching
• 3D object recognition: verify model1) Discard outliers for pose solution in prev step2) Perform top-down check for additional
features3) Evaluate probability that match is correct
a) Use Bayesian model, with probability that features would arise by chance if object was not present
b) Takes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]
Planar recognition• Training images
Planar recognition
• Reliably recognized at a rotation of 60° away from the camera
• Affine fit approximates perspective projection
• Only 3 points are needed for recognition
3D object recognition• Training images
3D object recognition
• Only 3 keys are needed for recognition, so extra keys provide robustness
• Affine model is no longer as accurate
Recognition under occlusion
Illumination invariance
Applications of SIFT
• Object recognition• Panoramic image stitching• Robot localization• Video indexing• …
• The Office of the Past– Document tracking and recognition
Location recognition
Robot Localization
Map continuously built over time
Locations of map features in 3D
Sony Aibo
SIFT usage:
Recognize charging station
Communicate with visual cards
Teach object recognition
The Office of the Past• Paper everywhere
Unify physical andelectronic desktops
• Recognize video of paper on physical desktop– Tracking– Recognition– Linking
Video camera
Desktop
Unify physical andelectronic desktops
• Applications– Find lost documents– Browse remote
desktop– Find electronic
version– History-based
queries
Video camera
Desktop
Example input video
Demo – Remote desktop
System overviewVideo camera
DeskUser
Computer
System overview
Video of desk
System overview
Video of desk Images from PDF
System overview
Video of desk Images from PDF
Track & recognize
System overview
Video of desk Images from PDF
Track & recognize
T T+1
Desk Desk
Internal representation
System overview
Video of desk Images from PDF
Track & recognize
T T+1
Internal representation
Scene Graph
Desk Desk
System overview
Video of desk Images from PDF
Track & recognize
T T+1
Internal representation
Where is my W-2?
Desk Desk
System overview
Video of desk Images from PDF
Track & recognize
T T+1
Desk Desk
Internal representation
Where is my W-2?
Answer
Assumptions
• Document– Corresponding electronic copy exists– No duplicates of same document
Assumptions
• Document– Corresponding electronic copy exists– No duplicates of same document
• Motion– 3 event types: move/entry/exit– One document at a time– Only topmost document can move
Non-assumptions
• Desk need not be initially empty
Non-assumptions
• Desk need not be initially empty• Stacks may overlap
Algorithm overviewInput
Frames… …
Algorithm overviewInput
Frames… …
Event Detection
before after
Algorithm overviewInput
Frames… …
Event Detection
Event Interpretation
“A document moved from (x1,y1) to (x2,y2)”
before after
Algorithm overviewInput
Frames… …
Event Detection
Event Interpretation
“A document moved from (x1,y1) to (x2,y2)”
Document Recognition
before after
File1.pdf
File2.pdf
File3.pdf
Algorithm overviewInput
Frames… …
Event Detection
Event Interpretation
“A document moved from (x1,y1) to (x2,y2)”
Document Recognition
before after
File1.pdf
File2.pdf
File3.pdf
Scene Graph Update
Desk Desk
Algorithm overviewInput
Frames… …
Event Detection
Event Interpretation
“A document moved from (x1,y1) to (x2,y2)”
Document Recognition
before after
File1.pdf
File2.pdf
File3.pdf
Scene Graph Update
Desk Desk
SIFT
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
before after
Document tracking example
Motion: (x,y,θ)
before after
Document Recognition
…
File1.pdf File2.pdf File3.pdf File4.pdf File5.pdf File6.pdf
• Match against PDF image database
…
Document Recognition• Performance analysis
– Tested 20 pages against database of 162 pages
Document Recognition• Performance analysis
– Tested 20 pages against database of 162 pages
– ~200x300 pixels per document for reliable match
Document Resolution
Recognition Rate
Document Recognition• Performance analysis
– Tested 20 pages against database of 162 pages
– ~200x300 pixels per document for reliable match
Document Resolution
Recognition Rate
300
0.9
Results
• Input video– ~40 minutes– 1024x768 @ 15 fps– 22 documents, 49 events
• Running time– Video processed offline– No optimization– A few hours for entire video
Demo – Paper tracking
Photo sorting example
Photo sorting example
Demo – Photo sorting
Future work
• Enhance realism– Handle more realistic desktops– Real-time performance
• More applications– Support other document tasks
• E.g., attach reminder, cluster documents
– Beyond documents• Other 3D desktop objects, books/CD’s
Summary
• SIFT is:– Scale/rotation invariant local feature– Highly distinctive– Robust to occlusion, illumination
change, 3D viewpoint change– Efficient (real-time performance)– Suitable for many useful applications
References
• Distinctive image features from scale-invariant keypoints – David G. Lowe, International Journal of Computer Vision,
60, 2 (2004), pp. 91-110• Recognising panoramas
– Matthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25.
• Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops – Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM
Symposium on User Interface Software and Technology (UIST 2004), pp. 99-107.