Post on 19-May-2015
6: Location and context
What makes a cow a cow?
Google knows because other people know
We think we know
“because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…
How do you know?
What is the object in the middle?
No segmentation … Not even the pixel values of the object …
Where is evidence for an object?
Uijlings IJCV 2011
Where is evidence for an object?
Uijlings IJCV 2011
What is the visual extent of an object?
Uijlings IJCV 2012
Where: exhaustive search
Look everywhere for the object window Imposes computational constraints on
Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)
Impressive but takes long.
Viola IJCV 2004 Dalal CVPR 2005 Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7
Where: the need for a hierarchy
An image is intrinsically hierarchical.
Gu CVPR 2009
Selective search
Van de Sande ICCV 2011
Windows formed by hierarchical grouping. Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004
Selective search example
11
Selective search example
Average best overlap ~88%
… looks like this
High recall cat
Pairs of concepts
Uijlings ICCV demo 2012
6 Conclusion
Selective search gives good localization. Localization needed to understand pairs of concepts.
7 Data and metadata
http://bit.ly/visualsearchengines
How many concepts?
Li Fei Fei slide. Biederman, Psychological Rev. 1987
How many examples?
Once you are over 100 – 1000 examples, success is there.
Russell IJCV 2008
LabelMe 290,000 object annotations
Amateur labeling
Amateur labeling
Amateur labeling
Xirong Li, TMM 2009
Tag relevance by social annotation
Consistency in tagging between users on similar images.
Tag relevance by social annotation
Pretty good for snow not so good for rainbow.
Social negative bootstrapping
Xirong Li ACM MM 2009
Negative images are as important as positive images to learn. Not just random negative images, but close ones. • We want to learn positive
example from an expert, and obtain as many negative samples as we like for free from the web.
• We iteratively aim for the hardest negatives.
Social negative bootstrapping
Xirong Li ICMR 2011
Knowledge ontology ImageNet
acknowledgement WordNet friends
Christiane Fellbaum Dan Osherson
Princeton Kai Li
Princeton Alex Berg Columbia
Jia Deng Princeton/Stanford
Hao Su Stanford
PASCAL VOC
The PASCAL Visual Object Classes (VOC). 500,000 Images downloaded from flickr. Queries like “car”, “vehicle”, “street”, “downtown”. 10,000 objects, 25,000 labels. Mark Everingham, Luc Van Gool, Chris Williams, John Winn, Andrew Zisserman
7. Conclusion
Data is king. The data are beginning to reflect the human cognition capacity [at a basic level]. Harvesting social data requires advanced computer vision control.
8 Performance
PASCAL 2010 Aeroplane
Bus
Bicycle Bird Boat Bottle
Car Cat Chair Cow
True Positives - Person UOCTTI_LSVM_MDPM
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Person UOCTTI_LSVM_MDPM
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
Non-birds & non-boats
Non-bird images: Highest ranked
Non-boat images: Highest ranked
Water texture and scene composition?
Non-chair
True Positives - Motorbike MITUCLA_HIERARCHY
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Motorbike MITUCLA_HIERARCHY
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
Object localization 2008-2010
Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.
0
10
20
30
40
50
60
aerop
lane
bicyc
le bird
boat
bottle bu
s car cat
chair cow
dining
table dog
horse
motor
bike
perso
n
potte
dplan
t
shee
pso
fa
train
tvmon
itor
Max A
P (%
)
200820092010
TRECvid evaluation standard
Concept detection
Aircraft
Beach
Mountain
People marching
Police/Security
Flower
Measuring performance
• Precision
Set of retrieved items
Set of relevant items
Set of relevant retrieved items
inverse relationship Recall
1.
2.
3.
4.
5.
Results
UvA-MediaMill@TRECVID
• other systems
Snoek et al, TRECVID 04-10
Performance doubled in just 3 years
• 36 concept detectors
Snoek & Smeulders, IEEE Computer 2010
Even when using training data of different origin, great progress. But the number of concepts is still limited.
8. Conclusion
Impressive results and quickly improving per year. Very valuable competition. Best non-classes start to make sense!
9 Speed
SURF based on integral images
Introduced by Viola & Jones in the context of face detection: sliding windows in left to right / up to bottom integral images.
46
SURF principle
LREC 2004, 26 May 2004, Lisbon 47
LyyLyyLxyLxy
Lyy
Lyy
L L L xx yy xy
Approximate Gaussian derivatives with box filters:
SURF speed
LREC 2004, 26 May 2004, Lisbon 48
Computation time: 6 times faster than DoG (~100msec). Independent of filter scale.
Sca
le
Dense descriptor extraction
Pixel-wise Responses Final Descriptor
Factor 16 speed improvement, Another factor 2 by the use of matrix libs.
Projection: Random Forest
Binary decision trees
Moosmann et al. 2008 ......
.... ....
Real-time bag of words
D-SURF 2x2 <empty> Random
Forest RBF
Descriptor Extraction
Projection Classification
Pre-projection Actual projection SVM kernel
MAP: 0.370
Total computation time is 38 milliseconds per image
26 frames per second on a normal PC in any 20 concepts.
15 10 13
9. Conclusion
SURF scale and rotation invariant Fast due to the use of integral images Download: http://www.vision.ee.ethz.ch/~surf/ DURF extraction is 6x faster than Dense-SIFT. Projection using Random Forest 50x faster than NN.
Internet Video Search: the beginning
concept
detection
telling stories
browsing
video video
video measuring
features
lexicon
learning