Post on 27-Jun-2018
Effective Dataset Construction in Computer Vision
Kota Yamaguchi
Recent progress in image recognition
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
[Russakovsky 2014]
GoogLeNet 6.7%
Clarifai 11.7%
SuperVision 16.4%
XRCE 25.8%
NEC 28.2%
Human 5.1%
Ioffe et al. (arXiv) 4.9%
ILSVRC!image!classifica1on!task!
Steel!drum!
Scale T-shirt Steel drum Drumstick Mud turtle
ILSVRC classification error
Deep models and ...
Int J Comput Vis
Fig. 1 The best reported performance on PASCAL VOC challenge has
shown marked increases since 2006 (top). This could be due to various
factors: the dataset itself has evolved over time, the best-performing
methods differ across years, etc. In the bottom-row, we plot a particular
factor—training data size—which appears to correlate well with per-
formance. This begs the question: has the increase been largely driven
from the availability of larger training sets?
Fig. 2 We plot idealized curves of performance versus training dataset
size and model complexity. The effect of additional training examples
is diminished as the training dataset grows (left), while we expect per-
formance to grow with model complexity up to a point, after which an
overly-flexible model overfits the training dataset (right). Both these
notions can be made precise with learning theory bounds, see e.g.
(McAllester 1999)
1.1 Challenges
We found there is a surprising amount of subtlety in scaling
up training data sets in current systems. For a fixed model,
one would expect performance to generally increase with the
amount of data and eventually saturate (Fig. 2). Empirically,
we often saw the bizarre result that off-the-shelf implemen-
tations show decreased performance with additional data!
One would also expect that to take advantage of additional
training data, it is necessary to grow the model complexity,
in this case by adding mixture components to capture dif-
ferent object sub-categories and viewpoints. However, even
with non-parametric models that grow with the amount of
training data, we quickly encountered diminishing returns in
performance with only modest amounts of training data.
We show that the apparent performance ceiling is not
a consequence of HOG+linear classifiers. We provide an
analysis of the popular deformable part model (DPM), show-
ing that it can be viewed as an efficient way to implicitly
encode and score an exponentially-large set of rigid mixture
components with shared parameters. With the appropriate
sharing, DPMs produce substantial performance gains over
standard non-parametric mixture models. However, DPMs
have fixed complexity and still saturate in performance with
current amounts of training data, even when scaled to mix-
tures of DPMs. This difficulty is further exacerbated by the
computational demands of non-parametric mixture models,
which can be impractical for many applications.
1.2 Proposed Solutions
In this paper, we offer explanations and solutions for many
of these difficulties. First, we found it crucial to set model
regularization as a function of training dataset using cross-
validation, a standard technique which is often overlooked
in current object detection systems. Second, existing strate-
gies for discovering sub-category structure, such as cluster-
ing aspect ratios (Felzenszwalb et al. 2010), appearance fea-
tures (Divvala et al. 2012), and keypoint labels (Bourdev and
Malik 2009) may not suffice. We found this was related to
the inability of classifiers to deal with “polluted” data when
mixture labels were improperly assigned. Increasing model
complexity is thus only useful when mixture components
capture the “right” sub-category structure.
To efficiently take advantage of additional training data,
we introduce a non-parametric extension of a DPM which we
call an exemplar deformable part model (EDPM). Notably,
EDPMs increase the expressive power of DPMs with only a
negligible increase in computation, making them practically
useful. We provide evidence that suggests that compositional
representations of mixture templates provide an effective way
to help target the “long-tail” of object appearances by sharing
local part appearance parameters across templates.
Extrapolating beyond our experiments, we see the strik-
ing difference between classic mixture models and the non-
parametric compositional model (both mixtures of linear
classifiers operating on the same feature space) as evidence
that the greatest gains in the near future will not be had with
simple models + bigger data, but rather through improved
representations and learning algorithms.
We introduce our large-scale dataset in Sect. 2, describe
our non-parametric mixture models in Sect. 3, present exten-
sive experimental results in Sect. 4, and conclude with a dis-
cussion in Sect. 5 including related work.
123
X Zhu et al. Do We Need More Training Data? IJCV 2015
PASCAL VOC best performance
Data improvement? Model improvement?
We need both
Data drive statistical models
Training data Model
Testing data Results
Training
Testing
e.g., CNN
Object recognition datasets
2014
• ImageNet • Stanford Background
• SUN
2009 2004
• Caltech 101 • MSRC • ESP Game
• MS COCO • YFCC 100M
• TinyImage
WordNet (1985-)
• PASCAL VOC • Caltech 256 • LabelMe
• SBU1M
• UIUC
ImageNet • WordNet Hierarchy
• 14M images, 21K
synsets as of Apr 2015
• Used in ILSVRC
[Deng 2009]
Microsoft COCO • Over 300K
images, 2M instances
• Creative Commons
• Segmentation • 5 captions /
image
[Lin 2014]
Quality vs. scale
Scale
Quality
1M 100K 10K 1K 10M 100M 1B
In-house datasets
10B 100
Raw, user-generated data
• SBU1M • YFCC100M
Crowd-sourced datasets
• MS COCO
• ImageNet
• SUN • Caltech 101
Motivation
• Data is driving image recognition, but creating a good dataset is not easy
• Big-data challenges – Scalability – Quality
• How should we construct a dataset?
Agenda
Part I: dataset construction • Dataset construction • Collecting data • Annotating data • Crowdsourcing
• 10-min break
Part II: case studies • Data-driven clothing
parsing • Popularity analysis • Studying fashion styles • Studying fashion trends
Dataset construction
1. Decide the task – Classification, Detection, Segmentation, ...
2. Collecting data – Web, Fieldwork
3. Select and annotate data – Crowdsourcing
4. Your dataset is ready for use
To annotate, or not to annotate?
• If your task is ... – Supervised approach – Benchmarking
• If your task is ...
– Data mining – Weakly supervised approach
Need only minor annotations
Need a lot of annotations
Supervised scenario: classification
• Image-label pairs – Every picture must
be completely annotated
– Very clean CIFAR-10 dataset [Krizhevsky 2009]
D = x, y( ){ }
yÎ bird, cat, dog, ...{ }
Weakly-supervised scenario: tag prediction
• Image and user-attached tags – No annotation effort – Often missing tags – Not necessarily visual
D = x, z( ){ }zÎ canon, USA, vintage, ...{ }
Boulder, Colorado, city, historic, history, America, United States, urban, street, vintage, historical, ephemeral, classic, retro, brick, sign, signage, tavern, restaurant, cafe, dining, building, nostalgic, nostalgia, old, wall, door, window, Canon, architecture, Southwest
https://www.flickr.com/photos/29069717@N02/16772466913/
Dataset construction
• Purpose of the dataset greatly differs depending on the goal
• Be ambitious! – ImageNet [Deng et al, 2009]
• From WordNet to visual ontology – Visipedia [Perona et al, 2010]
• Constructing visual encyclopedia
COLLECTING VISUAL DATA
Collecting data
• Web • Fieldwork
Collecting data on the Web
• Approaches – Search Engine: Google, Bing – Web API: Flickr, SNS – Web scraping
• Considerations – Legal issues – Noise and distribution of online content – Storage
Keyword-based search
• Good for weakly-categorized images • Issues
– Data-size limitation – Variability
whippet
Query expansion [Deng 2009]
whippet dog whippet greyhound
whippet lebrel 惠比特犬
Translations
Synonyms
whippet
Web API
• Web-services sometimes provide a developer API
• Structured data – e.g., JSON, XML
www.flickr.com/services/developer
Web scraper
WWW
HTTP Client Parser
URL Queue Storage URLs
URLs Data
Page
Legal issues
• CAREFULLY READ TERMS – Service providers don’t like bad access
• especially if it harms their business • e.g., copying the entire website
– Users own copyright on their own content – Talk to an expert if unsure
• Recommendation
– Creative Commons (Flickr) – Citing URL instead of redistributing data
YFCC100M Yahoo Flickr Creative Commons 100M
http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images
[Thomee et al, 2015]
• 100M Flickr Photos
• 49M geo-tagged • Image URLs • Title and Description • Tags
Could be used as a basis to build a new dataset upon
Quality of online data
Flickr description != caption
Vacation on the water One week vacation in the blue waters of Turkey was one of the best weeks in my life. On day in each bay just worrying about the sun and the water. One week without putting on shoes or using the phone. Paradise on earth!
https://www.flickr.com/photos/rspedro/8396863230/
Learning from online content
• User-generated content does not contain clean data – Non-visual texts / tags – Tags tend to have high precision, low recall – Frequency issue
• Hopefully, large data-size resolves issues
Bigger data help – Retrieval from SBU1M dataset
[Ordonez 2011]
Im2Text: Describing Images Using 1 Million
Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Stony Brook University
Method overview
Contributions
Method BLEU score
Global matching (1k) 0.0774 +- 0.0059
Global matching (10k) 0.0909 +- 0.0070
Global matching (100k) 0.0917 +- 0.0101
Global matching (1million) 0.1177 +- 0.0099
Global + Content matching (linear regression) 0.1215 +- 0.0071
Global + Content matching (linear SVM) 0.1259 +- 0.0060
High level information
BLEU score evaluation
• SBU Captioned Photo Dataset: A large novel data set containing 1 million
images from the web with associated captions written by people, filtered so that
the descriptions are likely to refer to visual content.
[http://tamaraberg.com/sbucaptions]
• A description generation method that utilizes global image representations to
retrieve and transfer captions from our data set to a query image.
• A description generation method that utilizes both global representations and
direct estimates of image content (objects, actions, stuff, attributes, and
scenes) to produce relevant image descriptions.
Dataset size
Good results
Amazing colours in
the sky at sunset with
the orange of the
cloud and the blue of
the sky behind.
Strange cloud formation literally flowing
through the sky like a river in relation to
the other clouds out there.
Fresh fruit and vegetables
at the market in Port
Louis Mauritius.
Clock tower
against the sky. Tree with red leaves in the field in
autumn.
One monkey on the tree in the
Ourika Valley Morocco A female mallard duck in the lake
at Luukki Espoo
The sun was
coming through
the trees while I
was sitting in my
chair by the river
Query Image
1,000 10,000 100,000 1,000,000
Our dog Zoe in her bed.
Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.
The Egyptian cat statue by the
floor clock and perpetual motion
machine in the pantheon.
Man sits in a rusted car buried in
the sand on Waitarere beach.
Emma in her hat looking super
cute.
t i d ig f d hitLittle girl and her dog in
northern Thailand. They both
seemed interested in what
we were doing.
Past work on image
retrieval has shown that
small collections often
produce spurious
matches. Increasing data
set size has a significant
effect on the quality of
retrieved global matches.
Quantitative results also
reflect this (see table at
the bottom)
Kentucky cows in a field.
The cat in the window.
The sky is blue over the Gherkin. The boat ended up a kilometre
from the water in the middle of
the airstrip.
Tree beside the river. Water over the road.
Bad results Incorrect objects Incorrect context
Completely wrong
Human evaluation
Objects: 80 object categories using part-based deformable models and
compute distances with objects detected in the query image based on visual
attributes and raw visual descriptors.
Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color
and geometric features as input. We determine similarity with the query image using
product of SVM probabilities. (water, etc)
People/Actions: Detect people and pose using state-of-the-art methods and
compute person similarity using an attribute based representation of pose.
Scenes: Train classifiers using global features for 26 common scene types and use the
vector of classifier responses as a feature to compute similarity between images.
TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the
keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
SBU Captioned Photo Dataset
Under the sky of burning clouds. Stained glass
window in
Eusebius church.
From the big tower on the hill over
looking central Wakkanai.
Reflection of the clear blue sky in
the water.
An old roman wall by the
tower of London. A tree right around
the corner from our
house is this tree
after the snow fell it
was so beautiful.
Young baboon in
the campsite at
fish river canyon.
Not quite sure what the
name of this bird is. Saw
while walking along the
beach in Ocracoke, NC
Granite in green
glass. This is a boat I saw
while walking near the
house we rented.
Graffiti water
tower in Sidney,
Ohio.
Looks like this might have
been RCA building you can
still see the RCA dog in the
stained glass window
The old Premium
Oil Co. sign in
Green River, Utah.
Chimaki black floral
cosmetic bag with
handle.
Graffiti water
tower in Sidney,
Ohio.
The water tower
in downtown
Campbell.
The famous Liver bird atop
the Royal Liverpool
Insurance building near the
newly tarted up docks.
Evan playing in
the sand on a
calais beach in
France.
This is the old water
tower at the
Goodyear plant in
Cartersville, Georgia.
My house...yeah right. This
was the beach house we
stayed in with my family for
vacation, in the Outer Banks.
The water tower
in downtown
Campbell.
Old cone water
tower with Graffiti
in Detroit
Michigan
Matching using Global
Image Features
(GIST + Color)
Smallest house in paris
between red (on right)
and beige (on left).
Bridge to temple in
Hoan Kiem lake.
The water is
clear enough to
see fish
swimming
around in it.
A walk around the
lake near our house
with Abby.
Hangzhou bridge in
West lake.
The daintree river by
boat.
. . .
SBU Captioned Photo Dataset
Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”
1 million captioned images!
The bridge over the
lake on Suzhou Street.
The Daintree river by boat. Bridge over Cacapon river.
Iron bridge over the Duck river.
. . .
Transfer Caption(s)
e.g. “The bridge over the lake
on Suzhou Street.”
Rerank retrieved images using high level content (captions, object detections,
scene classification, stuff detections, people & actions)
sky
trees
water
building
bridge
One of the many stone bridges in town
that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Computer Vision
Our Goal
The view from the 13th floor of an
apartment building in Nakano awesome.
Please choose the
image that better
corresponds to the
given caption:
In addition, we propose a new evaluation task where a user is presented with two photographs
and one caption. The user must assign the caption to the most relevant image. For evaluation we
use a query image, a random image and a generated caption.
Im2Text: Describing Images Using 1 Million
Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Stony Brook University
Method overview
Contributions
Method BLEU score
Global matching (1k) 0.0774 +- 0.0059
Global matching (10k) 0.0909 +- 0.0070
Global matching (100k) 0.0917 +- 0.0101
Global matching (1million) 0.1177 +- 0.0099
Global + Content matching (linear regression) 0.1215 +- 0.0071
Global + Content matching (linear SVM) 0.1259 +- 0.0060
High level information
BLEU score evaluation
• SBU Captioned Photo Dataset: A large novel data set containing 1 million
images from the web with associated captions written by people, filtered so that
the descriptions are likely to refer to visual content.
[http://tamaraberg.com/sbucaptions]
• A description generation method that utilizes global image representations to
retrieve and transfer captions from our data set to a query image.
• A description generation method that utilizes both global representations and
direct estimates of image content (objects, actions, stuff, attributes, and
scenes) to produce relevant image descriptions.
Dataset size
Good results
Amazing colours in
the sky at sunset with
the orange of the
cloud and the blue of
the sky behind.
Strange cloud formation literally flowing
through the sky like a river in relation to
the other clouds out there.
Fresh fruit and vegetables
at the market in Port
Louis Mauritius.
Clock tower
against the sky. Tree with red leaves in the field in
autumn.
One monkey on the tree in the
Ourika Valley Morocco A female mallard duck in the lake
at Luukki Espoo
The sun was
coming through
the trees while I
was sitting in my
chair by the river
Query Image
1,000 10,000 100,000 1,000,000
Our dog Zoe in her bed.
Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.
The Egyptian cat statue by the
floor clock and perpetual motion
machine in the pantheon.
Man sits in a rusted car buried in
the sand on Waitarere beach.
Emma in her hat looking super
cute.
t i d ig f d hitLittle girl and her dog in
northern Thailand. They both
seemed interested in what
we were doing.
Past work on image
retrieval has shown that
small collections often
produce spurious
matches. Increasing data
set size has a significant
effect on the quality of
retrieved global matches.
Quantitative results also
reflect this (see table at
the bottom)
Kentucky cows in a field.
The cat in the window.
The sky is blue over the Gherkin. The boat ended up a kilometre
from the water in the middle of
the airstrip.
Tree beside the river. Water over the road.
Bad results Incorrect objects Incorrect context
Completely wrong
Human evaluation
Objects: 80 object categories using part-based deformable models and
compute distances with objects detected in the query image based on visual
attributes and raw visual descriptors.
Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color
and geometric features as input. We determine similarity with the query image using
product of SVM probabilities. (water, etc)
People/Actions: Detect people and pose using state-of-the-art methods and
compute person similarity using an attribute based representation of pose.
Scenes: Train classifiers using global features for 26 common scene types and use the
vector of classifier responses as a feature to compute similarity between images.
TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the
keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
SBU Captioned Photo Dataset
Under the sky of burning clouds. Stained glass
window in
Eusebius church.
From the big tower on the hill over
looking central Wakkanai.
Reflection of the clear blue sky in
the water.
An old roman wall by the
tower of London. A tree right around
the corner from our
house is this tree
after the snow fell it
was so beautiful.
Young baboon in
the campsite at
fish river canyon.
Not quite sure what the
name of this bird is. Saw
while walking along the
beach in Ocracoke, NC
Granite in green
glass. This is a boat I saw
while walking near the
house we rented.
Graffiti water
tower in Sidney,
Ohio.
Looks like this might have
been RCA building you can
still see the RCA dog in the
stained glass window
The old Premium
Oil Co. sign in
Green River, Utah.
Chimaki black floral
cosmetic bag with
handle.
Graffiti water
tower in Sidney,
Ohio.
The water tower
in downtown
Campbell.
The famous Liver bird atop
the Royal Liverpool
Insurance building near the
newly tarted up docks.
Evan playing in
the sand on a
calais beach in
France.
This is the old water
tower at the
Goodyear plant in
Cartersville, Georgia.
My house...yeah right. This
was the beach house we
stayed in with my family for
vacation, in the Outer Banks.
The water tower
in downtown
Campbell.
Old cone water
tower with Graffiti
in Detroit
Michigan
Matching using Global
Image Features
(GIST + Color)
Smallest house in paris
between red (on right)
and beige (on left).
Bridge to temple in
Hoan Kiem lake.
The water is
clear enough to
see fish
swimming
around in it.
A walk around the
lake near our house
with Abby.
Hangzhou bridge in
West lake.
The daintree river by
boat.
. . .
SBU Captioned Photo Dataset
Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”
1 million captioned images!
The bridge over the
lake on Suzhou Street.
The Daintree river by boat. Bridge over Cacapon river.
Iron bridge over the Duck river.
. . .
Transfer Caption(s)
e.g. “The bridge over the lake
on Suzhou Street.”
Rerank retrieved images using high level content (captions, object detections,
scene classification, stuff detections, people & actions)
sky
trees
water
building
bridge
One of the many stone bridges in town
that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Computer Vision
Our Goal
The view from the 13th floor of an
apartment building in Nakano awesome.
Please choose the
image that better
corresponds to the
given caption:
In addition, we propose a new evaluation task where a user is presented with two photographs
and one caption. The user must assign the caption to the most relevant image. For evaluation we
use a query image, a random image and a generated caption.
Im2Text: Describing Images Using 1 Million
Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Stony Brook University
Method overview
Contributions
Method BLEU score
Global matching (1k) 0.0774 +- 0.0059
Global matching (10k) 0.0909 +- 0.0070
Global matching (100k) 0.0917 +- 0.0101
Global matching (1million) 0.1177 +- 0.0099
Global + Content matching (linear regression) 0.1215 +- 0.0071
Global + Content matching (linear SVM) 0.1259 +- 0.0060
High level information
BLEU score evaluation
• SBU Captioned Photo Dataset: A large novel data set containing 1 million
images from the web with associated captions written by people, filtered so that
the descriptions are likely to refer to visual content.
[http://tamaraberg.com/sbucaptions]
• A description generation method that utilizes global image representations to
retrieve and transfer captions from our data set to a query image.
• A description generation method that utilizes both global representations and
direct estimates of image content (objects, actions, stuff, attributes, and
scenes) to produce relevant image descriptions.
Dataset size
Good results
Amazing colours in
the sky at sunset with
the orange of the
cloud and the blue of
the sky behind.
Strange cloud formation literally flowing
through the sky like a river in relation to
the other clouds out there.
Fresh fruit and vegetables
at the market in Port
Louis Mauritius.
Clock tower
against the sky. Tree with red leaves in the field in
autumn.
One monkey on the tree in the
Ourika Valley Morocco A female mallard duck in the lake
at Luukki Espoo
The sun was
coming through
the trees while I
was sitting in my
chair by the river
Query Image
1,000 10,000 100,000 1,000,000
Our dog Zoe in her bed.
Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.
The Egyptian cat statue by the
floor clock and perpetual motion
machine in the pantheon.
Man sits in a rusted car buried in
the sand on Waitarere beach.
Emma in her hat looking super
cute.
t i d ig f d hitLittle girl and her dog in
northern Thailand. They both
seemed interested in what
we were doing.
Past work on image
retrieval has shown that
small collections often
produce spurious
matches. Increasing data
set size has a significant
effect on the quality of
retrieved global matches.
Quantitative results also
reflect this (see table at
the bottom)
Kentucky cows in a field.
The cat in the window.
The sky is blue over the Gherkin. The boat ended up a kilometre
from the water in the middle of
the airstrip.
Tree beside the river. Water over the road.
Bad results Incorrect objects Incorrect context
Completely wrong
Human evaluation
Objects: 80 object categories using part-based deformable models and
compute distances with objects detected in the query image based on visual
attributes and raw visual descriptors.
Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color
and geometric features as input. We determine similarity with the query image using
product of SVM probabilities. (water, etc)
People/Actions: Detect people and pose using state-of-the-art methods and
compute person similarity using an attribute based representation of pose.
Scenes: Train classifiers using global features for 26 common scene types and use the
vector of classifier responses as a feature to compute similarity between images.
TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the
keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
SBU Captioned Photo Dataset
Under the sky of burning clouds. Stained glass
window in
Eusebius church.
From the big tower on the hill over
looking central Wakkanai.
Reflection of the clear blue sky in
the water.
An old roman wall by the
tower of London. A tree right around
the corner from our
house is this tree
after the snow fell it
was so beautiful.
Young baboon in
the campsite at
fish river canyon.
Not quite sure what the
name of this bird is. Saw
while walking along the
beach in Ocracoke, NC
Granite in green
glass. This is a boat I saw
while walking near the
house we rented.
Graffiti water
tower in Sidney,
Ohio.
Looks like this might have
been RCA building you can
still see the RCA dog in the
stained glass window
The old Premium
Oil Co. sign in
Green River, Utah.
Chimaki black floral
cosmetic bag with
handle.
Graffiti water
tower in Sidney,
Ohio.
The water tower
in downtown
Campbell.
The famous Liver bird atop
the Royal Liverpool
Insurance building near the
newly tarted up docks.
Evan playing in
the sand on a
calais beach in
France.
This is the old water
tower at the
Goodyear plant in
Cartersville, Georgia.
My house...yeah right. This
was the beach house we
stayed in with my family for
vacation, in the Outer Banks.
The water tower
in downtown
Campbell.
Old cone water
tower with Graffiti
in Detroit
Michigan
Matching using Global
Image Features
(GIST + Color)
Smallest house in paris
between red (on right)
and beige (on left).
Bridge to temple in
Hoan Kiem lake.
The water is
clear enough to
see fish
swimming
around in it.
A walk around the
lake near our house
with Abby.
Hangzhou bridge in
West lake.
The daintree river by
boat.
. . .
SBU Captioned Photo Dataset
Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”
1 million captioned images!
The bridge over the
lake on Suzhou Street.
The Daintree river by boat. Bridge over Cacapon river.
Iron bridge over the Duck river.
. . .
Transfer Caption(s)
e.g. “The bridge over the lake
on Suzhou Street.”
Rerank retrieved images using high level content (captions, object detections,
scene classification, stuff detections, people & actions)
sky
trees
water
building
bridge
One of the many stone bridges in town
that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Computer Vision
Our Goal
The view from the 13th floor of an
apartment building in Nakano awesome.
Please choose the
image that better
corresponds to the
given caption:
In addition, we propose a new evaluation task where a user is presented with two photographs
and one caption. The user must assign the caption to the most relevant image. For evaluation we
use a query image, a random image and a generated caption.
Power-laws
Limited vocabulary appears extremely large number of times
Most of the words are rare
frequency
Long tail
figure: wikipedia
• Frequency of tag words • Content popularity
f (x) = a xk
Family of distributions of the form:
Collecting data in the field
• Full control over data – Sensor types
• RGB-D, Panorama
– Quality • No copyright issue
• Cost and scalability
Data collection summary
• Web, fieldwork, building on existing dataset
• Legal concerns • Quality issues
• Probably bigger is better
– Deep learning requires big data
ANNOTATING DATA
Annotation process
D = x, y( ){ }
Output: Annotated data
¢D = x, z( ){ }
Input: Raw data
Weak labels • Search-keywords • Tags • GPS (?)
Clean labels • Image-labels • Bounding-boxes • Pixel-labels
Annotation system
Image
Designing annotation tools Types of annotations • Category tag • Bounding box
– Human pose • Segmentation
– Polygon – Super-pixels
• Natural language • Attributes • Tracking
Tools • HTML / JavaScript • Web server to host
images
Bounding boxes
Segmentation
• Pixel-wise labels • Approximation
– polygons – super-pixels
LabelMe [Russell 2007]
Natural language
• Image-text pairs
• Decomposable with NLP techniques – Attribute: Adj + Noun – Action: Noun + Verb
• Sentence generation
• One jet lands at an airport while another
takes off next to it.
• Two airplanes parked in an airport.
• Two jets taxi past each other.
• Two parked jet airplanes facing opposite
directions.
• two passenger planes on a grassy plain
UIUC Pascal Sentence [Rashtchian 2010]
Relative attributes [Parikh 2011]
> natural
< smiling
Slide credit: Devi Parikh
Is this natural?
Object annotation in videos
Vatic [Vondrick 2012]
Choosing the right task
• The more difficult the task is, the more expensive annotation becomes – (worker) time is money – Very difficult task results in poor quality
• Decompose a very complicated task into
multiple simple tasks – e.g., Single task to detect ALL objects ->
Multiple tasks to detect specific objects
Scaling up multi-label annotation
• Goal: Efficiently labeling hundreds of categories
Table Chair Horse Dog Cat Bird ...
+ + - - - - + - - - + - + + - - - -
[Deng 2014]
~1000 ?
Hierarchy, sparsity, correlation [Deng 2014]
Is there a table?
Is there a chair?
Is there a horse?
Is there a dog?
Is there a cat?
Is there a bird?
Naively asking 1000 labels
Is there an animal?
Is there a mammal?
Is there a cat?
Hierarchical questions
No table, chair
Table Chair Horse Dog Cat Bird
No bird
Probably no horse?
Human-in-the-loop approach
• Active learning – Only annotating
uncertain instances – Lower costs – Faster learning
Unannotated images
Annotated images
Selection Model
Selected images
Annotators
[Vijayanarasimhan 2011]
Pitfall: Asking people for validation
Is this an airplane? • Answer yes if a green
rectangle is drawn around an airplane. Otherwise answer no.
Rule-of-thumb: Ask to annotate ground-truth
yes no
Machines are very unlikely to produce 100% correct detection
Crowdsourcing market
Online workers
$$$
Requester
Tasks
• Image classification • Object detection • Segmentation • Language
description
Result
Monetary rewards
Amazon Mechanical Turk, CrowdFlower, etc.
Non-US people probably need somebody in US or agents to use MTurk as of 2015...
Amazon MTurk demographics
• Country: 80% US, 20% India
• Gender: 50% Male, 50% Female
• Age: – 50% 30 years old – 20% 20 years old – 20% 40 years old
P Ipeirotis, Demographics of Mechanical Turk: Now Live! (April 2015 edition) http://www.behind-the-enemy-lines.com
http://blogs.scientificamerican.com/guilty-planet/
Workers are not the same P Welinder et al., The Multidimensional Wisdom of Crowds, NIPS 2010
• One annotator = one classifier for ``duck’’ presence • Estimated decision parameters from Bayesian model • Groups 1, 2, 3 have different decision boundary
Quality control
• There are always sloppy annotators – Think of a bot randomly clicking on buttons
• Have a qualification test • Insert JavaScript to validate answers
– Reject too short or fast answers • Assign multiple annotators per task • Control worker motivation
– Feedback, Gamification
Qualification tests
• MTurk can be set up to allow only workers passing a qualification test – Prepare test questions and gold-
standard answers – Useful to assess, e.g., writing
ability
• Also possible to validate workers during the main tasks
Annotation tasks
Qualification test
Good Bad
Giving feedback to workers
• Feedback motivates workers – Also include a comment form to get their
opinion
100% 0% 50%
You’re the rookie! You’ve annotated 20 images You have only 20 images left!
Game-based annotation
• ESP Game
• Pros – No MTurk – Motivation
• Cons – Cheating – Bias
many.corante.com
[Ahn 2004]
ReferItGame [Kazemzadeh 2014]
guy in front
guy in front
Player 1
Player 2 man in red shirt
man in red shirt
Write a referring expression
Click on the referred object
Finding experts on the Web • Quizz
– Free medical quiz using targeted ads
– Knowledgeable volunteers, without monetary rewards
– Much faster with better quality
• Do monetary rewards harm quality?
[Ipeirotis 2014]
Crowdsourcing considerations
• Know your workers
• Quality control – Fun tasks attract motivated workers!
Part I: dataset construction
• What is your dataset for? – Know your task
• Collecting data – On Web or fields, Quality and big data
• Annotating data – Designing the right tool
• Crowdsourcing – Workers and quality control
Part II: Case studies
• Data-driven clothing parsing • Popularity analysis • Studying fashion styles • Studying fashion trends
DATA-DRIVEN CLOTHING PARSING
CVPR 2012, ICCV 2013
style.com
Clothing parsing
Pose and clothing
Semantic segmentation Pose estimation [Shotton 06] [Gould 09] [Liu 09] [Eigen 12]
[Singh 13] [Tighe 10, 13, 14] [Dong 13] [Ferrari 08] [Bourdev 09] [Yang 11]
[Dantone 13] [Ladicky 13]
Online fashion networks
Chictopia Lookbook Chicisimo Pinterest Tumblr ...
www.chictopia.com
Datasets
• Fashionista dataset – Small, completely annotated images – For supervised learning
• Paper Doll dataset
– Large-scale tagged images – For semi-supervised approach
Fashionista dataset
• 685 images – pose annotation – super-pixel labels
• Manually picked
images from Chictopia
• Crowd-sourced annotation
(a) Superpixels (b) Pose estimation
null
shorts
shoes
purse
top
necklace
hair
skin
(c) Predicted Clothing Parse (d) Pose re-estimation
Figure 2: Clothing parsing pipeline: (a) Parsing the image into Superpixels [1], (b) Original pose estimation using state ofthe art flexible mixtures of parts model [27]. (c) Precise clothing parse output by our proposed clothing estimation model(note the accurate labeling of items as small as the wearer’s necklace, or as intricate as her open toed shoes). (d) Optional re-estimate of pose using clothing estimates (note the improvement in her left arm prediction, compared to the original incorrectestimate down along the side of her body).
garment retrieval application (Fig 1).
Our main contributions include:• A novel dataset for studying clothing parsing, consist-ing of 158,235 fashion photos with associated text an-notations, and web-based tools for labeling.
• An effective model to recognize and precisely parsepictures of people into their constituent garments.
• Initial experiments on how clothing prediction mightimprove state of the art models for pose estimation.
• A prototype visual garment retrieval application thatcan retrieve matches independent of pose.Of course, clothing estimation is a very challenging
problem. The number of garment types you might observein a day on the catwalk of a New York city street is enor-mous. Add variations in pose, garment appearance, lay-ering, and occlusion into the picture, and accurate cloth-ing parsing becomes formidable. Therefore, we considera somewhat restricted domain, fashion photos from Chic-topia.com. These highly motivated users – fashionistas –upload individual snapshots (often full body) of their outfi tsto the website and usually provide some information relatedto the garments, style, or occasion for the outfi t. This allowsus to consider the clothing labeling problem in two scenar-ios: 1) a constrained labeling problem where we take theusers’noisy and perhaps incomplete tags as the list of pos-sible garment labels for parsing, and 2) where we considerall garment types in our collection as candidate labels.
1.1. Related W ork
Clothing recognition: Though clothing items determinemost of the surface appearance of the everyday human,there have been relatively few attempts at computationalrecognition of clothing. Early clothing parsing attempts fo-cused on identifying layers of upper body clothes in very
limited situations [2]. Later work focused on grammati-cal representations of clothing using artists’ sketches [6].Freifeld and Black [13] represented clothing as a defor-mation from an underlying body countour, learned fromtraining examples using principal component analysis toproduce eigen-clothing. M ost recently attempts have beenmade to consider clothing items such as t-shirt or jeans assemantic attributes of a person, but only for a limited num-ber of garments [4]. Different from these past approaches,we consider the problem of estimating a complete and pre-cise region based labeling of a person’s outfi t, for generalimages with a large number of potential garment types.
Clothing items have also been used as implicit cues ofidentity in surveillance scenarios [26], to find people in animage collection of an event [11, 22, 25], to estimate occu-pation [23], or for robot manipulation [16]. Our proposedapproach could be useful in all of these scenarios.
Pose Estimation: Pose estimation is a popular and wellstudied enterprise. Some previous approaches have con-sidered pose estimation as a labeling problem, assigningmost likely body parts to superpixels [18], or triangulatedregions [20]. Current approaches often model the body asa collection of small parts and model relationships amongthem, using conditional random fields [19, 9, 15, 10], or dis-criminative models [8]. Recent work has extended patchesto more general poselet representations [5, 3], or incorpo-rated mixtures of parts [27] to obtain state of the art results.Our pose estimation subgoal builds on this lastmethod [27],extending the approach to incorporate clothing estimationsin models for pose identification.
Image Parsing: Image parsing has been studied as a steptoward general image understanding [21, 12, 24]. We con-sider a similar problem (parsing) and take a related ap-
Annotation tools Web-based annotation tools at Amazon Mechanical Turk
Lesson: Segmentation is too hard for MTurk workers
Open-source tools
JS-Graph-Annotator JS-Segment-Annotator
github.com/kyamagu/js-graph-annotator github.com/kyamagu/js-segment-annotator
CRF-based parsing
null
shoes
shirt
jeans
hair
skin
null
tights
jacket
dress
hat
heels
hair
skin
null
shorts
blouse
bracelet
wedges
hair
skin
null
shoes
top
stockings
hair
skin
Figure 4: Successful results on the Fahionista dataset.
null
purse
boots
sweater
hat
bracelet
hair
skin
(a) Skin-like color
null
t-shirt
shoes
jacket
hair
skin
(b) Failing pose estimate
null
tights
boots
jacket
dress
hat
hair
skin
(c) Spill in the background
null
purse
dress
accessories
belt
heels
hair
skin
(d) Coarse pattern
Figure 5: Failure cases
the art [27]) As motivation for future research on clothingestimation, we also find that given true clothing labels ourpose re-estimation system reaches a PCPof 89.5% , demon-strating the potential usefulness of incorporating clothinginto pose identification.
4.4. Retrieving Visually Similar Garments
We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-
CVPR 12
Failure cases CVPR 12
null
shoes
shirt
jeans
hair
skin
null
tights
jacket
dress
hat
heels
hair
skin
null
shorts
blouse
bracelet
wedges
hair
skin
null
shoes
top
stockings
hair
skin
Figure 5: Example successful results on the Fahionista dataset.
null
purse
boots
sweater
hat
bracelet
hair
skin
(a) Skin-like color
null
t-shirt
shoes
jacket
hair
skin
(b) Failing pose estimate
null
tights
boots
jacket
dress
hat
hair
skin
(c) Spill in the background
null
purse
dress
accessories
belt
heels
hair
skin
(d) Coarse pattern
Figure 6: Example failure cases
4.4. Retr ieving Visually Similar Garments
We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-
gion, and measure similarity between items by Euclideandistance. For retrieval, we prepare a query image and obtaina list of images ordered by visual similari ty. Figure 1 showsa few of top retrieved results for images displaying shorts,blazer , and t-shirt (query in leftmost col, retrieval results
Paper Doll parsing Retrieval-based approach
Paper Doll Dataset
NN
images NN
images Similar
images
Candidate tags
Image Parser
Tagged images
Web 1. Get tagged images on the Web 2. Retrieve similar images 3. Use them to predict items
Tagged images on the Web
Dress Hat
Heels Sweater Heels
Paper Doll dataset ~339,000 images
Retrieval example
bag cardigan heels shorts top
boots skirt
flats necklace shirt skirt
belt pumps skirt t-shirt
belt shirt shoes skirt tights
skirt top
Query
dress shoes skirt tights belt top
Candidate tags ...
...
Mixture of retrieval-based methods Global parsing
NN parsing
Transferred parsing
Combined parsing
Combine predictions
Final parsing
Smoothing
Input image
Similar styles
Candidate items
Results Input Truth Paper Doll CRF
Results Input Truth CRF Paper Doll
Big data benefits: performance
Data size
*CRF baseline doesn’t use big data
Big data benefits: qualitatively
Input
skin
hair
bag
boots
dress
skirt
top
Data size = 256
Data size = 262,144
accessories bag boots dress necklace shoes shorts skirt top
bag boots dress heels skirt sunglasses top
Parsing
Data-driven clothing parsing
• Fashionista dataset – small – completely-annotated
• Paper Doll dataset
– large – user-annotated
POPULARITY ANALYSIS MM2014
Predicted most popular
Predicted least popular
Popularity prediction
Regression analysis in 300K posts
• Tag TF-IDF • Image
composition • Color entropy • Style descriptor • Parse descriptor
Popularity
• User identity • Previous posts • Node degrees
Input Output
Social factors
Content factors
• Votes
Like button in Chictopia
Long tail
Promotion effect?
Findings
• The outfit doesn’t matter (!!!)
• Popularity is mostly the outcome of the social network – social bias – #votes ∝ #followers – People just click on friends’ photos
Regression performance Factors R2 Spearman Accuracy
top 25% Accuracy top 75%
Social 0.491 0.682 0.847 0.779 Content 0.248 0.488 0.778 0.737 Social + Content
0.493 0.685 0.845 0.775
Social factors significantly boost the performance
Rich-get-richer phenomena
• Popularity growth of a linked content is proportional to the current popularity
Easley and Kleinberg 2010
What if there is no social network?
• Popularity = f ( content factors )?
Crowdsourcing! • Collecting popularity
votes in Amazon MTurk
• No network!
3000 pictures 25 assignments
Out-of-network popularity
#images
#votes
No social factor in the voting process
Task
• Predict crowd popularity using Content factors and/or Social factors in Chictopia
Social factors
Chictopia
Content factors
MTurk
Voting data
?
Predicting crowd votes Factors R2 Spearman Accuracy
top 25% Accuracy top 75%
Social 0.423 0.634 0.845 0.787 Content 0.428 0.647 0.888 0.862 Social + Content
0.473 0.686 0.884 0.858
• Content factors matter • Social factors from Chictopia predict crowd votes well
• User-content correlation: Top-bloggers consistently post
good pictures
Lessons
• Crowdsourcing is not only for getting ground-truth, but to study human behavior
• Research opportunity for social visual media
STUDYING FASHION STYLES ECCV2014
Q: What makes the boy on the right look Harajuku-style?
Tie? Shoes?
tokyofashion.com
Goal
• Finding what constitutes a fashion style
• Approach – Game-based annotation – Attribute factorization
Goth
Who’s more Bohemian?
hipsterwars.com Game-based relative ``style-ness’’ collection Asking our online friends for participation NO MONETARY REWARDS! Initial keyword-search on Google or Fashion SNS
Participation statistics
Most played the game only a few clicks
Some motivated users clicked A LOT
TrueSkill game algorithm
• Algorithm to select which pair to play
• Idea: – Represent each image by Gaussian over
rating – Update Gaussian parameters after each click – Chooses expected-to-tie images for play
[R Herbrich, 2007]
Score distribution after game
Most Hipster
Least Hipster
Annotation examples
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
# 1534ECCV
# 1534
10 ECCV-14 submission ID 1534
Most (Predicted) Least (Predicted)
Pin
up
G
oth
H
ipste
r B
ohe
mia
n
Pre
ppy
Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.
5.2 W ith in -class classifi cat ion
Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally
have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
# 1534ECCV
# 1534
10 ECCV-14 submission ID 1534
Most (Predicted) Least (Predicted)
Pin
up
Goth
H
ipste
r B
ohe
mia
n
Pre
ppy
Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.
5. 2 W ith in -class classifi cat ion
Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally
have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance
MOST LEAST
High-quality dataset without Amazon MTurk
Relative vs. absolute
• Asked MTurk workers 1-10 ratings
• Much noisier results from MTurk
Analyzing what makes her look preppy
Factorization results
Fashion style analysis
• Game-based annotation collected high-quality data without monetary rewards
• How can we collect seed images?
STUDYING FASHION TRENDS WACV2015
Fashion trend: Runway to realway
Fashion show Street
Runway dataset ~35k images in 9k fashion shows over 15 years, from 2000 to 2014
Brands by photos
101
102
103
0
2
4
6
8
10
12
Photos
Bra
nds
The query image is given in the left column, while five candidate images are shown in the right columns.
1. Select an image with the most similar outfit to the query. 2. If there is NO similar image, please select NONE.
Query image
NONE
Collecting human judgments to learn similarity
Select an image with the most similar outfit to the query image
Runway-to-runway retrieval Retrieving similar styles from other fashion shows
Runway-to-realway retrieval Retrieving similar styles from street snaps
Visually analyzing floral trend Runway image of floral Retrieved images in street with timestamp
Peaks in spring!
% retrieved images
Part II: Case studies
• Vital roles of data – Data-driven clothing parsing
• Small complete annotations, large-scale tags – Popularity analysis
• Verifying network phenomena using crowds – Studying fashion styles
• Game-based data collection – Studying fashion trends
• Learning human judgments to analyze trend
Effective dataset construction
Crowd-sourced datasets
Scale
Quality
1M 100K 10K 1K 10M 100M 1B
Raw, user-generated data
In-house datasets
10B 100
Driving force to computer vision
Wisdom of crowds