Image Binary descrip0ons Rela0ve descrip0ons
not natural, not open, perspec0ve
more natural than tallbuilding; less natural than forest; more open than tallbuilding; less open than coast; more perspec0ve than tallbuilding;
not natural, not open, perspec0ve
more natural than insidecity; less natural than highway; more open than street; less open than coast; more perspec0ve than highway; less
perspec0ve than insidecity natural, open, perspec0ve
more natural than tallbuilding; less natural than mountain; more open than mountain; less perspec0ve than opencountry;
White, not Smiling, VisibleForehead
more White than AlexRodriguez; more Smiling than JaredLeto; less Smiling than ZacEfron; more VisibleForehead than JaredLeto; less
VisibleForehead than MileyCyrus
White, not Smiling, not VisibleForehead
more White than AlexRodriguez; less White than MileyCyrus; less Smiling than HughLaurie; more VisibleForehead than ZacEfron; less
VisibleForehead than MileyCyrus not Young,
BushyEyebrows, RoundFace
more Young than CliveOwen; less Young than ScarleMJohansson; more BushyEyebrows than ZacEfron; less BushyEyebrows than AlexRodriguez;
more RoundFace than CliveOwen; less RoundFace than ZacEfron
2. Learning Rela-ve A0ributes
Categorical (binary) aMributes are restric0ve and can be unnatural
Rela-ve A0ributes Devi Parikh (TTIC) and Kristen Grauman (UT Aus0n)
6. Datasets Proposed idea: Rela-ve A0ributes Richer communica0on between
humans and machines Describe images or categories rela0vely
e.g. “dogs are furrier than giraffes”, “find less congested downtown Chicago scene than
Learn a ranking func0on for each aMribute
1. Main Idea 4. Rela-ve Zero-‐shot Learning Mo-va-on:
Natural Not Natural ?
Smiling Not Smiling ?
Enables new applica-ons Novel zero-‐shot learning from aMribute
comparisons Precise automa0cally generated textual
descrip0ons of images
5. Describing Images Rela-vely { }{ },. . .∼},( ){ ,
. . .�For each aMribute
rm(xi) = wTmxiLearn a scoring func0on
∀(i, j) ∈ Om : wTmxi > wT
mxj ∀(i, j) ∈ Sm : wTmxi = wT
mxj
that best sa0sfies constraints:
Supervision is am, Sm:Om:
Max-‐margin learning to rank formula-on
� ∼ �Young: Smiling: …
Learnt rela-ve a0ributes
Rela-ve a0ributes space
Training: Images from S seen and descrip0ons of U unseen categories
Tes0ng: Categorize image into N (=S+U) categories
Young: H C S � � Z M �
Unseen categories
Smiling: Z M � Need not use all aMributes Need not relate to all S
Smiling
Youth
S
H Z C
M
Auto -‐ generate textual descrip-on of:
�Density: … Learnt rela-ve a0ributes
3. Ranking Func-on vs. Binary Classifier Score
min
�1
2||wT
m||22 + C
��ξ2ij +
�γ2ij
��
s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT
m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;
ξij ≥ 0; γij ≥ 0
% correctly ordered pairs Classifier Ranker
Outdoor scenes 80% 89%
Celebrity faces 67% 82%
Rela-ve a0ributes space
Density
1/8 dataset
“more dense than , less dense than
C C H H H C F H H M F F I F
“more dense than Highways, less dense than Forests”
”
Binary RelativeOSR TI S HCOMF
natural 0 0 0 0 1 1 1 1 T≺I∼S≺H≺C∼O∼M∼Fopen 0 0 0 1 1 1 1 0 T∼F≺I∼S≺M≺H∼C∼O
perspective 1 1 1 1 0 0 0 0 O≺C≺M∼F≺H≺I≺S≺Tlarge-objects 1 1 1 0 0 0 0 0 F≺O∼M≺I∼S≺H∼C≺Tdiagonal-plane 1 1 1 1 0 0 0 0 F≺O∼M≺C≺I∼S≺H≺Tclose-depth 1 1 1 1 0 0 0 1 C≺M≺O≺T∼I∼S∼H∼FPubFig ACHJ MS VZ
Masculine-looking 1 1 1 1 0 0 1 1 S≺M≺Z≺V≺J≺A≺H≺CWhite 0 1 1 1 1 1 1 1 A≺C≺H≺Z≺J≺S≺M≺VYoung 0 0 0 0 1 1 0 1 V≺H≺C≺J≺A≺S≺Z≺MSmiling 1 1 1 0 1 1 0 1 J≺V≺H≺A∼C≺S∼Z≺MChubby 1 0 0 0 0 0 0 0 V≺J≺H≺C≺Z≺M≺S≺A
Visible-forehead 1 1 1 0 1 1 1 0 J≺Z≺M≺S≺A∼C∼H∼VBushy-eyebrows 0 1 0 1 0 0 0 0 M≺S≺Z≺V≺H≺A≺C≺JNarrow-eyes 0 1 1 0 0 0 1 1 M≺J≺S≺A≺H≺C≺V≺ZPointy-nose 0 0 1 0 0 0 0 1 A≺C≺J∼M∼V≺S≺Z≺HBig-lips 1 0 0 0 1 1 0 0 H≺J≺V≺Z≺C≺M≺A≺S
Round-face 1 0 0 0 1 1 0 0 H≺V≺J≺C≺Z≺A≺S≺M
0 1 2 3 4 50
20
40
60
80
Accu
racy
# unseen categories
OSR
0 1 2 3 4 50
20
40
60
Accu
racy
# unseen categories
PubFig
DAP SRA Proposed
1 2 5 150
20
40
60
Accu
racy
# labeled pairs
OSR
1 2 5 150
20
40
60
Accu
racy
# labeled pairs
PubFig
DAP SRA Proposed
6 5 4 3 2 10
20
40
60
Accu
racy
# att to describe unseen
OSR
11109 8 7 6 5 4 3 2 10
20
40
60
Accu
racy
# att to describe unseen
PubFig
DAP SRA Proposed
1 2 30
20
40
60
Accu
racy
Looseness of constraints
OSR
1 2 30
20
40
60
Accu
racy
Looseness of constraints
PubFig
DAP SRA Proposed
1 2 30
20
40
60
80
100
# top choices
% c
orre
ct im
age
in to
p ch
oice
s
BinaryRelative
Outdoor Scene Recogni-on (OSR): 2688 images, 8 categories: coast (C), forest (F), highway (H), inside-‐city (I), mountain (M), open-‐country (O), street (S) and tall-‐building (T), gist features;
Public Figure Face (PubFig): 800 images, 8 categories: Alex Rodriguez (A), Clive Owen (C), Hugh Laurie (H), Jared Leto (J), Miley Cyrus (M), ScarleM Johansson (S), Viggo Mortensen (V) and Zac Efron (Z), gist and color features
8. Zero-‐shot Learning Results
Whereas conven0onal Binary descrip-on: “not dense”
Not dense: Dense:
Rela-ve descrip-on:
7. Image Descrip-on Results
wb wm
Number of unseen categories
Amt. of labeled data to learn a0ributes
Amount of descrip-on
Quality of descrip-on
Baselines: Direct AMribute Predic0on (DAP) [Lampert et al. 2009] (binary)
Classifier instead of ranker (SRA)
min
�1
2||wT
m||22 + C
��ξ2ij +
�γ2ij
��
s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT
m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;
ξij ≥ 0; γij ≥ 0
min
�1
2||wT
m||22 + C
��ξ2ij +
�γ2ij
��
s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT
m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;
ξij ≥ 0; γij ≥ 0
Adapted objec0ve from [Joachims, 2002]
How do learned ranking func0ons
differ from classifier outputs?
Infer image category using max-‐likelihood
More smiling than
Less smiling than
More VisFHead than
Less VisFHead than
More chubby than
Less chubby than
Human subject experiment: Which image is?
OSR PubFig
c(x) = argmax�
m
P (acm|x)
classical recogni0on problem binary ~ rela0ve supervision
<< baseline supervision can give unique ordering on all classes
An aMribute is more discrimina0ve when used rela0vely
Rela0ve aMributes jointly carve out space for unseen category
”
? ? ? Example descrip-ons
Top Related