Connected Segmentation Tree -- A Joint Representation of...
Transcript of Connected Segmentation Tree -- A Joint Representation of...
Connected Segmentation Tree -- A Joint Representation of Region Layout and HierarchyNarendra Ahuja and Sinisa Todorovic
{n-ahuja, sintod}@uiuc.eduCVPR 2008MOTIVATION
Objects in 3D world:
comprised of other objects = parts
different materials on their surfacesstructure of surface materials
have distinct locations & finite volumecohesive
Their image projections:regions of contiguous pixels = 2D objects
photometric properties: color, texture
spatial layout and hierarchy of 2D objectslayout and embedding of subregions within regions
geometric properties: shape, area
PROBLEM STATEMENT
GIVEN images, each containing >= 0 occurrences of an object category
IDENTIFY all regions occupied by the category in the image set
LEARN a model of the category that JOINTLY captures:
- Geometric properties (e.g., shape, area, relative displacements)
- Photometric properties (e.g., intensity contrast along the boundary)
- Embedding (or containment) relationships
- Neighbor relationships and their strength
of the regions identified to represent category occurrences.
GIVEN a new image, use the learned category model to
DETECT, RECOGNIZE, SEGMENT any occurrences of the category.
bag of keypointsplanar-graph modelshierarchical models
photometric spatial layout embeddingmodeling properties of regionsoccupied by 2D objects geometric
✔
✔
✔
✔ ✔
✔ ✔✘
✘ ✘ ✘
✘
Connected Segmentation Tree ✔ ✔ ✔ ✔
PRIOR WORK vs. PROPOSED OBJECT REPRESENTATION
hierarchical modelse.g., segmentation tree [1, 2]
the same segmentation treefor different layouts
=
planar-graphmodels
preselected same-scale parts
bag of keypoints
no spatialinformation
≠ hierarchy of adjacency graphs
different connected segmentation trees for different layouts
Proposed model: Connected Segmentation Tree
OVERVIEW OF OUR APPROACH
1) Images = Connected segmentation trees (CST) ⇒ Similar objects = Similar subgraphs
2) Similarity defined in terms of region properties:
- Geometric and photometric
- Strengths of region neighbor relationships
and recursively the same properties of embedded subregions
3) Similar subgraphs found by using max-clique based graph matching4) Maximally matching subgraphs are fused into a graph-union = CST category model
5) Matching the model with the CST of a new image:
- Simultaneous recognition and segmentation all category occurrences
- Explanation of recognition in terms of object parts and their neighbor relationships
Boundary pixels and their corresponding standard Voronoi polygons for points
Generalized Voronoi polygons for regions
Union of Voronoi polygons of boundary pixels
⇔
CONTRIBUTIONS -- GENERALIZED VORONOI DIAGRAM
- Regions are neighbors if their generalized Voronoi polygons touch
- Strength of neighborliness = Percentage of a Voronoi polygon's perimeter that is shared
CONTRIBUTIONS -- DEFINITION OF REGION NEIGHBORLINESS
1) Regions are exposed to one other through their nearby boundary segments
2) If boundary parts of two regions are:
- Visible to each other
- Nearby The two regions are called neighbors
- Sufficiently far from other region boundaries
3) Relative degrees of boundary exposure to neighbors = Strengths of a region's neighborliness
4) Region neighborliness is asymmetric
!""""""#
""""""$
⇒
CONTRIBUTIONS -- GENERALIZED MAX-CLIQUE GRAPH MATCHING
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
0.3
0.4
0.1
0.4
0.2
0.3
0.2
0.4
0.1
0.1
1 1
11
0.1
0.2
0.1
0.3
1 1 1 1
11
0.4
0.1
0.2
0.3
1 1 1
1 1
0.4
0.3
0.3
0.2
0.2
0.1 0.1
0.3
0.4
1 1
1 1 1 1
1
0.1
Problem: How to match graphs whose both edges and nodes are weighted?
Properties of the proposed algorithm: - It matches regions with regions, and separately region spatial relationships with corresponding relationships
- The maximum common subgraph, found by the algorithm, preserves the original structure of input graphs
- It seeks legal region-region and relationship-relationship matches whose unary and pairwise potentials are large
Theorem: Maximum subgraph isomorphism between two graphs with weighted edges and nodes
Maximum weight clique of their generalized association graph
⇔
G = (V, E) G! = (V !, E!)andGiven two CST image representations:
find structure-preserving bijection: f = {(v, v!)} ! V " V !
which maximizes their similarity measure defined as
SGG! = maxf
!
"###
(v,v!)!f
!vv! +###
(v,v!,u,u!)!f"f
"vv!uu!
$
%
unary potential = function ofregion intrinsic properties
pairwise potential = function of region neighborliness and embedding
EXAMPLE: CST MODEL OF WEIZMANN HORSES648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
CVPR
#2328
CVPR
#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) original image (b) STs (c) CSTsFigure 5. Samples from Hoofed Animals (top) and LabelMe (bot-
tom): CSTs outperform STs in detection and segmentation.
Qualitative evaluation – Model: Fig. 6 illustrates the
model of category horses G, learned on six, randomly se-lected training imagesD from theWeizmann dataset. Nodes
v in G, are depicted as rectangles that contain those re-gions in D that got matched with v during learning. Ascan be seen, the structure of G correctly captures the re-
cursive containment and neighbor relations of regions that
comprise the horses’ appearances in the training set. For
example, nodes “head,” “neck,” and “mane” are found to
be children of the node representing a larger “head&neck”
region, and they are all identified as neighbors. Also, G cor-rectly encodes that “head&neck” and “tail” are not neigh-
bors, and that they are positioned to the left and right of
“abdomen.” Frequently occurring, similar regions repre-
senting backgroundmay happen to be included in the model
if they consistently co-occur in the training set with regions
representing the category of interest. Thus, G contains fewnodes corresponding to “fence” identified as neighbor to the
horse’s “head.” Typically, these background nodes make a
small percentage of nodes in the model (3-5%), depending
on a specific training set.
Quantitative evaluation:
Regarding the comparison with prior work, there is very
limited past work on segmenting category instances. For
Weizmann horses, the best segmentation results are SE =7% [25] , and SE = 18.2% [5]. [25] uses semi-supervised
requiring training images to contain only horses, and cannot
handle occlusion. [19] accomplish with multiple segmenta-
tions modeled using LDA the following segmentation er-
rors: buildings SE = 0.47%, cars SE = 0.79%.Fig. 8 and Table 1–2 show detection, segmentation, and
recognition errors of CST. As can be seen, small increase in
negative examples Mn does not downgrade performance.
AsMn becomes larger, it so happens that in our training set
objects belonging to other categories start appearing more
frequently. Therefore, by our basic definition, these objects
become the target category. As a result, the algorithm now
correctly learns the new category instances, as expected.
Thus, with increase of Mn, the training set becomes inap-
Figure 6. CST-based model of Weizmann horses.
0.1 0.2 0.3 0.4 0.50.5
0.6
0.7
0.8
0.9
1
1!Precision
Rec
all
Caltech!101: Faces
H+WN
H+N
H
H, 10%
H, 20%
H+WN, 10%
H+WN, 20%
0.1 0.2 0.3 0.4 0.5
0.7
0.8
0.9
1
1!Precision
Rec
all
Caltech!101: Motorbikes
H+WN
H+N
H
H, 10%
H, 20%
H+WN, 10%
H+WN, 20%
Figure 7. Recall-precision curves: (H) hierachical relation-
ships only, (H+N) hierarchical and neighborhood relationships,
(H+WN) hierarchical and weighted neighborhood relationships;
(10%) and (20%) the size of a rectangular occlusion with respect
to image size.
Algorithm CST vs. ST CST vs. SL CST CST# positive imgs 10 10 20vs.10 30vs.10
Faces 8.5% 3.4% 1.7% 2.5%Horses 9.5% 4.1% 2.8% 2.9%
Table 2. Increase in the area under RPC
propriate. Increasing the number of positive training exam-
ples yields higher recognition recall and precision.
The performance of our approach has desirable invari-
7
training set with positive and negative examples
sample parts and their spatial relations in the learned CST model
containment
neighborliness
REFERENCES[1] N. Ahuja and S. Todorovic, "Learning the taxonomy and models of categories present in arbitrary images," in ICCV 2007
[2] S. Todorovic and N. Ahuja, "Unsupervised category modeling, recognition and segmentation in images," in IEEE TPAMI, 2008
ACKNOWLEDGMENTThe support of the National Science Foundation under grant NSF IIS 07-43014 is gratefully acknowledged
RESULTSUIUC Hoofed Animals Dataset [1] (http://vision.ai.uiuc.edu/~sintod/datasets.html)
Simultaneous recognition and segmentation of category "cow" represented by the CST model (middle row), and the segmentation-tree (ST) model [1, 2] (bottom row).
CSTs outperform STs.
CST
ST
in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Anchorage, AL, June 2008
LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels
Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2
Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1
Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9
Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [?, ?, ?].
[19] S. Todorovic and N. Ahuja. Extracting subimages of an un-known category from a set of images. In CVPR, volume 1,pages 927–934, 2006.
[20] A. Torsello and E. R. Hancock. Computing approximate treeedit distance using relaxation labeling. Pattern Recogn. Lett.,24(8):1089–1097, 2003.
[21] J. Winn and N. Jojic. Locus: learning object classes withunsupervised segmentation. In ICCV, pages 756–763, 2005.
8
Characteristics of recognition and segmentation:1) Invariant to translation and in-plane rotation
2) Robust against a certain degree of:
- Scale variations
- Illumination changes
- Object articulation
- Partial occlusion
- Background clutter
3) Segmentation is good when object boundaries are:
- Jagged
- Blurred
- Form complex topologies
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
CVPR
#2328CVPR
#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels
Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2
Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1
Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9
Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [16, 22, 5].
0.1 0.2 0.3 0.4 0.50.75
0.8
0.85
0.9
0.95
1
1!Precision
Recall
Caltech!101: Faces+Planes+Mbikes+Cars
CSTCST!unweightSTCST, 20%ST, 20%
20 30 40 50 60 70 800.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
number of training images Mn+M
p
reco
gn
itio
n a
ccu
racy
Recognition on Caltech!101: 4 categories
Mp=10, M
n!, CST
Mn=10, M
p!, CST
Mp=10, M
n!, ST
Mn=10, M
p!, ST
Figure 7. (left)Detection recall-precision curves: “CST-unweight”
means that edges in CST are not weighted. 20% is the size of a
rectangular occlusion w.r.t. the image size. Mp=10,Mn=10. ST
is the method of [20]. (right) Recognition accuracy of CST and
ST for the varying ratio ofMp andMn in the training set.
occurrences in the training set, while subgraphs of visible
object parts in the CST of a test image can still be matched
with the model; (iv) minor depth rotations of objects caus-
ing their shape deformations, because structural instability
of CSTs (e.g., due to region splits/mergers) is accounted for
during matching; and (v) clutter, since clutter regions are
not frequent and thus not learned.
6. Conclusion
We have presented what we believe is the first attempt
to jointly learn a canonical model of an object in terms
of photometric and geometric properties, and containment
and neighbor relationships of its parts. As other funda-
mental contributions, the paper proposes: (1) A general-
ized Voronoi diagram to represent region adjacency; and (2)
A new max-clique based algorithm for matching weighted
graphs, and thus learning the model.
References
[1] N. Ahuja. Dot pattern processing using Voronoi neighbor-
hoods. TPAMI, 4(3):336–343, 1982. 3
[2] N. Ahuja. A transform for multiscale image segmenta-
tion by integrated edge and region detection. IEEE TPAMI,
18(12):1211–1235, 1996. 3
[3] N. Ahuja and S. Todorovic. Extracting texels in 2.1D natural
textures. In ICCV, 2007. 1, 2, 3, 4
[4] G. Bouchard and B. Triggs. Hierarchical part-based visual
object categorization. InCVPR, vol. 1, pp. 710–715, 2005. 3
[5] L. Cao and L. Fei-Fei. Spatially coherent latent topic model
for concurrent object segmentation and classification. In
ICCV, 2007. 7, 8
[6] G. Carneiro and D. Lowe. Sparse flexible models of local
features. In ECCV, vol. 3, pp. 29–43, 2006. 3
[7] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite
templates for cloth modeling and sketching. InCVPR, vol. 1,
pp. 943–950, 2006. 3
[8] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial
priors for part-based recognition using statistical models. In
CVPR, vol. 1, pp. 10–17, 2005. 2
[9] R. Fergus, P. Perona, and A. Zisserman. Object class recog-
nition by unsupervised scale-invariant learning. In CVPR,
vol. 2, pp. 264–271, 2003. 2
[10] S. Fidler and A. Leonardis. Towards scalable representations
of object categories: Learning a hierarchy of parts. InCVPR,
2007. 3, 7
[11] R. Glantz, M. Pelillo, and W. G. Kropatsch. Matching seg-
mentation hierarchies. Int. J. Pattern Rec. Artificial Intelli-
gence, 18(3):397–424, 2004. 4
[12] Y. Jin and S. Geman. Context and hierarchy in a probabilistic
image model. In CVPR, vol. 2, pp. 2145–2152, 2006. 3
[13] A. Levinshtein, C. Sminchisescu, and S. Dickinson. Learn-
ing hierarchical shape models from examples. In EMM-
CVPR, Springer LNCS, vol. 3757, pp. 251–267, 2005. 3
[14] B. Ommer and J. M. Buhmann. Learning the compositional
nature of visual objects. In CVPR, 2007. 3
[15] M. Pelillo, K. Siddiqi, and S. W. Zucker. Matching hier-
archical structures using association graphs. IEEE TPAMI,
21(11):1105–1120, 1999. 4, 5
[16] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and
A. Zisserman. Using multiple segmentations to discover ob-
jects and their extent in image collections. In CVPR, vol. 2,
pp. 1605–1604, 2006. 7, 8
[17] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition
of shapes by editing their shock graphs. IEEE Trans. Pattern
Anal. Machine Intell., 26(5):550–571, 2004. 4
[18] A. Shokoufandeh, L. Bretzner, D. Macrini, M. Demirci,
C. Jonsson, and S. Dickinson. The representation and match-
ing of categorical shape. Computer Vision Image Under-
standing, 103(2):139–154, 2006. 3, 4
[19] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W.
Zucker. Shock graphs and shape matching. Int. J. Comput.
Vision, 35(1):13–32, 1999. 4
[20] S. Todorovic and N. Ahuja. Extracting subimages of an un-
known category from a set of images. In CVPR, vol. 1, pp.
927–934, 2006. 1, 2, 3, 4, 6, 8
[21] A. Torsello and E. R. Hancock. Computing approximate tree
edit distance using relaxation labeling. Pattern Recogn. Lett.,
24(8):1089–1097, 2003. 4
[22] J. Winn and N. Jojic. Locus: learning object classes with
unsupervised segmentation. In ICCV, vol. 1, pp. 756–763,
2005. 7, 8
8
positivenegative
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
CVPR
#2328CVPR
#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels
Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2
Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1
Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9
Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [16, 22, 5].
0.1 0.2 0.3 0.4 0.50.75
0.8
0.85
0.9
0.95
1
1!Precision
Recall
Caltech!101: Faces+Planes+Mbikes+Cars
CSTCST!unweightSTCST, 20%ST, 20%
20 30 40 50 60 70 800.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
number of training images Mn+M
p
recognitio
n a
ccura
cy
Recognition on Caltech!101: 4 categories
Mp=10, M
n!, CST
Mn=10, M
p!, CST
Mp=10, M
n!, ST
Mn=10, M
p!, ST
Figure 7. (left)Detection recall-precision curves: “CST-unweight”
means that edges in CST are not weighted. 20% is the size of a
rectangular occlusion w.r.t. the image size. Mp=10,Mn=10. ST
is the method of [20]. (right) Recognition accuracy of CST and
ST for the varying ratio ofMp andMn in the training set.
occurrences in the training set, while subgraphs of visible
object parts in the CST of a test image can still be matched
with the model; (iv) minor depth rotations of objects caus-
ing their shape deformations, because structural instability
of CSTs (e.g., due to region splits/mergers) is accounted for
during matching; and (v) clutter, since clutter regions are
not frequent and thus not learned.
6. Conclusion
We have presented what we believe is the first attempt
to jointly learn a canonical model of an object in terms
of photometric and geometric properties, and containment
and neighbor relationships of its parts. As other funda-
mental contributions, the paper proposes: (1) A general-
ized Voronoi diagram to represent region adjacency; and (2)
A new max-clique based algorithm for matching weighted
graphs, and thus learning the model.
References
[1] N. Ahuja. Dot pattern processing using Voronoi neighbor-
hoods. TPAMI, 4(3):336–343, 1982. 3
[2] N. Ahuja. A transform for multiscale image segmenta-
tion by integrated edge and region detection. IEEE TPAMI,
18(12):1211–1235, 1996. 3
[3] N. Ahuja and S. Todorovic. Extracting texels in 2.1D natural
textures. In ICCV, 2007. 1, 2, 3, 4
[4] G. Bouchard and B. Triggs. Hierarchical part-based visual
object categorization. InCVPR, vol. 1, pp. 710–715, 2005. 3
[5] L. Cao and L. Fei-Fei. Spatially coherent latent topic model
for concurrent object segmentation and classification. In
ICCV, 2007. 7, 8
[6] G. Carneiro and D. Lowe. Sparse flexible models of local
features. In ECCV, vol. 3, pp. 29–43, 2006. 3
[7] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite
templates for cloth modeling and sketching. InCVPR, vol. 1,
pp. 943–950, 2006. 3
[8] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial
priors for part-based recognition using statistical models. In
CVPR, vol. 1, pp. 10–17, 2005. 2
[9] R. Fergus, P. Perona, and A. Zisserman. Object class recog-
nition by unsupervised scale-invariant learning. In CVPR,
vol. 2, pp. 264–271, 2003. 2
[10] S. Fidler and A. Leonardis. Towards scalable representations
of object categories: Learning a hierarchy of parts. InCVPR,
2007. 3, 7
[11] R. Glantz, M. Pelillo, and W. G. Kropatsch. Matching seg-
mentation hierarchies. Int. J. Pattern Rec. Artificial Intelli-
gence, 18(3):397–424, 2004. 4
[12] Y. Jin and S. Geman. Context and hierarchy in a probabilistic
image model. In CVPR, vol. 2, pp. 2145–2152, 2006. 3
[13] A. Levinshtein, C. Sminchisescu, and S. Dickinson. Learn-
ing hierarchical shape models from examples. In EMM-
CVPR, Springer LNCS, vol. 3757, pp. 251–267, 2005. 3
[14] B. Ommer and J. M. Buhmann. Learning the compositional
nature of visual objects. In CVPR, 2007. 3
[15] M. Pelillo, K. Siddiqi, and S. W. Zucker. Matching hier-
archical structures using association graphs. IEEE TPAMI,
21(11):1105–1120, 1999. 4, 5
[16] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and
A. Zisserman. Using multiple segmentations to discover ob-
jects and their extent in image collections. In CVPR, vol. 2,
pp. 1605–1604, 2006. 7, 8
[17] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition
of shapes by editing their shock graphs. IEEE Trans. Pattern
Anal. Machine Intell., 26(5):550–571, 2004. 4
[18] A. Shokoufandeh, L. Bretzner, D. Macrini, M. Demirci,
C. Jonsson, and S. Dickinson. The representation and match-
ing of categorical shape. Computer Vision Image Under-
standing, 103(2):139–154, 2006. 3, 4
[19] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W.
Zucker. Shock graphs and shape matching. Int. J. Comput.
Vision, 35(1):13–32, 1999. 4
[20] S. Todorovic and N. Ahuja. Extracting subimages of an un-
known category from a set of images. In CVPR, vol. 1, pp.
927–934, 2006. 1, 2, 3, 4, 6, 8
[21] A. Torsello and E. R. Hancock. Computing approximate tree
edit distance using relaxation labeling. Pattern Recogn. Lett.,
24(8):1089–1097, 2003. 4
[22] J. Winn and N. Jojic. Locus: learning object classes with
unsupervised segmentation. In ICCV, vol. 1, pp. 756–763,
2005. 7, 8
8
random occlusion
edges without weights
random occlusion
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
CVPR
#2328
CVPR
#2328
CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) original image (b) STs (c) CSTsFigure 5. CSTs outperform STs in both detection and segmentation
on samples from Hoofed Animals (top) and LabelMe (bottom).
Undetected image parts are masked out.
randomly selected images D from the Weizmann dataset.
Nodes v in G, depicted as rectangles, contain regions fromD that got matched with v during learning. As can be
seen, the structure of G correctly captures the recursive con-tainment and neighbor relations of regions occupied by the
horses in D. For example, nodes head, neck, and mane
are found to be children of node head&neck, and they are
all identified as neighbors. Also, G correctly encodes thathead&neck and tail are not neighbors. Similar background
regions that consistently co-occur with regions represent-
ing the category (horses) in D may also be included in the
model. For example, G contains few nodes correspondingto fence identified as neighbor to the horse’s head. Typi-
cally, these background nodes make a small percentage of
nodes in the model (3-5%), depending on a specific training
set.
Quantitative evaluation: Fig. 7 (left) presents the recall-
precision curves (RPC) of detection for the Caltech-101 cat-
egories using CSTs and STs. Detection performance in the
presence of occlusion is tested by masking out a randomly
selected rectangular area in the image, and replacing this
area with a patch from the background category of Caltech-
101. CST increases the area under the RPC of ST by 6.5%,
and by 3.1% in the presence of the occluding patch covering
20% of the image. Invariance to in-plane rotation is tested
by randomly rotating test images, as illustrated in Fig. 1b.
Performance on these rotated images is the same as the one
presented in Fig. 7. Measuring the strength of neighborli-
ness using the generalized Voronoi diagram improves per-
formance over the case when the weights of links in CST are
set to take only values 1 or 0, referred to as CST-unweight.
CST increases the area under the RPC of CST-unweight by
2.2%. Fig. 7 (right) shows recognition accuracy of CST and
ST. A small increase in Mn does not downgrade the accu-
racy. As Mn becomes larger, objects belonging to other
categories start appearing more frequently, and thus get
learned, making the training set inappropriate. Increasing
Mp yields smaller recognition error. CST outperforms ST
Figure 6. CST-based model of Weizmann horses.
in recognition, and longer maintains high accuracy with the
increase of Mn. Table 1 summarizes detection recall, and
segmentation and recognition errors obtained for the equal
error rates on LabelMe and Hoofed Animals datasets. For
Hoofed Animals, CST outperforms ST in detection recall
by 7.5%, segmentation by 10.7%, and recognition by 8.6%.
For comparison, we obtained SE=6.5% on a relatively sim-
ple UIUC (multiscale) car dataset, using the same set-up as
in [10], while their result is SE=7.9%. The other hierarchi-cal approaches cited here use non-benchmark datasets, or
report a single retrieval result for the entire Caltech-101, be-
yond the focus of this paper. Non-hierarchical approaches
to modeling object categories, which ignore the contain-
ment of object parts, and use image segments obtained at
only one pre-selected segmentation scale, report the follow-
ing results: [16] – SE=47% for buildings, and SE=79%for cars of LabelMe; [22] – SE=7% for Weizmann horses;
and [5] – SE=18.2% for Weizmann horses, and recognition
accuracy of 94.6% for the four categories of Caltech-101.
CSTs yield better, or, in only a few cases, very similar per-
formance.
The presented results demonstrate that our approach is
invariant with respect to: (i) translation, in-plane rotation
and object articulation, since CST itself is invariant to these
changes; (ii) certain degree of scale changes, since match-
ing is based on relative properties of regions; (iii) occlusion
in the training and test sets, since graph-union registers the
entire (unoccluded) category structure from partial views of
7
Simultaneous recognition and segmentation of category "car"
LabelMe Dataset
original ST CST