IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1,...

13
Model-Based Recognition of 3D Objects from Single Images Isaac Weiss, Member, Computer Society, and Manjit Ray Abstract—In this work, we treat major problems of object recognition which have received relatively little attention lately. Among them are the loss of depth information in the projection from a 3D object to a single 2D image, and the complexity of finding feature correspondences between images. We use geometric invariants to reduce the complexity of these problems. There are no geometric invariants of a projection from 3D to 2D. However, given certain modeling assumptions about the 3D object, such invariants can be found. The modeling assumptions can be either a particular model or a generic assumption about a class of models. Here, we use such assumptions for single-view recognition. We find algebraic relations between the invariants of a 3D model and those of its 2D image under general projective projection. These relations can be described geometrically as invariant models in a 3D invariant space, illuminated by invariant “light rays,” and projected onto an invariant version of the given image. We apply the method to real images. Index Terms—Object recognition, invariance, model-based. æ 1 INTRODUCTION T HIS paper concentrates on the use of invariant relations between 3D objects and 2D images for object recognition. In contrast, almost all the work on invariants so far has been concerned with transformations between spaces of equal dimensionality, e.g., [40], [41], [24]. In the single-view case, invariants were found for the projection of a planar shape onto the image, although the planar shape was embedded in 3D. For real 3D objects, most of the work has involved multiple views with known correspondence, which amounts to a 3D to 3D projection. Yet, humans have little problem recognizing a 3D object from a single 2D image. This recognition ability cannot be based on pure geometry, since it has been shown (e.g., [6], [20]) that there are no geometric invariants of a projection from 3D to 2D. Thus, when we only have 2D geometric information, we need to use some modeling assumptions to recover the 3D shape. There are several possibilities for a modeling assump- tion. The simplest one is having a library of specific 3D models. In theory, there could be many models that project into the same image, so an object cannot be identified uniquely. In practice, however, only one or very few models in the database can project to the same image, so it is possible to recognize them. Another possibility is to have more generic assumptions, rather than specific models. One such assumption can be that the visible object is symmetric in 3D. More general assumptions for curved objects were studied in [45]. A general analysis of modeling assumptions in this context was done in [20]. In this paper, we deal mainly with the two assumptions mentioned above, namely specific models and symmetry. To a lesser extent, we use other assumptions, such as that a vanishing point in a 2D image indicates parallel lines in 3D. The outline of our recognition method is as follows: 1) Modeling: We define a 3D invariant space, namely, a space with three invariant coordinates I 1 ;I 2 ;I 3 . Given a 3D model, we can extract a set of such invariant triplets from it, so it can be represented as a set of points in the invariant space. 2) Matching: Given an image of the model, the depth information is lost so the invariant point cannot be recovered. However, we show that we can draw a set of “invariant light rays” in 3D, each ray passing through a 3D invariant model point (Fig. 1). When enough rays intersect the model points in the 3D invariant space, we can safely assume that the model is indeed the one visible in the image. We do not need to search for feature correspon- dences. We can also see that the rays converge at a point in the invariant space that represents the location of the camera center with respect to the model. Thus, it is easy to find the pose of the model. Given this, we can project the original (noninvariant) model onto the image. That makes it possible to perform a more exact match between the model projection and the given image of the object. In summary, the invariant modeling assumption and object descriptors make it possible to perform recognition regardless of viewpoint and with no need for a search for feature correspondence. The use of modeling for shape recovery from single images is of course not new. However, most of the earlier work was not concerned with viewpoint invariance. Some recent research does use invariance in modeling. However, most of it uses very specific modeling assumptions that cannot be applied to general shapes. A major example is the assumption that the objects are composed of “generalized cylinders” [4], [46] of various forms. The invariance and generality of this assumption are limited. A subset of this IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001 1 . I. Weiss is with the Center for Automation Research, University of Maryland, College Park, MD 20742-3275. E-mail: [email protected]. . M. Ray is with Siemens Medical Systems, Nuclear Medicine Group, 2501 North Barrington Rd., Hoffman Estates, IL 60195. E-mail: [email protected]. Manuscript received 15 Oct. 1998; revised 2 Aug. 1999; accepted 15 Sept. 2000. Recommended for acceptance by D.J. Kriegman. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 108052. 0162-8828/01/$10.00 ß 2001 IEEE

Transcript of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1,...

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

Model-Based Recognition of3D Objects from Single Images

Isaac Weiss, Member, Computer Society, and Manjit Ray

AbstractÐIn this work, we treat major problems of object recognition which have received relatively little attention lately. Among them

are the loss of depth information in the projection from a 3D object to a single 2D image, and the complexity of finding feature

correspondences between images. We use geometric invariants to reduce the complexity of these problems. There are no geometric

invariants of a projection from 3D to 2D. However, given certain modeling assumptions about the 3D object, such invariants can be

found. The modeling assumptions can be either a particular model or a generic assumption about a class of models. Here, we use such

assumptions for single-view recognition. We find algebraic relations between the invariants of a 3D model and those of its 2D image

under general projective projection. These relations can be described geometrically as invariant models in a 3D invariant space,

illuminated by invariant ªlight rays,º and projected onto an invariant version of the given image. We apply the method to real images.

Index TermsÐObject recognition, invariance, model-based.

æ

1 INTRODUCTION

THIS paper concentrates on the use of invariant relationsbetween 3D objects and 2D images for object recognition.

In contrast, almost all the work on invariants so far has beenconcerned with transformations between spaces of equaldimensionality, e.g., [40], [41], [24]. In the single-view case,invariants were found for the projection of a planar shapeonto the image, although the planar shape was embedded in3D. For real 3D objects, most of the work has involvedmultiple views with known correspondence, which amountsto a 3D to 3D projection. Yet, humans have little problemrecognizing a 3D object from a single 2D image.

This recognition ability cannot be based on pure geometry,since it has been shown (e.g., [6], [20]) that there are nogeometric invariants of a projection from 3D to 2D. Thus,when we only have 2D geometric information, we need to usesome modeling assumptions to recover the 3D shape.

There are several possibilities for a modeling assump-tion. The simplest one is having a library of specific3D models. In theory, there could be many models thatproject into the same image, so an object cannot beidentified uniquely. In practice, however, only one or veryfew models in the database can project to the same image,so it is possible to recognize them.

Another possibility is to have more generic assumptions,rather than specific models. One such assumption can bethat the visible object is symmetric in 3D. More generalassumptions for curved objects were studied in [45]. Ageneral analysis of modeling assumptions in this context

was done in [20]. In this paper, we deal mainly with the twoassumptions mentioned above, namely specific models andsymmetry. To a lesser extent, we use other assumptions,such as that a vanishing point in a 2D image indicatesparallel lines in 3D.

The outline of our recognition method is as follows:1) Modeling: We define a 3D invariant space, namely, aspace with three invariant coordinates I1; I2; I3. Given a3D model, we can extract a set of such invariant tripletsfrom it, so it can be represented as a set of points in theinvariant space. 2) Matching: Given an image of the model,the depth information is lost so the invariant point cannotbe recovered. However, we show that we can draw a set ofªinvariant light raysº in 3D, each ray passing through a3D invariant model point (Fig. 1). When enough raysintersect the model points in the 3D invariant space, we cansafely assume that the model is indeed the one visible in theimage. We do not need to search for feature correspon-dences. We can also see that the rays converge at a point inthe invariant space that represents the location of thecamera center with respect to the model. Thus, it is easy tofind the pose of the model. Given this, we can project theoriginal (noninvariant) model onto the image. That makes itpossible to perform a more exact match between the modelprojection and the given image of the object.

In summary, the invariant modeling assumption andobject descriptors make it possible to perform recognitionregardless of viewpoint and with no need for a search forfeature correspondence.

The use of modeling for shape recovery from singleimages is of course not new. However, most of the earlierwork was not concerned with viewpoint invariance. Somerecent research does use invariance in modeling. However,most of it uses very specific modeling assumptions thatcannot be applied to general shapes. A major example is theassumption that the objects are composed of ªgeneralizedcylindersº [4], [46] of various forms. The invariance andgenerality of this assumption are limited. A subset of this

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001 1

. I. Weiss is with the Center for Automation Research, University ofMaryland, College Park, MD 20742-3275. E-mail: [email protected].

. M. Ray is with Siemens Medical Systems, Nuclear Medicine Group, 2501North Barrington Rd., Hoffman Estates, IL 60195.E-mail: [email protected].

Manuscript received 15 Oct. 1998; revised 2 Aug. 1999; accepted 15 Sept.2000.Recommended for acceptance by D.J. Kriegman.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 108052.

0162-8828/01/$10.00 ß 2001 IEEE

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

example is the assumption that the object is a surface ofrevolution [47]. Another assumption is that various cornersvisible in the image are right-angled in 3D [13]. Yet anotherapproach is to assume that it is sufficient to characterize anobject by special local properties, such as tangencies andinflection points [38]. Of course, this ignores the informationavailable between the inflection points.

The above examples represent only a small part of thequite active research on the use of invariants in recognition.However, unlike our interest here, most of the recent activityis concerned not with single images but with multipleimages. These methods can recover the 3D geometrywithout modeling. However, they require knowledge ofthe correspondence between features in the differentimages. Correspondence is a difficult and generally un-solved problem, leading to a high-dimensional searchspace. Among such methods are those using the trilineartensor [29], [2], [23]. They require finding a substantialnumber of corresponding points and lines in at least threeimages. This can be accomplished reliably only for verysmall disparities. Much of the multiple image work isintended for camera calibration, e.g., [10], [19], [37], ratherthan object recognition. Techniques for multiple imageshave been applied to a single image, when ªrepetitiveº orsymmetric structures can be found in it [9], [47], [28]. Theseparts can be viewed as separate images, but the correspon-dence between their features still needs to be found. We candeal with both symmetric and nonsymmetric objects in asingle image, without correspondence.

Notable related work on recognition from multipleimages includes alignment [35], which represents a3D object as a linear combination of its 2D projections,and a method of recovering 3D Euclidean invariants from asequence of images [39], [8], [21]. An invariant relationinvolving a single image was also derived there but it didnot allow recovery of the 3D information from the image.Three-dimensional structure and motion was also recov-ered from an image stream by [34] using the Singular ValueDecomposition method. Point correspondence was used inall of these methods.

Methods for multiple images can be useful for ourpurpose during the modeling stage. In this paper, we relyon given 3D models to provide 3D invariants. This raisesthe question of how such models can be obtained. If thecoordinates of feature points of an object are known in some

coordinate frame, then it is a simple matter to find the3D model invariants. However, quite often an object isknown only from 2D images and, in this case, we canemploy various methods of using multiple images toprovide the 3D invariants of the model. For the matchingstage, we need the 2D-3D invariant relations derived here.These relations can also be used for finding 3D modelinvariants, as well as the methods mentioned above.

Work that directly connects 3D and 2D quantitativeinvariants of single images was done in [14] and [32]. Thatwork considered affine point-sets only, while here wederive the projective case as well. The latter referencederived more concise mathematical expressions than theformer, earlier work but was not implemented. Another setof relations was derived in [7], but these contained thecamera position vector and are thus not fully invariant.

There is also considerable research on curves. Differ-ential projective geometry was used for plane curves in,e.g., [42], [24], [43], [5], [25], [36]. More specializedapproaches, including the use of pairs of coplanar conics,can be found in [31], [33], [3], [26], [18], [15]. More general3D curves were studied in [45].

2 POINT SET INVARIANTS

Here, we describe our method of connecting 3D and2D invariants by applying it to point sets. We rederive allthe results in [32] in a much simpler way, using elementaryalgebra rather than algebraic geometry. We then add thenew projective case.

2.1 General Dependencies among Invariants

We denote 3D world homogeneous coordinates by X, and2D image coordinates by x. We start with five points Xi,i � 1; . . . ; 5 in 3D space, of which at least the first four arenot coplanar. They are projected into xi in the image. Thecorrespondence is assumed to be known for now. In a3D projective or affine space, five points cannot be linearlyindependent. We can express the fifth point as a linearcombination of the first four:

X5 � aX1 � bX2 � cX3 � dX4: �1�In the projective case, the coefficients a; b; c; d are notuniquely determined because the point coordinates can bemultiplied by an arbitrary factor. In the affine case, thecoefficients are constrained by the requirement that thefourth homogeneous coordinate is always 1, again leavingonly three independent coefficients. Because the projectionfrom 3D to 2D is linear (in homogeneous coordinates), thesame dependence (1) holds in 2D:

x5 � ax1 � bx2 � cx3 � dx4:

Since determinants are relative invariants of a projectiveor affine transformation, we look at the determinantsformed by these points in both 3D and 2D. Any four ofthe five points in 3D, expressed in four homogeneouscoordinates, can form a determinant Mi. We can give thedeterminant the same index as the fifth point that was leftout. For example,

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

Fig. 1. Invariant space.

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

M1 � jX2;X3;X4;X5j:Similarly, in the 2D projection, any three of the five points

can form a determinant mij, with indices equal to those of

the points that were left out, e.g.,

m12 � jx3;x4;x5j:Since the points are not independent, neither are the

determinants. Substituting the linear dependence (1) in M1

above, we obtain

M1 �ajX2;X3;X4;X1j � bjX2;X3;X4;X2j� cjX2;X3;X4;X3j � djX2;X3;X4;X4j:

As is well-known, a determinant with two equal columns

vanishes. Also, when columns are interchanged in a

determinant, the sign of the determinant is reversed.

Therefore, we obtain

M1 � ajX2;X3;X4;X1j � ÿajX1;X2;X3;X4j � ÿaM5:

Similarly, for the other determinants, with a simplified

notation:

M2 � j1; 3; 4; 5j � bj1; 3; 4; 2j � bj1; 2; 3; 4j � bM5

M3 � j1;2; 4; 5j � cj1; 2; 4; 3j � ÿcj1; 2; 3; 4j � ÿcM5

M4 � j1; 2; 3; 5j � dj1; 2; 3; 4j � dM5:

The coefficients a; b; c; d can now be expressed as

invariants, using the above relations:

a � ÿM1

M5; b �M2

M5; c � ÿM3

M5; d �M4

M5: �2�

This is the standard solution for a; b; c; d as unknowns in (1).Similar relations hold in the 2D projection:

m12 � j3; 4; 5j� aj3; 4; 1j � bj3; 4; 2j� aj1; 3; 4j � bj2; 3; 4j � am25 � bm15

�3�

m13 � j2; 4; 5j � aj2; 4; 1j � cj2; 4; 3j � am35 ÿ cm15 �4�

m14 � j2; 3; 5j � aj2; 3; 1j � dj2; 3; 4j � am45 � dm15: �5�All other relations are linearly dependent on these.

2.2 Relation between 3D and 2D InvariantsÐAffineCase

In the affine case, the coefficients a; b; c; d are absolute

invariants. Therefore, we can substitute the a; b; c; d found in

3D, (2), directly into the 2D equations above. We obtain

three relations between the 3D and 2D invariants:

M5m12 �M1m25 ÿM2m15 � 0 �6�

M5m13 �M1m35 ÿM3m15 � 0 �7�

M5m14 �M1m45 ÿM4m15 � 0: �8�These relations are obviously invariant to any affine

transformation in both 3D and 2D. A 3D transformation

will merely multiply all the Mi by the same constant factor,

which drops out of the equations. A 2D affine transforma-tion multiplies all the mij by the same constant factor, which

again drops out. However, in the projective case, each pointcan be independently multiplied by an arbitrary factor �i,which does not in general drop out. Thus, the above

relations are affine but not projective invariant.The above relations are linearly dependent so that only

two of them are meaningful. To see this, we first note a

relationship between the Mi which exists only in the affinecase. We can write a determinant involving all five points as

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

z1 z2 z3 z4 z5

1 1 1 1 11 1 1 1 1

����������

����������

� 0:

The Mi are minors of this determinant, so we can write theabove equation as

M1 ÿM2 �M3 ÿM4 �M5 � 0: �9�This is equivalent to writing a� b� c� d � 1 in (1), which

ensures that the last coordinate equals 1.Similar relations can be derived in 2D. We have

x2 x3 x4 x5

y2 y3 y4 y5

1 1 1 11 1 1 1

��������

��������� 0

leading to the relation

m12 ÿm13 �m14 ÿm15 � 0:

Similarly, from the determinant involving points 1; 2; 3; 4,we obtain the relation

m15 ÿm25 �m35 ÿm45 � 0:

We now look at the following linear combination of the

invariant relations, (6), (7), and (8):

�6� ÿ �7� � �8� �M5�m12 ÿm13 �m14��M1�m25 ÿm35 �m45� � �ÿM2 �M3 ÿM4�m15 � 0:

Using the two relations above between the mij, we obtain

m15�M1 ÿM2 �M3 ÿM4 �M5� � 0

which is an identity, due the the relation between the Mi (9).Thus, only two invariant relations, say (6), (7), are

independent.Similar results for this particular case can be obtained

using Grassmannians, Schubert cycles, and wedge-products

[32]. These methods are hard to extend much beyond thispoint.

2.3 Relation between 3D and 2D Absolute AffineInvariants

It is easy now to derive the relation between the 3D and

2D absolute invariants. We define the 3D absolute invariants

I1 �M1

M5; I2 �

M2

M5; I3 �

M3

M5

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 3

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

and the 2D absolute invariants

i12 �m12

m15; i13 �

m13

m15; i25 �

m25

m15; i35 �

m35

m15

and obtain the following theorem:

Theorem 1. Given five points Xi in 3D, at least four of which arenoncoplanar, the relation between their 3D and 2D absoluteaffine invariants is given by

i12 � I1i25 ÿ I2 � 0 �10�

i13 � I1i35 ÿ I3 � 0: �11�

Proof. Divide (6), (7) by M5 and m15 (assuming these do notvanish). tu

Given the 2D projection of a five-point set, we have thusobtained two equations for the three unknown 3D invar-iants I1; I2; I3. Since all three invariants are needed torecover the five points in 3D (up to an affine transforma-tion), we can recover the 3D quintuple only up to one freeparameter.

A geometric interpretation and applications are de-scribed later.

2.4 Points and Directions

Rather than dealing with a point, we can deal with adirection, namely a unit vector in 3D pointing in a certaindirection. This can represent a direction of a majorsymmetry axis of the model or a direction of the cameraaxis.

A direction in 3D is equivalent to a point on the infiniteplane and can be written as �x; y; z; 0�, i.e., with a vanishingfourth coordinate. It will remain at infinity under affinetransformation. Thus, the derivation leading to (6), (7), (8)remains valid. However, (9) needs to be modified. For adirection X4, we can write the determinant

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

z1 z2 z3 z4 z5

1 1 1 0 11 1 1 0 1

����������

����������

� 0

in which Mi are the minors, and we immediately obtain

M1 ÿM2 �M3 �M5 � 0:

The constraints on mij also change, but it is easy to showthat we again obtain a linear dependency among eqs. (6),(7), (8). Thus, Theorem 1 is still valid. A direction vector canbe multiplied by an arbitrary constant, but X4;x4 arecommon to all terms in the equations and, thus, their factorsdrop out.

2.5 The Projective Case

Our method is easily extended to the projective case,yielding new simple results such as Theorem 2 below.This case is more difficult because, in ªrealº (Cartesian)coordinates, the projective transformation is nonlinear,causing previous approaches [1], [17] to be quitecumbersome.

We start with the 3D quantities. To obtain invariants, we

need at least six points, having three projective invariants.

We now have two linear dependencies rather than one:

�5X5 � a�1X1 � b�2X2 � c�3X3 � d�4X4

�6X6 � a0�1X1 � b0�2X2 � c0�3X3 � d0�4X4

with the �i being arbitrary scalar factors. We now have two

sets of four equations with four unknowns each. The two sets

of unknowns are a�1=�5; � � � ; d�4=�5 and a0�1=�6; � � � ; d0�4=�6.

The solutions are similar to (2):

a�1

�5� ÿM1

M5; b

�2

�5�M2

M5; c

�3

�5� ÿM3

M5; d

�4

�5�M4

M5

�12�

a0�1

�6� ÿM

01

M5; b0

�2

�6�M

02

M5; c0

�3

�6� ÿM

03

M5; d0

�4

�6�M

04

M5

�13�with M 0

i denoting determinants in which X5 is replaced by

X6. Unlike the affine case, these solutions are not invariant.

However, we can find cross-ratios of them which are

absolute projective invariants, i.e., cross-ratios that elim-

inate all the �i:

I1 �ab0

a0b�M1M

02

M 01M2

; I2 �ac0

a0c�M1M

03

M 01M3

; I3 �ad0

a0d�M1M

04

M 01M4

:

�14�We turn now to the 2D quantities. Our unknown

quantities can be written in terms of 2D relative invariants,

similarly to (3), (4), (5), with the first unknown being a free

parameter �:

a�1

�5� �

b�2

�5� 1

m15�m12 ÿ �m25�

c�3

�5� ÿ1

m15�m13 ÿ �m35�

d�4

�5� 1

m15�m14 ÿ �m45�:

Similar equations hold for a0; b0; c0; d0, with �0, m0ij, �6

replacing �, mij, �5.We can now eliminate the �i on the left-hand sides above

using the same cross-ratios as in the 3D case, (14). We obtain

I1 �ab0

a0b� ��m

012 ÿ �0m25�

�0�m12 ÿ �m25�

I2 �ac0

a0c� ��m

013 ÿ �0m35�

�0�m13 ÿ �m35�

I3 �ad0

a0d� ��m

014 ÿ �0m45�

�0�m14 ÿ �m45�:

This is simply a quadric surface in invariant space,

parametrized by �; �0:

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

�m012 ÿ I1�0m12 � ��0�I1 ÿ 1�m25 � 0

�m013 ÿ I2�0m13 � ��0�I2 ÿ 1�m35 � 0

�m014 ÿ I3�0m14 � ��0�I3 ÿ 1�m45 � 0:

It is easy to eliminate the terms proportional to �; �0 (e.g., by

Gaussian elimination). We are left with one relation propor-

tional to��0, which we divide by��0m12m013m

014m25. Defining

now the 2D projective invariant cross-ratios

i1 �m012m14

m12m014

i2 �m012m35

m013m25i3 �

m012m13

m12m013

i4 �m012m45

m014m25;

we finally obtain:

Theorem 2. The 3D absolute projective invariants of six generic

points in 3D are related to the corresponding 2D invariants by

I3�I2 ÿ 1�i1i2 ÿ I3�I1 ÿ 1�i1 ÿ I1�I2 ÿ 1�i2 �I2�I3 ÿ 1�i3i4 ÿ I2�I1 ÿ 1�i3 ÿ I1�I3 ÿ 1�i4:

�15�

The above theorem gives only one relation between the

three 3D invariants. To get two relations as in the affine

case, we need seven points.

3 APPLICATIONS

We will apply the above results to recognize 3D objects

invariantly. For this purpose, we build a 3D invariant space

in which recognition will take place. Each 3D 5-tuple is

represented as a point in this space, with invariant

coordinates I1; I2; I3 (Fig 1). A 3D model is represented by

a set of points in this space. (There is also a tag identifying

these points as part of the same model.) The problem now is

how to match the image to the model. We will distinguish

several applications. They will be described using the affine

approximation, but we will show later that the projective

case needs only a small modification, requiring only

5-tuples. This involves the use of vanishing points as

described in Section 4.

1. Single image. Since the depth information is lost inthis case, we cannot recognize an object withouthaving a model. The simplest model can consist offive points in space, giving rise to three invariantsI1; I2; I3. These can be represented as a point in the3D invariant space. (Of course a realistic model willconsist of many such points.) To recognize the modelin the image, we extract 5-tuples of features from theimage and calculate their 2D invariants i��. UsingTheorem 1 (in the affine case), we have twoequations relating the three 3D invariants I1; I2; I3.Geometrically, we obtain a space line in our3D invariant space. If a 5-tuple in the 2D image isa projection of some 5-tuple in 3D, then the lineobtained from this 2D 5-tuple will pass through thepoint in invariant space representing the 3D 5-tuple(Fig. 1). That is, we have found the correspondencebetween the 2D and 3D 5-tuples. A different view ofthe 5-tuple will give rise to a different line in 3D, butstill passing through the same point. To recognizeobjects, we thus look for instances in which lines

obtained from the image pass through pointsrepresenting models in the invariant 3D space.

It is easy to see the advantage in complexity. If wehave n tuples from the image and m tuples from theset of models, a brute force method needsO�mn� trialmatches to find a correct pose and match. In ourmethod, we can follow each line in the 3D invariantspace to find which points it intersects, which isO�n�. The factor m is replaced by a much smallerone, derived in Section 5.2. We also avoid thepose calculations which require inverting an11� 11 matrix for each trial match. This assumesno errors in the intersections. In practice, of course,we need to find intersections within some tolerance.For this purpose, we use hierarchical indexingmethods in the 3D invariant space. The points, andin some variants the lines also, are organized asoctrees and the possible intersections can be quicklynarrowed down to small neighborhoods.

2. Multiple images. Although we concentrate on singleviews, multiple-view applications are also valuable.For instance, the symmetric case (described next) canbe regarded as multiple views. Here, we do not needa model to recover depth information, only foridentification. We use our method to find corre-spondences between the images.

We first extract a 5-tuple of features from one ofthe images and transform it into a line in the3D invariant space. Next, we look at a differentimage, in which objects are seen from a differentviewpoint. We again draw a line in the same3D space of invariants as before. If this line meetsthe first line in 3D, then the two 5-tuples have thesame 3D invariants Ii, namely the coordinates of theintersection point. This means that the two 5-tuplesare affine equivalent. (We will later generalize to theprojective case.) This in turn, indicates that we mayhave two different views of the same 5-tuple. Thus,we have again detected a correspondence between5-tuples, this time between two views.

Here, too, we can see the advantage of theinvariant method in reducing complexity. Withn 5-tuples, the total number of lines in the invariantspace is O�n�. We can find line intersections in a waysimilar to that used in Hough space, namely, dividethe space into bins and see if a certain bin has morethan one line going through it. We do not need tocheck all the bins; we only need to go along theknown lines. A hierarchical scale space approach canbe used to make the process more efficient. Thisbrings our total complexity closer to O�n� rather thanO�n2� as in noninvariant methods.

Based on the above discussion, our recognitionalgorithm involves the following steps:

a. Feature extraction. Find candidate 5-tuples thatare potentially affine equivalent (and similarlyin the projective case). Various constraints canbe used to prune unpromising sets, e.g., linksthat are visible between the points in the imageshould exist in the model, or parallelism.

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 5

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

b. Invariant description. For each 5-tuple thatremains from (a), calculate the two equationsfor the three 3D invariants, i.e., draw a line in a3D invariant space. In the 2-view case, find alllines that meet in 3D. Each intersection repre-sents the 3D affine invariants of two affineequivalent 5-tuples. An object will be repre-sented by several such intersection points.

c. Recognition. We now have an invariant3D description of the visible object, in the formof intersection points in the 2-view case, or justlines in the 1-view case. Similarly, the models inthe database can be represented by point sets inthe same invariant 3D space. We can use several5-tuples in each model, obtaining several pointsrepresenting the model in the invariant space.Identification is now straightforward. If anobject's point or line set falls near a model'spoint set in the invariant space, then the object isidentified with the model. No search is neededfor the right model because the invariant pointsets of the models can be indexed according totheir coordinates. No search is needed for theviewpoint or pose either because of the invar-iance. (Of course, this is said from the puregeometric point of view. Errors produce severalcandidate models and we have to search amongthem.) As many points as practical need to beused to increase reliability.

d. Verification. This step is independent of theinvariants method. It overcomes any errors thatmay have occurred in calculating the determi-nants Ii. Using the 2D coordinates of the images,and the correspondence found in Step c, wecalculate the 3D coordinates of the features.Now, we can find the 3D transformation (pose)that produces the best fit between the 3D objectand the model identified in Step c, using leastsquares fitting. The identification is rejected ifthe fitting error is too big. We may try to fitseveral models to each object to find the best fit.

3. Single image, symmetric models. The problemencountered earlier in the single-image case wasthe unknown parameter along the space line,resulting from the missing depth information.Instead of using a model as was done in case 1, wecan use a modeling assumption. One modelingassumption that we use is symmetry. Most man-made objects are symmetric, e.g., vehicles, tanks,airplanes, and buildings. Symmetry is also found inhuman and animal bodies. The symmetry is ob-served explicitly only in a frontal view. In any otherview, the symmetry is observed as a skew-symme-try. Many researchers have used skew-symmetry forrecognition, but with serious limitations. Theyusually assume that the skew-symmetry itself isknown, i.e., we know which feature in one half of theobject corresponds to which feature in the symmetrichalf. In other words, they assume that the corre-spondence problem has already been solved. Here,

we make no such assumption but detect the skew-symmetric objects in an image.

The two halves of a skew-symmetric object areaffine equivalent (in the affine projection). Therefore,we can apply the algorithm described above whichwas designed to find affine equivalent 5-tuples.Having found matching 5-tuples, we have to verifythat they are parts of the same object. The linesconnecting corresponding points in a symmetricobject are parallel in 3D, therefore, they will be parallelin an affine projection, and this is easy to check.

The verification Step d is easier, due to thesymmetry assumption. The skew-symmetric objectthat we have found can be rectified, using an affinetransformation, to obtain a standard view in which theobject is (nonskew) symmetric. It can then be matcheddirectly with a database of symmetric models.

4 IMPLEMENTATION ISSUES

Experiments performed so far with the method are veryencouraging. We have performed both real and simulatedexperiments, which are detailed in the next section.

During the implementation, a number of problems hadto be overcome. We briefly summarize some of them here,along with their solutions.

1. Low-level image processing. It was not the goal ofthis research to develop new methods of featureextraction. Rather, we use the model-based approachto overcome the problems inherent in featuredetection. Nevertheless, the inadequacies of featuredetection are so great that we must do somepreprocessing before using our method. The posi-tions of feature points are very inaccurate, so wehave concentrated on finding long edges and theirintersections rather than points. However, weusually have far too many lines, most of which areirrelevant to the object (Fig. 2).

Solution: We keep only those lines that lie alongprincipal directions. Most man-made objects such asvehicles and buildings have a small number ofdirections, e.g., major axes, to which most prominentlines are parallel. These are relatively easy to find.Fig. 3a shows the lines left after this pruning.

2. Large numbers of 5-tuples. With the number offeatures in the hundreds, the number of all possiblequintuples is on the order of 1005, which is quiteprohibitive.

Solution: We use only connected quintuples, i.e.,5-tuples whose member points are connected to eachother by visible lines. Four points may be connected toform a (3D) corner, with one central point connected tothree others, and the connections lying along principaldirections. The fifth point can be any point connectedto these four. Thus, the number of possible quintuples(and invariants) is reduced to on the order of 103.Fig. 3b shows the corners that we obtained.

3. Perspectivity. The affine (orthographic) approxima-tion that we initially assumed was found to beinadequate in many cases. We had to use the full

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

perspective treatment. However, this normally re-quires 7-point sets rather than the 5-point sets wehave been using, which could increase the complex-ity significantly.

Solution: We use vanishing points, namely, the

points at which the lines along a principal direction

intersect in the image (Fig. 4). This is based on a

(usually plausible) assumption that when several

lines intersect in an image, they probably intersect in

3D as well. Thus, the vanishing points provide

additional known correspondences between 2D and

3D points. We can use them as the two additional

points needed to create a 7-tuple without increasing

complexity. The invariant calculation is quite insensi-

tive to the exact locations of the vanishing points.

4. Finding intersections. Finding the point-line inter-sections in the 3D invariant space needs to be doneefficiently. Given a line, we want to search for pointsthat intersect it only within a small distance from theline. This is what keeps the complexity of themethod down and proportional to the number oflines, as discussed before.

For this purpose, we have used hierarchical

indexing methods based on octrees. This was done

in cooperation with Professor H. Samet of our

department. We tried two variants of the method

[12], [22], [27]. 1) Incremental nearest-neighbor: The

lines are organized as an octree and we incremen-

tally find points that are closest to the line.

2) Incremental distance join: Both the lines and

points are organized as octrees. Both octrees are

processed simultaneously to find line-point pairs

with smallest distances.

5 EXPERIMENTS

5.1 General Setup

We have experimented with a data set of objects including

vehicles such as trucks and tanks, both real and simulated.

Most experiments involved single views.The first step is building a 3D model in the invariant

space described earlier. We start from a 3D model of a

vehicle (Fig. 5), and choose a ªbaseº of four corner points on

it. These can be assigned the coordinates of a standard

affine coordinate base, namely,

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 7

Fig. 2. Image and detected lines.

Fig. 3. (a) Lines along principal directions. (b) Corner features.

Fig. 4. Vanishing points.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

X1 � �1; 0; 0; 1�; X2 � �0; 1; 0; 1�;X3 � �0; 0; 1; 1�; X4 � �0; 0; 0; 1�:

The original model can be transformed by a unique affine

transformation so that the four base points take on the

above values. This has no effect on our treatment since it is

invariant. Any fifth point of the model is then transformed

to a point with generic coordinates X � �X;Y ; Z; 1�.It is easy to see that these new coordinates are in fact

invariants. One way is to explicitly calculate the invariants

as given earlier to obtain

I1 � X; I2 � Y ; I3 � Z:Another way is to note that such a base is a standard, or

canonical base that can always be obtained from any given

affine-transformed version of the model. That makes it an

invariant base and any quantity expressed with respect to it

is thus invariant. The model we obtain by this method in theinvariant space is shown in Fig. 5 (right).

Next, given one view of the vehicle, we calculate thelines in 3D invariant space. This is done using Theorem 2and taking into account the vanishing points. The results foreach view are shown in Fig. 6.

An intersection of a line with a point in 3D means that acorrespondence has been found between the 5-point set inthe image, which gave rise to the line, and a 5-point set inthe 3D model that gave rise to the point. In theory, we needonly six or seven feature correspondences. This is enough tocalculate the pose (or camera coordinates). We can thenproject the model on the image and do a more detailedmatch. However, more intersections give higher reliability.

The lines converge to a point representing the position ofthe camera center in invariant space, making it easy tocalculate the pose. Thus, the lines are in fact a representa-tion of the light rays in invariant space.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

Fig. 5. Original and invariant models.

Fig. 6. Intersections in invariant space between image lines and model points. On the left are two different images of the same HMMWV vehicle. Onthe right is a 3D invariant model of this vehicle. (The same model at the top and the bottom.) The 3D lines on the right are calculated from the2D images on the left. Their intersection with the model points implies recognition of the vehicle for each view. The closed circles represent basepoints while the open circles are other feature points. The circle radius represents a tolerance parameter. The lines converge to the correspondingcamera center.

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

5.2 Simulated Images

Several questions have been investigated in theseexperiments:

1. affine case vs. projective case,2. effect of model selection criterion, e.g., a tolerance

parameter for the point-line intersections, and3. effect of varying the number of feature points on

reliability and complexity.

To reduce relative errors, we choose model base pointsthat are as far apart from each other as possible. Thesepoints define a finite invariant space, whose linear size wenormalize as 1. We can set a tolerance for detecting a model,say as error spheres around the points. The error rate aswell as the complexity depend on the tolerance. Also, theydepend on the number of features in the model. In thefollowing, we derive a rough estimate of some of thesedependencies and then describe experiments that betterquantify these relations.

The dependence of complexity on the number of imageand model points can be roughly estimated as follows: First,we need to match the base points of the image and model.Thus, in the worst case, we need to choose four points out ofnimage feature points, namely, nimage

4

ÿ �possibilities. We will

see later that we don't really need to check all thesepossibilities. One of these choices will match a certainmodel base in the invariant space. To define a modeluniquely, we need at least one more point and a line thatintersects it. Thus, we need to know the complexity offinding intersections of a line with a point in 3D. If thenumber of points in the invariant space is n, and if they aredistributed evenly, then on average there will beO�n1=3� points in some tube centered on the line. In ahierarchical method, each of these points can be accessed inO�log�n�� operations. Thus, the complexity for each line isO�n1=3 log�n��. This is reduced further by a constant factor�2, with � being the radius of the tube as a fraction of the sizeof the invariant space. This is a significant saving relative toa noninvariant method when n is large. Such a methodwould simply try all models in all poses, involving at leastO�n� operations per 5-tuple. Indeed, n can be large if eachmodel is represented by many bases (we will see why this isnecessary). To get the total complexity, the above expres-sions are multiplied by the number of lines l.

The error rate can be estimated as follows: We set a sphereof tolerance with radius � around each model point. Tointersect a point, a line needs to be within a tube whose basearea is ��2 and whose length is 1, so that its volume isproportional to �2. (We can absorb the� in the other factors.) Ifa model a has na points, then the tube will contain �2na modelpoints and this is the probability of the line to intersect a pointof model a. Given l lines, we thus have l�2na intersectionswith the model's points. The total probability of hittingm points of model a just by chance is then proportional to�l�2na�m. Thus, this can be regarded as the false positiveserror. This error decreases rapidly with m if

l�2na < 1: �16�Thus, we want to decrease � and increase m.

In a similar way, the number of correct matches can beestimated as mla�

2na, with la being the number of ªgoodºlines, namely, lines that actually belong to object a ratherthan other objects or clutter. If we set � too small, there willbe too many false negatives (misses) due to errors in thepositions of the lines. Thus, we want to increase � up to thelevel allowed by (16). (However, this increases complexitysomewhat.) The relation between the tolerance, the noise,and the error rate is in fact nonlinear due to the nonlinearityof the invariant equations. We investigate it empirically inthe experiments; the above discussion is only for gainingsome insight.

Condition (16) can be easily maintained if the number oflines l is relatively low, i.e., if we have relatively fewfeatures in the image. This was true in our simulatedexperiments. However, it is not always true in real images,as discussed later.

In our simulations, we generated about 100 models byvarying the components of various trucks, i.e., changing thesize of the cab, trunk, etc. We generated 10 differentsimulated views of each model. We then added randomnoise at various levels to the images. We found lines andpoints which intersected within a tolerance �. Theseintersections were collected in a voting table, with eachintersection contributing a vote for the model associatedwith the relevant point. The model with the most votes wasconsidered as recognized. Figs. 7, 8, and 9 show the totalerror rate vs. the noise level in various situations.

Fig. 7 shows the projective vs. the affine error rates. Thedifferent curves correspond to different distances of thecamera from the object, relative to the object's size. We cansee that the advantage of the projective treatment increasesas the camera gets closer to the object. At a short distancethe affine approximation gives a 0 percent recognition rateeven without noise, while the projective treatment givessatisfactory results. The affine case improves when weincrease the camera distance relative to the object size.

Fig. 8 shows the quantitative relation between the noiselevel, the tolerance, and the error rate. We see that thecurves start out almost flat at low noise levels because, inthis case, all the lines are within the set tolerance. As thenoise increases, the recognition rate drops depending on thetolerance �.

Fig. 9 shows the effect of increasing the number offeatures m of each model. The plots correspond to differentnumbers of model features, specifically 4 and 16 (besidesthe base). It is clear that more features reduces the errorrate, as we deduced analytically.

5.3 Real Images

With real images, we do not have control over parameterssuch as camera distance or noise level. We can control thenumber of features detected in the image by varying thethreshold used in the edge or corner detector.

Fig. 10 shows samples from our real image data set.There were 32 images taken of each of the 20 objects atvarious azimuths and depression angles. The number ofobjects is in fact higher because an object can look verydifferent when seen in pure front, side, or back views. Theobjects are in natural, uncontrolled settings and are notsegmented from the background.

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 9

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

A problem that arises in real images is that not all thepoints we use as a ªbaseº in the model are visible in itsimage. This can be because of occlusion or because thepoints are on a ªhiddenº side of the object. Therefore, wehave to use several models describing the same object,differing by the choice of base points. This ensures that

there will always be a model in the invariant space whosebase points are visible in the image.

Here, we have a trade-off between two ways of applyingthe method. One is to use all possible bases in the imageand only a few in the models. The opposite approach is touse models with all possible bases and only a few bases in

the image. Either way, we will obtain at least one matchingbase. The first approach adds complexity online, as we haveto find all sets of four points in real time. It also adds manylines to the invariant space. The second approach avoidsthis, but requires storing many representations of the same

model in invariant space. This approach appears to bepreferable when the features extracted from the image arequite reliable. The first approach is necessary with noisyimages and we will see how to reduce its problems.

Another problem with real images is the large number offeatures obtained from the edge detector, resulting in a verylarge number of lines in the invariant space. Many of thesefeatures result from background clutter, camouflage, etc.Equation (16) is then not always satisfied. That is, if thefeature detector yields a large set of features, there is a goodchance that many models will fit our image just by chance.This observation seems to hold generally for any feature-based matching method, whether or not it uses invariants.

A way to deal with the problem is to impose a structureon the set of image features, e.g., find some grouping of thefeatures. In the present work, we impose the constraint offeature connectivity, i.e., we consider only feature pointsthat are connected by straight edges that appears both in themodel and in the image. This serves two purposes:

1. It reduces complexity by reducing the number offeatures to be considered; unconnected featurepoints are ignored. Thus, the complexity factor ofnimage

4

ÿ �mentioned earlier is reduced to a much

smaller number.2. It improves the error rate in two ways. First, as we

saw before, the error increases with the density oflines in the invariant space, and thus it helps to

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

Fig. 8. Error rate vs. noise level for various tolerances on the point-line

intersections. � is 5 percent of the size of the invariant space. The noise

parameter expresses errors relative to the size of the 2D image.

Fig. 7. The affine approximation fails at short camera distances. The full projective treatment works at all distances. d is the size of the object.

Fig. 9. Error curves for different numbers of model points.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

minimize this density. Second, the connectivity ofthe features is itself a property that should matchbetween the image and the model, i.e., features areregarded as matched only if they are connected inthe same way in the image and the model.

With large numbers of image features, we found thatfeature connectivity was essential for a reasonable errorrate. More research is needed on ways to group featurestogether and on the effect of grouping on the error rate. Thisis a general issue that applies to any feature-basedrecognition method.

To use connectivity, we defined a cost function formatching that contained two main factors: 1) the weighted

distance of the image lines from the model points in the

invariant space, relative to �; 2) the connectivity of the

features. We require that, if two points are connected in the

model, the corresponding features are connected in the

image; otherwise, we penalize the cost of the match. The

relative costs were determined empirically to yield reason-

able performance. For a model to be recognized, we required

that at least eight points be hit by image lines within a

tolerance radius of about 5 percent of the size of the invariant

space. As it turns out, these factors have less impact on the

error rate than the edge detector threshold. This threshold

determines the number of lines l in the invariant space. If we

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 11

Fig. 10. Samples from our real image data set. Two images of each sample object are shown. Objects are recognized in each image separately

using the method of Fig. 6. This data set was made available to us courtesy of the Model Based Vision Laboratory, Wright-Patterson Air Force Base.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

set this threshold too high, too many features are missed andthe model is missed. Setting the threshold too low increasescomplexity and false positives according to (16). However,this is mitigated by our connectivity constraint.

We have implemented several variants of an octreehierarchical indexing structure to find the intersections oflines and points in the invariant space, as well as a ªbruteforceº method that does not use invariants, but simplymatches the image to all possible models in all poses. Foreach image, we used two different thresholds on the edgedetector, resulting in different numbers of features. Wecompared the timing of the methods for each image andeach threshold. We averaged the results over all the images,with the results summarized in Table 1. As we see, thelower threshold (thresh2) takes more time, but reduces theerror rate. With the higher threshold (thresh1), we misssome features and have a higher, but still tolerable, errorrate. Thus, we have a degree of robustness to missingfeatures. We also see that the advantage of the octreemethod is higher with the lower threshold because therelative efficiency of the method increases with the numberof image features (namely 3D lines). This is also seen inTable 1.

5.4 Two Views

We have applied a method similar to the above to find acorrespondence between two views, without a model.Applications include: 1) ªstereoº with widely differingimages, such as the two views in Fig. 6; other methodsdepend on the disparity between the images being small.2) 3D model construction from 2D images.

In this case, instead of finding line-point intersections,we find line-line intersections. Each image generates a set oflines in the invariance space. A line intersection indicatesthat there is a correspondence between the original 5-pointsets from the two images. Preliminary experiments with themethod were promising. Further experimental study isunderway.

6 SUMMARY

We have presented a method for recognizing 3D modelsfrom single images. The method is based on representingthe models as points in an invariant space and representingthe image features as lines in the same space. Recognition isachieved when lines derived from the image intersectmodel points. This was done for the full projective case,unlike previous orthographic treatments. Our experimentsshow that the projective treatment is essential unless thecamera is very far from the object.

The experiments show significant efficiency advantageof the invariant space in conjunction with hierarchicalindexing methods, relative to methods that do not useinvariants. It is also observed that the method shares someproperties with other feature-based methods. One advan-tage of a feature-based method over global methods such asmoments is robustness to missing parts in the image. Evenwhen the feature detector missed some features, we stillhad a reasonable error rate. On the other hand, pointfeatures are more susceptible to random noise in theirpositions than global quantities, and there can be too manyof them. The way we overcame these problems was byusing an invariant structure in the feature set, namely, lineconnectivity. That is, we match feature points only if theirare connected by visible lines in both the model and theimage. More work is needed on finding such invariantstructures or groupings of the features.

ACKNOWLEDGMENTS

This work was supported by the Defense AdvancedResearch Projects Agency (DARPA order no. E655), theAir Force Wright Laboratory under grant F49620-96-1-0355,the Air Force Office of Scientific Research under grantF49620-92-J-0332, and the US Office of Naval Researchunder grant N00014-95-1-0521.

REFERENCES

[1] E. Barrett, personal communication. 1998.[2] P. Beardsley, P. Torr, and A. Zisserman, ª3D Model Acquisition

from Extended Image Sequences,º Proc. European Conf. ComputerVision, pp. 683-695, 1996.

[3] J. Ben-Arie, Z. Wang, and R. Rao, ªIconic Recognition with Affine-Invariant Spectral Signatures,º Proc. Int'l Conf. Pattern Recognition,vol. A, pp. 672-676, 1996.

[4] T.O. Binford and T.S. Levitt, ªModel-Based Recognition of Objectsin Complex Scenes,º Proc. DARPA Image Understanding Workshop,pp. 89-100, 1996.

[5] A. Bruckstein, E. Rivlin, and I. Weiss, ªScale Space Invariants forRecognition,º Machine Vision and Applications, vol. 15, pp. 335-344,1997.

[6] J. B. Burns, R. Weiss, and E.M. Riseman, ªView Variation of PointSet and Line Segment Features,º Proc. DARPA Image Under-standing Workshop, pp. 650-659, 1990.

[7] S. Carlsson, ªRelative Positioning from Model Indexing,º Imageand Vision Computing, vol. 12, pp. 179-186, 1994.

[8] S. Carlsson and D. Weinshall, ªDual Computation of ProjectiveShapes and Camera Positions from Multiple Images,º Int'l J.Computer Vision, vol. 27, no. 3, pp. 227-241, 1998.

[9] R.W. Curwen and J.L. Mundy, ªGrouping Planar ProjectiveSymmetries,º Proc. DARPA Image Understanding Workshop,pp. 595-605, 1997.

[10] R. Deriche, Z. Zhang, Q.T. Luong, and O.D. Faugeras, ªRobustRecovery of the Epipolar Geometry for an Uncalibrated StereoRig,º Proc. European Conf. Computer Vision, vol. A, pp. 567-576,1994.

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 2, FEBRUARY 2001

TABLE 1Timing for Different Methods Used for Real Images

Times are normalized to that of the noninvariant method with high threshold (thresh1). The speedup factor is the ratio of the noninvariant methodtime to the average of the two octree methods times.

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ... · fourth homogeneous coordinate is always 1, again leaving only three independent coefficients. Because the projection from

[11] H. Guggenheimer, Differential Geometry. New York: Dover, 1963.[12] G.R. Hjaltason and H. Samet, ªRanking in Spatial Databases,º

Proc. Advances in Spatial DatabasesÐFourth Int'l Symp.,M.J. Egenhofer and J.R. Herring, eds., pp. 83-95, 1995.

[13] J.P. Hopcroft, D.P. Huttenlocher, and P.C. Wayner, ªAffineInvariants for Model-Based Recognition,º Geometric Invariance inMachine Vision, J.L. Mundy and A. Zisserman, eds. Cambridge,Mass.: MIT Press, 1992.

[14] D. Jacobs, ªSpace Efficient 3D Model Indexing,º Proc. IEEE Conf.Computer Vision and Pattern Recognition, pp. 439-444, 1992.

[15] D. Jacobs and R. Basri, ª3D to 2D Recognition with Regions,º Proc.IEEE Conf. Computer Vision and Pattern Recognition, pp. 547-553,1997.

[16] D. Keren, R. Rivlin, I. Shimshoni, and I. Weiss, ªRecognizing 3DObjects Using Tactile Sensing and Curve Invariants,º TechnicalReport CS-TR-3812, Univ. of Maryland, 1997.

[17] S.J. Maybank, ªRelation between 3D Invariants and 2D Invar-iants,º Image and Vision Computing, vol. 16, pp. 13-20, 1998.

[18] P. Meer and I. Weiss, ªPoint/Line Correspondence underProjective Transformations,º Proc. Int'l Conf. Pattern Recognition,vol. A, pp. 399-402, 1992.

[19] R. Mohr, L. Quan, and F. Veillon, ªRelative 3D ReconstructionUsing Multiple Uncalibrated Images,º Int'l J. Robotics Research,vol. 14, pp. 619-632, 1995.

[20] Y. Moses and S. Ullman, ªGeneralization to Novel Views:Universal, Class-Based and Model-based Processing,º Int'l J.Computer Vision, vol. 29, pp. 233-253, 1988.

[21] R. Mohan, D. Weinshall, and R.R. Sarukkai, ª3D Object Recogni-tion by Indexing Structural Invariants from Multiple Views,º Proc.Fourth Int'l Conf. Computer Vision, pp. 264-268, May 1993.

[22] R.C. Nelson and H. Samet, ªA Population Analysis for Hierarch-ical Data Structures,º Proc. SIGMOD Conf., pp. 270-277, 1987.

[23] S. Peleg, A. Shashua, D. Weinshall, M. Werman, and M. Irani,ªMultisensor Representation of Extended Scenes Using MultiviewGeometry,º Proc. DARPA Image Understanding Workshop, pp. 79-83, 1997.

[24] E. Rivlin and I. Weiss, ªLocal Invariants for Recognition,º IEEETrans. Pattern Analysis and Machine Intelligence, vol. 16, pp. 226-238,1995.

[25] E. Rivlin and I. Weiss, ªRecognizing Objects Using DeformationInvariants,º Computer Vision and Image Understanding, vol. 65,pp. 95-108, 1997.

[26] C.A. Rothwell, Object Recognition through Invariant Indexing.Oxford: Oxford Univ. Press, 1995.

[27] H. Samet, The Design and Analysis of Spatial Data Structures.Reading, Mass.: Addison-Wesley, 1990.

[28] J. Sato and R. Cipolla, ªAffine Integral Invariants for ExtractingSymmetry Axes,º Image and Vision Computing, vol. 15, pp. 627-635,1997.

[29] A. Shashua and N. Navab, ªRelative Affine Structure: CanonicalModel for 3D from 2D Geometry and Applications,º IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 18, pp. 873-883, 1996.

[30] G. Sparr, ªProjective Invariants for Affine Shapes of PointConfigurations,º Proc. First Workshop Invariance, pp. 151-170, 1991.

[31] C.E. Springer, Geometry and Analysis of Projective Spaces. SanFrancisco: Freeman, 1994.

[32] P.F. Stiller, C.A. Asmuth, and C.S. Wan, ªInvariant Indexing andSingle View Recognition,º Proc. DARPA Image UnderstandingWorkshop, pp. 1423-1428, 1994.

[33] B. Strumpel, Algorithms in Invariant Theory. New York: SpringerVerlag, 1993.

[34] C. Tomasi and T. Kanade, ªShape and Motion from Image Streamsunder Orthography: A Factorization Method,º Proc. Int'l J.Computer Vision, vol. 9, no. 2, pp. 137-154, 1992.

[35] S. Ullman and R. Basri, ªRecognition by Linear Combination ofModels,º IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 13, no. 10, pp. 992-1006, Oct. 1991.

[36] L. Van Gool, T. Moons, E. Pauwels, and A. Oosterlinck, ªVisionand Lie's Approach to Invariance,º Image and Vision Computing,vol. 13, pp. 259-277, 1995.

[37] T. Vieville, O. Faugeras, and Q.T. Luong, ªMotion of Points andLines in the Uncalibrated Case,º Int'l J. Computer Vision, vol. 17,pp. 7-41, 1996.

[38] B. Vijayakumar, D.J. Kriegman, and J. Ponce, ªStructure andMotion of Curved 3D Objects from Monocular Silhouettes,º Proc.IEEE Conf. Computer Vision and Pattern Recognition, pp. 327-334,1996.

[39] D. Weinshall, ªModel-Based Invariants for 3D Vision,º Int'l J.Computer Vision, vol. 10, no. 1, pp. 27-42, 1993.

[40] I. Weiss, ªGeometric Invariants of Shapes,º Proc. Computer Visionand Image Processing, pp. 291-297, 1988.

[41] I. Weiss, ªNoise Resistant Invariants of Curves,º IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 15, pp. 943-948, 1993.

[42] I. Weiss, ªGeometric Invariants and Object Recognition,º Int'l J.Computer Vision, vol. 10, pp. 207-231, 1993.

[43] I. Weiss, ªLocal Projective and Affine Invariants,º Annals of Math.and Artificial Intelligence, vol. 13, pp. 203-225, 1995.

[44] I. Weiss, ª3D Curve Reconstruction from Uncalibrated Cameras,ºTechnical Report CS-TR-3605, Univ. of Maryland, 1996.

[45] I. Weiss, ªModel-Based Recognition of 3D Curves from OneView,º J. Math. Imaging and Vision, vol. 10, pp. 1-10, 1999.

[46] M. Zerroug and R. Nevatia, ªUsing Invariance and Quasi-invariance for the Segmentation and Recovery of Curved Objects,ºLecture Notes in Computer Science 825, Berlin: Springer-Verlag, 1994.

[47] A. Zisserman, D.A. Forsyth, J.L. Mundy, C.A. Rothwell, J. Liu, andN. Pillow, ª3D Object Recognition Using Invariance,º ArtificialIntelligence, vol. 78, pp. 239-288, 1995.

Isaac Weiss received the PhD degree from theDepartment of Physics and Astronomy of theTel-Aviv University. He was subsequently aresearch scientist at New York University'sCourant Institute of Mathematics (one year)and at the Massachusetts Institute of Technol-ogy (two years). He is now a senior researchscientist at the Center for Automation Researchof the University of Maryland. His currentresearch interests are computer vision, objectrecognition, pattern recognition, and robotics.

Specific areas of focus are applications of both geometric and physics-based invariance in object recognition, and methods of robustestimation. He is member of the IEEE Computer Society.

M. Ray received the BTech degree from the Indian Institure ofTechnology, Kharagpur, India, in 1994 in computer science andtechnology. He receieved the MSc and PhD degrees from the Universityof Maryland, College Park, in computer science, in 1997 and 2000,respectively. He is now with Siemens Medical Systems in HoffmanEstates, Illinois. His research interests are computer vision andcomputer graphics.

WEISS AND RAY: MODEL-BASED RECOGNITION OF 3D OBJECTS FROM SINGLE IMAGES 13