The MPEG-7 Standard and the Content-based Management of ...

6
The MPEG-7 Standard and the Content-based Management of Three-dimensional Data: A Case Study Eric Paquet and Marc Rioux Visual Information Technology Institute for Information Technology National Research Council Ottawa (Ontario) Kl A OR6 Canada E-mail: [email protected], [email protected] Abstract Content-bused management of three-dimensional data in the framework of the MPEG-7 standard is considered. An overview on MPEG-7 and three-dimensional data is presented. Descriptors and description schemes suituble for three-dimensional dutu are introduced and their integration in the framework of MPEG-7 is detailed. The shape descriptors are bused on cords, wuvelet transform and three-dimensional statistical moments. Implications of the normative and the non-normative parts of the standard are analyzed and their relations are considered. The descriptors and description schemes are evaluated according to MPEG-7 criteria. Experimental results indicating the potential of the descriptors ure presented. Some considerations on dynamic scenes are introduced. 1. Introduction More and more multimedia information is available around the world. The number of users and the amount of information are progressing at a very fast rate. To be useful that information has to be filtered and retrieved. In order to perform a search it is necessary to have a description of the content. That description can be a set of keywords, a structured text or some more abstract representation. If the description of the content is not standardized it is very difficult to retrieve the content because the descriptions involved are not necessarily compatible and their distribution can be limited to a restricted subset of users. In order to overcome this problem most search engines retrieve the documents and describe them with keywords by analyzing the textual information that surrounds the multimedia information. That procedure has two major drawbacks: it is time-consuming and the keywords extracted do not necessarily match the multimedia content. Actually the most qualified person to describe the content is the provider. In order to share its knowledge the provider needs a representation that can be understood by anybody accessing the content. This is why a standard representation of the content is needed. The aim of MPEG-7 is to become such a standard. 2. MPEG-7 and Three-dimensional Data MPEG-7 [l] is a member of the MPEG family which itself is joint technical committee of the IS0 and IEC: ISO/IEC JTCl/SC29/WGll. Contrary to MPEG 1, 2 and 4, MPEG-7 does not address coding but content description. As a matter of fact MPEG-7 is formally known as Multimedia Content Description Interface. What is standardized by MPEG-7 is called the normative part while what is not standardized is known as the non- normative part. The normative part of MPEG-7 is the description. The feature extraction and the search engine belong to the non-normative part. They are not standardized for two main raisons: they are not necessary for interoperability and MPEG-7 does not want to close the door to future technical improvements. MPEG-7 is made out of three main elements: the descriptor D, the description scheme DS and the description definition language DDL. The descriptor is the representation of a feature: a feature being a distinctive or characteristic part of the data. The descriptor may be composite. The description scheme is formed by the combination of a plurality of descriptors and description schemes. The description scheme specified the structure and the relation between the descriptors. Finally the DDL is the language used to specify the description scheme. 375 O-7695-0253-9/99 $10.00 0 1999 IEEE

Transcript of The MPEG-7 Standard and the Content-based Management of ...

Page 1: The MPEG-7 Standard and the Content-based Management of ...

The MPEG-7 Standard and the Content-based Management ofThree-dimensional Data: A Case Study

Eric Paquet and Marc RiouxVisual Information Technology

Institute for Information TechnologyNational Research Council

Ottawa (Ontario) Kl A OR6 CanadaE-mail: [email protected], [email protected]

Abstract

Content-bused management of three-dimensional data inthe framework of the MPEG-7 standard is considered.An overview on MPEG-7 and three-dimensional data ispresented. Descriptors and description schemes suitublefor three-dimensional dutu are introduced and theirintegration in the framework of MPEG-7 is detailed. Theshape descriptors are bused on cords, wuvelet transformand three-dimensional statistical moments. Implicationsof the normative and the non-normative parts of thestandard are analyzed and their relations are considered.The descriptors and description schemes are evaluatedaccording to MPEG-7 criteria. Experimental resultsindicating the potential of the descriptors ure presented.Some considerations on dynamic scenes are introduced.

1. Introduction

More and more multimedia information is availablearound the world. The number of users and the amount ofinformation are progressing at a very fast rate. To beuseful that information has to be filtered and retrieved. Inorder to perform a search it is necessary to have adescription of the content. That description can be a setof keywords, a structured text or some more abstractrepresentation.

If the description of the content is not standardized it isvery difficult to retrieve the content because thedescriptions involved are not necessarily compatible andtheir distribution can be limited to a restricted subset ofusers. In order to overcome this problem most searchengines retrieve the documents and describe them withkeywords by analyzing the textual information thatsurrounds the multimedia information. That procedurehas two major drawbacks: it is time-consuming and the

keywords extracted do not necessarily match themultimedia content. Actually the most qualified person todescribe the content is the provider. In order to share itsknowledge the provider needs a representation that can beunderstood by anybody accessing the content. This iswhy a standard representation of the content is needed.The aim of MPEG-7 is to become such a standard.

2. MPEG-7 and Three-dimensional Data

MPEG-7 [l] is a member of the MPEG family whichitself is joint technical committee of the IS0 and IEC:ISO/IEC JTCl/SC29/WGll. Contrary to MPEG 1, 2 and4, MPEG-7 does not address coding but contentdescription. As a matter of fact MPEG-7 is formallyknown as Multimedia Content Description Interface.What is standardized by MPEG-7 is called the normativepart while what is not standardized is known as the non-normative part. The normative part of MPEG-7 is thedescription. The feature extraction and the search enginebelong to the non-normative part. They are notstandardized for two main raisons: they are not necessaryfor interoperability and MPEG-7 does not want to closethe door to future technical improvements.

MPEG-7 is made out of three main elements: thedescriptor D, the description scheme DS and thedescription definition language DDL. The descriptor isthe representation of a feature: a feature being adistinctive or characteristic part of the data. Thedescriptor may be composite. The description scheme isformed by the combination of a plurality of descriptorsand description schemes. The description schemespecified the structure and the relation between thedescriptors. Finally the DDL is the language used tospecify the description scheme.

375O-7695-0253-9/99 $10.00 0 1999 IEEE

Page 2: The MPEG-7 Standard and the Content-based Management of ...

They are many three-dimensional objects and scenes onthe Web. Most of them are in the VRML formatincluding the IS0 standard VRML 97. Along withVRML other formats are emerging or are underdevelopment like Metastream from Metacreation, VRMLC from the VRML Consortium, Fahrenheit and AAF fromMicrosoft. As a matter of fact there are more than 40common commercial formats in use. Some of them likeDXF and ACIS are mostly used for CAD. Among themost important applications of three-dimensional data areindustrial design, inspection, virtual reality, catalogues,collections and medical applications. In most applicationsan important number of files is involved. In order tosearch efficiently those data it is important to have adescription suited for them. Interesting works on thissubject can be found in [2-61. The present paperaddresses the description of the scale, shape and color ofthree-dimensional objects in the context of MPEG-7.

3. Description of Three-dimensional Objectsin the Framework of MPEG-7

One of the main characteristics of three-dimensional datais their physical dimensions or scale. In order to describethe scale it is proposed to utilize three descriptors: thevolume a bounding box and a bounding sphere. Thevolume is simply a float value representing the volume ofthe object. The bounding box is set of three floatsrepresenting the dimensions of the smallest box that cancontain the object. The bounding sphere is a float valuecorresponding to the radius of the smallest sphere that cancontain the object. The three descriptors form a DSdescribing the scale of the object. The scale is featurethat can be easily used to rapidly filter objects for aspecific query. In the DS a field is used to specify the unitused for the measurement: meters, inches, etc.

The color distribution is represented by a set of threehistograms corresponding to the red, green and bluecomponent of the color. This representation has beenretained because commonly used in three-dimensionalformats. The histograms form a DS for the colordistribution. It has to be pointed out that each histogramis normalized in order to make their comparison easier.Each histogram has a header made out of a single numberrepresenting the number of channels.

Three-dimensional shape is of particular interest. Sincethe extraction is not part of the MPEG-7 standard, aspecial care must be taken in order that the descriptiondepends as little as possible on the extraction. In order todescribe the geometrical shape we use the concept of acord. A cord is a vector that goes from the center of mass

of the object to the center of mass of a given trianglebelonging to the surface of the object. In order to definethe orien,..,ion of a cord we use a reference frame thatdoes not depend on the orientation of the object. Thereference frame is defined as the eigen vectors of thetensor of inertia of the object. Assuming a triangulardecomposition, the tensor of inertia is defined by:

Where m, is the mass of a triangle and q and r are theCartesian coordinates. Each axis is identified by its eigenvalue. The axes are labeled one, two and three bydescending order of their corresponding eigen values.

[ITi, = /%,a, I (2),=I.?.?

The orientation of the cord is completely determined bythe two angles between the cord and the first tworeference axes. The norms of the cords are normalized:that means that the cord distribution does not depend onthe scale of the object. The angles and the norm defineuniquely the cord which, is our MPEG-7 feature. Now weare interested in the distribution of those cords so wedefine the cord distribution as a set of three histograms:the first histogram represents the distribution of the firstangle, the second histograms the distribution of thesecond angle and the third histogram the distribution ofthe norms. Each histogram is a descriptor and the setformed by them is a DS. Each histogram has a headermade out of a single number representing the number ofchannels. Depending on the number of channels theresulting DS can be very compact. The size of thehistogram is also dictated by the precision of therepresentation and by the discrimination capability that isneeded.

The behavior of a cord can be better understood byconsidering a regular pyramid and a step pyramid. Mostpeople agree that they belong to the same category. Ifnormal vectors would be used to represent the pyramids,five directions would characterize the regular pyramidwhile six directions would characterize the step pyramid.So they would be classified as two distinct objects. If acord representation is used the histograms correspondingto the regular pyramid and to the step pyramid are muchmore similar. Consequently a cord can be understood as aslow varying normal vector. Some results have beenpublished recently [7] and some are provided here. Fordetailed results the reader is invited to try our demolocated at http://cook.iitsg.nrc.ca:88OO/Nefertiti.

376

Page 3: The MPEG-7 Standard and the Content-based Management of ...

Figure 1. Query for a horse.

In addition to the cord, we propose another descriptorsbased on a wavelet representation. This is because inaddition to be bounded by a surface a three-dimensionalobject is also a volume. In order to analyzed the volumewe use a three-dimensional wavelet transform. Let usreview the procedure. For our purpose we use DAU4wavelets which have two vanishing moments. The NxNmatrix corresponding to that transform is

w=(‘t, Cl L> c,

c’3 -c(‘_ C’, - (‘(8

!z (‘2 (0 [I

L( 1 - C’,, <‘l -c’?

The wavelet coefficients are obtained by applying thematrix W on the three axes defined by the tensor ofinertia. We use those axes because the wavelet transformis not translation and rotation invariant. In order to applythe transform to the object the latter has to be binarized byusing a voxel representation. The wavelet coefficientsrepresent a tremendous amount of information. In orderto reduce it, we compute a set of statistical moments foreach level of resolution of the transform. For eachmoment, a histogram of the distribution of the moment isbuilt. The channels correspond to the level of detail andthe amplitude to the normalized moments. This histogramis the descriptor of the object. The DS is formed by theencapsulation of the histograms. A field in the DSindicates the number of moments used which is also thenumber of histograms. The DS has also a field to indicatethe number of levels of detail corresponding to thetransform.

The last descriptor is based on three-dimensionalstatistical moments. The three-dimensional statisticalmoment Myrs is defined as

Myrc = 2 HI, (4 - -r,,,, )” (Y, - !‘,,,, > ’ k, - z,,,, > \ C4)The statistical moments are not rotation invariant. Inorder to solve this problem they are computed in the samereference of the wavelet descriptors. The order of themoments is related to the level of detail. Three numbersindicating the order and a float indicating the value of themoment for the descriptor. The DS is made out of a fieldindicating the number of moments and the correspondingmoment descriptors. Overviews of all the descriptors andDS that we have presented can be seen in the followingfigure; a pseudo-code is used for the DDL.

Object

Histogram{int numberofchannels;float histogram[numberOfChannelsl;]

Scale{String unit,searchCriterion;float volume,sphereRadium,x,y,z;)

Color{int numberofchannels;String searchcriterion;Histogram red[numberOfChannelsl;Histogram green[numberOfChannelsl;Histogram blue[numberOf channels];)

Shape

Cord{int numberofchannels;String searchcriterion;Histogram cordAngle_l[numberOfChannelsl;Histogram cordAngle~2[numberOfChannelsl;Histogram cor~orm[numberOfChannelsl;l

Wavelet{int numberoflevels;int number0fMoments;String searchcriterion;Histogram histWavelet[numberOfMomentsl;l

Moments

int number0fMoments;String search Criterion;

Momentiint q,r,s;float value;}

Moment moments[numberOfMomentsl;I

1

Figure 2: 00 representation of D and DS.

377

Page 4: The MPEG-7 Standard and the Content-based Management of ...

Even if the non-normative part does not belong to thestandard it may have some serious impact on it. As statedearlier the non-normative part includes the extraction ofthe descriptors and the search engine. Let us start with thesearch engine. The search engine is concerned with thecomparison of the descriptors and the formulation of thequery. The comparison is the most critical part. It isrelated to the criterion that is used to compare thedescriptors. That criterion can be a metric like theHamming distance or the inner product or a non-metriclikes clustering. The fact that the comparison criterion isnot part of the standard may be problematic. As a matterof fact it is possible to optimize a descriptor for a givencomparison criterion. Our experiments have indicatedthat the Hamming distance seems to be the mostappropriate metric for the cord and wavelet descriptors.

Knowing that the descriptors and the comparison criterionmay be related to each other and that a bad criterion maysignificantly deteriorate the results we propose to add afield in the DS in order to indicate the type of comparisoncriterion that is recommended to use with the descriptors.It would be possible to standardize the nomenclature usedin that field. The parameter would indicate whichcomparison criterion should be used in the search. Thisparameter would not be mandatory and could beoverruled by the search engine. In any case the ultimatedecision would be left to the search engine.

As stated earlier the extraction of the descriptors is notpart of the standard. W C are going to review theextraction process in order to determine what are theimplications. We have proposed DS for the scale, colordistribution and shape. The extraction of the scale doesnot pose any particular problem because the volume, thedimensions of the smallest bounding box and the radius ofthe smallest sphere are not ambiguous: the details of theimplementation should not modify significantly theirvalues.

The case of the color distribution is more complicated.The color is defined in a non-unique way. Among themost common methods are texture mapping and vertices.A texture map is a color picture. By associating a set oftwo texture coordinates with each geometrical vertex it ispossible to map the texture on the surface of the object.The texture element is deformed in order to lit thegeometrical element on which it is mapped. That meansthat the color is defined both on the vertices and thesurface surrounding them. In the second case a color isassociated with each vertex. There is no color associateddirectly with the surrounding surface. In the simplest casethe color is defined as a triplet while in more sophisticatedmodels diffuse and specular color are taken into account.

For all those models the color representation can bereduced to a triplet and the corresponding histograms canbe computed. Even if the resulting color could vary fromone implementation to the next the variations are usuallysmall. In the worst case it is always possible to reduce thenumber of channel in the histograms. Actually we couldmodify our descriptors and DS for the color by defining aset of three histograms for the diffuse color, a set of threehistograms for the specular color and one histogram forthe specularity. If a unique triplet defines the color it ispossible to describe the color distribution by only usingthe histograms corresponding to the diffuse color leavingthe other histograms equal to zero. Such a DS wouldallow a more detailed description of the color distribution.A red, green and blue triplet is used for the color. This isnot the best representation from a human point of view.The hue, saturation and intensity representation and manyothers are more appropriated for that purpose. As amatter of fact the well-known JPEG is based on aluminance-chrominance representation. Nevertheless thered, green and blue representation is used because anyrepresentation can be converted to it. If require anotherrepresentation can be used in the comparison process byconverting the reference triplet to the appropriate system.

For the shape analysis two aspects are involved: datamodeling and descriptors computation. In order tocompute the cord descriptors a triangular mesh of theobject is needed. Many common three-dimensionalformats like VRML use triangular meshes as theirstandard representation. In many applications the objectare represented with parametric surfaces like NURBS ornon-uniform rational beta splines. Those representationscan always be reduced to a triangular mesh. As a mater offact most if not all rendering software and hardware needto reduce them to triangular meshes in order to performthe rendering. This is the reason why we use triangularmesh as our standard representation.

The next step is to compute the tensor of inertia. In thiscase the specific implementation of the computation maybe a problem. The reason is that the distribution of thevertices is not uniform on the surface of the object. Ausual case is a planar surface that is represented by a fewvertices while a highly curved surface is represented bymany of them. Consequently the area or mass around thevertices must be taken into account. That can be done byusing the center of mass of each triangle of the mesh andby attributing them a mass corresponding to the area oftheir surface. Of course they are many other ways: as longas the muss is taken into account all methodology shouldprovide a satisfactory result for the tensor of inertia. Thecomputation of the eigen vectors and values is a well-known problem and should not be of concern. The

378

Page 5: The MPEG-7 Standard and the Content-based Management of ...

computation of the orientations and modulus of the cordsshould provide results that are consistent because thosevalues are defined in an unambiguous way. Thus despitethe fact that the extraction and the search engine are notnormative it is possible to implement the descriptors andDS in a consistent way.

The wavelet descriptors can be easily extracted. Asexplained before the reference frame can be calculatedwithout particular problems and there is a fastimplementation of the wavelet transform. Thecomputation of the binary model is a well-knownproblem. Actually the voxel representation is often usedin order to generate a triangular mesh for three-dimensional raw data: the two representations arecomplimentary [g]. The moments-based descriptor is assimple to calculate as the tensor of inertia. Consequentlythe remarks are the same.

4. Evaluation of D and DS

MPEG-7 has issued a set of criterion for evaluating D andDS. The most important criteria used by MPEG-7 for thedescriptors are the effectiveness, the application domain,the expression efficiency or value calculation, theprocessing efficiency or matching, scalability and multi-level representation [I]. Let us review those criteria inorder to determine if our descriptors are suitable forMPEG-7.

The effectiveness determine if the descriptor capture animportant characteristic of the content. The scale is a veryimportant characteristic because it captures the physicaldimensions of the object. Only three-dimensional objectscan provide that information directly. The color is notspecific to three-dimensional objects but in manyapplications it is an important criteria for filtering andretrieval. The shape is also a criterion that is unique tothree-dimensional data in the sense that this is the onlyrepresentation that can p r o v i d e a complctc a n dunambiguous geometrical information as opposed topictures.

The application domain refers to the range of applicationdomains. Today three-dimensional data are used in CADapplications, inspection, visualization, virtual reality,medical applications and the Web. A large number ofscanners are available and important collections of three-dimensional objects already exist or are underconstruction like the CESAR project for human modeling.Those databases are confronted to the same problems:finding an efficient way to filter and retrieve useful

information. We believe that our descriptors can help toprovide an answer to that problem.

The expression efficiency determines if the descriptorexpresses the features precisely and completely. The cordrepresentation tends to capture the global and regionalcharacteristics while minimizing the importance of thelocal features. If local characteristics are needed it ispossible to increase the sensibility of the descriptors byadding more channels to the histograms. The waveletdescriptor describes the object at different levels ofresolution. That make it a good candidate for coarse tofine search. The same observation can be made for themoments-based descriptor.

The processing efficiency refers to the non-normative partof the standard: the extraction and the comparison of thedescriptors. As we have shown in a previous section thedescriptors can be easily extracted from most three-dimensional formats and the outcome of the process hasalmost no dependency on the particular implementation ofthe calculation.

An application is scalable if its performance does notdeteriorate with larger amount of data. Our descriptorsare scalable at the object level because the size of the corddescriptors is not a function of the size of the object file: itdoes not depend on the number of triangles. For thewavelet descriptors it depends logarithmically on thenumber of voxels. This is totally acceptable in mostapplications. Moments-based descriptors are alsoscalable: the size of the moment description does notdepend on the size of the data set.

At the database level our descriptors are also scalable inthe sense that the outcome of a query is determined bycomparing the reference descriptor with all the otherdescriptors. Consequently the complexity of the searchdepends linearly on the number of objects. This isacceptable for most applications and quit comment incontent-based applications.

Finally let us talk about the multi-level representation.That means that the descriptor represents the features atmultiple level of abstraction. In the cord case theabstraction level is related to the number of channels inthe histograms. If they are many channels the descriptionis precise but the level of abstraction is lower while if thenumber of channels is limited the description is lessprecise but the level of abstraction is higher. In case ofthe wavelet-based description the abstraction is controlledby the number of voxels representing the object: the morevoxels the more level of detail we have. Moments-based

379

Page 6: The MPEG-7 Standard and the Content-based Management of ...

descriptors also provide a multi-level description thatcorresponds to the orders of the moments.

According to MPEG-7 the DS should be evaluatedaccording to the following criterion [ 11: effectiveness,application domain, comprehensiveness, flexibility,extensibility, scalability, simplicity and abstraction atmultiple hierarchical levels. Our D.S is very simple: itencapsulates related descriptors in a single class. In thecase of the scale the volume, the dimensions of thebounding box and the radius of the sphere areencapsulated in a single class call the Scale. The red,green and blue histograms are also encapsulated in asingle class call the Color and the histogram describingthe distribution of the cords, the wavelet coefficients andthe moments are encapsulated in a single class calledShape.

A class called Object inherits those classes. The proposedscheme is simple and is scalable in the sense that it doesnot depend on the dimensions of the object. The schemecan be easily extended to three-dimensional scenes. Athree-dimensional scene is made out of objects. Our DScan describe each object present in the scene. AnotherDS can handle the relations between the objects. Theirrelative position and orientation could be such a relation.The DS is also flexible because more descriptors can beadded to the three basic classes: scale, shape and color.The DS does not have any multiple hierarchical levels butthe descriptors already provide those characteristics.

The DS provides a comprehensive description in the sensethat the object is composed of three classes that clearlydefine its major components: the scale, the colordistribution and the shape. It is left to the search engine tocombine those attributes in an appropriate fashion in orderto handle a particular query. The object scheme that wehave presented can also be extended to parts description.An object is made out of many parts. Each part can beconsidered as an object. So we can use our DS to describea part. A DS specifying the relations between the partscan describe the resulting object. Among the relations wehave relative positions and orientations, physicalconstraints and functionality. For the object parts andscenes we believe that the DS has to be applicationdependent in order to describe the relation between thecomponents in a useful way.

5. Conclusions

Descriptors and DS suitable for describing three-dimensional data in terms of scale, color and shape have

been introduced. Their implementation in the frameworkof MPEG-7 has been discussed as well as the implicationsof the normative and non-normative part of the standard.We have evaluated the descriptors and DS according tothe criteria stated by MPEG-7.

In the case of static objects the descriptors do not need tobe streamed with the data. They can remain be on thesame or on a remote location as long as there is a linkbetween them. Some formats like VRML 97 have enginesand script nodes that can be used to introduce a temporalbehavior. In that case the description has to be streamedwith the content. Streaming the descriptors provides aframe by frame description of the object but does notprovide any inter-frame description. As MPEG-I and 2are providing a coding for the inter-frame content webelieve that MPEG-7 should address the description ofinter-frame content. That description could include thedynamic of the action, the location of the action and ameasurement of the importance of the action relatively toother simultaneous actions.

References

[II

[21

[31

[51

[61

[71

WI

MPEG-7: Evaluation Process Document, ISO/IECJTC I /SC29/WG I I N2463, Atlantic City, October 1998.J. Rossignac. “Interactive Exploration of Distributed 3DDatabases over the Internet”, Procrediqs CompfttrrGruphics Ir~ternational Hawver. pp. 324-335 ( 1997).Y. Liu and F Dellaert. “A Classification Based SimilarityMetric for 3D Image Retrieval”. CVPR. pp. 800-805( 1998).Y. Gdalyahu and D. Weinshall. “Automatic HierarchicalClassification of Silhouettes of 3D Objects”, CVPR, pp.787-793 (I 998).J. H. Yi and D. M. Chelberg, “Model-Based 3D ObjectRecognition Using Bayesian Indexing”, Cor,z/>lcter Visimand Image Undrrstanding 69, pp. 87- 105 ( 1998).C. S. Chua and R Jarvis, “Point Signature: A NewRepresentation for 3D Object Recognition”, Internatimalhtrnal of Comprtrr Vision 25, pp. 63-85 (I 997).E. Paquet and M. Rioux, “Content-based Access ofVRML Libraries”, IAPR International Workshop onMulthedia lnfonnution Analysis and Retrieval, LectureNotes in Computer Sciences-Springer 1464, pp. 20-32( 1998).B. Curless and M. Levoy, “A Volumetric Method forBuilding Complex Models from Range Images”,SIGGRAPH, (I 996).