Content Based Video Retrieval

download Content Based Video Retrieval

of 4

Transcript of Content Based Video Retrieval

  • 7/29/2019 Content Based Video Retrieval

    1/4

    Content-Based Video Retrieval and Compression: A Unified Solution

    HongJiang Zhang, John Y. A. Wang and Yucel Altunbasak

    Hewlett-Packard Laboratories, 1501 Page Mill Rd. Palo Alto, CA94304

    Abstract

    Video compression and retrieval have been treated asseparate problems in the past. In this paper, we present anobject-based video representation that facilitates bothcompression and retrieval. Typically in retrievalapplications, a video sequence is subdivided in time into aset of shorter segments each of which contains similarcontent. These segments are represented by 2-Drepresentative images called "key-frames" that greatlyreduce amount of data that is searched. However, key-frames do not describe the motions and actions of objectswithin the segment. We propose a representation thatextends the ideas of the key-frame to further include whatwe define as "key-objects". These key-objects consist ofregions within a key-frame that move with similar motion.Thus our key-objects allow a retrieval system to moreefficiently present information to users and assist them inbrowsing and retrieving relevant video content.

    1. Introduction

    As computers networks improve and digital imageand video libraries become more readily available tohomes and offices via the Internet, the problem ofbandwidth and ones ability to access relevant databecome more challenging. Even now, the task of viewingand browsing through the vast amount of images andvideo data with conventional VCR-like interfaces istedious, time-consuming, and unbearable. In order tomake these databases widely usable, we need to develop amore effectively method for content selection, dataretrieval and browsing.

    Many compression algorithms and standards havebeen developed and adopted in the last two decades forefficient video transmission and storage. These includethe well accepted MPEG1, MPEG2, and H.26x standardsbased on simple block transform techniques. Theseblock-based representations, however, do not encode datain ways that are suitable for content-based retrieval [1].This is understandable since they were designed andoptimized around bandwidth and distortion constraints.Many researchers have investigated using thesecompression schemes for browsing and temporal analysisand have painstakingly realized the difficulties andchallenges of using representation not designed to encodecontent information for retrieval applications.

    In browsing and retrieval applications, the systemneeds to quickly sample the data, decide on relevance and

    present them to the user in a timely fashion. Consequently,it is necessary to consider some form of data compressionand scheme for content indexing in retrieval applicationsso that the system can achieve the required response.Thus, compression and content analysis must be addressedsimultaneously in a unified framework [1].

    As with text retrieval where a document is dividedinto smaller sub-components such as sentences, phrases,words, letters, and numerals, a long video sequence mustbe organized into smaller and more manageablecomponents consisting of scenes, shots, moving objects,and pixel properties such as, color, texture and motion.Ultimately, a compact representation that encodes data asa set of moving objects would be highly desirable. Severalresearchers have pursued along these directions such as inthe layered image representation work by Wang andAdelson [2]. An emerging video standard, MPEG-4, hasalso incorporated similar ideas based on these imagelayers. However, given these representations, there stillremains many challenges on the design and creation ofeffective indices based on image and video content.

    In this paper we propose a scheme for video retrievalapplication that extends the current key-frame retrievalapproach to include attributes of moving objects calledkey-objects. We begin in Section 2 by reviewing currentstrategies used in video retrieval and identifying thedesirable features. With these features, we develop avideo representation that incorporates key-objects andshow how it facilitates both content-based query andbrowsing. In Section 3, we present techniques for key-object analysis. Finally, we discuss applications of key-objects in a video database management system.

    2. Video Representation and Indexing

    In this section, we briefly discuss current strategiesused in video indexing and based on these ideas wedevelop a strategy that involves a more semantic andobject-based representation.

    Current video indexing schemes rely on two primaryconcepts: video parsing (temporal segmentation) andcontent analysis. Video parsing involves the detection oftemporal boundaries and identification of meaningfulsegments of video. These segments are categorized in ahierarchy similar to storyboards used in film making. Thetop level construct consists of sequences, which arecomposed of a set of scenes. Scenes are furtherpartitioned into shots. Each shot contains a sequence offrames that expresses an event within the long videosequence.

  • 7/29/2019 Content Based Video Retrieval

    2/4

    In current systems, shot boundaries are identifiedbased on the quantitative differences between consecutiveframes. Consequently, neighboring shots portray differentactions or events. These techniques for detection of shotboundaries that rely on simple temporal segmentationachieve satisfactory performance. Many such algorithmsrely on detecting discontinuities in color and motionbetween frames [3].

    Given the shot boundaries and data partitions,representative frames within each shot are selected toconvey the content. These 2-D image frames, called "key-frames" allow users of retrieval systems to quickly browseover the entire video by viewing frames from selectedtime samples that describe the highlights of each shot.The use of key-frames greatly reduces the amount of datarequired in indexing and, furthermore, provides anorganizational framework for dealing with video content.Thus, the problem of video retrieval and browsingbecomes manageable since it can rely on existing 2-Dimage database retrieval infrastructure to catalog the key-frames. These techniques are typically based on globalanalysis of color, texture, and simple motions [3, 4].

    These global features are rather effective for generalsearch applications because they describe thecharacteristics of the key-frame as a whole. However,they sometimes do not provide sufficient resolution forobject-based retrievals and queries based on properties ofobjects. As a result, key-frame retrieval is limited togeneral color, texture, or signal attributes search.

    3 Object-based video indexing

    Often in queries, it is desirable to quantify queriesbased on particular attributes that involve objects and sub-regions within the viewable image. Some support ofobjects in the retrieval representation would be greatlyadvantageous in dealing with these interesting queries

    about objects and their motions.Though object representations could be useful,

    analysis of semantic objects is not always possible. Manyresearchers have investigated various segmentationtechniques based on motion, color, texture, etc. to identifyobjects, though none can report that their segmentationalgorithm can identify semantic objects. These techniquesmerely identify simple regions of coherent motion, color,or texture, and frequently cannot deal with complexscenes that involve multiple motions, actions, and objects.Despite these limitations, it is, nevertheless, useful toprovide simple object abstractions where possible.

    3.1 Key-Object representation

    Key-frames provide a suitable abstraction andframework for video browsing. However, we couldsupport an even wider range of queries by definingsmaller units within the key-frame. We call these smallerunits "key-objects" because we used them to represent keyregions that participate in distinct actions within the shot.Our key-objects do not necessarily correspond to semanticobjects, that is an extremely difficult analysis problem.We want to avoid the semantic object analysis problem soinstead we seek out regions of coherent motion. Ourcriterion on coherent motion is perceptually andphysically motivated, since points of an object exhibit

    motion coherence and people tend to group regions ofsimilar motion into one semantic object. Thus motioncoherence might capture some aspect of objects desirablein retrieval. Several attributes that we attach to key-objects include color, texture, shape, motion, and lifecycle. The color and texture attributes might be computedwith algorithms described by previous authors [3].

    We incorporate our key-objects within the shot-basedkey-frame representation. In the augmentedrepresentation, each shot is represented by one or morekey-frames which are further decomposed into key-objects. We also provide for a general description ofmotion activity within the shot. In shots where key-objects cannot be reliably detected, a motion activitydescriptor provides information about likely actionswithin the shot. Motion activity captures the generalmotions in the shot such as global motions arising fromcamera pan or zoom. For example, our motion activitydescriptor can be used to distinguish "shaky" sequencescaptured with a hand-held camera from professionallycaptured sequences.

    Furthermore, there are some advantages in

    decomposing key-object motion into a global componentand a local/object-based component. In thisdecomposition, the key-object motion can be more easilyused to reflect motion relative to the background and otherkey-objects in the scene. Without this distinction, the key-object motion would instead represent motion relative tothe image frame. Thus this decomposition provides amore meaningful and effective description for retrieval.

    We summarize our descriptors for video indexing asfollows:

    Sequence

    Sequence-ID: unique index key of the sequence

    Shots: { Shot(1), Shot(2), ... Shot(N) }

    Shot

    Shot-ID: unique index key of the shot;

    Motion-Activity: mean/dominant motion, activitybased on variance;

    Objects: { Object(1), Object(2)...Object(N) }

    Object

    Object-ID: identification number of and objectwithin the shot;

    Object-shape: alpha map of the object;

    Object-life: relative time frame when the objectappears and disappears from the shot;

    Object-Color/Texture: the color and texture ofthe object;

    Object-Motion:the trajectory and motion modeparameters of the object.

    In a decomposition where object attributes do notchange drastically over the entire shot, we need only useone representative description for each attribute.However, because object attributes do often change, asingle descriptor may not be representative. In this case,we derive a set of attributes from various instances intime, thus, each key-object is a collection of "key-instances".

  • 7/29/2019 Content Based Video Retrieval

    3/4

    3.2 Key-object analysis

    In key-object analysis, we want to identify and groupregions within a shot that move with coherent motion.There exist several techniques for achieving this goal. Weuse motion segmentation techniques described by Wangand Adelson [2]. The two major components of thisalgorithm consist of local motion estimation and motion

    model estimation. This algorithm is similar to othertechniques that employ a generic furthermore algorithmexcept it makes several optimizations that improverobustness. It also reduces computations in our analysisprocedure because we also use this local motioninformation to derive a measure of motion activity.

    The local motion field can be obtained by optic-flowestimation techniques. Optic-flow estimation produces adense motion field that describes the motion betweenconsecutive frames for every pixel in the image. Based onthis observation of local motion, various motion modelsare hypothesized and their motion parameters iterativelyrefined. These models can range from simple translationto complex planar parallax motion. We find that affinemotion models provide a good compromise between

    complexity and stability. These models are used toclassify pixels or to partition the image into coherentlymoving regions.

    In cases where there exists clear and distinct motionregions, we track the region throughout the duration of theshot. These regions and their attributes of shape, color,and texture are cataloged along with the shot descriptors.Furthermore, we use the size of the region to determine aglobal motion. For example, a large region that includesthe peripheral pixels might correspond to the background.This assumption performs fairly well when foregroundobjects are proportionately smaller than the background.

    The key-object motions are adjusted to reflect amotion relative to this background motion. Theseadjustments on the affine parameters can be easilycompleted with a simple matrix transformation. Likewise,the relative motion between key-objects can be computedin a similar fashion. Figure 1 shows an example of optic-flow calculated between two frame of a video shot, wherethe moving object, a man riding a horse, can be seenclearly.

    (a) (b)Fig.1: (a) One key-frame and (b) its corresponding optic-flow.

    The key-object of a man riding a horse is clearly visible.

    Because of difficulties in motion segmentation andlimitations of current algorithms, complex scenes oftencannot be decomposed into distinct coherent regions.Under these circumstances, we resort to computing asimple measure of motion activity. This motion activityincludes measurements of mean, variance, and distribution

    of the observed optic-flow motion. These measurementsprovide information about the complexity and distributionof motion that might be useful in queries. The Wang andAdelson approach of model estimation that uses optic-flow data nicely complements our analysis of key-objectsand motion activity. The local motion is computed onceand used to compute activity and identify key-objects.Thus, the amount of computation is greatly reduced byusing this approach compared with an approach that usesa dominant motion estimation algorithm, which requiresmany costly image warps and motions estimations.

    3.3 Key-frame selection for video browsing

    As described earlier, in video browsing applications,key-frames provide a quick and effective way to conveythe video content for shots segments. The effectiveness ofkey-frame browsing depends on the choice of the imageframes selected for representing the shot. The imageframes within the shot are not all equally descriptive.Certain frames may provide more information about theobjects and actions within the shot than other frames.Although, as a convenience, image frames at shot

    boundaries are frequently used as key-frames for videobrowsing, they often "miss" the more interesting actionsthat occur in the middle of the shot.

    Below we outline some strategies for key-framesselection based on the motion activity and attributes ofkey-objects within the shot.

    i) a key-object enters or leaves the image frameboundaries;

    ii) key-objects participate in occlusion relationship;

    iii) two key-objects are at the closest distance betweenthem;

    iv) mean and extrema of key-object attributes, i.e.color, shape, motion;

    v) key-frames should have some small amount of

    background object overlap.

    Figure 1 and 2 show 3 key-frames selected from thevideo sequence according to the criteria outlined above.

    (a) (b)

    Fig.2: Key-frames selected based on moving objects: (a) the

    horse enters the scene; (b) the horse leaves the scene.

    Another approach at presenting 2D images to the userfor effective video browsing involves computing imagemosaics [5]. Image mosaics can be used effectively forimage retrieval because the mosaic, which is derived fromcondensing information from many image frames toproduce an expanded view or "panorama" of the scene,quickly convey the elements of the scene. However,mosaics are not generally applicable or computable for all

  • 7/29/2019 Content Based Video Retrieval

    4/4

    shots. In situations where they are computable, ourproposed video description and analysis of key-objectshelp us build these mosaics.

    4 Object-based retrieval and compression

    The object-based representation has advantages

    beyond its use in key-frame extraction and video mosaicconstruction. It widens the range of queries for content-based video retrieval. For instance, many queries involvequestions about the figure/ground objects, such as brightred object moving over a green background. Ourrepresentation scheme contains all features required tosupport such queries.

    In our scheme, the matching between the query andcandidate video sequences is based on visual attributes ofindividual objects and/or their compositions in shots.That is, for queries searching for the presence of one ormore objects in a shot, the similarity between the queryand the candidate is defined as the similarity between thekey-objets. Formally, if two shots are denoted as Si andSj, their key-object sets as Ki ={oi,m , m= 1, ..., M} and Kj

    = {oj,n, n= 1, ... , N}, then the similarity between the twoshots can be defined as

    ( ) [

    ]),(,...

    ),,(),,(max,

    ,,

    1

    2,,1,,1

    Njmik

    M

    m

    jmikjmikMjik

    oos

    oosoosSS =

    =S(1)

    where sk

    is a similarity metric between two objects; andthere are totally MxN similarity values, from which themaximum is selected. This definition states that thesimilarity between two shots is the sum of similarities ofthe most similar key-object pairs.

    Note that one might be interested in a similaritymeasure between two objects defined by any combination

    of the object attributes. Likewise, when key-objectsconsists of several key instances, we can easily extend thesimilarity measure to compare the attributes of the variousinstances.

    The proposed object-based representation alsoprovides a unified framework for both video indexing andcompression. Using this framework in video coding, wemay achieve higher compression ratio and supportcontent-based data access. As shown in Figure 3, themajor difference from traditional video coding schemes isthat a content analysis module is added to the compres-sion process. Video is first segmented into shots, eachshot is then decomposed into objects with extractedcontent features, and, finally, the encoding moduleencodes the objects. The size of the resultant meta-data

    (content features) is relatively small compared with imagedata and can be further compressed with conventionalalgorithms. The emerging video standard, MPEG-4, alsoincorporates a similar object-based scheme in compres-sion; however, retrieval has not been the focus [6].

    At the decoder end, the compressed data are decodedinto object image maps and mate-data. At this point,visual browsing and query can be without requiringsophisticated analysis of the compressed video as inbrowsing through MPEG-2 videos. For example, we canuse color histograms coded with key-objects to help

    retrieving shots with similar key-objects. Furthermore,with key-objects and their motion features, key-frames orimage mosaic of retrieved shots can be generatedefficiently for browsing.

    Channel

    Video Video Content

    Analyzer

    Object encoder

    Meta-data encoder

    Database

    Synthesize

    Edit

    Presentation

    VideoObject decoder

    Meta-data decoder

    Query

    Index

    Fig.3: Data and process flow diagram of a content-based video

    compression and retrieval process.

    5 Conclusion

    Video retrieval, browsing, and compression facesimilar problems. In this paper, we proposed a compactrepresentation that enables efficient search and browsing.Our approach involves key-objects which extends currentkey-frame techniques and supports more descriptivequeries about objects and their actions. Furthermore, weuse key-object analysis to assist in the select of key-frames. Techniques for using key-objects in retrieval arediscussed and the advantages over key-frame techniquesdemonstrated.

    6. References

    1. R. Picard, Content Access for Image/Video Coding:The Fourth Criterion, Proc. of ICPR, Oct. 1994.

    2. J.Y.A. Wang and E. H. Adelson, RepresentingMoving Images with Layers, IEEE Tran. on ImageProcessing, 3(5):625-638, September 1994.

    3. H. J. Zhang, et al, Video Parsing, Retrieval andBrowsing: An Integrated and Content-BasedSolution, Proc. of ACM Multimedia95, SanFrancisco, Nov.7-9, 1995, pp.15-24.

    4. M. M. Yeung and B. Liu, Efficient Matching andClustering of Video Shots, Proc. of ICIP95, Oct. 1995,pp.338-341.

    5. L.Teodosio and W. Bender, Salient Video Stills:Content and Context Preserved, Proc. ACMMultimedia93, August, 1993.

    6. Description of MPEG-4, ISO/IEC JTC1/SC29/WG11 N1410, Oct. 1996.