The MPEG-7 Multimedia Content Description Interface

Iούνιος 6, 2006

The MPEG-7

Multimedia Content Description Interface

Αναστασία Μπολοβίνου,

Υ/Δ Ινστιτούτου Πληροφορικής και Τηλεπικοινωνιών

Ε.Κ.Ε.Φ.Ε ΔΗΜΟΚΡΙΤΟΣ

2

Outline

• MPEG-7 motivation and scope• Visual Descriptors (color, texture, shape)• MPEG-7 retrieval evaluation criterion• Similarity measures and MPEG-7 visual descriptors• Building MPEG-7 Descriptors and Descriptors

Schemes with Description Definition Language• MPEG-7 VXM current state• Towards MPEG-7 Query Format Framework

(Queries and visual descriptor tools employed by the queries)

• Summary

3

Proliferation of audio-visual content

MPEG-7 motivation and design scenarios (possible queries)

• Music/audio: play a few notes and return music with similar music/audio

• Images/graphics: draw a sketch and return images with similar graphics

• Text/keywords: find AV material with subject corresponding to a keyword

• Movement: describe movements and return video clips with the specified temporal and spatial relations

• Scenario: describe actions and return scenarios where similar actions take place

Standardize multimedia metadata descriptions (facilitate

multimedia content-based retrieval) for

various types of audiovisual information

Consumer content

news

sports

Scientific content

Digital art galleries

Recorded material

4

- How to extract descriptions(feature extraction, indexing process,annotation & authoring tools,...)

Scope of the Standard

DescriptionProduction(extraction)

DescriptionConsumption

StandardDescription

Normative part ofMPEG-7 standard

- How to use descriptions (search engine, filtering tool, retrieval process, browsing device, ...) - The similarity between contents->The goal is to define the minimum that enables interoperability.

* MPEG-7 does not specify (non normative parts of MPEG-7):

5

Information flow

6

• Color DescriptorsDominant ColorScalable ColorColor LayoutColor StructureGoF/GoP Color

• Texture DescriptorsHomogeneous TextureTexture BrowsingEdge Histogram

• Shape DescriptorsRegion ShapeContour Shape3D Shape

Visual Descriptors• LocalizationRegion LocatorSpatio-TemporalLocator

OtherFace Recognition

• Motion Descriptors for VideoCamera MotionMotion TrajectoryParametric MotionMotion Activity

(Normative, basic, for localization)

7

Color Descriptors

Constrained color spaces:->Scalable Color Descriptor uses HSV->Color Structure Descriptor uses HMMD

Color Descriptors

Dominant Color Scalable Color- HSV space

Color Structure-HMMD space

Color Layout-YCbCr space

GroupOfFrames/Pictures

• Color Space: - R, G, B- Y, Cr, Cb- H, S, V- Monochrome- Linear transformation of R, G, B- HMMD

8

Scalable Color Descriptor (CSD)

• A color histogram in HSV color space

• Encoded by Haar TransformFeature vector: {NoCoef, NoBD, Coeff[..], CoeffSign[..]}

9

SCD extraction

to 4bits/bin

to 11bits/bi

nNbits/bin

(#bin<256)

10

GoF/GoP Color Descriptor

• Histograms Aggregation methods:– Average ..but sensitivity to outliers (lighting changes, occlusion, text overlays)– Median ..increased comp. complexity for sorting– Intersection ..differs: a “least common” color trait

viewpoint

•Extends Scalable Color Descriptor for a video segment or a group of pictures (joint color hist. is then possessed as CSD- Haar transform encoding)

Extraction

12

Dominant Color Descriptor (DCD)

• Clustering colors into a small number of representative colors (salient colors)

• F = { {ci, pi, vi}, s}• ci : Representative colors

• pi : Their percentages in the region

• vi : Color variances

• s : Spatial coherency

13

DCD Extraction (based on Lloyd gen. algorithm)

ci centroid of cluster ;

x(n) color vector at pixel;

v(n) perceptual weight for pixel .

+spatial coherency:Average number of connecting pixels of a dominant color using 3x3 masking window

H.V.P more sensitive to smooth regions

14

• http://debut.cis.nctu.edu.tw/Demo/ContentBasedVideoRetrieval/CBVR/Dominant/index.html

15

Color Layout Descriptor (CLD)

• Clustering the image into 64 (8x8) blocks

• Deriving the average color of each block (or using DCD)• Applying (8x8)DCT and encoding

• Efficient for– Sketch-based image retrieval– Content Filtering using image

indexing

…

…

.

.

...

. .

.

16

If the time domain data is smooth (with little variation

in data) then frequency domain data will make low frequency data larger and high frequency data smaller.

-> derived average colors are transformed into a series of coefficients by performing DCT (data in time

domain - > data in frequency domain).

-> A few low-frequency coefficients are selected using zigzag scanning and quantized to form a CLD (large quantization step in quantizing AC coef / small quantization

step in quantizing DC ). ->The color space adopted for CLD is YCrCb.

CLD extraction

F ={CoefPattern, YDCCoef,CbDCCoef,CrDCCoef,YACCoef, CbACCoef, CrACCoef}

17

Color Structure Descriptor (CSD)• Scanning the image by an

8x8 struct. element• Counting the number of

blocks containing each color• Generating a color histogram

(HMMD/4CSQ operating points)

8 x 8 structuringelement

COLORBINC0

C1 +1

C2

C3 +1

C4

C5

C6

C7 +1

18

CSD extraction

If

Then sub sampling factor p is given by:

F = {colQuant, Values[m]}

19

CSD scaling

20

Texture Descriptors

• Homogenous Texture Descriptor• Non-Homogenous Texture

Descriptor (Edge Histogram)• Texture Browsing

21

Homogenous Texture Descriptor (HTD)

• Partitioning the frequency domain into 30 channels (modeled by a 2D-Gabor function)

• Computing the energy and energy deviation for each channel

• Computing mean and standard variation of frequency coefficients

- > F = {fDC, fSD, e1,…, e30, d1,…, d30}

• An efficient implementation: – Radon transform followed by Fourier

transform

22

HTD Extraction –How to get 2-D frequency layout following the HVS

2-D image f(x,y)

1D P (R, θ)

Radon transform

1D F(P (R, θ))

Resulted sampling grid in polar coords

23

- > 2D-Gabor Function deployed to define Gabor filter banks

• It is a Gaussian weighted sinusoid

• It is used to model individual channels

• Each channel filters a specific type of texture

HTD Extraction - Data sampling in feature channel

25

HTD properties

One can perform

• Rotation invariance matching

• Intensity invariance matching (fCD removed from the feature vector)

• Scale-Invariant matching

F = {fDC, fSD, e1,…, e30, d1,…, d30}

26

Texture Browsing Descriptor

-> Same sp. filtering procedure as the HTD..

Scale and orientation

selective band-pass filters

regularity(periodic to random)

Coarseness(grain to coarse)

Directionality (/300)

->the texture browsing descriptor can be used to find aset of candidates with similar perceptual properties and thenuse the HTD to get a precise similarity match list among thecandidate images.

e.g look for textures that are very regular and oriented at 300

27

Edge Histogram Descriptor (EHD)

• Represents the spatial distribution of five types of edges– vertical, horizontal, 45°, 135°, and non-

directional

• Dividing the image into 16 (4x4) blocks• Generating a 5-bin histogram for each

block• It is scale invariant

Retain strong edges by thresholding canny edge operator

…• F = {BinCounts[k]} ,k=80

28

EHD extraction

Basic (80 bins) Extended (150 bins)

+13 clusters for semi-global

basic Semi-global

global

Egde map image using “Canny” edge operator

.

29

ETD valuation

• Cannot be used for object-based image retrieval

• Thedgeif set to 0 ETD applies for binary edge images (sketch-based retrieval)

• Extended HTD achieves better results but does not exhibits rotation invariant property

30

Shape Descriptors

• Region-based Descriptor• Contour-based Shape Descriptor• 2D/3D Shape Descriptor• 3D Shape Descriptor

31

Region-based Descriptor (RBD)

• Expresses pixel distribution within a 2-D object region

• Employs a complex 2D-Angular Radial Transformation (ART)

2

0

1

0,,,,,, ddfVfVF nmnmnm

jmAm exp2

1

0cos2

01

nn

nRn

m = 0, ..12

n = 0, ..3

• F = {MagnitudeOfART[k]} ,k=nxm

32

Region-based Descriptor (2)

• Applicable to figures (a) – (e)• Distinguishes (i) from (g) and

(h)• (j), (k), and (l) are similar

Advantages:Describes complex shapes with disconnected regions Robust to segmentation noise Small size Fast extraction and matching

33

Contour-Based Descriptor (CBD)

• It is based on Curvature Scale-Space representation

34

Curvature Scale-Space

• Finds curvature zero crossing points of the shape’s contour (key points)

• Reduces the number of key points step by step, by applying Gaussian smoothing

• The position of key points are expressed relative to the length of the contour curve

35

CBD Extraction

Location xCSS of curvature zero-crossing points

Filtering pass ycss

Repetitive smoothing of X and Y contour coordinates by the low-pass kernel (0.25, 0,5, 0,25) until the contour becomes convex

• F = {NofPeaks, GlobalCurv[ecc][circ], PrototypeCurv[ecc][circ], HighestPeakY, peakX[k], peakY[k]}

36

CBD Applicability

• Applicable to (a)• Distinguishes

differences in (b)• Find similarities in

(c) - (e)

Advantages:• Captures the shape very well• Robust to the noise, scale, and orientation• It is fast and compact

37

Comparison (RB/CB descriptors)

• Blue: Similar shapes by Region-Based• Yellow: Similar shapes by Contour-

Based

38

How MPEG-7 compare descriptors?

ANMRR (average modified retrieval rank):

-normalized measures that take into account different sizes of ground truth sets and the actual ranks obtained from the retrieval were defined -> retrievals that miss items are assigned a penalty.

Traditional metric

39

Similarity between features

• Typically descriptors: multidimensional vectors (of low level features)

• Similarity of two images in the vector feature space:

– the range query: all the points within a hyperrectanglealigned with the coordinate axes– the nearest-neighbour or within-distance (α−cut)query: a particular metric in the feature space– dissimilarity between statistical distributions: thesame metrics or specific measures

40

• http://nayana.ece.ucsb.edu/M7TextureDemo/Demo/client/M7TextureDemo.html

An example of CBIR system using HTD performing range query and NN query

41

Criticism on MPEG-7 distance measures• MPEG-7 adopts feature vector space distances based on

geometric assumptions of descriptor space, e.g

..but these quantitative measures (low-level information) do not fit ideally with human similarity perception

->researchers from other areas have developed alternative predicate-based models (descriptors are assumed to contain just binary elements in opposition to continuous data) which express the existence of properties and express high level information

See “Pattern difference” : 2K

bc K:NofPredicates in the data vectors Xi, Xj

b: property exists in Xi c: property exists in Xj

44

How to build and deploy an MPEG-7 Description

A description A Description Scheme (structure) .

A set of Descriptor Values (instantiation of a Descriptor for a given data set)

+

MPEG-7 Description Tools are a library of standardized Descriptions and Description Schemes

Adopting the XML Schema as the basis for the MPEG-7 DDL and the resulting XML-compliant instances (Descriptions in MPEG-7 textual format) eases interoperability by using a common, generic and powerful (+ extensible) representation format

in DDLanguage

45

How that worksDescription Definition Language:

->XML Schema (flexibility) - XMLS struct.lang.components - XMLS datatype lang.components - mpeg-7 spesific extentions + - >Binary version (efficiency)

Mpeg7 support for

vectors, matrices and

typed references

Text formatBiM formatmix

(XML)

47

Descriptions enabled by the MPEG-7 tools

Perceptual Descriptions:

- content’s spatio-temporal structure- info on low-level features - semantic info related to the reality captured by the content

Archival-oriented Descriptions:

-content’s creation/production

- info on using the content

- info on storing and representing the content

Additional info for organizing, managing and accessing the content:

- How objs are related and gathered in collections

-summaries/variations/transcoding to support efficient browsing

- User interaction info

Organization/Naviga-tion/Access/ User Interaction Tools

Content description Tools

Content management Tools

48

Type hierarchy for top levels elements

49

<Mpeg7><Description xsi:type=“ContentEntity”><MultimediaContent xsi:type=“VideoType”> <Video id=“video_example”> <MediaInformation>...</MediaInformation> <TemporalDecomposition gap=“false” overlap=“false”> <VideoSegment id=“VS1”> <MediaTime> <MediaTimePoint> T00:00:00</MediaTimePoint> <MediaDuration>PT2M</MediaDuration> </MediaTime> <VisualDescriptor xsi:type=“GoFGoPColorType” aggregation=“average”> <ScalableColor numOfCoef=“8” numOfBitplanesDicarded=“0”> <Coeff>1 2 3 4 5 6 7 8</Coeff> </ScalableColor> </VisualDescriptor> </VideoSegment>……

…

</VideoSegment> </TemporalDecompostion> </Video></MultimediaContent></Description></Mpeg7>

50

What DS to choose..?

MPEG-7 provides DSs for description of the structure and semantics of AV content + content management

Cont.Manag.Info can be attached to individual Segments

51

Viewpoint of the structure: Segments

52

Structure description

Video Segment

Segment decomposition

• Time• Color• Motion• Texture• Shape• Annotation

• Time• Mosaic• Annotation

Moving region

Relation Linkabove

Video Segments

Moving regions

Segment decomposition

Segments decomposition

53

Segment Decomposition

timeconnectivity

54

Content structural aspects (Segment DS tree) Annotate

the whole image with StillRegion

Spatial segmentation at different levels

Among different regions we could use

SegmentRelationship description tools

55

Content structural aspects

Temporal segments

(Segment Relationship DS graph)

57

Content Semantic aspects (SemanticGraph)

58

Example of Structure-Semantic Link DS

59

Content abstraction aspects (CoAbstr)-Hierarchical summary of a video

f0

f0

f0

f00

f01

f02

- > enables rapid browsing, navigation (also sequential summary)

60

(CoAbstr)-Partitions and decompositions(ViewDecomposition DS)

Frequency-space graph

61

(CoAbstr) Content Variation

• Universal Multimedia Access: Adapt delivery to network and terminal characteristics

62

CoAbstr – A collection (Collection StructureDS)

- >groups segments, events, or objects into collection clusters and specifies properties that are common to the elements:•The CollectionStructure DS describes also statistics and models of the attribute values of the elements, such as a mean color histogram for a collection of images. •The CollectionStructure DS also describes relationships among collection clusters.

63

Reference Software: the XM

• XM implements– MPEG-7 Descriptors (Ds) – MPEG-7 Description Schemes (DSs)– Coding Schemes– DDL

extraction <--search and retrieval

<--trasnscoding

description filtering

64

Beyond mpeg-7 version 1 (D&DS in VXM)

ColorTemperature: This descriptor specifies the perceptual temperaturefeeling of illumination color in an image for browsing and display preference controlpurposes (user friendly). Four perceptual temperature browsing categories areprovided; hot, warm, moderate, and cool. Each category is used for browsing imagesbased upon its perceptual meaning. – uses dominant color descriptor

Illumination Invariant Color: wraps the color descriptors. One or more color descriptors processed by the illumination invariant method can be included in this descriptor.

Shape Variation: can describe shape variations in terms of Shape Variation Map and the statistics of the region shape description of each binary shape image in the collection. Shape Variation Map consists of StaticShapeVariation and DynamicShapeVariation. The former corresponds to 35 quantized ART coefficients on a 2-dimensional histogram of group of shape images and the latter to the inverse of the histogram except the background.

Media-centric description schemes: Three visual description schemes are designed to describe several types of visual contents. The StillRegionFeatureType contains several elementary descriptors to describe the characteristics of arbitrary shaped still regions.

65

Visual CE current phase

• CE explore new technologies on identifying original images and their modified versions (N-1 modified versions), focused on the accuracy and robustness of identification

- > robustness is measured as the accuracy (HitRatio = k/(N)) separately calculated with each level of modification

Modifications: Brightness Size reduction Color to Monochrome

JPEG compr. with varying quality factors Color reduction Crop Histogram Equalization

Blur Geometric Transformation

66

Towards MPEG-7 Query Format

- >Though, the interface to support queries in an MPEG-7 database is not yet supported, requirements have been drafted

Output Query Format

ClientApplication

MPEG-7 Database

Input Query Format

Query Management Tools

e.g-query by textual description-Combinations of query conditions-spesification of the structure of the result set

e.g. structure of

the response

containing the

resulting set

e.g-spesification of the exceptions

-relevant feedback

67

Basic search functionalities may include:

• Query by Description (the client application provides possible query criteria)

The MPEG-7 Multimedia Content Description Interface

Documents

Transcript of The MPEG-7 Multimedia Content Description Interface