MPEG-7 STANDARD TOWARDS INTELLIGENT AUDIO-VISUAL INFORMATION HANDLING.

MPEG-7 STANDARD

TOWARDS INTELLIGENT AUDIO-VISUAL

INFORMATION HANDLING

MPEG-7, formally named “Multimedia Content Description Interface”, is a standard for describing the multimedia content data that supports some degree of interpretation of the information’s meaning, which can be passed onto, or accessed by, a device or a computer code. MPEG-7 is not aimed at any one application in particular; rather, the elements that MPEG-7 standardizes support as broad a range of applications as possible.

The elements that MPEG-7 standardizes provide support to a broad range of applications (for example, multimedia digital libraries, broadcast media selection, multimedia editing, home entertainment devices, etc.). MPEG-7 will also make the web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives, which are being made accessible to the public, as well as to multimedia catalogues enabling people to identify content for purchase.

Applications of MPEG-7Applications of MPEG-7

OBJECTIVES OF MPEG-7 STANDARD The MPEG-7 standard aims at providing

standardized core technologies allowing description of audiovisual data content in multimedia environments. Audiovisual data content that has MPEG-7 data associated with it, may include: still pictures, graphics, 3D models, audio, speech, video, and composition information about how these elements are combined in a multimedia presentation (scenarios). Special cases of these general data types may include facial expressions and personal characteristics

Broadcast media selection (e.g., radio channel, TV channel).

Cultural services (history museums, art galleries, etc.). Digital libraries (e.g., image catalogue, musical dictionary, film, video and radio archives).

E-Commerce (e.g., personalised advertising, on-line catalogues, directories of e-shops). Education (e.g., repositories of multimedia courses, multimedia search for support material). Home Entertainment (e.g., systems for the management of personal multimedia collections,

including manipulation of content, e.g. home video editing, searching a game, karaoke). Investigation services (e.g., human characteristics recognition, forensics).

Journalism (e.g. searching speeches of a certain politician using his name, his voice or his face).

Multimedia directory services (e.g. yellow pages, Tourist information, Geographical information systems).

Multimedia editing (e.g., personalised electronic news service, media authoring). Remote sensing (e.g., cartography, ecology, natural resources management).

Shopping (e.g., searching for clothes that you like). Social (e.g. dating services).

Surveillance (e.g., traffic control, surface transportation, non-destructive testing in hostile environments).

APPLICATION AREAS OF MPEG-7APPLICATION AREAS OF MPEG-7

MPEG-7 Description tools allow to create descriptions of content that may include:

Information describing the creation and production processes of the content (director, title, short feature movie)

Information related to the usage of the content (copyright pointers, usage history, broadcast schedule)

Information of the storage features of the content (storage format, encoding)

Structural information on spatial, temporal or spatio-temporal components of the content (scene cuts, segmentation in regions, region motion tracking)

Information about low level features in the content (colors, textures, sound timbres, melody description)

Conceptual information of the reality captured by the content (objects and events, interactions among objects)

All these descriptions are coded in an efficient way for searching, filtering, etc.

APPLICATION MODEL APPLICATION MODEL

Parts of the standard

MPEG-7 Visual – the Description Tools dealing with Visual descriptions.MPEG-7 Audio – the Description Tools dealing with Audio descriptionsMPEG-7 Multimedia Description Schemes - the Description Tools dealing with generic features and multimedia descriptions.MPEG-7 Description Definition Language - the language defining the syntax of the MPEG-7 Description Tools and for defining new Description Schemes.

Structure of the descriptionsStructure of the descriptions

Those main elements of the MPEG-7’s

standard are:

• Descriptors (D): representations of Features, that define the syntax and the semantics of each feature representation,

• Description Schemes (DS): specify the structure and semantics of the relationships between their components. These components may be both Descriptors and Description Schemes,

Description Definition Language (DDL): to allow the creation of new Description Schemes and, possibly, Descriptors and to allows the extension and modification of existing Description Schemes,

System tools: to support multiplexing of descriptions, synchronization issues, transmission mechanisms, coded representations (both textual and binary formats) for efficient storage and transmission, management and protection of intellectual property in MPEG-7 descriptions, etc

MPEG-7 Description Definition Language (DDL)

The DDL defines the syntactic rules to

express and combine Description Schemes

and Descriptors

DDL is a schema language to represent the

results of modeling audiovisual data, i.e.

DSs and Ds.

It was decided to adopt XML Schema Language from W3C as the MPEG-7 DDL. The DDL will require some specific extensions to XML Schema

WHAT IS XML ?

• Extensible Markup Language (XML) is a subset of SGML (ISO standard). Its goal is to enable generic SGML to be processed on the Web in the same way that is now possible with HTML. XML has been designed for ease of implementation compared to SGML.

XML defines document structure and embeds it directly within the document through the use of markups. A markup is composed of two kinds of tags which encapsulate data : open tags and close tags. XML is similar to HTML but the tags can be defined by the user. The definition of valid document structure is expressed in a language called DTD (Document Type Definition).

SIMPLE EXAMPLE of an XML DOCUMENT : <letter>

<header><name>Mr John Smith</name><address>

<street>15 rue Lacepede</street><city>Paris</city>

</address></header>

<text>Dear Mr Doe, .....</text>

</letter>

XML Schema: Structures

XML Schema consists of a set of structural

schema components which can be divided

into three groups. The primary components

are:

The Schema – the wrapper around the definitions and declarations;

• Simple type definitions; • Complex type definitions; • Attribute declarations; • Element declarations.

The secondary components are: • Attribute group definitions; • Identity-constraint definitions; • Model group definitions; • Notation declarationThe third group are the "helper" components

which contribute to the other components and cannot stand alone:

• Annotations; • Model groups; • Particles; • Wildcards.

Simple example of a DTD file : <!DOCTYPE letter[<!ELEMENT letter (header, text)><!ELEMENT header (name,address)><!ELEMENT address (street, city)>

<!ELEMENT name #PCDATA><!ELEMENT street #PCDATA><!ELEMENT city #PCDATA><!ELEMENT text #PCDATA>

]>

What is an XML schema ? The purpose of an XML schema is almost

the same than a DTD except that it goes beyond the current functionalities of a DTD and allows more precise datatype definitions and easier reuse of structure definitions. Schema can be seen as an extended DTD. Even more important is that an XML schema is itself an XML document.

XML Schema Overview

The DDL can be broken down into the following logical normative components:

• XML Schema Structural components;

• XML Schema Datatype components;

• MPEG-7 Extensions to XML Schema

The MPEG-7 DDL is basically XML Schema Language but with some MPEG-7-specific extensions such as array and matrix datatypes.

The DDL allows to define complexTypes and simpleTypes. The complexTypes specify the structural constraints while simpleTypes express datatype constraints

MPEG-7 Extensions to XML Schema

• Parameterized array sizes;

• Typed references ;

• Built-in array and matrix datatypes;

• Enumerated datatypes for MimeType, CountryCode, RegionCode, CurrencyCode and CharacterSetCode

XML Schema Language parsers available:

XSV - Open Source Edinburgh Schema

Validator (written in Python)

XML Spy - Validating XML Editor

Xerces - Open source XML Parsers in Java

and C++

EXAMPLE

<simpleType name=”6bitInteger” base=”nonNegativeInteger”>

<minInclusive value=”0”/>

<maxInclusive value=”63”/>

</simpleType>

A complex type definition is a set of attribute declarations and a content type, applicable to the attributes and children of an element declared to be of this complex type

<complexType name="Organization">

<element name="OrganizationName" type="string"/>

<element name="ContactPerson" type="Individual" minOccurs="0"

maxOccurs="unbounded"/>

<element name="Address" type="Place" minOccurs="0"/>

<attribute name="id" type="ID" use=”required”/>

</complexType>

0.1XML Built-in Primitive DatatypesSchema:Datatypes: string; boolean; float; double; decimal; timeDuration [ISO 8601]; recurringDuration; binary; uriReference; ID; IDREF; ENTITY; NOTATION;QName.

0 MPEG-7 Structural ExtensionsDefining Arrays and Matrices <simpleType name="IntegerMatrix3x4"

base="integer" derivedBy="list"><mpeg7:dimension value="3 4" />

</simpleType> <element name='IntegerMatrix3x4'

type='IntegerMatrix3x4'/> <IntegerMatrix3x4>5 8 9 46 7 8 27 1 3 5</IntegerMatrix3x4>

VECTORS<simpleType name="listOfInteger" base="integer"

derivedBy="list" /> <complexType name="VectorI" base="listOfInteger"

derivedBy="extension"> <attribute ref="mpeg7:dim" /></complexType> <simpleType name="listOfFloat" base="float"

derivedBy="list" /> <complexType name="VectorR" base="listOfFloat"

derivedBy="extension"> <attribute ref="mpeg7:dim" />

MPEG-7 Visual MPEG-7 Visual Description Tools included in

the standard consist of basic structures and Descriptors that cover the following basic visual features: Color, Texture, Shape, Motion, Localization, and Face recognition. Each category consists of elementary and sophisticated Descriptors.

Visual Descriptors

There are five Visual related Basic structures: the Grid layout, and the Time series, Multiple view, the Spatial 2D coordinates, and Temporal interpolation.

Color DescriptorsThere are seven Color Descriptors: Color space, Color Quantization, Dominant Colors, Scalable Color, Color Layout, Color-Structure, and GoF/GoP Color.

Basic Descriptors

VISUAL DESCRIPTORS AND

DESCRIPTION SCHEMES

Descriptors: representations of features that define the syntax and the semantics of each feature representation.

Description Schemes: specify the structure and semantics of the relationships between their components, which may be both Descriptors and Description Schemes

BASIC STRUCTURESBASIC STRUCTURESGRID LAYOUTGRID LAYOUTThe grid layout is a splitting of the image into a set of rectangular The grid layout is a splitting of the image into a set of rectangular regions, so that each region can be described separately. Each region of regions, so that each region can be described separately. Each region of the grid can be described in terms of other descriptors such as color or the grid can be described in terms of other descriptors such as color or texture.texture. 1.1.1.1. DDL representation syntaxDDL representation syntax<element name=”GridLayout”><element name=”GridLayout”>

<complexType content=”empty”><complexType content=”empty”><attribute name=”PartNumberH” <attribute name=”PartNumberH”

datatype=”positiveInteger”/>datatype=”positiveInteger”/><attribute name=”PartNumberV” <attribute name=”PartNumberV”

datatype=”positiveInteger”/>datatype=”positiveInteger”/></complexType></complexType>

</element</element

PartNumberH 16 bit

This field contains number of horizontal

partitions in the grid over the image.

PartNumberV 16 bit

This field contains number of vertical partitions

in the grid over the image

COLOR

Color space, several supported- RGB

- YCbCr

- HSV

- HMMD

- Linear transformation matrix with reference to RGB

- Monochrome

DDL representation syntax<element name=”ColorSpace”>

<complexType><choice>

<element name=”RGB” type=”emptyType”/><element name=”YCbCr” type=”emptyType”/><element name=”HSV” type=”emptyType”/><element name=”HMMD” type=”emptyType”/><element name=”LinearMatrix” >

<complexType base=”IntegerMatrix” derivedBy=”restriction”>



<minInclusive value=”0”/><maxInclusive value=”65535”/><attribute name=”sizes”

use=”fixed” value=”3 3”/></complexType>

</element><element name=”Monochrome” type=”emptyType”/>

</choice></complexType>

</element>

White Color

Max

Min

Black Color

Sum

Diff

Hue

HMMD SPACE REPRESENTATIONHMMD SPACE REPRESENTATION

1.Color quantization

This descriptor defines the quantization of a color space. The following quantization types are supported: uniform,subspace_uniform, subspace_nonuniform and lookup_table.

1. Dominant color

This descriptor specifies a set of dominant colors in an arbitrarily-shaped region. It targets content-based retrieval for color, either for the whole image or for an arbitrary region (rectangular or irregular)

1.1 DDL representation syntax<element name=”DominantColor”>

<complexType><element ref=”ColorSpace”/><element ref=”ColorQuantization”/><element name=”DomColorValues” minOccursPar=”DomColorsNumber”><complexType><element name=”Percentage” type=”unsigned5”/><element name=”ColorValueIndex”><simpleType base=”unsigned12” derivedBy=”list”><length valuePar=”ColorSpaceDim”/></simpleType></element><element name=”ColorVariance” minOccurs=”0” maxOccurs=”1”><simpleType base=”unsigned1” derivedBy=”list”><length valuePar=”ColorSpaceDim”/></simpleType></element></complexType></element><attribute name=”DomColorsNumber” type=”DomColorsNumberType”/><attribute name=”SpatialCoherency” type=”unsigned5”/></complexType>

</element><simpleType name=”DomColorNumberType” base=”positiveInteger”>

1. Descriptor semantics

DomColorsNumber

This element specifies the number of dominant colors in the region. The maximum allowed number of dominant colors is 8, the minimum number of dominant colors is 1.

VariancePresent

This is a flag used only in binary representation that signals the presence of the color variances in the descriptor.

SpatialCoherency

The image spatial variance (coherency) per dominant color captures whether or not a given dominant color is coherent and appears to be a solid color in the given image region.

NON-COHERENT AND COHERENT REGIONSNON-COHERENT AND COHERENT REGIONS

DESCRIPTORS ALREADY DEFINED

FOR THE FOLLOWING ATTRIBUTES:COLOR (COLOR SPACE, QUANTIZATION,

DOMINANT COLOR, SCALABLE COLOR,

COLOR LAYOUT, COLOR STRUCTURE)

COLOR HISTOGRAM FOR GROUP OF

FRAMES

TEXTURE (HOMOGENOUS,TEXTURE

BROWSING, EDGE HISTOGRAM)

SHAPE (REGION SHAPE, CONTOUR SHAPE)

CNTD.

MOTION(CAMERA MOTION, MOTION

TRAJECTORY, PARAMETRIC MOTION,

MOTION ACTIVITY)

LOCALIZATION(REGION LOCATOR,

SPATIO-TEMPORAL LOCATOR(INCLUDES

FigureTrajectory, ParemeterTrajectory)

TEXTUREHomogeneous texture

1

2

713

8

9

10

14

1516

1920

21222324

3

4

5

6

11

1217

18

30

2526272829

0

Channel (Ci)

channel

number (i)

This descriptor provides similarity based image-to-image matching for texture image databases. In order to describe the image texture, energy and energy deviation feature values are extracted from a frequency layout and are used to constitute a texture feature vector for similarity-based retrieval.

30 ANGULAR CHANNELS FOR FREQUENCY LAYOUT30 ANGULAR CHANNELS FOR FREQUENCY LAYOUT

1

0

360

0

2, )],(),([

PGp rsPi

ENERGY FUNCTION IS DEFINED AS FOLLOWSENERGY FUNCTION IS DEFINED AS FOLLOWS

P(ω,θ) is the Fourier transform of an image represented in the P(ω,θ) is the Fourier transform of an image represented in the polar frequency domainpolar frequency domain

2

2

2

2

,2

exp2

exprs

rsrsPG

G is Gaussian functionG is Gaussian function

]1[log10 ii pe eei i is energy in i channel is energy in i channel

DDL representation syntax<element name=”HomogeneousTexture”>

<complexType><attribute name=”FeatureType” type=”boolean”/><element name=”AverageFeatureValue” type=”unsigned8”/><element name=”StandardDeviationFeatureValue”

type=”unsigned8”/><element name=”EnergyComponents”>

<simpleType base=”unsigned8” derivedBy=”list”><length value=”30”/>

</simpleType></element><element name=”EnergyDeviationComponents” minOccurs=”0”

maxOccurs=”1”><simpleType base=”unsigned8” derivedBy=”list”>

<length value=”30”/></simpleType>

</element></complexType>

</element>

1. Texture browsing This descriptor specifies a texture browsing descriptor.

It relates to a perceptual characterisation of texture, similar to a human characterisation, in terms of regularity, coarseness and directionality. This representation is useful for browsing applications and coarse classification of textures. We refer to this as the

Perceptual Browsing Component (PBC).

RegularityComponentThis element represents texture’s regularity. A texture is said to be regular if it is a periodic pattern with clear directionalities and of uniform scale

11 10 01 0011 10 01 00

DirectionComponent

This element represents the dominant direction characterising the

texture directionality

ScaleComponent

This element represents the coarseness of the texture associated with the corresponding dominant orientation specified in the DirectionComponent

Edge histogram

The edge histogram descriptor represents the spatial distribution of five types of edges namely, four directional edges and one non-directional edge

a) vertical b) horizontal c) 45 degree d) 135 degree e)non-directional edge edge edge edge edge

DDL representation semanticselement name=”EdgeHistogram”>

<complexType><element name=”BinCounts”>

<simpleType base=”unsigned8” derivedBy=”list”>

<length value=”80”></simpleType>

</element></complexType>

</element>

SHAPE

Region shape

T s The shape of an object may consist of either a single connected region or a set of disjoint regions, as well as some holes in the object.

SHAPESSHAPES

The region-based shape descriptor utilizes a set of ART (Angular Radial Transform) coefficients. ART is a 2-D complex transform defined on a unit disk in polar coordinates,

2

0

1

0,,,,,, ddfVfVF nmnmnm

is an image function in polar coordinates, and is the ART basis function. The ART basis functions are separable along the angular and radial directions, i.e.,

),( f ),( nmV

nmnm RAV ,

The angular and radial basis functions are defined as follows:

jmAm exp2

1

0cos2

01

nn

nRn

ART BASIS FUNCTIONSART BASIS FUNCTIONS

CONTOUR SHAPE

The object contour shape descriptor describes a closed contour of a 2D object or region in an image or video

sequence

The object contour-based shape descriptor is based on the Curvature Scale Space (CSS) representation of the contour

HOW THE CONTOUR IS CALCULATED?N equidistant points are selected on the contour, starting from an

arbitrary point on the contour and following the contour clockwise. The x-coordinates of the selected N points are grouped together and the y-coordinates are also grouped together into two series X, Y. The contour is then gradually smoothed by repetitive application of a low-pass filter with the kernel (0.25,0.5,0.25) to X and Y coordinates of the selected N contour points

GlobalCurvatureVector

This element specifies global parameters of the contour, namely the

Eccentricity and Circularity

area

perimeterycircularit

2

.4)2(

2

2

r

rCcircle

FOR A CIRCLEFOR A CIRCLECIRCULARITY IS CIRCULARITY IS

2110220

202

2200220

2110220

202

2200220

42

42

iiiiiii

iiiiiiityeccentrici

202 )( cyyi ))((11 cc yyxxi 2

20 )( cxxi

MOTION

Camera motion This descriptor characterizes 3-D camera motion parameters. It is based on 3-D camera motion parameter information, which can be automatically extracted or generated by capture devices

Track left

Track right

Boom up

Boom down

Dollybackward

Dollyforward

Pan right

Pan left

Tilt up

Tilt downRoll

Motion trajectory

Motion Trajectory is a high-level feature associated with a moving region, defined as a spatio-temporal localization of one of its representative points (such as centroid

Parametric motion

This descriptor addresses the motion of objects in video sequences, as well as global motion

Motion activity

The activity descriptor captures intuitive notion of “intensity of action” or “pace of action” in a video segment. Examples of high activity include scenes such as “goal scoring in a soccer match”, “scoring in a baseball game”, “a high speed car chase”, etc. On the other hand, scenes such as “news reader shot”, “an interview scene”, “a still shot” etc. are perceived as low action shots

Localization

Region locator

This descriptor enables localization of regions within images or frames by specifying them with a brief and scalable representation of a Box or a Polygon

Spatio-temporal locator

The SpatioTemporalLocator describes spatio-temporal regions in a video sequence and provides localization functionality especially for hypermedia applications. It consists of FigureTrajectory and ParameterTrajectory.

Reference Region Reference Region

Motion Motion

Reference Region

Motion

FigureTrajectory

FigureTrajectory describes a spatio-temporal region by trajectories of the representative points of a reference region. Reference regions are represented by three kinds of figures: rectangles, ellipses and polygons

TemporalInterpolationD



ParameterTrajectory

Motion Motion Parameters

time

a1

a2

a3

a4


ParameterTrajectory describes a spatio-temporal region by a reference region and trajectories of motion parameters. Reference regions are described using the RegionLocator descriptor. Motion parameters and parametric motion model specify a mapping from the reference region to a region of an arbitrary frame

AUDIO DESCRIPTORS Audio Framework. The main hook into a description for all

audio description schemes and descriptors

Spoken Content DS. A DS representing the output of Automatic Speech Recognition (ASR).

Timbre Description. A collection of descriptors describing the perceptual features of instrument sounds

Audio Independent Components. A DS containing an Independent Component Analysis (ICA) of audio

EXAMPLES

AudioPowerType

describes the temporally-smoothed instantaneous power 

<complexType name="AudioPowerType" base="mpeg7:AudioSampledType"

derivedBy="extension">

<element name="Value" type="mpeg7:SeriesOfScalarType"

maxOccurs="unbounded"/>

</complexType

AudioSpectrumCentroidType

describes the center of gravity of the log-frequency power spectrum

 <complexType name="AudioSpectrumCentroidType" base="mpeg7:AudioSampledType" derivedBy="extension"> <element name="Value" type="mpeg7:SeriesOfScalarType" maxOccurs="unbounded"/> </complexType

3.1.1 AudioDescriptorType 13.1.2 AudioSampledType 23.1.3 AudioWaveformEnvelopeType 23.1.4 AudioSpectrumEnvelopeType 23.1.5 AudioPowerType 33.1.6 AudioSpectrumCentroidType 43.1.7 AudioSpectrumSpreadType 43.1.8 AudioFundamentalFrequencyType 53.1.9 AudioHarmonicityType 51.2 AudioDescriptorType 71.2.2 AudioSampledType 71.2.3 AudioWaveformEnvelopeType 71.2.4 AudioSpectrumEnvelopeType 71.2.5 AudioPowerType 91.2.6 AudioSpectrumCentroidType 101.2.7 AudioSpectrumSpreadType 101.2.8 AudioFundamentalFrequencyType 111.2.9 AudioHarmonicityType

THERE ARE QUITE MANY OF AUDIO DsTHERE ARE QUITE MANY OF AUDIO Ds

SPOKEN CONTENT DESCRIPTORSSpoken Content DS consists of combined word and phone lattices for each speaker in an audio stream

The DS can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech

EXAMPLE APPLICATIONSEXAMPLE APPLICATIONSRecall of audio/video data by memorable spoken events. An example would be a film or video recording where a character or person spoke a particular word or sequence of words. The source media would be known, and the query would return a position in the media

a) Spoken Document Retrieval. In this case, there is a database consisting of separate spoken documents. The result of the query is the relevant documents, and optionally the position in those documents of the matched speech.

.

A lattice structure for an hypothetical (combined phone and word) decoding of the expression “Taj Mahal drawing …”. It is assumed that the name ‘Taj Mahal’ is out of the vocabulary of the ASR system.

Definition of the SpokenContentHeader -->

















<!-- b) Although there must be at least one word or phone lexicon.

TIMBRE DESCRIPTOR

Timbre Descriptors aim at describing perceptual features of instrument sounds. Timbre is currently defined in the literature, as the perceptual features that make two sounds having the same pitch and loudness sound different. The aim of the Timbre DS is to describe these perceptual features with a reduced set of descriptors. The descriptors relate to notions such as “attack”, “brightness” or “richness” of a sound.

DEFINITIONS ARE GIVEN:EXAMPLE

LOG-ATTACK-TIME )01(log10 TTlat where

T0 is the time the signal starts T1 is the time the signal reaches its sustained part

T0t

Signal envelope(t)

T1

Signal

Sliding Analysis Window

STFT

Signal envelope

f0

Harmonic Peaks

Detection

ihsc

ihsstd

ihstd

ihsv

lat

z-1

isc

tc

ESTIMATION O´F SOUND TIMBRE DESCRIPTORSESTIMATION O´F SOUND TIMBRE DESCRIPTORS

MPEG-7 STANDARD TOWARDS INTELLIGENT AUDIO-VISUAL INFORMATION HANDLING.

Documents

Transcript of MPEG-7 STANDARD TOWARDS INTELLIGENT AUDIO-VISUAL INFORMATION HANDLING.