Optimized Bit Extraction Using Distortion Estimation in ...

8
Optimized Bit Extraction Using Distortion Estimation in the Scalable Extension of H.264/AVC Ehsan Maani EECS Department Northwestern University Evanston, IL, USA [email protected] Aggelos K. Katsaggelos EECS Department Northwestern University Evanston, IL, USA [email protected] Abstract The newly adopted scalable extension of H.264/AVC video coding standard (SVC), demonstrates significant im- provements in coding efficiency in addition to an increased degree of supported scalability relative to the scalable pro- files of prior video coding standards. For efficient adapta- tion of SVC bit streams to intermediate bit rates, the con- cept of Quality Layers has been introduced in the design of the SVC. The concept of Quality Layers allow a rate- distortion (RD) optimal bit extraction; However, existing Quality Layer assignment methods do not consider all Net- work Abstraction Layer (NAL) units from different layers for the optimization. In this paper, we first propose a technique to accurately and efficiently estimate the quality degrada- tion resulting from discarding an arbitrary number of NAL units from multiple layers of a bitstream. Then, we utilize this distortion estimation technique to assign Quality Lay- ers to NAL units for a more efficient extraction. Experimen- tal results show that a significant gain can be achieved by the proposed schemes. 1 Introduction With the drastic improvements and developments of the network infrastructures, multimedia applications for a vari- ety of devices with different capabilities have become very popular. These devices range from cell phones and PDA’s with small screens and restricted processing power to high- end PCs with high-definition displays. These devices are mainly connected to different types of networks with vari- ous bandwidth limitations and loss characteristics. Address- ing this vast heterogeneity is a considerably tedious task. A highly attractive solution which has been under devel- opment for the past 20 years, is known as Scalable Video Coding (SVC). The term “scalability” here means that cer- tain parts of the bitstream can be removed in order to adapt it to the various requirements of end users as well as to vary- ing network conditions or terminal capabilities. The new SVC standard [1] was developed based on H.264/AVC by the Joint Video Team (JVT) in collaboration with the International Telecommunication Standardization Sector (ITU-T). In fact, SVC was approved as Amendment 3 of the Advanced Video Coding (AVC) standard, with full compatibility of base layer information so that it can be decoded by existing AVC decoders. In addition, SVC al- lows for spatial, temporal, and quality salabilities [8]. The SVC design enables the creation of a video bitstream that is structured in layers, consisting of a base layer (BL) and one or more enhancement layers (EL). Each enhancement layer either improves the resolution (spatially or temporally) or the quality of the video sequence. The superb adaptabil- ity of the SVC and its high coding efficiency make SVC a suitable candidate for many video communication appli- cations such as multi-cast, video surveillance, and peer-to- peer video sharing. Note than in this paper the term SVC is used interchangeably for both the concept of the scalable coding in general and for the particular design of scalable extension of the H.264/AVC standard. Temporal scalability is commonly made possible by re- stricting motion-compensated prediction to reference pic- tures with a temporal layer identifier that is less than or equal to the temporal layer identifier of the picture to be pre- dicted. In SVC, temporal scalability is provided by the con- cept of hierarchical B-pictures [7]. Spatial scalability, on the other hand, is achieved by encoding each supported spa- tial resolution into one layer. In each spatial layer, motion- compensated prediction and intra-prediction are employed similar to AVC. Nonetheless, in order to further improve coding efficiency, additional inter-layer prediction mecha- nisms are incorporated [8]. Quality scalability can be seen as a special case of spa- tial scalability with identical picture sizes for base and en-

Transcript of Optimized Bit Extraction Using Distortion Estimation in ...

Page 1: Optimized Bit Extraction Using Distortion Estimation in ...

Optimized Bit Extraction Using Distortion Estimation in the Scalable Extensionof H.264/AVC

Ehsan MaaniEECS Department

Northwestern UniversityEvanston, IL, USA

[email protected]

Aggelos K. KatsaggelosEECS Department

Northwestern UniversityEvanston, IL, USA

[email protected]

Abstract

The newly adopted scalable extension of H.264/AVCvideo coding standard (SVC), demonstrates significant im-provements in coding efficiency in addition to an increaseddegree of supported scalability relative to the scalable pro-files of prior video coding standards. For efficient adapta-tion of SVC bit streams to intermediate bit rates, the con-cept of Quality Layers has been introduced in the designof the SVC. The concept of Quality Layers allow a rate-distortion (RD) optimal bit extraction; However, existingQuality Layer assignment methods do not consider all Net-work Abstraction Layer (NAL) units from different layers forthe optimization. In this paper, we first propose a techniqueto accurately and efficiently estimate the quality degrada-tion resulting from discarding an arbitrary number of NALunits from multiple layers of a bitstream. Then, we utilizethis distortion estimation technique to assign Quality Lay-ers to NAL units for a more efficient extraction. Experimen-tal results show that a significant gain can be achieved bythe proposed schemes.

1 Introduction

With the drastic improvements and developments of thenetwork infrastructures, multimedia applications for a vari-ety of devices with different capabilities have become verypopular. These devices range from cell phones and PDA’swith small screens and restricted processing power to high-end PCs with high-definition displays. These devices aremainly connected to different types of networks with vari-ous bandwidth limitations and loss characteristics. Address-ing this vast heterogeneity is a considerably tedious task.A highly attractive solution which has been under devel-opment for the past 20 years, is known as Scalable VideoCoding (SVC). The term “scalability” here means that cer-

tain parts of the bitstream can be removed in order to adaptit to the various requirements of end users as well as to vary-ing network conditions or terminal capabilities.

The new SVC standard [1] was developed based onH.264/AVC by the Joint Video Team (JVT) in collaborationwith the International Telecommunication StandardizationSector (ITU-T). In fact, SVC was approved as Amendment3 of the Advanced Video Coding (AVC) standard, with fullcompatibility of base layer information so that it can bedecoded by existing AVC decoders. In addition, SVC al-lows for spatial, temporal, and quality salabilities [8]. TheSVC design enables the creation of a video bitstream that isstructured in layers, consisting of a base layer (BL) and oneor more enhancement layers (EL). Each enhancement layereither improves the resolution (spatially or temporally) orthe quality of the video sequence. The superb adaptabil-ity of the SVC and its high coding efficiency make SVCa suitable candidate for many video communication appli-cations such as multi-cast, video surveillance, and peer-to-peer video sharing. Note than in this paper the term SVCis used interchangeably for both the concept of the scalablecoding in general and for the particular design of scalableextension of the H.264/AVC standard.

Temporal scalability is commonly made possible by re-stricting motion-compensated prediction to reference pic-tures with a temporal layer identifier that is less than orequal to the temporal layer identifier of the picture to be pre-dicted. In SVC, temporal scalability is provided by the con-cept of hierarchical B-pictures [7]. Spatial scalability,onthe other hand, is achieved by encoding each supported spa-tial resolution into one layer. In each spatial layer, motion-compensated prediction and intra-prediction are employedsimilar to AVC. Nonetheless, in order to further improvecoding efficiency, additionalinter-layer prediction mecha-nisms are incorporated [8].

Quality scalability can be seen as a special case of spa-tial scalability with identical picture sizes for base and en-

Page 2: Optimized Bit Extraction Using Distortion Estimation in ...

GOP

Base Layer

MGS Quality

Increments

Temporal

Layer 0

Temporal

Layer 1

Temporal

Layer 2

Temporal

Layer 3

0 1 2 3 4 5 6 7 8playback order

Figure 1. Structure of a single resolution SVCbit stream.

hancement layer. The same prediction techniques are uti-lized except for the corresponding upsampling operations.This type of quality scalability is referred to as coarse-grainquality scalable coding (CGS). Since CGS can only providea discrete set of decoding points, a variation of the CGS ap-proach, which is referred to as medium-grain quality scal-ability (MGS), is included in the SVC design to increasethe flexibility of bit stream adaptation. The increased flex-ibility of MGS is provided by splitting the refinements ofthe transform coefficients associated to each block of theenhancement layer into several layers each identified witha quality id [4]. Figure 1 portrays the structure of a singleresolution SVC bit stream.

Similarly to H.264/AVC, the coded video data of theSVC are organized into packets with an integer number ofbytes, called Network Abstraction Layer (NAL) units. EachNAL unit, belongs to a specific spatial, temporal, and qual-ity layer. Moreover, a set of NAL units from all spatial lay-ers having the same temporal instant constitute an AccessUnit (AU).

In order to extract a substream with a particular aver-age bit rate and/or resolution, abit stream extractoris em-ployed. However, there usually exist a huge number of pos-sibilities (specially for MGS coding) in combining NALunits that result in, approximately, the same bit rate. Avery simple and inefficient method is to randomly discardNAL units until the desired bit rate is achieved. Nonethe-less, the efficiency of the bit extractor can be substantiallyimproved by assigning apriority identifier to each NAL unitduring the encoding or a post-processing operation [3]. Thepriority identifier is directly related to the contributionofthe NAL unit in the overall quality of the video sequence.Therefore, the bit stream extractor, can first discard the NALunits with the lowest priority and reach the target bit rate

with a higher average reconstructed quality.The problem of optimal extraction of the NAL units is a

challenging one due to various temporal and spatial depen-dencies. A basic, content independent bit stream extractoris provided in the software implementation of the SVC, re-ferred to as Joint Scalable video Model (JSVM) [6]. An al-ternative rate-distortion optimized extraction method, alsoimplemented in JSVM, is presented in [3]. This technique,utilizes the Quality Layers concept and improves the perfor-mance of the JSVM basic extractor since it orders the pack-ets based on their global quality improvements. However,the quality improvements are calculated only within a qual-ity plane, i.e., it is assumed that the quality increments ofall temporal layers with lowerquality id are included first.Thus, when packets are ordered based on this measure, theirtrue impact on the global quality will be different and there-fore result in a sub-optimal extraction. An improvementto this technique for a multi-layer bit stream is presentedin [5]. In this approach some lower layer quality NAL unitsthat are not that useful in an RD optimal sense are discardedeven though they may have been used in interlayer predic-tion process.

In none of the existing RD optimized extraction tech-niques, NAL units with the most impact on the overall videoquality are freely selected independently of the layer theybelong to. The main challenge is that the process of motion-compensated prediction (MCP) in SVC, unlike MPEG-4visual, is designed such that the highest available picturequality is employed for frame prediction in a GOP. As aresult, the distortion of a picture also depends on the en-hancement layer of the pictures from which it has been pre-dicted. Consequently, each time the set of selected NALunit changes, multiple frame decodings is required beforethe quality of the sequence can be obtained. Since thereare a huge number of possibilities in the selection, findingthe optimal set is impossible in practice. In this paper, wepropose a method to accurately and efficiently approximatethe distortion of the sequence for any subset of the availableNAL units. Then, using the proposed distortion model, weintroduce a framework for rate-distortion optimized Qual-ity Layer assignment and bit extraction. In Section 2 weprovide an overview of the problem considered and its re-quired components. Subsequently, in Section 3 we presentour distortion calculations. The solution algorithm is thenprovided in Section 4. Experimental results are shown inSection 5 and finally conclusions are drawn in Section 6.

2 Problem Formulation

2.1 Quality Layers

The concept of quality or priority layers is employed toassign a prioritization order to the various elements con-

Page 3: Optimized Bit Extraction Using Distortion Estimation in ...

L(D0; T0; Q0) L(D1; T0; Q0)

L(D0; T1; Q0) L(D1; T1; Q0)

L(D1; T2; Q0)L(D0; T2; Q0)

L(D0; T0; Q1)

L(D0; T0; Q2)

L(D0; T1; Q1)

L(D0; T1; Q2)

L(D0; T2; Q1)

L(D0; T2; Q2)

L(D1; T2; Q2)

L(D1; T1; Q2)

L(D1; T0; Q2)

L(D1; T2; Q1)

L(D1; T1; Q1)

L(D1; T0; Q1)

Ba

se q

ua

lity

of

targ

et

spa

tia

l/te

mp

ora

l re

solu

tio

ns

Figure 2. Priority order for NAL units usingthe basic extractor.

stituting the whole scalable bit stream. This prioritizationexhibits the hierarchical distinctions or stratification amongvarious parts of the bit stream to be used for stream adap-tation. The priority information may be conveyed in twoways: either in the NAL unit header, utilizing the syntaxelementpriority id, or using an optional Supplemental En-hancement Information (SEI) message. Quality layers arecomputed and assigned by the encoder or a post-processingoperation at the encoder side. Nonetheless, prioritized bit-stream adaptation can take place at any point of the trans-mission, i.e., at the encoder side, in the network, or at thereceiver before decoding.

2.2 Bit Extraction Process

The target average bit rate can be acquired by discard-ing different quality refinement NAL units. Therefore, thereconstructed video sequence that corresponds to the giventarget bit rate depends on the used extraction method. Thebasic extraction process defined in the SVC utilizes thehigh-level syntax elements dependencyid, temporalid, andquality id for prioritization. Figure 2 illustrates the prioriti-zation order used in the basic extraction process. Each blockin Fig. 2 represents a layer (set of pictures with a specificresolution/quality) denoted byL(Dd, Tt, Qq) whereDd in-dicates the spatial resolution,Tt the temporal level, andQq

the quality level. The application/device for which the videois being decoded usually determines the target spatial andtemporal resolutions. Therefore, the base layer of each spa-tial and temporal resolution lower or equal to the target spa-tial and temporal resolutions have to be included first. Next,for each lower spatial resolution, NAL units of higher qual-ity levels are ordered in increasing order of their temporallevel. Finally, for the target spatial resolution, NAL unitsare ordered based on their quality level and are included un-til the target bit rate is reached.

A major drawback of the basic extraction method is thatits prioritization policy is independent of the video content.Since the distortion of a frame depends on the content ofthe frame in addition to the quantization parameter used,only a content-aware prioritization policy can ensure opti-mal extraction. Considering the fact that the standard doesnot specify the extraction process, one can devise a moreefficient alternative extraction process.

2.3 System Model

As a substitute to the content-independent packet priori-tization of JSVM, a rate-distortion optimized priority-basedframework can be employed. In this scenario, a priorityis computed for a NAL unit which represents a frame ora portion of a frame (i.e., residual frame) at a given spa-tio/temporal/quality level. Note that in this scheme, unlikethe basic extraction scheme, all pictures of a given layer donot necessarily follow the same prioritization order. In orderto efficiently assign Quality Layers, NAL units have to beordered according to their contribution to the overall qualityof the video sequence. When the correct order is obtained,one can assign the Quality Layers to each unit based on aquantization of its index [2].

Assuming an optimal order of the NAL units exists, itcan be obtained if for any bit rateRmin < R < Rmax anoptimal subset of the available NAL units can be extracted.Here,Rmin andRmax denote the minimum and maximumpossible bit rates of the scalable bit stream, respectively. Asa result, in this paper we consider the problem of optimalextraction of a substream at a provided bit rateR. Once thesolution to this problem is obtained, one can easily orderpackets and assign Quality Layers.

Letπ(n, d, q) represent the quality increment (NAL unit)associated with framen at spatial resolutiond and qualitylevelq (q = 0 represents the base quality). Then, any “con-sistent” subset of quality increments,P , can be uniquelyidentified by aselectionmapφ defined by

φ(n, d) = |Q(n, q)|, (1)

where,Q(n, q) := {q : π(n, d, q) ∈ P} and the notation|.| represents the cardinality of a set. The term “consistent”here refers to a set whose elements are all decodable by thescalable decoder (children do not appear in the set withoutparents). Note thatφ(n, d) = 0 indicates that no NAL unitfor framen at resolutiond has been included in the set. Inthis case, whend represents the base resolution, we inferthat the base layer has been skipped and therefore the de-pendant frames are also undecodable. The problem of opti-mal selection of the quality increments with a target rate ofRT can be formulated as

φ∗ = minφ∈Φ

D(φ), s.t. R(φ) ≤ RT , (2)

Page 4: Optimized Bit Extraction Using Distortion Estimation in ...

�� � � ����

�� � � ����

�� � � ����

�� � � ����

� � � � � � �

� ��������

Playback order

������������� ����

Figure 3. Example of a selection map for asingle resolution bit stream.

where,φ is a vector with elementsφ(n, d) for all possi-blen andd. Furthermore,R(φ) andD(φ) denote the aver-age bit rate and distortion of the video sequence computedusing the substream associated with selection mapφ. Here,Φ represents the set of all possible selection functions forwhich the resulting substream is decodable. In this work,the distortionD is calculated using the mean squared error(MSE) metric with respect to the full quality decoded se-quence. Note that for most applications, bit extraction is apost-processing operation, and thus the original video sig-nal is not available for quality evaluation. An example ofthe selection function for a single resolution bit stream isillustrated in Fig. 3.

In principle, a solution to equation (2) can be found usinga non-linear optimization scheme if fast evaluation of theobjective functions is possible. Nevertheless, due to variousspatio/temporal dependencies, evaluation ofD(φ) requiresdecoding of several images in addition to the computationcost for finding the MSE. In order to overcome this diffi-culty, we propose a low-complexity method to accuratelyestimate the source distortion for any functionφ ∈ Φ.

3 Source Distortion Calculations

As discussed in section 2.3, fast evaluation of the aver-age sequence distortion plays an essential role in solvingthe optimization problem of equation (2). In this sectionwe develop an approximation method for the computationof this distortion. The calculations presented in this sectionare for a general case when an arbitrary number of spatiallayers exist. However, a target resolution has to be speci-fied to evaluate the distortions. The quality increments fromspatial layers lower than the target resolution need to be up-sampled to the target resolution to evaluate their impact onthe global sequence distortion. Based on the status of the

base layer of a frame, the following two different distortionmodels are proposed: (1) when the base layer of the frameis available and decodable by the decoder, and (2) when thebase layer is either not available or undecodable due to lossof a required base layer. In this case an error concealmentstrategy is employed which requires some special consider-ations.

3.1 Frames with Decodable Base-Layer

Since for MGS coding of SVC, motion compensatedprediction is conducted using the highest available qualityof the reference pictures (except for key frames), propaga-tion of drift has to be taken into account whenever a qual-ity increment is missing. Letfd

n and fn denote a vec-tor representation of the reconstructedn-th frame using allof its quality increments in presence and absence of drift,respectively. Note that although all quality increments offramen are included for reconstruction of bothfn andfd

n,fd

n 6= fn since it is assumed that some quality increment ofthe parent frames are missing in the reconstruction offd

n.Similarly, leten(q) represent theerror vector introduced byinclusion ofq ≤ Q quality increments for that frame. Here,Q represents the total number of quality levels (in all lay-ers), hence,en(Q) = 0. The total distortion of framen dueto drift and EL truncation (i.e.,Dt

n) with respect tofn isobtained according to

Dtn(q) = ||fn − fd

n + en(q)||2

= Ddn + De

n(q) + 2(fn − fdn)T en(q),

(3)

where,Ddn andDe

n(q) represent the distortion, i.e., sum ofsquared errors (SSE), due to drift and EL truncation (asso-ciated with the inclusion ofq quality increments), respec-tively. The symbol||.|| here represent thel2-norm. Sincethe Cauchy-Schwartz inequality provides an upper boundfor equation (3) we can approximate the total distortionDt

n

as

Dtn(q) ≈ Dd

n + Den(q) + 2κ

Ddn

Den(q)

≤ Ddn + De

n(q) + 2√

Ddn

Den(q),

(4)

where,κ is a constant in the range0 ≤ κ ≤ 1 obtained ex-perimentally from training data. Consequently, in order tocalculate the total distortion, we need the drift and EL trun-cation distortions,Dd

n andDen(q), respectively. Fortunately,

the error due to EL truncation,Den(q), can be easily com-

puted specially at the encoder when performing the quan-tization of the transform coefficients. The drift distortions,on the other hand, depend on the computationally intensivemotion compensation operations and propagate from a pic-ture to itsdescendants.

Page 5: Optimized Bit Extraction Using Distortion Estimation in ...

Figure 4. Parent-child relationship for a GOPof size 4.

Let the setS = {s0, s1, ..., sN} represent theN picturesin the GOP plus the key picture of the preceding GOP de-noted bys0 as portrayed in Fig. 4 (forN = 4). Moreover,let g : S → Z denote a function defined on the set suchthat g(x) indicates the display order frame number of anyx ∈ S. As illustrated by the figure, pictures2 is consid-ered achild of picturess0 ands4 since it is bi-directionallypredicted from them. In principle, a non-zero distortion inany of the parent frames,inducesa distortion in the childframe. LetΛn represent the set of parent frames associatedwith framesn. Hence, the setΛn, is either empty (for keyframes) or has exactly two members, referred to ass1

n ands2

n. For instance, the parent set for frames2 in Fig. 4 equalsΛ2 = {s0, s4}. Further, letDt

i represent the total distortionof a parent frame ofsn, i.e.,i ∈ Λn. Then, we can assumethat the drift distortion inherited by the child frame, denotedasDd

sn

or simplyDdn, is a function of parent distortions, i.e.,

Ddn = F (Dt

s1n

, Dts2

n

). Therefore, an approximation toDnd

can be obtained by a second order Taylor expansion of thefunctionF around zero

Ddn ≈

i∈Λn

αiDti +

i∈Λn

j∈Λn

βijDtiD

tj . (5)

Here, the coefficientsαi and βij are first and secondorder partial derivatives ofF and are obtained by fit-ting a 2-dimensional quadratic surface to the training dataacquired by decoding. Note that mathematically speak-ing, F (Dt

s1n

, Dts2

n

) is not a function since the mapping

{Dts1

n

, Dts2

n

} → Ddn is not unique because distortions may

be due to various error distributions. Therefore, equation(5) can only be justified as an approximation since the er-rors arising from missing high frequency components are

0 10 20 30 40 50 60 70 800

5

10

15

20

25

Frame Number

MS

E

Estimated Distortion

Actual Distortion

Figure 5. Estimated versus actual distortionfor a random selection map.

usually widespread throughout the image and follow sim-ilar distributions. The coefficients of this equation for allframes except key frames, can be obtained by several de-codings of different substreams extracted from the globalSVC bit stream. Nevertheless, different methods for ex-tracting the training data may exist. For instance, a suitableset of data can be computed using the following steps:

• For each temporal layerT , discard a random set of thequality increments from frames in temporal layersTand lower, while keeping all quality increments of thehigher layers (to eliminate EL truncation distortion).

• Decode the resulting bit stream and collect all datapoints: distortion of each framen in a temporallayer higher thanT along with distortion of the par-ent frames form a data point{Dd

n, Dts1

n

, Dts2

n

} for thatframe.

Once the coefficientsαi andβij are computed for eachframe (except for key frames), the drift distortion of thechild frameDd

n can be efficiently estimated for various dis-tortions of the parent frames. The total distortionDt

n isthen computed according to equation (4). In the next step,the computed distortion of this frame is used (as a parentframe) to approximate the drift distortion of its children.Therefore, the distortion of the whole GOP can be estimatedrecursively starting from the key frame which is not subjectto drift distortion. Figure 5 demonstrates a comparison be-tween the estimated versus the actual distortion for a ran-dom selection map of Foreman CIF sequence.

Page 6: Optimized Bit Extraction Using Distortion Estimation in ...

3.1.1 Frames with Missing Base-Layer

In the framework considered in this work, in addition to theenhancement layer NAL units, we allow base layer NALunits to be skipped when resources are limited. Moreover,base layer NAL units may be damaged or lost in the chan-nel and therefore become unavailable to the decoder. In thisscenario, all descendants of the frame to which the NALunit belongs to are also discarded by the decoder. Conse-quently, the decoder utilizes a concealment technique as anattempt to hide the lost information from the viewer. Inthis work a simple and popular concealment strategy is em-ployed: the lost picture is replaced by the nearest temporalneighboring picture. To be able to determine the impact ofa frame loss on the overall quality of the video sequence,the distortion of the lost frame after concealment needs tobe computed.

Let Dconn,i denote the distortion of a framen concealed

using framei with a total distortion ofDti . SinceDcon

n,i doesnot vary greatly with respect toDt

i , we assume a linear re-lationship exists between them, i.e.,

Dconn,i ≈ µi + νiD

ti , (6)

where,µi and νi are constant coefficients calculated foreach frame with all concealing options (differenti’s). Forexample in Fig. 4, concealment options for frames3, inpreferable order, are{s2, s4, s0}. The coefficients in equa-tion (6) are obtained by conducting a linear regression anal-ysis on training data points. Note that these data points areacquired by performing error concealment on frames recon-structed from decodings explained in section 3.1.

During the optimization, whenever a frame is skipped ormissing, the pre-calculated coefficientsµi andνi, associ-ated with nearest available temporal neighbori (which hasa distortionDt

i ), are used according to equation (6) to esti-mate the distortion of the missing frame.

4 Solution Algorithm

The main challenge in solving the problem consideredin this work is the efficient evaluation of the sequence av-erage quality for a provided selection functionφ(n) (seeequation (2)) as discussed in Section 3 . Once the sequenceaverage quality for any selection functionφ(n) is known, intheory, a nonlinear optimization scheme can be applied inorder to find the best packet extraction pattern. In practice,however, careful consideration of the optimization methodis necessary due to coarse-grain discrete nature ofφ(n) andits highly complex relation to the overall distortion. In thiswork we propose a hill-climbing algorithm in order to ef-ficiently find a solution to this problem. Another usefulfeature of this algorithm is that it provides the optimizedselection map for the specified rate as well as all possible

lower rates by properly ordering the packets. Therefore, theoptimized order of the packets can be obtained by applyingthis algorithm and setting the the target rate to the maximumpossible rate.

The optimization can be performed over an arbitrarynumber of GOPs, denoted byM . Trivially, increasing theoptimization window may result in a greater performancegain at a price of higher computational complexity. In thisalgorithm, the base layer of the key pictures are given thehighest priority and therefore are the first packets to be in-cluded. Then, packets are added one at a time based on theirglobal distortion gradient. In other words, initially, these-lection functionφ(n) = 1 if sn is a key frame, otherwiseφ(n) = 0. Then, at each time stepi, a packetπ(n∗

i , φ(n∗

i ))is added andφ(n∗

i ) is incremented by one wheren∗

i is ob-tained by

n∗

i = arg minn

∂D(φ)/∂φ(n)

∂Rs(φ)/∂φ(n). (7)

Here,Rs(φ) represents the source rate associated with thecurrent selection functionφ. This process continues untilthe rate constraintRT is met or all available packets withinthe optimization window (i.e.,M GOPs), are added to theordering queue. Then, we move the optimization windowto the nextM GOPs and empty the ordering queue for thenext set of packets. Note that, specially for short sequences,a global packet ordering can also be acquired by expandingthe optimization window to include the entire sequence.

5 Experimental Results

In this section, we evaluate the performance of our pro-posed optimized bit extraction scheme for the H.264/AVCscalable extension. The simulation is implemented withthe reference software JSVM 9.10. Three video sequences(Mobile, Tempete, and Table) at display resolution of CIFand a frame rate of 30 fps are considered in our experiments.For each sequence, 201 frames are considered in 25 GOPsof size 8 frames (M = 25). Sequences are encoded intotwo layers, a base layer and a quality layer, with basis quan-tization parametersQP = 36 andQP = 24, respectively.Furthermore, the quality layer is divided into 5 MGS layers.

We compare our proposed source extraction scheme totwo existing extraction approaches: 1) The JSVM opti-mized extraction with quality layers [3], referred to as“JSVM QL”. 2) The content-independent JSVM basic ex-traction refereed to as “JSVM Basic”. For a fair compar-ison, packets in JSVM QL are ordered based on their ac-tual importance measure as explained in [3] not based ona quantized assigned Quality Layer. Furthermore, for theproposed scheme, we used the same number of decodingsas the JSVM QL technique requires making the executiontime of both systems roughly equal. Figures 6 shows the

Page 7: Optimized Bit Extraction Using Distortion Estimation in ...

performance of the three extraction approaches for differ-ent test sequences/resolutions. As demonstrated by the fig-ure the proposed scheme outperforms the JSVM extractionschemes by a maximum of over 0.5 dB. The provided gainof the proposed scheme is mainly due to the accurate es-timation of the distortion for any substream, which allowsus to freely select NAL units with the highest contributionto the video quality. The JSVM QL extraction, on the otherhand, obtains the impact of the refinement packets assumingall lower level quality planes are included. This results ininaccurate evaluation of the distortion reductions and there-fore sub-optimal extraction. The basic extraction schemeperforms the worst as expected since it only uses the highlevel syntax elements of the NAL units to order them andthus, is unaware of their impact on the quality of the se-quence.

6 Conclusion

In this paper, we presented a scheme that allows us to ac-curately approximate the distortion of a substream resultingfrom discarding an arbitrary set of NAL units from a globalSVC bit stream. This scheme requires computation of aset of parameters per frame which is obtained by evaluatingthe actual distortion of the entire sequence for a number ofsubstreams (usually between 10 to 20). Our extensive sim-ulations show that the distortion estimation error is on theaverage less than2%. The distortion estimate is then em-ployed to perform optimized Quality Layer assignment andeventually extraction for rate adaptation. Simulation resultsshowed significant gains compared to the basic bit extractorinitially proposed in the scalable extension of H.264/AVCstandard.

References

[1] Joint draft ITU-T rec. H.264 — ISO/IEC 14496-10 / amd.3scalable video coding, 2007.

[2] I. Amonou, N. Cammas, S. Kervadec, and S. Pateux. Layeredquality optimization of JSVM-3 when considering closed-loop encoding. Joint Video Team (JVT) of ISO-IEC MPEGand ITU-T VCEG, JVTQ081, Oct. 2005.

[3] I. Amonou, N. Cammas, S. Kervadec, and S. Pateux. Op-timized rate-distortion extraction with quality layers inthescalable extension of H.264/AVC.IEEE Trans. Circuits Syst.Video Techn., 17(9):1186–1193, 2007.

[4] H. Kirchhoffer, H. Schwarz, and T. Wiegand. CE1: SimplifiedFGS. Joint Video Team, Doc. JVT-W090, Apr. 2007.

[5] M. Manu, K. Lee, and W. Han. Multi layer quality layers.Joint Video Team, Doc. JVT-S043, Apr. 2006.

[6] J. Reichel, H. Schwarz, and M. Wien. Joint scalable videomodel 11 (jsvm 11). Joint Video Team, Doc. JVT-X202, July2007.

[7] H. Schwarz, D. Marpe, and T.Wiegand. Hierarchical B pic-tures. Joint Video Team, Doc. JVT-P014, July 2005.

500 1000 1500 2000 2500 300029

30

31

32

33

34

35

36

37

Bitrate (kbps)

PS

NR

(dB

)

Proposed Extraction

JSVM QL

JSVM Basic

(a) Mobile

400 800 1200 1600 2000 240030

31

32

33

34

35

36

37

38

Bitrate (kbps)

PS

NR

(dB

)

Proposed Extraction

JSVM QL

JSVM Basic

(b) Tempete

200 400 600 800 1000 1200 140032

33

34

35

36

37

38

39

40

Bitrate (kbps)

PS

NR

(dB

)

Proposed Extraction

JSVM QL

JSVM Basic

(c) Table

Figure 6. Performance of three extractionmethods for various CIF resolution test se-quences (a) Mobile (b) Tempete (c) Table.

Page 8: Optimized Bit Extraction Using Distortion Estimation in ...

[8] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the scal-able video coding extension of the h.264/avc standard.Cir-cuits and Systems for Video Technology, IEEE Transactionson, 17(9):1103–1120, Sept. 2007.