Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video...

download Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video Coding Formats

of 4

Transcript of Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video...

  • 7/27/2019 Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video Coding Formats

    1/4

    Description-Based Substitution Methods for Emulating Temporal

    Scalability in State-of-the-Art Video Coding Formats

    W. De Neve1, D. De Schrijver1, D. Van de Walle1, P. Lambert1, and R. Van de Walle2

    1 Ghent University - IBBT, ELIS, Multimedia LabGaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

    2 Ghent University - IBBT - IMEC, ELIS, Multimedia LabGaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

    Abstract MPEG-21 BSDL provides a solution for dis-covering the structure of a binary media resource to gen-erate its XML description, and for the generation of anadapted media resource using a transformed description.So far, most literature related to XML-driven video con-tent adaptation deals with the exploitation of temporalscalability by dropping certain pictures. This paper takesit a step further by focusing on how description-drivenexploitation of temporal scalability can be realized byreplacing coded pictures by placeholder pictures. Suchan approach allows for a transparent integration withthe systems layer in modern multimedia architectures.

    Our XML-based replacement technique will be discussedfrom a high-level point of view for MPEG-{1, 2} Video,MPEG-4 Visual, and SMPTEs Video Codec 1 (VC-1).A more detailed analysis, including some performancemeasurements, will be provided for H.264/MPEG-4 AVCas its technical design is the most challenging one.

    1 Introduction

    The traditional view of digital video coding is to find an

    optimal trade-off between bit rate and distortion, givena maximum allowed delay and complexity. However, dueto a growing diversity in the present-day multimedialandscape in terms of terminals, networks, and mediaformats used, there is a strong tendency to draw anotherimportant aspect into this relationship, i.e. adaptivity.The subject of this paper is a discussion of description-driven extraction of video streams of multiple picturerates from a single coded bitstream (i.e., temporal scal-ability). Special attention is paid to the deployment ofplaceholder pictures to avoid conflicts with the systemslayer (e.g., file formats, streaming protocols). As such,the substitution of certain coded pictures by dummy pic-tures makes it, for instance, straightforward to maintainsynchronization with other media streams in a file con-tainer, even when a varying GOP structure is in use.

    This paper is organized as follows. Section 2 discussesthe MPEG-21 Bitstream Syntax Description Language(MPEG-21 BSDL). The concept of placeholder picturesis introduced in Section 3. The creation of placeholderpictures for several widely-used coding formats is elab-orated on in Section 4, hereby putting the emphasis onH.264/AVC. Finally, Section 5 discusses some perfor-mance results while Section 6 concludes this paper.

    2 MPEG-21 BSDL

    The MPEG-21 community is currently working towardsan efficient, secure, and interoperable description-drivenframework for media format-agnostic content adapta-tion. One of the building blocks of this architecture isMPEG-21 BSDL [1]. Sitting at the intersection betweenthe metadata world and the world of media representa-tion, this schema language is used to describe the (high-level) structure of a certain media format (file format,coding format). This gives birth to a document that iscalled a Bitstream Syntax Schema (BS Schema). It con-tains the necessary information for translating the struc-ture of a binary media resource into an XML-based Bit-

    stream Syntax Description (BSD), and for the creationof a tailored media resource by relying on a transformeddescription.

    The generation of the XML-based description is doneby a software module called BintoBSD Parser, while theadapted bitstream is constructed by a module that iscalled BSDtoBin Parser. How to transform a BSD, isnot specified by MPEG-21 BSDL. The functioning ofboth parsers is guided by a particular BS Schema for acertain media format. As such, a BintoBSD and BSD-toBin Parser constitute the pillars of a generic softwarearchitecture for media format-unaware content adapta-tion, and of which the operation is entirely steered byXML-based technologies. This concept is similar to howXHTML and Cascading Style Sheet (CSS) documentssteer a browser in order to render a web page.

  • 7/27/2019 Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video Coding Formats

    2/4

    2 W. De Neve, D. De Schrijver, D. Van de Walle, P. Lambert, and R. Van de Walle

    < x s d : s c h e m a >

    < x s d : e l e m e n t n a m e = " b i t s t r e a m " b s 2 : s t a r t C o n t e x t = "' bitstream ' ">

    < x s d : c o m p l e x T y p e >

    < x s d : s e q u e n c e >

    < x s d : e l e m e n t n a m e = " p a r s e _ u n i t "

    b s 2 : s t a r t C o n t e x t = "' pu ' " b s 2 : s t o p C o n t e x t = "' old_pu ' "

    b s 2 : r e d e f i n e M a r k e r = "' tmp_pu ' ' old_pu ' ' pu ' ' tmp_pu ' ">

    < x s d : c o m p l e x T y p e >

    < x s d : c h o i c e >< x s d : e l e m e n t r e f = " c o d e d _ h e a d e r _ d a t a " / >

    < x s d : e l e m e n t r e f = " c o d e d _ v i d e o _ d a t a " / >

    < / x s d : c h o i c e >

    < / x s d : c o m p l e x T y p e >

    < / x s d : e l e m e n t >

    < / x s d : s e q u e n c e >

    < / x s d : c o m p l e x T y p e >

    < / x s d : e l e m e n t >

    < / x s d : s c h e m a >

    Fig. 1 Generic BS Schema for a packet-based format

    In Figure 1, a generic template is provided for aBS Schema describing the structure of a typical packet-based coding format. The high-level structure of all video

    coding formats as discussed in this paper, is in line withthis template [e.g., a Parse Unit (PU) in H.264/AVCequals to a Network Abstraction Layer Unit (NALU)].The template is also annotated by attributes that en-able an intelligent context management by a BintoBSDParser. These extensions to BSDL are currently underconsideration for standardization by MPEG. They allowan efficient BSD generation in terms of processing timeand memory consumption needed (see the performanceresults in Section 5 for the BintoBSD Parser) [2].

    3 Placeholder Pictures

    A placeholder or dummy picture can be defined as apicture that is identical to a certain reference picture,or that is reconstructed by relying on a well-defined in-terpolation process between different reference pictures.This implies that only a limited amount of informationhas to be stored or transmitted in order to identify place-holder pictures. Dummy pictures are typically used forachieving bit rate savings when the content of successivepictures is very similar. Because these pictures are ableto introduce an additional delay, they can also be usedfor synchronization purposes. Hence, they are sometimes

    called delay or skipped pictures as well. In this paper,placeholder pictures are used to fill up the gaps thatare created in a bitstream due to the removal of certainpictures (temporal scalability). This approach makes itstraightforward to maintain synchronization with othermedia streams in a particular container format, espe-cially when a varying GOP structure is in use.

    The operational flow of our XML-driven content adap-tation approach, independent of the coding format used,can be described as follows. Given a bitstream and aBS Schema for the format of the media resource, a de-scription of the high-level structure of the bitstream iscreated by a generic BintoBSD Parser. The descriptionsof the different PUs in the BSD are then processed byrelying on a hybrid solution that involves the use of aStreaming Transformations for XML (STX) and an Ex-

    tensible Stylesheet Language Transformations (XSLT)stylesheet. Note that an XSLT-only approach fails incase of large BSDs due to its high memory requirements.A full STX approach is the subject of further research.

    The STX stylesheet is used for iterating through thedescriptions of the different PUs and for passing themto the XSLT stylesheet. As such, the use of STX makesit possible to acquire a low memory footprint duringthe processing of a BSD (see Section 5). The XSLTstylesheet, embedded in the STX stylesheet, is subse-quently used for doing look-ahead operations in the de-scription of a particular PU. This allows doing a moredetailed analysis of the description of a PU in an easyway. For instance, in case of the detection of a descriptionof a bidirectionally coded picture, the XSLT stylesheetmight decide to replace or to adapt this PU descriptionto obtain a description of a PU representing a place-holder picture. Finally, the customized BSD is providedto a BSDtoBin Parser which in its turn generates anadapted bitstream, suited for a certain usage environ-ment. Note that the BS Schema has to include a syntaxdescription of a dummy picture to generate a tailoredbitstream (due to the use of a few additional, low-levelsyntax elements; see also Figure 2). This fragment in theBS Schema is only used by a BSDtoBin Parser.

    4 Creation of Placeholder Pictures

    In the next subsections, the creation of placeholder pic-tures will be discussed from a syntax point of view for afew well-known coding formats. Special attention will bepaid to the creation of dummy pictures in H.264/AVC.

    4.1 MPEG-1 Video and MPEG-2 Video

    Signaling a placeholder picture in MPEG-2 Video canbe realized by using so-called pseudo-skipped pictures.This concept, introduced by Lee et al. in [3], boils downto creating an artificial picture that consists of multipleslices. Hereby, every slice contains two macroblocks that

    mark the beginning and the end of the slice. In-betweenmacroblocks are automatically considered as skipped.The macroblocks at the start and end of the slice areto be coded in backward or forward motion compen-sated mode, and with zero-valued motion vectors andquantized coefficients. Because MPEG-2 Video can beconsidered as a superset of MPEG-1 Video, the tech-nique for creating dummy pictures in MPEG-1 Video isvery similar. When creating such pictures in MPEG-1Video, there are only some minor syntactical differencesto be taken into account at the macroblock layer. Themore flexible definition of a slice in MPEG-1 Video alsoallows for a few optimizations since an entire picture canbe coded as a single slice. A description of a placeholderpicture in MPEG-1 Video is depicted in Figure 2. Notethe resolution dependency in the macroblock addresses.

  • 7/27/2019 Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video Coding Formats

    3/4

    Description-based Substitution Methods for Emulating Temporal Scalability 3

    < p s e u d o _ s k i p p e d _ p i c t u r e _ 3 5 2 x 2 8 8 >

    < p i c t u r e _ h e a d e r >

    < / p i c t u r e _ h e a d e r >

    < o n e _ s l i c e _ 3 5 2 x 2 8 8 >

    < s l i c e _ s t a r t _ c o d e > 0 0 0 0 0 1 0 1 < / s l i c e _ s t a r t _ c o d e >

    < q u a n t i s e r _ s c a l e _ c o d e > 1 < / q u a n t i s e r _ s c a l e _ c o d e >

    < e x t r a _ b i t _ s l i c e > 0 < / e x t r a _ b i t _ s l i c e >

    < f i r s t _ m a c r o b l o c k >< m a c r o b l o c k _ a d d r e s s _ i n c r e m e n t > 1 < / m a c r o b l o c k _ a d d r e s s _ i n c r e m e n t >

    < m a c r o b l o c k _ t y p e > 2 < / m a c r o b l o c k _ t y p e >

    < m o t i o n _ h o r _ b a c k w a r d _ c o d e > 1 < / m o t i o n _ h o r _ b a c k w a r d _ c o d e >

    < m o t i o n _ v e r t _ b a c k w a r d _ c o d e > 1 < / m o t i o n _ v e r t _ b a c k w a r d _ c o d e >

    < / f i r s t _ m a c r o b l o c k >

    < l a s t _ m a c r o b l o c k >

    < m a c r o b l o c k _ e s c a p e > 8 < / m a c r o b l o c k _ e s c a p e >

    < m a c r o b l o c k _ e s c a p e > 8 < / m a c r o b l o c k _ e s c a p e >

    < m a c r o b l o c k _ a d d r e s s _ i n c r e m e n t > 2 5 < / m a c r o b l o c k _ a d d r e s s _ i n c r e m e n t >

    < m a c r o b l o c k _ t y p e > 2 < / m a c r o b l o c k _ t y p e >

    < m o t i o n _ h o r _ b a c k w a r d _ c o d e > 1 < / m o t i o n _ h o r _ b a c k w a r d _ c o d e >

    < m o t i o n _ v e r t _ b a c k w a r d _ c o d e > 1 < / m o t i o n _ v e r t _ b a c k w a r d _ c o d e >

    < / l a s t _ m a c r o b l o c k >

    < b i t _ s t u f f i n g > 0 < / b i t _ s t u f f i n g >

    < / o n e _ s l i c e _ 3 5 2 x 2 8 8 >

    < / p s e u d o _ s k i p p e d _ p i c t u r e _ 3 5 2 x 2 8 8 >

    Fig. 2 A pseudo-skipped picture in MPEG-1 Video

    4.2 MPEG-4 Visual and Video Codec 1

    In MPEG-4 Visual, the description of any coded VideoObject Plane (VOP) can be transformed into a descrip-tion of a non-coded VOP (N-VOP), i.e. a dummy pic-ture, by enabling the vop coded flag in the VOP headerand by stripping all further information for that pic-ture. Because SMPTEs VC-1 specification provides anown picture type for signaling a placeholder picture (i.e.,a Skipped picture), it is rather straightforward to re-place the description of a picture by the description of

    a Skipped picture. Both specifications allow creating aresolution-independent description of a dummy picture.

    4.3 H.264/MPEG-4 AVC

    Before diving into the construction of placeholder pic-tures in H.264/AVC, it is interesting to have a closerlook at some of its design features. First, in H.264/AVCthere is no such thing as an explicit I, P, or B picturesince only I, P, and B slices are defined. In addition, acoded picture can comprise a mixture of different types

    of slices. Finally, B slices can be used as a reference forthe reconstruction of other slices. Therefore, the recom-mended way for picture dropping in H.264/AVC is torely on sub-sequences and Supplemental EnhancementInformation messages (SEI messages) [4].

    The flexible design of H.264/AVC does not only havean impact on how to implement and exploit temporalscalability; it also has an influence regarding the creationof placeholder pictures. While the generic nature of ourXML-driven substitution approach is constrained by aresolution dependency in MPEG-{1, 2} Video, there area few other aspects to deal with when using our pro-posed method for H.264/AVC. Besides the picture reso-lution and the number of macroblocks used in a slice, itis also important to take into account parameters suchas the variable length coding scheme, the slice type, and

    - - - - - - - - - - - - - - - - - - - - - o r i g i n a l n o n - r e f e r e n c e B s l i c e - - - - - - - - - - - - - - - - - - - - - -

    < c o d e d _ s l i c e _ o f _ a _ n o n _ I D R _ p i c t u r e >

    < s l i c e _ l a y e r _ w i t h o u t _ p a r t i t i o n i n g _ r b s p >

    < s l i c e _ h e a d e r >

    < f i r s t _ m b _ i n _ s l i c e > 0 < / f i r s t _ m b _ i n _ s l i c e >

    < s l i c e _ t y p e > 1 < / s l i c e _ t y p e >

    < p i c _ p a r a m e t e r _ s e t _ i d > 0 < / p i c _ p a r a m e t e r _ s e t _ i d >

    < f r a m e _ n u m x s i : t y p e = " b 5 " > 3 < / f r a m e _ n u m >

    < / s l i c e _ h e a d e r >

    < s l i c e _ d a t a >

    < b i t _ s t u f f i n g > 3 < / b i t _ s t u f f i n g >

    < s l i c e _ p a y l o a d > 2 0 6 0 5 4 < / s l i c e _ p a y l o a d >

    < / s l i c e _ d a t a >

    < / s l i c e _ l a y e r _ w i t h o u t _ p a r t i t i o n i n g _ r b s p >

    < / c o d e d _ s l i c e _ o f _ a _ n o n _ I D R _ p i c t u r e >

    - - - - -- - - - - - r e s u l t i ng p l a c eh o l d er : s k i pp e d n o n - r e fe r e n ce P s l i ce - - - - -- - - - -

    < c o d e d _ s l i c e _ o f _ a _ s k i p p e d _ n o n _ I D R _ p i c t u r e >

    < s k i p p e d _ s l i c e _ l a y e r _ w i t h o u t _ p a r t i t i o n i n g _ r b s p >

    < s l i c e _ h e a d e r >

    < f i r s t _ m b _ i n _ s l i c e > 0 < / f i r s t _ m b _ i n _ s l i c e >

    < s l i c e _ t y p e > 5 < / s l i c e _ t y p e >

    < p i c _ p a r a m e t e r _ s e t _ i d > 0 < / p i c _ p a r a m e t e r _ s e t _ i d >

    < f r a m e _ n u m x s i : t y p e = " b 5 " > 3 < / f r a m e _ n u m >

    < / s l i c e _ h e a d e r >

    < s k i p p e d _ s l i c e _ d a t a >

    < m b _ s k i p _ r u n > 5 4 4 < / m b _ s k i p _ r u n >

    < r b s p _ t r a i l i n g _ b i t s >

    < r b s p _ s t o p _ o n e _ b i t > 1 < / r b s p _ s t o p _ o n e _ b i t >

    < r b s p _ a l i g n m e n t _ z e r o _ b i t > 0 < / r b s p _ a l i g n m e n t _ z e r o _ b i t >

    < / r b s p _ t r a i l i n g _ b i t s >

    < / s k i p p e d _ s l i c e _ d a t a >

    < / s k i p p e d _ s l i c e _ l a y e r _ w i t h o u t _ p a r t i t i o n i n g _ r b s p >

    < / c o d e d _ s l i c e _ o f _ a _ s k i p p e d _ n o n _ I D R _ p i c t u r e >

    Fig. 3 Slice manipulation in H.264/AVC

    whether or not a slice is used as a reference for the re-construction of other slices. Therefor, the idea is to keepthe creation of placeholder slices as simple as possible,hereby minimizing the impact on the different decodingprocesses (reference picture and display order manage-ment, deblocking filter). As such, this enables the use ofskipped slices into an H.264/AVC bitstream with a small

    number of changes at the level of the different parametersets and the slice header() syntax structures.

    The use of placeholder pictures in H.264/AVC willbe exemplified with the translation of a non-referenceB slice to a non-reference dummy P slice. This conver-sion process is partly clarified in Figure 3. Firstly, thenal ref idc syntax element in a NALU header is notchanged, implying that the property whether a slice isused as a reference or not, is maintained. This is a firststep in keeping the management of reference picturesconsistent in case such pictures are involved in the con-version process1. Secondly, the slice type is changed

    in order to signal a P slice, hereby avoiding an interpo-lation between the reference pictures in case a skippedB slice would have been used. The frame num syntax el-ement can be merely copied to the target slice header(no reference pictures are removed), as well as all infor-mation related to the display order management (slicesare only replaced by dummy ones), information pertain-ing to the deblocking filter, and information concerningmemory management control operations. The parame-ters related to B slices, used in the management of thereference lists and the reconstruction of a slice (weightedprediction), have to be dropped. To save some extra bits,the value of slice qs delta is initialized to zero as it is

    1 Pictures of which the reconstruction is dependent onskipped reference pictures also need to be signaled as skipped.

  • 7/27/2019 Description-based Substitution Methods for Emulating Temporal Scalability in State-of-the-Art Video Coding Formats

    4/4

    4 W. De Neve, D. De Schrijver, D. Van de Walle, P. Lambert, and R. Van de Walle

    Table 1 Performance measurements (non-reference B slices replaced by non-reference skipped P slices)

    bitstream characteristics BintoBSDm STX/XSLT BSDtoBinm#slices/ fileo filem ET ET MC BSDo cBSDo ET ET MC BSDm cBSDm ET ET MC

    resolution picture #PUs (MB) (MB) (s) (PU/s) (MB) (MB) (KB) (s) (PU/s) (MB) (MB) (KB) (s) (PU/s) (MB)

    848x352 5 17819 37.4 26.6 131 136 1.7 35.0 352 1865 10 2.7 36.2 243 32 557 1.9

    1280x544 7 24945 110.0 73.9 204 122 1.7 45.9 498 2746 9 2.7 47.6 341 44 567 1.91904x800 9 32071 163.0 109.0 278 115 1.7 59.8 607 3340 10 2.7 62.0 419 59 544 1.9

    not used during the reconstruction of a skipped slice. Fi-nally, the slice data() structure is replaced such thatall macroblocks are flagged as skipped. Note that ourapproach only works when Context-Adaptive VariableLength Coding (CAVLC) is in use. Arithmetic codingleads to a context-adaptive representation of the syntaxelement that communicates a skipped macroblock run,something not easy to express in BSDL.

    5 Experimental Results

    The use of placeholder pictures has been successfully ver-ified for all coding formats discussed, among which is theuse of pseudo-skipped pictures for DVD content. Due toplace constraints, we will only provide some measure-ments for H.264/AVC. The results as shown in Table 1were obtained on a PC with an Intel Pentium IV 2.61GHz CPU and 512 MB of memory (ET stands for Ex-ecution Time; MC for peak Memory Consumption; PUfor Parse Unit, i.e. a NALU; and cBSD for compressedBSD). The BSD and bitstream generation was done by

    in-house optimized versions of the parsers as available inthe MPEG-21 reference software. The Joost STX proces-sor (version 20050521), having a built-in XSLT engine,was used for transforming BSDs; WinRAR 3.0s defaulttext compression algorithm was used for compressingthem [the BSDs could not be compressed with currentimplementations of MPEG-7 BiM (Binary Format forMetadata)]. The resources involved are 3 different ver-sions of the same movie trailer (The New World), as usedin a real-world simulstore scenario. Each version has avarying GOP pattern. However, a P slice coded pictureis mostly alternated with a non-reference B slice coded

    picture. At the start of the bitstream, every resource hasone Sequence and one Picture Parameter Set, as well asone buffering period SEI message. The picture rate is23.98 Hz, resulting in a length of 148 s.

    A first observation is that the parsers are able togenerate BSDs and customized bitstreams with a feasi-ble computational and memory complexity. Transform-ing a BSD can be done efficiently in terms of memoryrequirements; an XSLT-only approach would have re-quired lots of memory [5]. However, the processing speedis currently unacceptable, caused by the large number ofslices per picture, the use of XSLT, and the complexityof the transformation. The hybrid STX/XSLT approachalso results in an increase of the BSD size, mainly due tounwanted namespace declarations for each PU. Finally,it is worth paying attention to the decrease in bitstream

    sizes. The column labeled fileo contains the sizes of theoriginal H.264/AVC bitstreams; filem denotes the sizesof the modified bitstreams containing placeholders.

    6 Conclusion

    In this paper, a discussion was provided regarding theuse of placeholder pictures in an XML-driven contentadaptation chain, hereby aiming at transparently ex-

    ploiting temporal scalability. The construction of skippedpictures was outlined for several state-of-the-art videocoding formats. Their use was stressed for H.264/AVCas its design is the most challenging one for our substi-tution technique. Some performance measurements wereprovided for real-life H.264/AVC bitstreams. The resultsillustrate that the generation of a BSD and the construc-tion of an adapted H.264/AVC bitstream with place-holder slices can be done with a feasible complexity interms of system memory and processing time needed.

    Acknowledgement: The research as described in this

    paper was funded by Ghent University, the InterdisciplinaryInstitute for Broadband Technology (IBBT), the Institute for

    the Promotion of Innovation by Science and Technology in

    Flanders (IWT), the Fund for Scientific Research-Flanders

    (FWO-Flanders), the Belgian Federal Science Policy Office

    (BFSPO), and the European Union.

    References

    1. S. Devillers, C. Timmerer, J. Heuer, H. Hellwagner, Bit-stream Syntax Description-Based Adaptation in Stream-ing and Constrained Environments, IEEE Transactionson Multimedia, vol. 7, pp. 463470, 2005.

    2. D. De Schrijver, W. De Neve, K. De Wolf, R. Van deWalle, Generating MPEG-21 BSDL Descriptions UsingContext-Related Attributes, Proc. of the 7th IEEE ISMconference, pp. 7986, USA, 2005.

    3. Y. Lee, J. Lee, H. Chang, J. Y. Nam, A New SceneChange Control Scheme based on Pseudo-skipped Pic-ture, Proc. of SPIE/Visual Communications and ImageProcessing, vol. 3024, pp. 159166, USA, 1997.

    4. W. De Neve, D. Van Deursen, D. De Schrijver, K. De Wolf,R. Van de Walle, Using Bitstream Structure Descriptionsfor the Exploitation of Multi-layered Temporal Scalabil-ity in H.264/MPEG-4 AVCs Base Specification, Proc. ofPCM 2005, Springer-Verlag, pp. 641652, Chejudo, 2005.

    5. W. De Neve, D. De Schrijver, D. Van Deursen, R. Van de

    Walle, XML-Driven Bitstream Extraction Along theTemporal Axis of SMPTEs Video Codec 1, Proc. ofWIAMIS 2006, Accepted for publication, Seoul, 2006.