Highlevel Syntax Reference Pic Management Hevc

download Highlevel Syntax Reference Pic Management Hevc

of 13

description

Highlevel Syntax Reference Pic Management Hevc

Transcript of Highlevel Syntax Reference Pic Management Hevc

  • 1858 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    Overview of HEVC High-Level Syntax andReference Picture Management

    Rickard Sjberg, Ying Chen, Senior Member, IEEE, Akira Fujibayashi, Miska M. Hannuksela, Member, IEEE,Jonatan Samuelsson, Thiow Keng Tan, Senior Member, IEEE, Ye-Kui Wang, and

    Stephan Wenger, Senior Member, IEEE

    AbstractThe increasing proportion of video traffic intelecommunication networks puts an emphasis on efficientvideo compression technology. High Efficiency Video Coding(HEVC) is the forthcoming video coding standard that providessubstantial bit rate reductions compared to its predecessors. Inthe HEVC standardization process, technologies such as picturepartitioning, reference picture management, and parametersets are categorized as high-level syntax. The design ofthe high-level syntax impacts the interface to systems anderror resilience, and provides new functionalities. This paperpresents an overview of the HEVC high-level syntax, includingnetwork abstraction layer unit headers, parameter sets, picturepartitioning schemes, reference picture management, andsupplemental enhancement information messages.

    Index TermsHigh Efficiency Video Coding (HEVC), JointCollaborative Team on Video Coding (JCT-VC), parameter set,reference picture list, reference picture set, video coding.

    I. Introduction

    DURING the last decade, there has been a steady increasein transmission speeds over fixed and mobile networks,as well as a large capacity increase in storage devices suchas hard disks. One could therefore be led to believe that theneed for video compression has decreased, whereas in factthe opposite is true. Faster networks and more efficient videocompression have enabled previously infeasible applications,such as over-the-top video streaming, triggering an explosionin video traffic [1]. The demand for higher-resolution contenthas further raised video traffic, to the point where, today, it

    Manuscript received April 13, 2012; revised July 25, 2012; accepted August21, 2012. Date of publication October 5, 2012; date of current version January8, 2013. This paper was recommended by Associate Editor A. Kaup.

    R. Sjberg and J. Samuelsson are with Ericsson Research, Multimedia Tech-nologies, 16480 Stockholm, Sweden (e-mail: [email protected];[email protected]).

    Y. Chen and Y.-K. Wang are with the Multimedia Research and Develop-ment and Standards Group, QCT, Qualcomm, Inc., San Diego, CA 92121USA (e-mail: [email protected]; [email protected]).

    A. Fujibayashi is with NTT DOCOMO, Inc., Research Laboratories, Tokyo239-8536, Japan (e-mail: [email protected]).

    M. M. Hannuksela is with the Nokia Research Center, Tampere 33720,Finland (e-mail: [email protected]).

    T. K. Tan is with M-Sphere Consulting Pte. Ltd., 808379, Singapore,and also with NTT DOCOMO, Inc., Tokyo 239-8536, Japan (e-mail:[email protected]).

    S. Wenger is with VidyoCast, Inc., Hackensack, NJ 07601 USA (e-mail:[email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TCSVT.2012.2223052

    occupies over 50% and 40% of transmitted data in fixed andmobile networks, respectively [2], [3].

    High Efficiency Video Coding (HEVC) is the forthcomingvideo coding standard expected to be published in early 2013.It has been reported that HEVC provides a bit rate reductionof about 50% at the same subjective quality when comparedto advanced video coding (H.264/AVC) [4]. While improvedcompression efficiency is crucial for the success of the codec,the high-level syntax of HEVC also plays an important role,especially in how the features of the codec are exposed tosystems. In this paper, we describe the high-level designof HEVC, with a focus on those novel high-level elementsthat provide new functionalities and contribute to improvedrobustness against transmission errors. This paper is based ona draft HEVC specification [5]. It is conceivable that the finalISO/IEC standard or ITU Recommendation differ in smalldetails from this draft version.

    The high-level architecture of video coding standardschanged dramatically with H.264/AVC [6]. Previous standards,such as Moving Picture Experts Group (MPEG)-2 Video[7], H.263 [8], and MPEG-4 Visual [9], were designed witha continuous video stream in mind. Although picture seg-mentation (through slices and similar tools) was available,previous standards were not designed to be loss robust withrespect to information above the slice header level, such aspicture headers. Especially when transmitting video over lossypacket networks, header information had to be repeated byexternal, bolt-on technologies, for example by picture headerrepetition in H.263/RFC2429 [10] and header extension codes(HEC) in MPEG-4 Visual. Although H.264/AVC can be trans-mitted as a stream, it was designed from the outset for packet-based transmission. The H.264/AVC network abstraction layer(NAL) provides for self-contained packets, allowing the videolayer to be identical for different network environments. Theparameter set concept removes the need for header informationduplication. Header parameters are signaled in parameter setNAL units, which are referenced by coded video NAL units.

    HEVC inherits a number of high-level features fromH.264/AVC, such as the NAL unit and parameter set concepts,the use of picture order count (POC), and SEI messagesfor supplemental data signaling to give a few examples.Some high-level features that exist in H.264/AVC were notincluded in HEVC, for example flexible macroblock order

    1051-8215/$31.00 c 2012 IEEE

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1859

    (FMO), redundant slices, arbitrary slice order (ASO), datapartitioning, and SP/SI pictures. A number of new high-levelfeatures are introduced, such as the video parameter set (VPS),clean random access (CRA) picture, broken link access (BLA)picture, and temporal sub-layer access (TSA) pictures, the tilesand wavefront tools for parallel processing, the dependent slicetool for reduced delay, and the reference picture set (RPS)concept for reference picture management.

    This paper is organized as follows. Section II providesan overview of the HEVC NAL unit header and discussesfuture HEVC extensions. In Section III, we describe theHEVC parameter sets, including the new VPS. Random accessand temporal switching are discussed in Section IV, andSection V addresses the HEVC picture partitioning schemesand parallel processing. Section VI provides an overview ofHEVC reference picture management including the referencepicture set (RPS) concept. Reference picture lists are describedin Section VII and SEI messages are discussed in Section VIII.Finally, Section IX concludes the paper.

    II. NAL Unit Header

    Similarly to H.264/AVC, HEVC uses a NAL unit basedbitstream structure. A coded bitstream is partitioned into NALunits which, when conveyed over lossy packet networks,should be smaller than the maximum transfer unit (MTU) size.Each NAL unit consists of a NAL unit header followed by theNAL unit payload. There are two conceptual classes of NALunits. Video coding layer (VCL) NAL units containing codedsample data, e.g., coded slice NAL units, whereas non-VCLNAL units that contain metadata typically belonging to morethan one coded picture, or where the association with a singlecoded picture would be meaningless, such as parameter setNAL units, or where the information is not needed by thedecoding process, such as SEI NAL units.

    The NAL unit header is designed to co-serve as part ofthe packet header in RTP-based packet networks, and tobe processed by media-aware network elements (MANEs).Further, to enable efficient processing by MANEs, NAL unitheaders are of a fixed format (as described below) and do notinclude variable length codes. All this renders bits in the NALunit header among the most expensive real estate in termsof coding efficiency, and the JCT-VC has been very careful inassigning only those codepoints to the NAL unit header thatare required by MANEs for media aware processing.

    A typical HEVC video transmission scenario involvingMANEs is shown in Fig. 1, wherein a MANE, also knownas media gateway, is present between the sender and thereceiver(s). Receivers, which can be of different types, receiveand decode the bitstream or a part thereof. The MANE is inthe signaling context, and therefore aware of key propertiesof the video bitstream such as profile and level. Based onsuch information as well as information carried in the NALunit header, the MANE may perform intelligent media-awarestream adaptation. One example is local repair and/or localredundancy codinga MANE can, for example, duplicatecritical NAL units such as parameter set NAL units on thetransmission path to only a subset of receivers with badconnectivity. In order to do that, the MANE needs to know

    Fig. 1. Video transmission scenario.

    Fig. 2. NAL unit headers of H.264/AVC and HEVC. (a) H.264/AVC NALunit header. (b) HEVC NAL unit header.

    the type of the NAL unit. Another example of such adaptationis bitstream thinning where certain packets are removed froma bitstream based on network conditions or decoder/displaycapabilities. For example, a MANE can remove slice databelonging to a temporal enhancement layer when it sensescongestion between itself and a given receiver.

    H.264/AVC specifies a one-byte NAL unit header, whichwas extended by three bytes in its scalable (H.264/SVC) [11]and multiview (H.264/MVC) extensions [12]. The extensionsare required to signal different scalable dimensions, e.g.,temporal, spatial, quality, or view dimensions.

    In HEVC, a two-byte NAL unit header was introduced withthe anticipation that this design is sufficient to support theHEVC scalable and 3-D video coding (3DV) [13] extensions,as well as other future extensions, as briefly described below.

    Fig. 2 shows both the H.264/AVC and the HEVC NAL unitheaders. In both standards, the forbiddenzero (F) bit must bezero. It is included to prevent start code emulations in MPEG-2systems legacy environments. In H.264/AVC, the nalrefidc(NRI) was a two-bit codeword. The main motivation fortwo bits has been the support of different transport prioritysignaling to support data partitioning. HEVC does not includedata partitioning, and the disposable nature of NAL units isindicated by the NAL unit type, rather than by dedicated bits.Therefore, the NRI field has become unnecessary. One bit isused to increase the numeric range of the NAL unit type to64 types. The other bit is reserved for future extensions.

    The second part of the HEVC NAL unit header includestwo syntax elements: reservedzero6bits (R6bits, 6 bits,one of which is part of the first byte as already described),

  • 1860 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    and temporalidplus1 (TIDP, 3 bits). With TIDP, temporalscalability is supported (with the temporal identifier rangingfrom 0 to 6 inclusive). The reservedzero6bits are widelyexpected to carry some form of layer identification informationin future extensions, as described below.

    An HEVC bitstream might consist of several temporal sub-layers. Each NAL unit belongs to a specific sub-layer asindicated by the TemporalId (equal to temporalidplus1-1). All VCL NAL units of the same picture must belongto the same sub-layer, thus it can be said that the pictureitself belongs to that sub-layer. HEVC prohibits any kind ofdependency on data in a higher sub-layer in the decodingprocess of a lower sub-layer. It is required that a subbitstream,created from an HEVC bitstream by removing all NAL unitswith TemporalId higher than a specific value, by itself is abitstream conforming to HEVC. It is the responsibility of theencoder to ensure that all conditions for bitstream conformance(e.g., buffer restrictions) are fulfilled for each subbitstream.

    In the first version of HEVC, the reservedzero6bits shallbe set to 000000 for all NAL units. HEVC version 1 con-forming decoders ignore NAL units with reservedzero6bitsother than 000000.

    It is widely anticipated that, in the scalable or 3DV ex-tensions, the R6bits are renamed as layerid, to describe allscalability dimensions but the temporal dimension. In 3DV,layerid would identify view and depth, and in a scalableextension, it would be used to jointly indicate the spatial andquality scalability dimensions. Adaptations based on eitherone of temporalidplus1 and layerid or both of them arepossible by the HEVC NAL unit header design.

    III. Parameter Sets

    HEVC inherits the parameter set concept of H.264/AVC[14] with a few modification and additions. The modifica-tions and additions can be subcategorized into three groups:1) additions and modification made necessary by differentcoding tools of HEVC when compared to H.264/AVC, 2) addi-tions and modifications resulting from operational experiencewith H.264/AVC, and 3) the newly introduced VPS.

    Parameter sets were introduced in H.264/AVC in responseto the devastating effects of a loss of the sequence header andpicture header, if a picture is partitioned into multiple segments(i.e., slices) and those segments are transported in their owntransport unit (e.g., RTP packet)which is desirable for MTUsize matching. The loss of the first packet of a picture, whichcarries not only the first picture segment data, but also thepicture header (and sometimes also the GOP and sequenceheader), might lead to a completely incorrectly reconstructedpicture (and sometimes also the following pictures), even if allother packets were not lost. Some decoder implementationswould not even attempt to decode the received packets of apicture, if the packet with the picture header was lost. Tocombat this vulnerability, transport layer based mechanismswere introduced. For example, the RTP payload format forH.263, specified in RFC 2429 [10], allowed for carrying aredundant copy of the picture header in as many packets as theencoder/packetizer chooses. During the design of H.264/AVC,

    it was recognized that the vulnerability of a picture headeris an architectural issue of the video codec itself, rather thana transport problem, and therefore the parameter set conceptwas introduced as a fix for the issue.

    Parameter sets can be either part of the video bitstream orcan be received by a decoder through other means (includingout-of-band transmission using a reliable channel, hard codingin encoder and decoder, and so on). A parameter set containsan identification, which is referenced, directly or indirectly,from the slice header as discussed in more detail later. Thereferencing process is known as activation. Depending onthe parameter set type, the activation occurs per picture or persequence. The concept of activation through referencing wasintroduced, among other reasons, because implicit activationby virtue of the position of the information in the bitstream(as common for other syntax elements of a video codec) isnot available in case of out-of-band transmission.

    The VPS was introduced to convey information that isapplicable to multiple layers as well as sub-layers. H.264/AVC(in all its versions) did not contain a comparable parameter set,requiring a complex modeling of the layering structure forpurposes such as capability exchange and session negotiation.In H.264/AVCs scalable extension, the scalability informationSEI message offers approximately the same content, but by itsnature of being an SEI message, most of the same informationhas to be repeated in sequence parameter sets (SPSs), whichin some applications scenarios also need to be transmittedout-of-band, and consequently cause increased initial delay,particularly when the retransmission gets involved to guaranteereliability in out-of-band transmission. In cases of broadcastand multicast with in-band transmission of parameter sets,repeating of the same information can be significant overheadas parameter sets need to be repeated at each random accesspoint for tuning in and channel switching. The VPS wasintroduced to address these shortcomings as well as to enablea clean and extensible high-level design of multilayer codecs.

    Each layer of a given video sequence, regardless of whetherthey have the same or different SPSs, refer to the same VPS.The VPS conveys information including: 1) common syntaxelements shared by multiple layers or operation points, inorder to avoid unnecessary duplications; 2) essential infor-mation of operation points needed for session negotiation,including, e.g., profile and level; and 3) other operation pointspecific information, which doesnt belong to one SPS, e.g.,hypothetical reference decoder (HRD) parameters for layersor sub-layers. The parsing of essential information of eachoperation point does not require variable length coding, thusis considered as lightweight for most network elements. It isexpected that the VPS extension, to be specified in HEVCextensions, may contain more syntax elements than those inthe current VPS, for efficient parameter signaling, flexible andlightweight session negotiation as well as advanced bitstreamadaptation, e.g., based on view identifier in 3DV extension.

    According to the HEVC specification [5], some informationis duplicated between the VPS and the SPSs belonging to thelayer. This duplication was introduced to allow a version 1decoder to disregard the VPS NAL unit and still have availableall information required to decode the bitstream.

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1861

    In H.264/AVC as well as in HEVC, SPSs contain informa-tion which applies to all slices of a coded video sequence. InHEVC, a coded video sequence starts from an instantaneousdecoding refresh (IDR) picture, or a BLA picture, or a CRApicture that is the first picture in the bitstream, and includesall subsequent pictures that are not an IDR or BLA picture. Abitstream consists of one or more coded video sequences. Thecontent of the SPS can be roughly subdivided into six cate-gories: 1) a self-reference (its own ID); 2) decoder operationpoint related information (profile, level, picture size, numbersub-layers, and so on); 3) enabling flags for certain toolswithin a profile, and associated coding tool parameters in casethe tool is enabled; 4) information restricting the flexibilityof structures and transform coefficient coding; 5) temporalscalability control (similar to H.264/SVC [11]); and 6) visualusability information (VUI), which includes HRD information.

    HEVCs picture parameter set (PPS) contains such infor-mation which could change from picture to picture. ThePPS includes information roughly comparable what was partof the PPS in H.264/AVC, including: 1) a self-reference;2) initial picture control information such as initial quanti-zation parameter (QP), a number of flags indicating the useof, or presence of, certain tools or control information in theslice header; and 3) tiling information.

    The slice header contains information that can change fromslice to slice, as well as such picture related information thatis relatively small or relevant only for certain slice or picturetypes. The size of slice header may be noticeably bigger thanthe PPS, particular when there are tile or wavefront entry pointoffsets in the slice header and RPS, prediction weights, orreference picture list modifications are explicitly signaled.

    Activation of parameter sets is similar to H.264/AVC. Asshown in Fig. 3, the slice header contains a reference to PPS.The PPS, in turn, contains a reference to the SPS and the SPScontains a reference to the VPS. One common implementationstrategy for parameter sets is to keep all parameter sets of agiven type (PPS, SPS, and VPS) in tables, whose maximumsize is indirectly specified by the numbering range of theparameter set IDs. Under this implementation strategy, theparameter set activation can be as simple as accessing thePPS tables based on information in the slice header, copyingthe information found into the relevant decoder data structures,and following the reference in the PPS to the relevant SPS, andfollowing the reference in SPS to the relevant VPS. As theseoperations need to be performed only once per picture (worstcase), the operation is lightweight. Similarly, the handlingof the reception of a parameter set NAL unit, regardless ofits type, is also straightforward. Parameter set NAL units donot contain parsing dependencieswhich means they are self-contained and do not require context derived from other NALunits for parsing. Although this may cost a few more bits, itenables straightforward parsing and storage of parameter setsin their respective table entries.

    Finally, each type of parameter set contains an extensionmechanism, which allows extending the parameter set in futureversions of HEVC without breaking backward compatibilityand without creating a parsing dependency to the profile/levelinformation carried in the VPS and SPS.

    Fig. 3. Slice header referring to a PPS, indirectly an SPS and indirectly aVPS.

    TABLE IPicture Type Categories, Picture Types, and Picture Subtypes

    IV. Picture Types

    As illustrated in Table I, picture types can be classified intothe following groups in HEVC: 1) random access point (RAP)pictures, 2) leading pictures, 3) sub-layer access pictures, and4) pictures that do not fall into the three aforementionedgroups. The picture types and their subtypes as described inTable I are identified by the NAL unit type in HEVC. RAPpicture types include IDR picture, BLA picture, and CRApicture, and can be further characterized based on the leadingpictures associated with them as indicated in Table I. TheCRA picture facilitates decoding beginning from any randomaccess point in the middle of a coded video sequence, whichis more efficient from the compression efficiency point ofview than inserting an IDR picture (and, thereby, splitting thecoded video sequence into two coded video sequences). RAPpictures and leading pictures are reviewed in further detail inSection IV-A.

    Temporal sub-layer access pictures, TSA and STSA, in-dicate valid temporal sub-layer switching points and are re-viewed in Section IV-B.

  • 1862 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    A. Random Access Point and Leading PicturesIn many video applications, such as broadcasting and

    streaming, an important feature for users is to be able to switchbetween different channels and to jump to specific parts of thevideo with minimum delay. This feature is enabled by insertingRAP pictures in (regular) intervals into the video bitstream.The IDR picture can be used for random access. However, pic-tures following the IDR in decoding order cannot use picturesdecoded prior to the IDR picture as reference. Consequently,bitstreams relying on IDR pictures for random access can havesignificantly lower coding efficiency (e.g., 6%, as reported in[15]). To improve coding efficiency, CRA pictures in HEVCallows pictures that follow the CRA picture in decoding orderbut precede it in output order to use pictures decoded beforethe CRA picture as reference and still allow similar cleanrandom access functionality as an IDR picture. Clean randomaccess is ensured by guaranteeing that pictures that follow aCRA picture in both decoding and output order are decodableif random access is performed at the CRA picture.

    A typical prediction structure around a CRA picture isshown in Fig. 4, which refers to the concept of structureof pictures (SOP) defined as one or more coded picturesconsecutive in decoding order, in which the first coded picturein decoding order is a reference picture at the lowest sub-layerand no coded picture except potentially the first coded picturein decoding order is a RAP picture. The relative decodingorder of the pictures is illustrated by the numerals insidethe pictures. Any picture in the previous SOP has a smallerdecoding order than any picture in the current SOP and anypicture in the next SOP has a larger decoding order than anypicture in the current SOP. The CRA picture, I28, belongs to aSOP which also contains the pictures B29, B30, and B31. Theyfollow the CRA picture in decoding order but precede theCRA picture in output order. These pictures are called leadingpictures of the CRA picture. Since B29 and B31 do not refer toany picture preceding the CRA picture in decoding order, theycan be correctly decoded when the decoding starts from theCRA picture and are therefore RADL pictures. Picture B30 is aRASL picture which can be correctly decoded if the decodingstarts from a RAP picture before the current CRA picture.RASL pictures cannot be correctly decoded when randomaccess from this CRA picture occurs; hence RASL picturesare typically discarded during random access decoding. If I28had been an IDR picture, it would have cleared the DPB andhence B30 would not have been able to use P24 as a reference.Having I28 being a CRA picture makes this possible.

    To prevent error propagation from reference pictures thatmay not be available depending on where the decoding starts,all pictures in the next SOP as shown in Fig. 4, that follow theCRA picture both in decoding order and output order, shallnot use any picture that precedes the CRA picture either indecoding order or output order (which includes the leadingpictures) as reference.

    When a part of a bitstream starting from a CRA picture isincluded in another bitstream, the RASL pictures associatedwith the CRA picture cannot be decoded, because someof their reference pictures are not present in the combinedbitstream. To make such splicing operation straightforward,

    Fig. 4. CRA picture and leading pictures.

    the NAL unit type of the CRA picture can be changed toindicate that it is a BLA picture. The RASL pictures associatedwith a BLA picture are typically not correctly decodable henceshould not be output/displayed.

    Similar random access functionalities are supported inH.264/AVC with the recovery point SEI message. AnH.264/AVC decoder implementation may or may not supportthe functionality. In HEVC, a bitstream starting with a CRAor BLA picture is considered as a conforming bitstream [16].When a bitstream starts with a CRA picture, the RASLpictures of the CRA picture may refer to unavailable referencepictures and hence cannot be correctly decoded. However,HEVC specifies that the RASL pictures of the starting CRApicture are not output, hence the name clean random access.For establishment of bitstream conformance requirement (e.g.,to mandate that the encoders follow the syntax as well as con-straints to syntax element values for leading pictures), HEVCspecifies a decoding process to generate unavailable referencepictures for decoding of the RASL pictures. However, con-forming decoder implementations do not have to follow thatdecoding process, as long as it can generate identical outputcompared to when the decoding process is performed from thebeginning of the coded video sequence. The same as describedabove applies to RASL pictures associated with BLA picturesregardless of whether the BLA picture starts a bitstream.

    It is worth to note that in HEVC a conforming bitstreammay contain no IDR pictures at all.

    B. Temporal Sub-layer Access PicturesThe term temporal sub-layer switching point refers to when

    a picture in a sub-layer has no dependency on any otherpicture in the same sub-layer that precedes the picture indecoding order.

    A MANE may apply bitstream thinning to an HEVCbitstream that is encoded with multiple sub-layers. At anypoint in the bitstream a MANE can start removing NAL unitsof higher sub-layers with the knowledge that the pictures in thelower sub-layers are still decodable since the decoding processfor the pictures in the lower sub-layers does not depend on theNAL units of the higher sub-layers. The action of starting toremove all NAL units with TemporalId higher than a certainvalue can be referred to as temporal down-switching. Temporaldown-switching is always possible at any picture.

    The action of starting to forward NAL units of a certainsub-layer that has not been forwarded up until that point

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1863

    can be referred to as temporal up-switching. Temporalup-switching is only possible if none of the pictures in thelayer that is switched to depend on any picture in the samesub-layer prior to the point in the bitstream at which theswitch was performed.

    In H.264/SVC temporal sub-layer switching points can beindicated through setting temporalidnestingflag in the SPSequal to 1, if all pictures in the sequence with temporalidgreater than 0 are temporal layer switching points, or throughthe temporal level switching point SEI message, which alsocontains information for how long period temporal layer Mshould have been decoded prior to the switch point in orderto switch up to temporal layer M + 1 at the switch point.

    In HEVC, just as in H.264/SVC, it is possible to indicatethat all pictures with TemporalId greater than 0 are sub-layerswitching points by setting spstemporalidnestingflag inthe SPS equal to 1. In HEVC there are two picture types, theTSA and STSA picture types, that can be used to indicatetemporal sub-layer switching points.

    The TSA picture type is defined as a switch-to switchingpoint meaning that when a picture is coded as a TSA pictureit is possible to perform sub-layer switching to the layer inwhich the TSA picture is contained (as opposed to a switch-from definition for which an indication that it is possible toperform sub-layer switching would be contained in the sub-layer that it is possible to switch from). The TSA picture typeimposes restrictions on the TSA picture itself and all picturesin the same sub-layer that follow the TSA picture in decodingorder. None of these pictures shall use inter prediction fromany picture in the same sub-layer that precedes the TSApicture in decoding order. The TSA definition further imposesrestrictions on the pictures in higher sub-layers that follow theTSA picture in decoding order. None of these pictures shallreference a picture that precedes the TSA picture in decodingorder if that picture belongs to the same or higher sub-layeras the TSA picture. It is specified that TSA pictures musthave TemporalId greater than 0. The STSA is similar to theTSA picture but does not impose restrictions on the picturesin higher sub-layers that follow the STSA picture in decodingorder and hence enable up-switching only onto the sub-layerwhere the STSA picture resides.

    In Fig. 5, horizontal axis represents the output order, thevertical axis represents the sub-layer, and the subscript num-bers represents the decoding order (which is the same as theoutput order for the given example). The arrows represent interprediction. The pictures P1, P2, P3, and P5 are valid sub-layerswitching points and may use the TSA picture type. P6 andP7 are not switching points since P7 uses P5 for prediction.

    V. Picture Partitioning Schemes

    HEVC includes four different picture partitioning schemes,namely regular slices, dependent slices, tiles, and wavefrontparallel processing (WPP), which may be applied for MTUsize matching, parallel processing, and reduced end-to-enddelay.

    Regular slices are similar as in H.264/AVC. Each regularslice is encapsulated in its own NAL unit, and in-picture

    Fig. 5. Coding structure with three temporal layers.

    prediction (intrasample prediction, motion information predic-tion, coding mode prediction) and entropy coding dependencyacross slice boundaries are disabled. Thus a regular slicecan be reconstructed independently from other regular sliceswithin the same picture (though there may still be interdepen-dencies due to loop filtering operations).

    The regular slice is the only tool that can be used for paral-lelization that is also available, in virtually identical form, inH.264/AVC. Parallelization based on regular slices does not re-quire much interprocessor or intercore communication (exceptfor interprocessor or intercore data sharing for motion com-pensation when decoding a predictively coded picture, whichis typically much heavier than interprocessor or intercore datasharing due to in-picture prediction). However, for the samereason, the use of regular slices can incur substantial codingoverhead due to the bit cost of the slice header and due to thelack of prediction across the slice border. Further, regular slices(in contrast to the other tools mentioned below) also serve asthe key mechanism for bitstream partitioning to match MTUsize requirements, due to the in-picture independence of regu-lar slices and that each regular slice is encapsulated in its ownNAL unit. In many cases, the goal of parallelization and thegoal of MTU size matching place contradicting demands to theslice layout in a picture. The realization of this situation led tothe development of the parallelization tools mentioned below.

    Dependent slices have short slice headers and allow par-titioning of the bitstream at treeblock boundaries withoutbreaking any in-picture prediction. Basically, dependent slicesprovide fragmentation of regular slices into multiple NALunits, to provide reduced end-to-end delay by allowing a partof a regular slice to be sent out before the encoding of theentire regular slice is finished.

    In WPP, the picture is partitioned into single rows ofcoding tree units (CTUs). Entropy decoding and predictionare allowed to use data from CTUs in other partitions.Parallel processing is possible through parallel decodingof CTU rows, where the start of the decoding of a CTUrow is delayed by two CTUs, so to ensure that data relatedto a CTU above and to the right of the subject CTU isavailable before the subject CTU is being decoded. Usingthis staggered start (which appears like a wavefront whenrepresented graphically), parallelization is possible with upto as many processors/cores as the picture contains CTUrows. Because in-picture prediction between neighboringtreeblock rows within a picture is permitted, the requiredinterprocessor/intercore communication to enable in-pictureprediction can be substantial. The WPP partitioning does notresult in the production of additional NAL units compared towhen it is not applied, thus WPP is not a tool for MTU size

  • 1864 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    matching. However, if MTU size matching is required, regularslices can be used with WPP, with certain coding overhead.

    Tiles define horizontal and vertical boundaries that partitiona picture into tile columns and rows. The scan order of CTUsis changed to be local within a tile (in the order of a CTU rasterscan of a tile), before decoding the top-left CTU of the next tilein the order of tile raster scan of a picture. Similar to regularslices, tiles break in-picture prediction dependencies as wellas entropy decoding dependencies. However, they do not needto be included into individual NAL units (same as WPP in thisregard); hence tiles cannot be used for MTU size matching.Each tile can be processed by one processor/core, and theinterprocessor/intercore communication required for in-pictureprediction between processing units decoding neighboring tilesis limited to conveying the shared slice header in cases aslice is spanning more than one tile, and loop filtering relatedsharing of reconstructed samples and metadata. When morethan one tile or WPP segment is included in a slice, the entrypoint byte offset for each tile or WPP segment other than thefirst one in the slice is signaled in the slice header.

    For simplicity, restrictions on the application of the fourdifferent picture partitioning schemes have been specified inHEVC. A given coded video sequence cannot include bothtiles and wavefronts. For each slice and tile, either or bothof the following conditions must be fulfilled: 1) all codedtreeblocks in a slice belong to the same tile, and 2) allcoded treeblocks in a tile belong to the same slice. Finally, awavefront segment contains exactly one CTU row, and whenWPP is in use, if a slice starts within a CTU row, it must endin the same CTU row.

    VI. RPSs

    A. IntroductionThe RPS concept in HEVC defines how previously decoded

    pictures are managed in a decoded picture buffer (DPB) inorder to be used for reference, i.e., sample data predictionand motion vector prediction. Pictures in the DPB can bemarked as used for short-term reference, used for long-term reference or unused for reference. Once a picturehas been marked unused for reference it can no longerbe used for prediction, and when it is no longer needed foroutput it can be removed from the DPB. The RPS concept forreference picture management is fundamentally different fromthe reference picture management of previous video codingstandards. Instead of signaling relative changes to the DPB,the status of the DPB is signaled in every slice. A goal inthe HEVC development for reference picture managementwas to have a basic level of error robustness in all standard-conforming bitstreams and decoders.

    B. BackgroundH.263 Annex U and H.264/AVC [6] constituted a tech-

    nological shift in the area of reference picture usage formotion compensated prediction of image sample data in videocoding. The support for flexible reference picture selection wasintroduced in Annex N and U of H.263 and also adopted inH.264/AVC and allowed for up to 16 reference pictures to

    Fig. 6. HRD buffer model.

    be used providing improved compression efficiency as well asimproved possibilities for error (packet-loss) recovery usingfeedback channels.

    H.264/AVC as well as HEVC defines an HRD, whichmodels a decoder and describes the usage of a coded picturebuffer (CPB) and a DPB, as illustrated in Fig. 6. In bothH.264/AVC and HEVC the decoding order of coded picturesis the same as the order in which the coded pictures occurin the bitstream. Both standards support an output order ofdecoded pictures that is different from the decoding order ofthe pictures. Each picture is associated with a picture ordercount (POC) value that represents the output order.

    In H.264/AVC, there are two types of reference pictures,short-term and long-term. A reference picture may be markedas unused for reference when it becomes no longer neededfor prediction reference. The conversion among these threestatuses (short-term, long-term, and unused for reference) iscontrolled by the decoded reference picture marking process.There are two alternative decoded reference picture mark-ing mechanisms, the implicit sliding window process andthe explicit memory management control operation (MMCO)process. The sliding window process marks a short-termreference picture as unused for reference when the numberof reference frames is equal to a given maximum number(maxnumrefframes in SPS). The short-term reference pic-tures are stored in a first-in, first-out manner so that the mostrecently decoded short-term pictures are kept in the DPB.

    The explicit MMCO process may include multiple MMCOcommands. An MMCO command may mark one or moreshort-term or long-term reference picture as unused forreference, mark all the pictures as unused for reference,or mark the current reference picture or an existing short-termreference picture as long-term, and assign a long-term pictureindex to that long-term picture.

    In H.264/AVC the reference picture marking operations aswell as the processes for output and removal of pictures fromthe DPB are performed after a picture has been decoded.Some aspects related to H.264/AVC reference picture markingmechanisms are discussed below.

    1) Gaps in framenum and Non-Existing Pictures: InH.264/AVC each reference picture is associated with a number,FrameNum (derived from the framenum syntax element inthe slice header), which indicates the decoding order. Normallythis number increases by one for each reference picture butgaps in FrameNum may be allowed (by setting the sequencelevel parameter gapsinframenumallowedflag to one),such that an encoder or a MANE can deliver a bitstreamin which FrameNum increases by more than one for a ref-

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1865

    erence picture relative to the preceding reference picture indecoding order. This was allowed in order to support temporalscalability. A decoder that receives a sequence with gaps inFrameNum shall create non-existing pictures to fill the gap.The non-existing pictures are assigned with FrameNum valuesin the gap and are considered as reference pictures during de-coded reference picture marking but will not be used for output(hence not displayed). The non-existing pictures ensure thatthe status of the DPB, with respect to the FrameNum of thepictures residing in it, is the same for a decoder that receivedthe pictures as for a decoder that did not receive the pictures.

    2) Loss of a Reference Picture When using Sliding Win-dow: When a reference picture is lost in H.264/AVC, adecoder can try to conceal the picture (and possibly report theloss to the encoder if a feedback channel is available) giventhat the loss is detected. If gaps in FrameNum are disallowed,a discontinuity in FrameNum values indicates an unintentionalloss of a reference picture. If gaps in FrameNum are allowed,a discontinuity in FrameNum values may be caused by eitherintentional removal of temporal layers or subsequences or anaccidental picture loss, and decoders should infer a picture lossonly if a non-existing picture is referred in the inter predictionprocess. The POC of a concealed picture may not be knownwhich can cause the decoder to use incorrect reference pictureswithout detecting any errors when decoding B-pictures.

    3) Loss of a Reference Picture With MMCO: InH.264/AVC, when losing a reference picture that contains anMMCO command marking a short-term reference picture asunused for reference, then the status of reference picturesin the DPB becomes incorrect and consequently, referencepicture lists for a few pictures following the lost picture maybecome incorrect.

    If a picture containing MMCO commands related to long-term reference pictures is lost there is a risk that the numberof long-term reference pictures in the DPB is different fromwhat it would have been if the picture was received, resultingin an incorrect sliding window process for all the followingpictures. That is, the encoder and decoder will contain a differ-ent number of short-term reference pictures resulting in out-of-sync behavior of the sliding window process. What makesthe situation even worse is that a decoder will not necessarilyknow that the sliding window process is out-of-sync.

    C. RPS ConceptHEVC introduces a completely new approach for reference

    picture management, referred to as an RPS or buffer descrip-tion [17].

    The most fundamental difference with the RPS conceptcompared to MMCO/sliding window of H.264/AVC is that foreach particular slice a complete set of the reference picturesthat are used by the current picture or any subsequent picturemust be provided. Thus, a complete set of all pictures that mustbe kept in the DPB for use by the current or future picture issignaled. This is different from the H.264/AVC scheme whereonly relative changes to the DPB are signaled. With the RPSconcept, no information from earlier pictures in decoding orderis needed to maintain the correct status of reference picturesin the DPB.

    TABLE IIRPS EXAMPLE

    Picture RPS {reference picture, used by current picture}I0 P1 {I0, 1}B2 {I0, 1}, {P1, 1}B3 {I0, 1}, {P1, 0}, {B2, 1}B4 {P1, 1}, {B2, 1}

    Fig. 7. Coding structure for RPS example.

    D. RPS ExampleAn example of a coding structure is shown in Fig. 7. The

    RPSs for the pictures in Fig. 7 are shown in Table II. Thefirst picture in decoding order is an IDR picture, I0, for whichno RPS is signaled since it is the first picture in the codedvideo sequence and no picture that precedes the IDR picturein decoding order can be used for reference by the IDR pictureor by any picture that follows the IDR picture in decodingorder. The second picture in decoding order, P1, uses I0 forreference. It must therefore include I0 in its RPS. Picture B2uses both I0 and P1 for reference so they are both included inthe RPS of B2.

    B3 uses I0 and B2 for reference so they are included in theRPS of B3. But also P1 must be included since this picture willbe used for reference for future pictures. Finally, picture B4will use B2 and P1 for prediction. Note that the RPS of B4 doesnot contain I0. Since I0 is not listed, it will be marked unusedfor reference. This means that I0 cannot be used for referenceby B4 or by any picture that follows B4 in decoding order.

    E. Order of Picture Decoding and DPB OperationsThe order of picture decoding and DPB operations in

    HEVC is changed compared to H.264/AVC in order to ex-ploit the advantages of RPS and improve error resilience. InH.264/AVC picture marking and buffer operations (both outputand removal of decoded pictures from the DPB) are appliedafter a current picture has been decoded, with the exceptionof when gaps in FrameNum are detected. In HEVC, the RPSis first decoded from a slice header of the current picture,then picture marking and buffer operations are applied beforedecoding the current picture.

    The order of different processes for H.264/AVC and HEVCis shown in Figs. 8 and 9, respectively. As can be seen, thehigh-level decoding processes based on syntax elements inthe slice header and above in HEVC have been significantlysimplified compared to H.264/AVC thanks to the introductionof the RPS concept.

    F. Signaling of RPSsEach slice header in HEVC must include parameters for

    signaling of the RPS for the picture containing the slices.

  • 1866 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    Fig. 8. H.264/AVC decoding order.

    The only exception is that no RPS is signaled for IDRslices, instead the RPS is inferred to be empty. For I slicesthat do not belong to an IDR picture, an RPS may beprovided, even if they belong to an I picture since theremay be pictures following the I picture in decoding orderwhich use inter prediction from pictures that preceded theI picture in decoding order. The number of pictures in anRPS shall not exceed the DPB size limit as specified by thespsmaxdecpicbuffering syntax element in the SPS.

    Each picture is associated with a POC value that representsthe output order. The slice headers contain a fixed-length code-word, picordercntlsb, representing the least significant bitsof the full POC value, also known as the POC LSB. The lengthof the codeword is signaled in the SPS and can be between 4and 16 bits. The RPS concept uses POC to identify referencepictures. Besides its own POC value, each slice header directlycontains or inherits from the SPS a coded representation of thePOC values of each picture in the RPS.

    The RPS for each picture consists of five different listsof reference pictures, also referred to the five RPS subsets:RefPicSetStCurrBefore consists of all short-term referencepictures that are prior to the current picture in both decodingorder and output order, and that may be used in inter predictionof the current picture. RefPicSetStCurrAfter consists of allshort-term reference pictures that are prior to the currentpicture in decoding order, that succeed the current picturein output order, and that may be used in inter prediction ofthe current picture. RefPicSetStFoll consists of all short-termreference pictures that may be used in inter prediction of one ormore of the pictures following the current picture in decodingorder, and that are not used in inter prediction of the currentpicture. RefPicSetLtCurr consists of all long-term referencepictures that may be used in inter prediction of the currentpicture. RefPicSetLtFoll consists of all long-term referencepictures that may be used in inter prediction of one or moreof the pictures following the current picture in decoding order,and that are not used in inter prediction of the current picture.

    Fig. 9. HEVC decoding order.

    Fig. 10. Example of a picture, B14, with entries in all subsets of the RPS.

    The RPS is signaled using up to three loops iteratingover different types of reference pictures; short-term referencepictures with lower POC value than the current picture, short-term reference pictures with higher POC value than the currentpicture and long-term reference pictures. In addition, a flag(usedbycurrpicXflag) is sent for each reference pictureindicating whether the reference picture is used for referenceby the current picture (included in one of the lists RefPicSet-StCurrBefore, RefPicSetStCurrAfter, or RefPicSetLtCurr) ornot (included in one of the lists RefPicSetStFoll or RefPic-SetLtFoll).

    In the example in Fig. 10, the picture B14 contains exactlyone picture in each of the five RPS subsets; P8 is in RefPic-SetStCurrBefore since it is before in output order and used byB14, P12 is in RefPicSetStCurrAfter since it is after in outputorder and used by B14, P13 is in RefPicSetStFoll since it isa short-term reference picture that is not used by B14 (butmust be kept in the DPB since it is used by B15), P4 is inRefPicSetLtCurr since it is a long-term reference picture thatis used by B14. I0 is in RefPicSetLtFoll since it is a long-termreference picture that is not used by the current picture (butmust be kept in the DPB since it is used by B15).

    1) Signaling of Short-Term Reference Pictures in an RPS:The short-term part of an RPS may be included directly inthe slice header. Alternatively, the slice header may containonly a syntax element which represents an index, referencingto a predefined list of RPSs sent in the active SPS. Theshort-term part of an RPS can be signaled using either oftwo different schemes; Inter RPS, as described in the nextsubsection, or Intra RPS, as described in this subsection. WhenIntra RPS is used, numnegativepics and numpositivepicsare signaled representing the length of two different lists ofreference pictures. These lists contain the reference pictureswith negative POC difference and positive POC differencecompared to the current picture, respectively. Each element in

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1867

    these lists is encoded with a variable length code representingthe difference in POC value relative to the previous element inthe list minus one. For the first picture in each list the signalingis relative to the POC value of the current picture minus one.

    2) Signaling Short-Term Pictures in an RPS using InterRPS Scheme: When encoding the recurring RPSs in thesequence parameter set, it is possible to encode the elementsof one RPS with reference to another RPS already encoded inthe sequence parameter set. This is referred to as Inter RPS.There are no error robustness problems associated with thismethod as all the RPSs of the sequences parameter set are inthe same NAL unit.

    The Inter RPS syntax exploits the fact that the RPS of thecurrent picture can be predicted from the RPS of a previouslydecoded picture. This is because all the reference pictures ofthe current picture must either be reference pictures of theprevious picture or the previously decoded picture itself. It isonly necessary to indicate which of these pictures should bereference pictures and be used for the prediction of the currentpicture [18].

    Therefore the syntax comprises of the following: an indexpointing to the RPS to use as a predictor, a deltaPOC to beadded to the difference in POC value of the predictor RPS toobtain the difference in POC value of the current RPS, anda set of indicators to indicate which pictures are referencepictures and whether they are used for the prediction of thecurrent or future pictures.

    3) Signaling of Long-Term Reference Pictures in anRPS: Encoders that would like to exploit the use of long-term reference pictures must set the SPS syntax elementlongtermrefpicspresentflag to one. Long-term referencepictures can then be signaled in the slice header by fixed-length codewords, poclsblt, representing the least signifi-cant bits of the full POC value of each long-term picture.Each poclsblt is a copy of the picordercntlsb codewordthat was signaled for a particular long-term picture. It is alsopossible to signal a set of long-term pictures in the SPS as a listof POC LSB values. The POC LSB for a long-term picturecan then be signaled in the slice header as an index to thislist.

    The deltapocmsbcycleltminus1 syntax element canadditionally be signaled to enable the calculation of thefull POC distance of a long-term reference picture rela-tive to the current picture. It is required that the codeworddeltapocmsbcycleltminus1 is signaled for each long-term reference picture that has the same POC LSB value asany other reference picture in the RPS.

    G. Picture MarkingBefore picture decoding, there will typically be a number of

    pictures present in the DPB. Some of them may be availablefor prediction and thus marked as used for reference. Othersmay be unavailable for prediction but waiting for output, thusmarked as unused for reference. When the slice header hasbeen parsed, a picture marking process is carried out beforethe slice data is decoded. Pictures that are present in the DPBand marked as used for reference but are not included in theRPS are marked unused for reference. Pictures that are not

    present in the DPB but are included in the reference pictureset are ignored if the usedbycurrpicXflag is equal tozero. However, if the usedbycurrpicXflag instead isequal to one, this reference picture was intended to be usedfor prediction in the current picture but is missing. Then anunintentional picture loss is inferred and the decoder shouldtake appropriate actions.

    After decoding the current picture, it is marked used forshort-term reference.

    H. POC, FrameNum, and Non-Existing PicturesHEVC does not include FrameNum signaling and pro-

    cesses for generating and handling of non-existing pictures.In H.264/AVC FrameNum was used to identify pictures inthe DPB when performing buffer operations (i.e., MMCO).FrameNum was also used to detect gaps in the decoding orderof reference pictures (i.e., due to temporal scaling) in order togenerate non-existing pictures. The non-existing pictures wereintroduced to H.264/AVC in order to keep correct status of thereference pictures in the DPB in the case of temporal scaling.In HEVC the RPS contains the status of all reference picturesin the DPB, thus generating and handling of non-existingpictures are not needed. Within the RPS concept, referencepictures are identified by POC. This has the advantage thatoutput order is known even for reference pictures that havebeen removed from the bitstream or unintentionally lost. SincePOC is used for reference picture identification there is noneed to signal FrameNum.

    I. Error Resilience Aspects of RPSSince all reference pictures that are used by the current

    picture must be included in the RPS there is no risk that theloss of one or more packets of data will lead to undetectedusage of an incorrect reference picture. As soon as there isa reference picture with usedbycurrpicXflag set to onein the RPS but no corresponding reference picture in the DPBthe decoder will know that an unintentional picture loss hasoccurred. The decoder side can react to the detection of a lostpicture as appropriate in the specific application, e.g., create aconcealed picture or report the loss to the encoder side througha feedback channel. How to handle picture losses is not in thescope of the HEVC specification. In H.264/AVC, the POCof a lost reference picture is typically not known. In HEVC,the RPS contains the POC of all reference pictures, whichimproves the possibility of creating a good concealment oflost reference pictures.

    VII. Reference Picture Lists

    In HEVC, the term inter prediction is used to denoteprediction derived from data elements (e.g., sample values ormotion vectors) of reference pictures other than the current de-coded picture. Like in H.264/AVC, a picture can be predictedfrom multiple reference pictures. The reference pictures thatare used for inter prediction are organized in one or morereference picture lists. The reference index identifies which ofthe reference pictures in the list should be used for creatingthe prediction signal.

  • 1868 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    A single reference picture list, List 0, is used for a P sliceand two reference picture lists, List 0 and List 1 are used forB slices. Similar to H.264/AVC, the reference picture list con-struction in HEVC includes reference picture list initializationand reference picture list modification.

    In H.264/AVC, the initialization process for List 0 is dif-ferent for P slices (for which decoding order is used) and Bslices (for which output order is used). In HEVC, output orderis used in both cases.

    Reference picture list initialization creates default List 0and List 1 (if the slice is a B slice) based on three RPSsubsets: RefPicSetStCurrBefore, RefPicSetStCurrAfter, andRefPicSetLtCurr. Short-term pictures with earlier (later) outputorder are firstly inserted into the List 0 (List 1) in ascendingorder of POC distance to the current picture, then short-term pictures with later (earlier) output order are insertedinto the List 0 (List 1) in ascending order of POC distanceto the current picture, and finally the long-term pictures areinserted at the end. In terms of RPS, for List 0, the entries inRefPicSetStCurrBefore are inserted in the initial list, followedby the entries in RefPicSetStCurrAfter. Afterward, the entriesin RefPicSetLtCurr, if available, are appended.

    In HEVC, the above process is repeated (reference picturesthat have already been added to the reference picture listare added again) when the number of entries in a list issmaller than the target number of active reference pictures(signaled in the picture parameter set or slice header). Whenthe number of entries is larger than the target number the list istruncated.

    After a reference picture list has been initialized, it maybe modified such that the reference pictures for the currentpicture may be arranged in any order, including the casewhere one particular reference picture may appear in morethan one position in the list, based on the reference picturelist modification commands. When the flag that indicates ifthe presence of list modifications is set to one, a fixed number(equal to the target number of entries in the reference picturelist) of commands are signaled, and each command insertsone entry for a reference picture list. A reference pictureis identified in the command by the index to the list ofreference pictures for the current picture derived from theRPS signaling. This is different from reference picture listmodification in H.264/AVC, wherein a picture is identifiedeither by the picture number (derived from the framenumsyntax element) or the long-term reference picture index, and itis possible that fewer commands are needed, e.g., for swappingthe first two entries of an initial list or inserting one entry atthe beginning of the initial list and shifting the others.

    A reference picture list is not allowed to include anyreference picture with TemporalId greater than the currentpicture.

    VIII. Supplemental EnhancementInformation

    The supplemental enhancement information (SEI) mecha-nism enables encoders to include such metadata in the bit-stream that is not required for correct decoding of the samplevalues of the output pictures but can be used for various

    other purposes, such as picture output timing, displaying, aswell as loss detection and concealment. Encoders can includeany number of SEI NAL units in an access unit, and eachSEI NAL unit may contain one or more SEI messages. TheHEVC standard includes the syntax and semantics for severalSEI messages, but the handling of the SEI messages is notspecified, because they do not affect the normative decodingprocess. One reason to have SEI messages in the HEVCstandard is to enable supplemental data being interpretedidentically in different systems using HEVC. Specificationsand systems using HEVC may require encoders to generatecertain SEI messages or may define specific handling ofparticular types of received SEI messages.

    Table III lists the SEI messages specified in HEVC andbriefly describes their purposes. The SEI mechanism and sev-eral SEI messages in HEVC were inherited from H.264/AVCand hence not described in detail in this paper. The field indica-tion, decoded picture hash, the structure of pictures (SOP) de-scription SEI messages were introduced in the HEVC standard.The field indication SEI message enables the use of interlacedvideo content with HEVC. It is similar to the respective mes-sage in the H.263 standard [8]. The decoded picture hash SEImessage contains a checksum derived from the decoded sam-ples of the associated picture, hence enabling error detection.The sub-picture timing SEI message provides removal timesfor sub-pictures in hypothetical reference decoder operationfor very low-delay applications, such as remote screen sharing.The active parameter set SEI message provides the identifiersfor the active video and sequence parameter sets, such thata MANE does not need to parse the slice header, PPS, andSPS, when obtaining information that is essential for streamadaptation and other intelligent media-aware operations.

    The SOP description SEI message describes the structure ofthe bitstream through RPSs and is described in further detailsbelow. The motivation to design the SOP description SEImessage arose from several use cases where the knowledgeof the temporal and inter prediction structure is helpful [19].For example, a gateway can use the SOP information in bit rateadaptation to determine a set of interrelated pictures that canbe dropped without affecting the decoding of the forwardedbitstream. Such bitstream trimming can have a finer granularitythan the subbitstream extraction based on TemporalId and cantherefore be more suitable for subtle temporary bit rate adap-tation. The SOP description SEI message resides in the firstaccess unit of a SOP. It provides the following information foreach picture in the SOP: an indication whether the picture is areference or a non-reference picture, the TemporalId value ofthe picture, the short-term RPS index used by the picture, andthe picture order count relative to the first picture of the SOP.These pieces of information represent the temporal structureand the inter prediction hierarchy of the SOP comprehensively.

    IX. Conclusion

    We presented various aspects of HEVCs high-level syntax,which form the system and transport interfaces of the HEVCcodec to applications. While based on H.264/AVCs concepts,several changes were introduced in HEVC. Compared to

  • SJBERG et al.: OVERVIEW OF HEVC HIGH-LEVEL SYNTAX AND REFERENCE PICTURE MANAGEMENT 1869

    TABLE IIIOverview of SEI Messages

    SEI message PurposeBuffering period Initial delays for hypothetical reference

    decoder (HRD) operationPicture timing Picture output time and picture/sub-picture

    removal time for HRD operationPan-scan rectangle Displaying at a different picture aspect

    ratio (PAR) than the PAR of the outputpictures

    Filler payload Adjusting the bitrate to meet specific con-straints

    User data registeredUser data unregistered

    SEI messages specified by external entities

    Recovery point Additional information for clean randomaccess. Gradual decoding refresh.

    Scene information Information about scene changes and tran-sitions

    Full-frame snapshot Indication to label the associated decodedpicture as a still-image snapshot of thevideo content

    Progressiverefinement segment

    Indicates that certain consecutive picturesrepresent a progressive refinement of thequality of a picture rather than a movingscene

    Film grain character-istics

    Enables decoders to synthesize film grain

    Deblocking filter dis-play preference

    Recommends whether or not displayedpictures should undergo the in-loop de-blocking filter process

    Post-filter hint Provides suggested post-filter coefficientsor correlation information for post-filterdesign

    Tone mapping infor-mation

    Remapping to another color space thanthat used or assumed in encoding

    Framepackingarrangement

    Packing of stereoscopic video into anHEVC bitstream

    Display orientation Specifies flipping and/or rotation thatshould be applied to the output pictureswhen they are displayed

    Field indication Provides information related to interlacedvideo content and/or field coding, e.g., in-dicates whether the picture is a progressiveframe, a field, or a frame containing twointerleaved fields

    Decoded picture hash Checksum of the decoded picture, whichmay be used for error detection

    Sub-picture timing Sub-picture removal time for HRD opera-tion

    Active parameter sets Provides information on active VPS, SPS,etc.

    Structure of Picturesdescription

    Describes the temporal and inter predictionstructure of the bitstream

    H.264/AVC, a number of source coding tools commonlyassociated with high-level designs have been eliminated, suchas flexible macroblock order, data partitioning, and redundantslices. New concepts, including VPS, parallel coding tools,additional random access point, and temporal sub-layerswitching picture types, were shown to be beneficial to thedesign and have been included. Some mechanisms availablein H.264/AVC were kept in spirit, while the details of theirimplementation were changed, occasionally severely. Thoseinclude the RPS architecture, and the NAL unit header syntax.We believe that with those modifications, one cornerstonehas been set for the success of HEVC in future video codingapplications.

    References

    [1] Sandvine, Sandvine Global Internet Phenomena Report: Fall 2011,2011.

    [2] Cisco, Visual Networking Index: Forecast and Methodology, 20102015,Jun. 2011.

    [3] Allot MobileTrends: Global Mobile Broadband Report, H2. (2011) [On-line]. Available: http://www.allot.com/index.aspx?id=4019&fileID=4117

    [4] T. K. Tan, A. Fujibayashi, Y. Suzuki, and J. Takiue, [AHG 8]Objective and Subjective Evaluation of HM5.0, Joint CollaborativeTeam on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IECJTC1/SC29/WG11, JCTVC-H0116, 8th Meeting, San Jose, CA, Feb.2012.

    [5] B. Bross, W. J. Han, J. Ohm, G. Sullivan, and T. Wiegand, HighEfficiency Video Coding (HEVC) Text Specification Draft 9, JCTVC-K1003-v9.doc, Joint Collaborative Team on Video Coding (JCT-VC)of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, 11th Meeting,Shanghai, China, Oct. 2012.

    [6] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec.H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), 2003.

    [7] Generic Coding of Moving Pictures and Associated Audio InformationPart 2: Video, ITU-T and ISO/IEC JTC 1, ITU-T Rec. H.262 andISO/IEC 13 818-2 (MPEG-2), 1994.

    [8] Video Coding for Low Bit Rate Communication, ITU-T, ITU-T Rec.H.263 version 2, 2005.

    [9] Coding of Audio-Visual ObjectsPart 2: Visual, ISO/IEC 14492-2(MPEG-4 Visual), ISO/IEC JTC 1, Version 3, May 2004.

    [10] C. Bormann, L. Cline, G. Deisher, T. Gardos, C. Maciocco, D. Newell,J. Ott, G. Sullivan, S. Wenger, and C. Zhu, RTP Payload Format forthe 1998 Version of ITU-T Rec. H.263 Video (H.263+), RFC2429, May1999.

    [11] Y.-K. Wang, M. M. Hannuksela, S. Pateux, A. Eleftheriadis, and S.Wenger, System and transport interface of SVC, IEEE Trans. CircuitsSyst. Video Technol., vol. 17, no. 9, pp. 11491163, Sep. 2007.

    [12] Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, J. Lainema, andM. Gabbouj, The emerging MVC standard for 3D video services,EURASIP J. Adv. Signal Process., vol. 2009, Article ID 786015, Jan.2009.

    [13] Call for Proposals on 3-D Video Coding Technology, ISO/IECJTC1/SC29/WG11/N12036, Geneva, Switzerland, Mar. 2011.

    [14] S. Wenger, H.264/AVC over IP, IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 645656, Jul. 2003.

    [15] A. Fujibayashi and T. K. Tan, Random Access Support for HEVC, JointCollaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3and ISO/IEC JTC1/SC29/WG11, JCTVC-D234, 4th Meeting, Daegu,Korea, Jan. 2011.

    [16] Y.-K. Wang, Y. Chen, M. Karczewicz, and J. L. Chen, On BitstreamsStarting with CRA Pictures, Joint Collaborative Team on Video Cod-ing (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11,JCTVC-H0496, 8th Meeting, San Jose, CA, Feb. 2012.

    [17] R. Sjberg and J. Samuelsson, Absolute Signaling of Reference Pictures,Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-F493, 6th Meeting,Turin, Italy, Jul. 2011.

    [18] T. K. Tan and C. S. Boon, AHG21: Inter Reference Picture Set PredictionSyntax and Semantics, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-G198, 7th Meeting, Geneva, Switzerland, Nov. 2011.

    [19] J. Boyce, D. Hong, and A. Eleftheriadis, High Layer Syntax toImprove Support for Temporal Scalability, Joint Collaborative Teamon Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IECJTC1/SC29/WG11, JCTVC-D200, 4th Meeting, Daegu, Korea, Jan.2011.

    Rickard Sjberg received the M.S. degree in com-puter science from the Royal Institute of Technology,Stockholm, Sweden, in 1997.

    He has been with Ericsson, Sweden, since 1996,working with still image and video coding research,real-time video codec implementations of H.263,MPEG-4 part2, and H.264/AVC, as well as sub-jective video encoding optimizations. Since 1997,he has contributed more than 100 proposals to thevideo coding standards of ITU-T, MPEG, JVT, andJCT-VC. He is currently a Senior Specialist of

    video coding with Multimedia Technologies, Ericsson Research, Stockholm,working as the Technical Leader of the Ericsson 2-D video coding research.

  • 1870 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

    Ying Chen (M05SM11) received the B.S. degreein applied mathematics and the M.S. degree in elec-trical engineering and computer science, both fromPeking University, Beijing, China, in 2001 and 2004,respectively, and the Ph.D. degree in computing andelectrical engineering from the Tampere Universityof Technology (TUT), Tampere, Finland, in 2010.

    He is currently a Senior Staff Engineer with Qual-comm, Inc., San Diego, CA. He joined Qualcommin 2009. He was a Researcher with TUT and NokiaResearch Center, Finland, from 2006 to February

    2009, and was a Research Engineer with Thomson Corporate Research,Beijing, China, from 2004 to 2006. He has been actively contributing toMPEG, JVT, and JCT-VC, on scalable video coding (SVC), multiview videocoding, and 3-D video (3DV) coding extensions of H.264/AVC, as well ashigh-level syntax and 3DV extension of HEVC. He has also been involvedin standardization activities of MPEG systems, including MPEG file format,MPEG-2 systems, and dynamic adaptive streaming over HTTP (DASH). Hehas co-authored more than 200 standard contribution documents to JVT,JCT-VC, and MPEG and around 30 academic papers in the fields of imageprocessing, video coding, and video transmission.

    Akira Fujibayashi received the B.S. and M.S.degrees in electronics, information, and communi-cation engineering from Waseda University, Tokyo,Japan, in 2002 and 2004, respectively.

    He has been with NTT DOCOMO, Inc., Tokyo,Japan, since 2004. He is an active participant at theITU-T/ISO/IEC Joint Collaborative Team for VideoCoding (JCT-VC) standardization activities and haspublished some contributions in JCT-VC. His currentresearch interests include image and video coding,perceptual coding, and video enlargement technolo-

    gies.

    Miska M. Hannuksela (M03) received the M.S.degree in engineering and the Doctor of Sciencedegree in technology from the Tampere Universityof Technology, Tampere, Finland, in 1997 and 2010,respectively.

    He has been with Nokia since 1996 in differentpositions, including a Research Manager/Leader inthe areas of video and image compression, end-to-end multimedia systems, sensor signal processing,and context extraction. He is currently a Distin-guished Scientist in Multimedia Technologies in

    Nokia Research Center, Tampere, Finland. He has published about 100 journaland conference papers and hundreds of standardization contributions in JCT-VC, JVT, MPEG, 3GPP, and DVB. He holds patents from more than 60 patentfamilies. His current research interests include video compression, multimediacommunication systems and formats, user experience and human perceptionof multimedia, and sensor signal processing.

    Dr. Hannuksela received the Award for the Best Doctoral Thesis of theTampere University of Technology in 2009, and the Scientific AchievementAward nominated by the Centre of Excellence of Signal Processing, TampereUniversity of Technology, in 2010. He has been an Associate Editor of inIEEE Transactions on Circuits and Systems for Video Technologysince 2010.

    Jonatan Samuelsson received the Master of Sciencedegree in computer science from the Royal Instituteof Technology, Stockholm, Sweden, in 2007.

    He has been with Ericsson Research in MultimediaTechnologies, Stockholm Sweden, since 2007. Hisresearch has included different aspects of videocompression with specific emphasis on softwareimplementations of real-time video codecs. His areasof focus have included rate-controllers, adaptive QPalgorithms, packetization and transport of video,video buffering and error feedback. He is currently

    a Senior Researcher, is active in standardization, and has published a dozencontributions in JCT-VC.

    Thiow Keng Tan (S89M94SM03) receivedthe Bachelor of Science and Bachelor of Electricaland Electronics Engineering degrees from MonashUniversity, Victoria, Australia, in 1987 and 1989,respectively, and the Ph.D. degree in electrical engi-neering in 1994 from Monash University.

    He is currently a Consultant for NTT DOCOMO,Inc., Tokyo, Japan. He is an active participant in thevideo subgroup of the ISO/IEC JCT1/SC29/WG11Moving Picture Experts Group (MPEG), the ITU-T SG16 Video Coding Experts Group (VCEG), the

    ITU-T/ISO/IEC Joint Video Team (JVT), and the ITU-T/ISO/IEC JointCollaborative Team for Video Coding (JCT-VC) standardization activities.He is the inventor of at least 50 granted U.S. patents. His current researchinterests include image and video coding, analysis, and processing.

    Dr. Tan has served on the editorial board of the IEEE Transactionson Image Processing. He was awarded the Dougles Lampard ElectricalEngineering Medal for his Ph.D. thesis and the First Prize IEEE Region10 Student Paper Award for his final year undergraduate project. He wasalso awarded three ISO certificates for outstanding contributions to thedevelopment of the MPEG-4 standard.

    Ye-Kui Wang received the B.S. degree in industrialautomation from the Beijing Institute of Technology,Beijing, Peoples Republic of China, in 1995, andthe Ph.D. degree in electrical engineering from theGraduate School in Beijing, University of Scienceand Technology of China, Beijing, Peoples Republicof China, in 2001.

    He is currently a Senior Staff Engineer with Qual-comm, Inc., San Diego, CA. He was formerly aMultimedia Standards Manager with Huawei Tech-nologies, Bridgewater, NJ, from December 2008

    to August 2011. He was a Principal Research Staff Member with NokiaCorporation, Tampere, Finland, from February 2003 to December 2008,and a Senior Researcher with the Tampere International Center for SignalProcessing, Tampere University of Technology, Tampere, Finland, from June2001 to January 2003. He has co-authored over 300 technical standardizationcontributions, about 50 academic papers, and more than 100 granted orpending patents or patent applications in his areas of interest. His currentresearch interests include video coding, multimedia transport, and systems.

    Dr. Wang has been an active contributor to various multimedia standards,including video codecs, file formats, RTP payload formats and streamingsystems, developed in ITU-T VCEG, ISO/IEC MPEG, JVT, JCT-VC, 3GPPSA4, IETF, and AVS. He has been an editor for several standard specifications,including ITU-T H.271, SVC File Format, multiview video voding (MVC),IETF RFC 6184, and IETF RFC 6190.

    Stephan Wenger (M99) received the Diploma andPh.D. degree in computer science from TechnischeUniversitat (TU) Berlin, Berlin, Germany, in 1989and 1995, respectively.

    After working as an Assistant Professor with TUBerlin, in various consulting roles, he spent fiveyears in Nokias Research Center and IPR/legalgroups. Since 2007, he has been the Chief Technol-ogy Officer of VidyoCast, a division of Vidyo, Inc.,Hackensack, NJ. He has published dozens of journaland conference papers, Internet RFCs, and hundreds

    of standardization contributions to many standards setting organizations, aswell as more than 30 granted or pending patents and patent applications.His current research interests include the interface between media coding andtransport over a variety of networks.