Real-Time CPU Based H.265/HEVC Encoding Solution with ...

Real-Time CPU

Based H.265/HEVC Encoding

Solution with Intel® Platform

Technology

Yang Lu

Intel Corporation

Shanghai, PRC

2013.12

White Paper: Real-Time CPU Based H.265/HEVC Encoding Solution with Intel® Platform Technology

2

Contents

Contents ............................................................................................................................................. 2

1. Abstract.................................................................................................................................. 3

2. Video Codec Introduction ............................................................................................ 3

3. H.265/HEVC Performance Issues ............................................................................ 4

4. Real-time HEVC Encoder Solution Based on Intel® Xeon™ Platform

6

5. Summary ............................................................................................................................ 14

Reference ....................................................................................................................................... 14

Notices .......................................................................................................................................... Error! Bookmark not defined.


3

1. Abstract International Telecommunication Union (ITU) announced the new video codec standard:

High Efficiency Video Coding (HEVC)/H.265, which claims should be about 50 percent

more efficient than the current H.264/MPEG-4 standard. However the complexity of the

algorithm and data structure of H.265 is much more than 4 times the H.264, that means

H.265 based codec will require more computing resource/power than its predecessor. In

this paper we investigate the HEVC codec characters, focus on CPU based software

video trans-coding technologies that provides the best video quality and most flexible

programming model, maximize IA platforms’ capabilities at one HEVC codec, to achieve

real-time performance for HEVC encoding codec on IA platform.

2. Video Codec Introduction

Video coding standards have evolved primarily through the development of the

well-known ITU-T and ISO/IEC standards. The ITU-T produced H.261 and H.263,

ISO/IEC produced MPEG-1 and MPEG-4 Visual, and the two organizations jointly

produced the H.262/MPEG-2 Video and H.264/MPEG-4 Advanced Video Coding (AVC)

standards [1].

Figure 1. Video Standard/Codec Evolution

H.265/HEVC (High-Efficiency Video Coding), introduced last year, is the latest video

codec standard developed by ISO / IEC and ITU-T, aimed to maximize compression

capability and reduce data loss. H.265/HEVC doubles the compression ratio compared

to the previous H.264/AVC standard, but has the same subjective quality. HEVC

technology helps online video providers to provide high-quality video with lesser

bandwidth, making it the next video codec revolution.

HEVC propose several new video coding syntax architecture and algorithms to obtain the

high efficient coding standard[1][2]:

a) Random Access and Bitstream Splicing Features

The new design supports special features to enable random access and bitstream splicing.

In H.264/MPEG-4 AVC, a bitstream must always start with an IDR access unit, but in the

HEVC random access is supported.

b) Coding Tree Units Structure


4

A picture is partitioned into coding tree units (CTUs), which each contain luma CTBs

and chroma CTBs. The value of L may be equal to 16, 32, or 64 as determined by an

encoded syntax element specified in the SPS. The CTU contains a quadtree syntax that

allows for splitting the CBs to a selected appropriate size based on the signal

characteristics of the region that is covered by the CTB. All previous video coding

standards just use the fixed array size of 16×16 luma samples, but HEVC supports

variable-size CTBs selected according to needs of encoders in terms of memory and

computational requirements.

c) Tree-Structured Partitioning Into Transform Blocks and Units

A CB can be recursively partitioned into transform blocks (TBs). The partitioning is

signaled by a residual quadtree. In contrast to previous standards, the HEVC design

allows a TB to span across multiple PBs for interpicture-predicted CUs to maximize the

potential coding efficiency benefits of the quadtree-structured TB partitioning.

d) Intrapicture Prediction

Directional prediction with 33 different directional orientations is defined for (square)

transform block(TB) sizes from 4×4 up to 32×32. The possible prediction directions are

all directions. HEVC supports various intrapicture predictive coding methods referred to

as Intra−Angular, Intra−Planar, and Intra−DC.

This advanced coding standard demands extremely high processing capabilities from

both of client devices and backend trans-coding servers.

3. H.265/HEVC Performance Issues

Current HEVC HM project only implement the major functionalities of this standard, the

real performance still far away from the production and real deployment.

− no parallel scheme

− poor vectorization tuning

Figure 2. HM project profiling – thread concurrency


5

Figure 3. HM project profiling – hot code

This HEVC encoder consumes over 100 times of CPU resource than H.264 on server side,

and more than 10 times CPU power on client side.

H.265/HEVC codec attract world-wide multi groups/agencies to optimize the

performance, push to real deployment. Several open sourced projects: a) OpenHEVC(currently HM10.0 compatible, and did some optimization on decoder)

https://github.com/OpenHEVC/openHEVC b) x265(compatible with HM, and did optimization on parallel & SIMD)

http://code.google.com/p/x265/

https://bitbucket.org/multicoreware/x265/wiki/Home

We take a 720p 24 fps video to evaluate the x.265 encoder performance, on Intel(R)

Xeon(R) Sandy Bridge(E5-2680 @ 2.70GHz, 8*2 physical cores) platform. This codec did lots

of work to optimize the original standard by both of task and data parallelism, however

from our benchmarking it can only use 6 cores’ capabilities in total 32 logical cores

system (SMT ON), can’t maximize current multi-core platform computing resource.

Figure 4. CPU usage of X.265 project

https://github.com/OpenHEVC/openHEVC

https://github.com/OpenHEVC/openHEVC



https://bitbucket.org/multicoreware/x265/wiki/Home


6

Figure 5. SIMD tuning of X.265 project

In x.265 project, SIMD instruction has been utilized to tuning vectorization, which

contribute 70+% performance speedup here, with further icc compiling optimization, we

get 2x speedup on IA platform totally. However the encoder performance here still has

big gap with the real-time encoder deployment, especially for HD 1080p videos.

In the PRC, more than 20 multimedia ISVs are pursuing available HEVC solution and

platform to save the online video service cost and maintain the quality at the same time.

Figure 6: Online video market in the PRC

4. Real-time HEVC Encoder Solution Based on

Intel® Xeon™ Platform

Video encoding application is a standard CPU and memory intensive workload, which

requires high capabilities of the server platform, such as core computing efficiency,

reliability, and stability. The computing complexity of H.265/HEVC codec is far more 4

times than previous H.264/MPEG, it raises unprecedented processing requirements to the

backend server platform. In this section, we will introduce IA major technologies that

help Strongene[3] HEVC codec to reach the 1080p real-time encoding standard.

4.1 SIMD Vectorization Tuning for HEVC Encoding Functions

Most of the video and image time-consuming functions locate to the block based data

intensive computing, which can be optimized by the IA SIMD(single instruction multi


7

data) vectorization instructions. SIMD instructions process multi set data within one

single CPU cycle, that will greatly improve the data throughput and execution efficiency.

SIMD have been widely supported at x86 processors, evolving from MMX, SSE, AVX,

to the AVX2 at different x86 platform generations.

We take a common 64*64 block computing in video/image processing as an example

here to demonstrate how to utilize the SSE and AVX2 intrinsic to optimize the original

code:

Code example for 64*64 block computing #include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include "smmintrin.h"

#include "immintrin.h"

/********* original block computing serial scalar computing ************/ #define PIXEL_SAD_C( func_type, name, lx, ly )

func_type int name( pixel *pix1, int i_stride_pix1,pixel *pix2, int i_stride_pix2 )

{

int sum = 0;

int x, y;

for( y = 0; y < ly; y+=2 )

{

for( x = 0; x < lx; x++ )

{

sum += abs( pix1[x] - pix2[x] );

}

pix1 += i_stride_pix1<<1;

pix2 += i_stride_pix2<<1;

}

return sum << 1;

}

PIXEL_SAD_C( static, LENT_sad_64x64_c, 64, 64 )

PIXEL_SAD_C( static, LENT_sad_32x32_c, 32, 32 )

#define SAD4( w, h )

static void LENT_sad4_##w##x##h##_c( pixel *fenc, pixel *p0, pixel *p1, pixel *p2,

pixel *p3, int i_stride, int cost[4] )

{

cost[0] = LENT_sad_##w##x##h##_c( fenc, FENC_STRIDE, p0, i_stride );




}

SAD4( 64, 64 )

SAD4( 32, 32 )

/************** SSE instruction implementation ************************/

void inline sad4_32_fast_sse( pixel *fenc, pixel *p0, pixel *p1, pixel *p2, pixel

*p3, int i_stride, int cost[4], int ly )

{

__m128i sum = _mm_setzero_si128();

int i;

i_stride <<= 1;

for( i = 0; i < ly; i += 2 )


8

{

__m128i se = _mm_load_si128( (__m128i *)(fenc) );

__m128i s0 = _mm_loadu_si128( (__m128i *)(p0) );




s0 = _mm_sad_epu8( se, s0 );




s0 = _mm_hadd_epi32( s0, s1 );


sum = _mm_add_epi32( sum, _mm_hadd_epi32( s0, s1 ) );

se = _mm_load_si128( (__m128i *)(fenc + 16) );

s0 = _mm_loadu_si128( (__m128i *)(p0 + 16) );

s1 = _mm_loadu_si128( (__m128i *)(p1 + 16) );

s2 = _mm_loadu_si128( (__m128i *)(p2 + 16) );

s3 = _mm_loadu_si128( (__m128i *)(p3 + 16) );








fenc += (2*FENC_STRIDE);

p0 += i_stride;

p1 += i_stride;

p2 += i_stride;

p3 += i_stride;

}

_mm_storeu_si128( (__m128i *)cost, _mm_slli_epi32( sum, 1) );

}

void inline sad4_64_fast_sse( pixel *fenc, pixel *p0, pixel *p1, pixel *p2, pixel

*p3, int i_stride,int cost[4], int ly )

{

__m128i sum = _mm_setzero_si128();

int i;

i_stride <<= 1;

for( i = 0; i < ly; i += 2 )

{

__m128i se = _mm_load_si128( (__m128i *)(fenc) );











9




s0 = _mm_loadu_si128( (__m128i *)(p0 + 16) );

s1 = _mm_loadu_si128( (__m128i *)(p1 + 16) );

s2 = _mm_loadu_si128( (__m128i *)(p2 + 16) );

s3 = _mm_loadu_si128( (__m128i *)(p3 + 16) );









s0 = _mm_loadu_si128( (__m128i *)(p0 + 32) );

s1 = _mm_loadu_si128( (__m128i *)(p1 + 32) );

s2 = _mm_loadu_si128( (__m128i *)(p2 + 32) );

s3 = _mm_loadu_si128( (__m128i *)(p3 + 32) );









s0 = _mm_loadu_si128( (__m128i *)(p0 + 48) );

s1 = _mm_loadu_si128( (__m128i *)(p1 + 48) );

s2 = _mm_loadu_si128( (__m128i *)(p2 + 48) );

s3 = _mm_loadu_si128( (__m128i *)(p3 + 48) );









p0 += i_stride;

p1 += i_stride;

p2 += i_stride;

p3 += i_stride;

}

_mm_storeu_si128( (__m128i *)cost, _mm_slli_epi32( sum, 1) );

}

/************** AVX2 instruction implementation ************************/


10

void inline sad4_32_fast_avx2( pixel *fenc, pixel *p0, pixel *p1, pixel *p2, pixel

*p3, int i_stride,int cost[4], int ly )

{

__m256i sum = _mm256_setzero_si256();

int i;

i_stride <<= 1;

for( i = 0; i < ly; i += 2 )

{

__m256i se = _mm256_load_si256( (__m256i *)(fenc) );

__m256i s0 = _mm256_loadu_si256( (__m256i *)(p0) );




s0 = _mm256_sad_epu8( se, s0 );

s1 = _mm256_sad_epu8( se, s1 );

s2 = _mm256_sad_epu8( se, s2 );

s3 = _mm256_sad_epu8( se, s3 );

s0 = _mm256_hadd_epi32( s0, s1 );

s1 = _mm256_hadd_epi32( s2, s3 );

sum = _mm256_add_epi32( sum, _mm256_hadd_epi32( s0, s1 ) );


p0 += i_stride;

p1 += i_stride;

p2 += i_stride;

p3 += i_stride;

}

_mm256_storeu_si256( (__m256i *)cost, _mm256_slli_epi32( sum, 1) );

}

void inline sad4_64_fast_avx2( pixel *fenc, pixel *p0, pixel *p1, pixel *p2, pixel

*p3, int i_stride, int cost[4], int ly )

{

__m256i sum = _mm256_setzero_si256();

int i;

i_stride <<= 1;

for( i = 0; i < ly; i += 2 )

{

__m256i se = _mm256_load_si256( (__m256i *)(fenc) );





s0 = _mm256_sad_epu8( se, s0 );

s1 = _mm256_sad_epu8( se, s1 );

s2 = _mm256_sad_epu8( se, s2 );

s3 = _mm256_sad_epu8( se, s3 );

s0 = _mm256_hadd_epi32( s0, s1 );

s1 = _mm256_hadd_epi32( s2, s3 );


se = _mm256_load_si256( (__m256i *)(fenc + 32) );

s0 = _mm256_loadu_si256( (__m256i *)(p0 + 32) );

s1 = _mm256_loadu_si256( (__m256i *)(p1 + 32) );

s2 = _mm256_loadu_si256( (__m256i *)(p2 + 32) );


11

s3 = _mm256_loadu_si256( (__m256i *)(p3 + 32) );

s0 = _mm256_sad_epu8( se, s0 );

s1 = _mm256_sad_epu8( se, s1 );

s2 = _mm256_sad_epu8( se, s2 );

s3 = _mm256_sad_epu8( se, s3 );

s0 = _mm256_hadd_epi32( s0, s1 );

s1 = _mm256_hadd_epi32( s2, s3 );



p0 += i_stride;

p1 += i_stride;

p2 += i_stride;

p3 += i_stride;

}

_mm256_storeu_si256( (__m256i *)cost, _mm256_slli_epi32( sum, 1) );

}

Result: CPU Cycle original SSE AVX2 run 1 98877 977 679 run 2 98463 1092 690 run 3 98152 978 679 run 4 98003 943 679 run 5 98118 954 678 avg. 98322.6 988.8 681 speedup 1.00 99.44 144.38

Table 1. SSE and AVX2 implementation result

From the table 1, in this function, the SSE and AVX2 instructions can boost the

performance hundred times, and AVX2 code further provide more than 40% performance

improvement than SSE.

In Strongene encoding codec, observed from the profiling data, all the major hot

functions can be vectorized by SIMD instructions, such as low-complexity motion

compensation interpolation, transpose-free integer transform, butterfly Hadamard

transform and the least-memory-redundancy SAD/SSD calculation. Based on above

SIMD programming model and paradigms, Strongene rewrite the hot functions in the

encoding codec to pursuing maximum performance increase. Figure 7 is our profiling

data on a standard 1080p HEVC encoding scenario, 60% hot functions are running in

SIMD SSE instructions, and have started the AVX2 coding also.


12

Figure 7. Profiling Results of Strogene encoding functions

AVX2 instruction will theoretically double the performance of previous 128b SSE code

by 256b int computing, which will be supported in Xeon Haswell platform that launched

in 2014, we can expect further extremely performance improvement when upgrade the

SSE code to AVX2 at Haswell platform.

4.2 Thread Concurrency and Cores Scalability Tuning

As we have seen in the section 2.3, most of current implementations can’t utilize all the

cores’ capabilities of the multi-core platform. Based on the latest IA Xeon multi-core

architectures, clarified the parallelism dependency between HEVC CTB based algorithms,

Strongene propose the inter-frame wave-front (IFW) parallel framework to replace

original OWF(overlapped wave-front) and WPP(wave-front parallel processing) methods.

Then develop a three-level thread management scheme to guarantee the IFW can fully

utilize all the CPU cores to accelerate the HEVC encoding process. With this new

parallelism framework, at Intel(R) Xeon(R) Ivy Bridge(E5-2697 @2.70GHz, 12*2

physical cores, SMT OFF) platform, Strongene codec can utilize 18-24 physical cores’

computing resource, pretty good thread concurrency achieved.

Figure 8. Thread Concurrency and CPU Utilization in Strongene Encoding Codec

With the new WHP parallelism framework and fully implemented SIMD instructions

from task level and data level respectively, Strongene encoding codec accomplished

70.3x performance speedup at x86 processors for1080p video sequences.


13

4.3 Further Tuning with SMT/HT

Simultaneous Multithreading (SMT), also called Hyper-threading (HT) technology is

widely supported in all IA platforms, that make the operating system addresses two

virtual or logical cores for each physical core, and shares the resources between them

when possible. The main function of hyper-threading is to decrease the number of

dependent instructions on the pipeline. It offers performance benefits when CPU cores

fully running in the heavy level, but not in every application such as that have the cores

stay idle, in this case SMT technology will introduce the task/thread switching overhead.

Therefore, we turn off the SMT in Strongene encoding codec platform, and reach the

HEVC 1080p video real-time encoding standard at IA Xeon IVY Bridge E5-2697 v2

platform, as the yellow line showed in following table.

Platform Resolution Bitrate (kbps) fps CPU Usage Encoding- mode SMT

WSM

E7-8837

@2.67GHz

(8*8c)

720p 800 8.2 15c ultrafast OFF

1600 2.6 18c ultraslow OFF

1080p 1500 3.6 27c ultrafast OFF


4k 5000 1.2 19c ultrafast OFF


IVY

E5-2697 v2

@2.70GHz

(2*12c)

720p 1000 11 40% 14c ultraslow ON

720p 1000 46 60% 16c ultrafast ON

1080p 1500 21 70% 16c ultrafast ON

1080p 1500 25 80% 18c ultrafast OFF

IVB

E7-4890

@2.80GHz

(4*15c)

1080p 2000 22 19c ultrafast ON

1080p 8000 6.11 15c ultraslow ON

4k 8000 7.02 29c ultrafast ON

4k 8000 3.28 23c ultraslow ON

Table 2. Strongene HEVC encoding performance on Xeon Platform1

After achieving tremendous performance improvements, we further evaluate the

Strongene HEVC encoding codec capability at IA Xeon platform, focus on the bandwidth

and quality issues.

File：BQTerrance_1920x1080_60.yuv

Resolution：1920x1080 Size：1869Mbyte，622080 kbps

Platform：E5-2697 v2 @2.70GHz, RAM 64GB

DDR3-1867, QPI 8.0 GT /s

OS/SW：Red Hat 6.4, kernel 2.6.32, gcc v4.4.7, ffmpeg

v2.0.1, Lentoid HEVC Encoder r2096 linux

Codec Size(byte) Bitrate(kbps) PSNR_Y/U/V(db)

H.264 12254696 4078.1

32.311/39.369/42.043

H.265 6215615 2064.28 34.016/39.822/42.141

1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such

as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

http://en.wikipedia.org/wiki/Operating_system

http://www.intel.com/performance


14

Table 3. H.264 and H.265 codec performance compare

From the Table 3 and Figure 9, we can see that H.265/HEVC codec saves 50%

bandwidth and maintain the same video quality.

Figure 9. Bandwidth and PSNR Compare of H.264 and H.265 codec

5. Summary

H.265/HEVC is the most popular video standard in the coming decade, all the media

applications and products are pursuing the HEVC support currently. In this paper, we

accomplished a CPU based real-time HEVC encoding solution on Intel(R) Xeon(R)

platform with IA new platform technologies. Our IA platform based advanced solution

has been deployed in Xunlei[4] online video service and product, and will definitely

accelerate the H.265/HEVC technology production and population.

Reference [1] Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012.

[2] High Efficiency Video Coding (HEVC) text specification draft 10, JCTVC-L1003_v34

[3] http://www.strongene.com/en/homepage.jsp

[4] http://www.xunlei.com/

http://www.strongene.com/en/homepage.jsp

http://www.xunlei.com/

Real-Time CPU Based H.265/HEVC Encoding Solution with ...

Documents

Transcript of Real-Time CPU Based H.265/HEVC Encoding Solution with ...