1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....

41
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006

description

3 Introduction Background Knowledge (1/5) Video Communication

Transcript of 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P....

Page 1: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

1

Hierarchical Parallelization of an H.264/AVC Video Encoder

A. Rodriguez, A. Gonzalez, and M.P. Malumbres

IEEE PARELEC 2006

Page 2: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

2

Outline

IntroductionPerformance AnalysisHierarchical H.264 Parallel EncoderExperimental ResultsConclusions

Page 3: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

3

IntroductionBackground Knowledge (1/5)

Video Communication

Page 4: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

4

IntroductionBackground Knowledge (2/5)

H.264/AVCRemove sensitive redundant informationIn order to reach the limits on compression

efficiency intensive computation Video on demand, video conference, live

broadcasting, etc.

Page 5: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

5

IntroductionBackground Knowledge (3/5)

H.264/AVC encoderHigh CPU demand

Low latency Real time response

Platforms with supercomputing capabilitiesClustersMultiprocessorsSpecial purpose devices

Page 6: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

6

IntroductionBackground Knowledge (4/5)

ClusterA group of linked computersImprove performance and/or availability

over that provided by a single computerCategorizations

High-availability clusters Load-balancing clusters High-performance clusters

Page 7: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

7

IntroductionBackground Knowledge (5/5)

Message Passing ParallelismMessage passing runtimes and libraries MPI

Multithread ParallelismOpenMP

Optimized librariesSIMD extension and global processing unit Intel IPP, AMD ACML, etc.

Page 8: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

8

IntroductionMain Purpose (1/6)

Apply parallel processing to H.264 encoders in order to reduce computation intensity.

Given video quality and bit rateImage resolutionFrame rateLatency

Page 9: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

9

IntroductionMain Purpose (2/6)

Hierarchical parallelization of H.264 encoder

Two level MPI message passing parallelizationGOP levelSlice level

Page 10: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

10

IntroductionMain Purpose (3/6)

GOP level parallelismGood speed-upHigh latency

…….. …….. ……..

GOP GOP GOP

Page 11: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

11

IntroductionMain Purpose (4/6)

Example of latency1 GOP = 10 framesFrame rate = 30 frames/secTime for encoding 1 GOP = 3 secondsWe have to encode 9 GOP in parallel in order to

achieve real time responseLatency = 3 seconds

Page 12: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

12

IntroductionMain Purpose (5/6)

Slice level parallelismLow latencyLess coding efficiency

Page 13: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

13

IntroductionMain Purpose (6/6)

Combination both approachesSpeed-up Efficiency

Page 14: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

14

Performance AnalysisOverview (1/2)

““Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations”

“A Parallel implementation of H.26L video encoder”

CombinationScalability and low latency

Page 15: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

15

Performance AnalysisOverview (2/2)

Processing flow

video sequence

GOP GOP GOP GOP……..……..

Increasethroughput

Reducelatency

Page 16: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

16

Performance AnalysisEquation definition

Little’s lawN = X * R

• N : Number of GOPs processed in parallel.

• X : Number of GOPs encoded per second.

• R : Elapsed time between a GOP enters the

system and the same GOP is completely

encoded.

Page 17: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

17

Performance AnalysisAnalysis (1/2)

If we have np nodes in the cluster and every GOP decomposed in ns slices

N = np / ns

R = RSEQ / ( ns * Es)

• RSEQ : Sequential encoding time of a GOP

• Es : Parallel efficiency of slice level

Page 18: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

18

Performance AnalysisAnalysis (2/2)

GOP throughput of combined parallel encoder

If Es is significantly less than 1, throughput would be affected negatively

sSEQ

p

ss

SEQ

s

p

ERn

EnRnn

RNX

Page 19: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

19

Performance AnalysisExample (1/4)

Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder

encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined

SEQ

p

Rn

X

Page 20: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

20

Performance AnalysisExample (2/4)

To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec

np = 4 * 5 = 20 nodes

Page 21: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

21

Performance AnalysisExample (3/4)

Combined with slice level parallelismMaximum of allowed latency = 1 secSlice parallelism efficiency = 0.8

nodesnp 258.05*4

slicesER

Rn

s

SEQs 25.6

8.0*15

Page 22: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

22

Performance AnalysisExample (4/4)

We set ns to 7 and N to 4, and number of required nodes is adjusted to 28

sec89.0

8.0*75

sec/GOPs48.48.0*5

4*7

ss

SEQ

EnR

R

X Throughput

Latency

Page 23: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

23

Performance AnalysisEfficiency Estimation (1/5)

Why we have to estimate Es ?ThroughputLatency

How to estimate Es ?PAMELA (PerformAnce ModEling

LAnguage) model

Page 24: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

24

Performance AnalysisEfficiency Estimation (2/5)

Update DPB (Decoding Picture Buffer) in every nodeUsing MPI_Allgather

In this PAMELA model MPI_Allgather is implemented using binary tree

Page 25: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

25

Performance AnalysisEfficiency Estimation (3/5)

The PAMELA model to parallel encode one frame is :

L = par ( p = 1…ns )

delay (ts); delay (tw)

seq ( I = 0…log2(ns)-1)

par ( j = 1…ns)

delay ( tL + tc * 2i)

ns : The number of slices processed in

parallel

ts : The mean of slice encoding time

tw : The mean wait time due to variations

in ts and global synchronization

tL : Start up time

tc : Transmission time of one encoded

slice

Page 26: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

26

Performance AnalysisEfficiency Estimation (4/5)

The parallel time obtained solving this model is

Efficiency can be computed as

T(L) = ts + tw + tAG

tAG = log2 (ns) * tL + (ns - 1) * tc

Page 27: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

27

Performance AnalysisEfficiency Estimation (5/5)

The experimental estimations of parameter values

Estimated efficiency for a slice based parallel encoder

tL tc ts tw tAG

6.0 0.0133*4056 840000 20586 421

Page 28: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

28

Performance AnalysisSlice Parallelism Scalability (1/4) The feasible number of slices will

depend on the video resolution

Number of MBs per slice

Bit rate increment (%)

Page 29: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

29

Performance AnalysisSlice Parallelism Scalability (2/4)Bit rate overhead vs. number of slices

per frame

Page 30: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

30

Performance AnalysisSlice Parallelism Scalability (3/4)PSNR loss vs. number of slices per

frame

Page 31: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

31

Performance AnalysisSlice Parallelism Scalability (4/4)Encoding time vs. number of slices per

frame

Page 32: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

32

Hierarchical Parallel Encoder Overview

In order to achieve scalability and low latencyCombine GOP and slice level parallelism

In the first levelDivide sequence in GOPs(15 frames) Every GOP is assigned to a processor

group inside the cluster Each group encodes independently

Page 33: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

33

Hierarchical Parallel Encoder GOP assignment method

Local managerCommunicate with global manager

Global managerInform the GOP assignment by sending a

message with the GOP number to the requesting local manager

Simple and load balance

Page 34: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

34

Hierarchical Parallel Encoder Framework

Hierarchical H.264 parallel encoderGlobal Manager

P0

P1 P2

P0

P1 P2

P0

P1 P2

Page 35: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

35

Experimental ResultsEnvironments (1/2)

Mozart4 biprocessor nodes with AMD Opteron 246

at 2 GHz interconnected by a switched Gigabit Ethernet

AldebaranSGI Altix 3700 with 44 nodes Itanium II

interconnected by a high performance proprietary network

Page 36: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

36

Experimental ResultsEnvironments (2/2)

720 * 480 standard sequence Ayersroc which composed by 16 GOPs Configuration Cluster #Groups #Slices

01_Gr_08S1 Mozart 1 8

02_Gr_04S1 Mozart 2 4

04_Gr_02S1 Mozart 4 2

08_Gr_01S1 Mozart 8 1

01_Gr_16S1 Aldebaran 1 16

02_Gr_08S1 Aldebaran 2 8

04_Gr_04S1 Aldebaran 4 4

08_Gr_02S1 Aldebaran 8 2

16_Gr_01S1 Aldebaran 16 1

Page 37: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

37

Experimental ResultsSystem Speedup (1/2)

Speed up in Mozart

Page 38: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

38

Experimental ResultsSystem Speedup (2/2)

Speed up in Aldebaran

Page 39: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

39

Experimental ResultsEncoding Latency

Mean GOP encoding time

Page 40: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

40

Conclusions

A hierarchical parallel video encoder based on H.264/AVC was proposed.

Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder.

Some issues remains open, as mentioned in previous section.

Page 41: 1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres…

41

Reference

[1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2400), pp. 830, 833, Padderborn, 2002.2002.

[2] A. Rodriguez, A. González and M.P. Malumbres,A. Rodriguez, A. González and M.P. Malumbres,“ Performance “ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, Electrical Engineering, pp. 354, 357, Dresden, 2004.pp. 354, 357, Dresden, 2004.

[3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, Parallel Systems”, IEEE Transactions on Parallel and Distributed IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003.Systems, vol 14, no 2, Feb. 2003.

[4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.Publishers, Inc.