Encoders Module M9.3 Section 6.3. Encoders Priority Encoders TTL Encoders.
Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture
description
Transcript of Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture
![Page 1: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/1.jpg)
1
Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture
Tom R. Jacobs, Vassilios A. Chouliars,
and David J. Mulvaney
IEEE Transactions on Consumer Electronics
![Page 2: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/2.jpg)
2
Outline
Introduction Background knowledge Main purpose
Previous work Methodology Experimental results Conclusions
![Page 3: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/3.jpg)
3
IntroductionBackground Knowledge (1/5)
A number of lossy video compression standards have been developed. MPEG-1, MPEG-2, MPEG4-PART2, H.264
In order to maintain image quality and reduce bit-rates
Additional computation and power consumption
![Page 4: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/4.jpg)
4
IntroductionBackground Knowledge (2/5)
Such processing-intense consumer application algorithms are generally implemented in System-On-Chip (SOC) devices.
Parallelism DLP Data-Level Parallelism TLP Thread-Level Parallelism
![Page 5: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/5.jpg)
5
IntroductionBackground Knowledge (3/5)
Data-Level Parallelism (DLP) Distributing the data across different parallel
processing nodes.Program:
…
if CPU="a" then
low_limit=1; upper_limit=5
else if CPU="b" then
low_limit=6; upper_limit=10
end if do i = low_limit , upper_limit
Task on d(i)
end do
...
end program
![Page 6: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/6.jpg)
6
IntroductionBackground Knowledge (4/5)
1 2 3 4 5 6 7 8 9 10
Data array D of size 10
Processing node
Processing node
![Page 7: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/7.jpg)
7
IntroductionBackground Knowledge (5/5)
Thread-Level Parallelism (TLP) TLP is the parallelism inherent in an
application that runs multiple threads at once.
Benefit- Distributing the workload of a single high-
performance processor among a number of slower and simpler processor cores.
![Page 8: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/8.jpg)
8
IntroductionMain Purpose (1/2)
Utilizing Thread-Level Parallel (TLP) techniques to improve the performance on video coding. Reduce DIC (Dynamic Instruction Count).
How to improve? Workload distribution among a number of
parallel-executing processors.
![Page 9: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/9.jpg)
9
IntroductionMain Purpose (2/2)
The results presented demonstrate that reductions in dynamic instruction count can be achieved.
![Page 10: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/10.jpg)
10
Previous Work
The majority of this research is focused on coarse-granularity TLP exploitation, with distribution the workload most commonly at GOP level.
GOP GOP GOP GOP GOP GOP
Multi-threading
Little inter-node communication
![Page 11: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/11.jpg)
11
Previous Work
In 1995, K. Shen, L. A. Rowe, and E.J. Delp implemented parallel MPEG-1 at GOP level.
In 1996, S. Bozoki, S. J. P. Westen, R. L. Lagendijk and J. Biemond performed a comparison between GOP and slice level on MPEG-1.
![Page 12: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/12.jpg)
12
Previous Work
In 1997, A. Bilas, J. Fritts and J. P. Singh evaluated the performance of MPEG-2 decoders using shared memory system.
Akramullah, Ahmad and Liou implemented a threaded MPEG-2 encoder at the MB level by using local memory.
![Page 13: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/13.jpg)
13
MethodologyOverview
The threaded MPEG-2 , MPEG-4 and H.264 implemented were compiled on multi-context instruction simulator (MT-ISS) based on SimpleScalar infrastructure.
The most important issue Data dependancies between processors. Avoid race hazards.
![Page 14: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/14.jpg)
14
MethodologyRace hazards
Integer i
Thread 1
0
Thread 2
1
i+1
01
12
i+1
2
Integer i
Thread 1 Thread 2
0
0 0
i+11 1
i+1
11 Race hazards
Expected condition
Error condition
![Page 15: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/15.jpg)
15
MethodologyThread-parallel MPEG-2 (1/5)
Test model 5 (TM5) of MPEG-2 encoder is used.
Computation analysis (QCIF) DIST1 52%~73% of total DIC for a search
window of 6 to 62 pels respectively. FullSearch 3.5%~23.2% of total DIC.
Can be improved by less complex algorithmic ME method. (such as 3-step, 4-step, diamond)
FDCT, and IDCT 2.1%~21% of total DIC.
![Page 16: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/16.jpg)
16
MethodologyThread-parallel MPEG-2 (2/5)
![Page 17: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/17.jpg)
17
MethodologyThread-parallel MPEG-2 (3/5)
Motion Estimation Kernel implementation can take advantage
of data parallel techniques. Store the information in mbinfo structure for
motion compensation. Maintain exclusivity of all variables during
the parallel sections.
![Page 18: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/18.jpg)
18
MethodologyThread-parallel MPEG-2 (4/5)
Forward transform FDCT first scans the MBs on a row-by-row
basis, process these MBs in a row individually.
Determine prediction error and applies the DCT to the block.
Thread-parallel transform function can be performed in block-level.
![Page 19: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/19.jpg)
19
MethodologyThread-parallel MPEG-2 (5/5)
Inverse transform IDCT scans the MBs first row-by-row and
then block-by-block. Due to the absence of data dependencies
between blocks Can executed as parallel.
![Page 20: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/20.jpg)
20
MethodologyThread-parallel MPEG-4 (1/8)
The implementation is based on XviD project with Advanced Simple Profile (ASP). Bidirectional frames Quarter-pel motion compensation Global motion compensation Trellis quantization Custom quantization matrices
![Page 21: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/21.jpg)
21
MethodologyThread-parallel MPEG-4 (2/8)
Computation analysis (QCIF)
![Page 22: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/22.jpg)
22
MethodologyThread-parallel MPEG-4 (3/8)
The nature of XivD encoder Intra-frame encoding Inter-frame encoding
![Page 23: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/23.jpg)
23
MethodologyThread-parallel MPEG-4 (4/8)
Intra-frame encoding FrameCodeI (row-by-row for each MBs) Parallelize the loop for encoding the MBs in a
row of the image. MB data structure pMB.
Shared memory array. The highest DIC metric in FrameCodeI is
MBTransQuantIntra.
![Page 24: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/24.jpg)
24
MethodologyThread-parallel MPEG-4 (5/8)
MBTransQuantIntra Forward transformation, quantization and
inverse transformation. Shared data structure pEnc
Includes a count of quantization values. Serial code section.
Transform specific MB pixel data into the frequency domain independently.
MBPrediction and MBCoding Responsible for VLC and write to bitstream.
![Page 25: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/25.jpg)
25
MethodologyThread-parallel MPEG-4 (6/8)
Inter-frame encoding FrameCodeP Part 1
Motion Estimation Part 2
Transformation Quantization
MC
![Page 26: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/26.jpg)
26
MethodologyThread-parallel MPEG-4 (7/8)
Motion Estimation Determine a MV for every MB and applies
certain criteria to indicate when Intra coding should be used.
Scanning in raster line order. Two kind of the process
Motion prediction from current frame. ME relative to reference frames.
![Page 27: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/27.jpg)
27
MethodologyThread-parallel MPEG-4 (8/8)
Motion Prediction Examining the MVs in neighbouring MBs and
determining an initial estimate for ME.
●
● ●
●
● ●
●
● ● ●
Ideal pattern typical pattern TLP pattern
![Page 28: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/28.jpg)
28
MethodologyH.264 (1/6)
Using x264 for implementation. Frame slicing
Main problems of using MB-level Wide variation in processor workload. The modification of prediction algorithm is
needed.
![Page 29: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/29.jpg)
29
MethodologyH.264 (2/6)
Slice group in H.264 A group of MBs in a frame. Can be encoded or decoded separatedly
from the remainder of the frame. Not allowing motion prediction cross slice
boundaries. Drawback
The required bit-rate increase.
![Page 30: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/30.jpg)
30
MethodologyH.264 (3/6)
Comparison of different slice number
![Page 31: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/31.jpg)
31
MethodologyH.264 (4/6)
Comparison of different slice number
![Page 32: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/32.jpg)
32
MethodologyH.264 (5/6)
Different resolution with 4 slices
![Page 33: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/33.jpg)
33
MethodologyH.264 (6/6)
Computation analysis
![Page 34: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/34.jpg)
34
Experimental ResultsMPEG-2
SearchRange
![Page 35: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/35.jpg)
35
Experimental ResultsMPEG-4
QualitySetting
![Page 36: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/36.jpg)
36
Experimental ResultsH.264
QuantizationParameter
![Page 37: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/37.jpg)
37
Experimental ResultsComparative results
![Page 38: Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture](https://reader035.fdocuments.net/reader035/viewer/2022062803/568146b7550346895db3df43/html5/thumbnails/38.jpg)
38
Conclusions
The DIC metric of MPEG-2, MPEG-4, and H.264 can be greatly reduced by TLP.
For HD sequences, the improvement is around 84%, 92%, 96% respectively.
TLP has become more significant for each new generation of video encoders.