Optimizing Video Editing Software with
Transcript of Optimizing Video Editing Software with
Optimizing Video Editing Software with
OpenCL
Stanley Lam
Cyberlink
Cyberlink Senior Program Manager/Technologist
3 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL - Introduction
Introduction
Video Editing Pipeline
– Decode Effect Blend Encode
– 200+ effects
OpenCL for Effect/Blend acceleration
– Compatibility
Single-code for multiple devices
– Performance
GPU support
Concerns
– Host-to-device memory copy
– Host code security
Encode Blend Decode Effect
Effects Decoder Blender Encoder
Video Editing Pipeline
4-Stages pipeline 1. Obtain decoded frames from Decoder modules 2. Apply Video Effect on demand 3. Apply Blender to merge frame layers 1-by-1 4. Pass frame into Encoder to produce final video
Encode Blend Decode Effect
Effect Decoder Blender
Effect Decoder Blender
Encoder
6 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL – Problem statement
Problem statement
OpenCL source buffer needs to be uploaded to the device before applying kernel operations
OpenCL result buffer needs to be downloaded to the host and passed to the next pipeline stage
Such Host-to-Device buffer movements could eliminate any HW acceleration gain
CPU OCL_GPU OCL_Kernel
Gain
OCL_GPU
Gain Effect Name Total(ms) Total (ms) Kernel Upload Download
TV Simulator 14.11 15.46 2.88 6.11 6.47 389.93 % -8.73 %
Platform: Armorhead
Driver: 8.832
SDK: APP SDK 2.4 RC
OS: Win7 Ultimate 64bits
8 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL – Resource inventory
1 Encode n source k effects n blender
Pipeline: n Source Videos (hence n blenders) with k effects
Effect 1 Decoder 1 Effect 2 Blender
Effect 5 Decoder 2 Effect 6 Blender
Effect k-1 Decoder n Effect k Blender
Encoder
Resource inventory
copy = 0
Encoder
Typical case: All modules are executed by the CPU (Host) on system memory.
• Minimum # of frame copy = n + k + n
copy = n copy = k copy = n
Effect 1 Decoder 1 Effect 2 Blender
Effect 3 Decoder 2 Effect 4 Blender
Effect 2n-1 Decoder n Effect 2n Blender
… … …
3
4
5 6 7
8
1 2
… 4(n-1)
4n-1 4n-2 4n-3
: Frame on system memory
4n
Resource inventory
If k == 2n
If GPGPU Filter is used…
Logic:
– Upload input frame to the Device
– Upload filter kernel code to the Device
– Do kernel code in the Device
– Read-back output frame to the Host
11
OpenCL GPGPU Effect – Host/Device copy
Pipeline OCL Host/Device Frame Movement
Effect
(OpenCL) Output
Input Output
Host
Device
Write Read
Input
• 1 Input frame uploaded to Device • 1 Output frame read-back to Host
: Frame on GPU Device
: Frame on CPU Host
OpenCL GPGPU Blender – Host/Device copy
Pipeline OCL Host/Device Frame Movement
Blender
(OpenCL) Output
Input1 Output
Host
Device
Write Read
Input2
Write
Input1
Input2
• 2 Input frames uploaded to Device • 1 Output frame read-back to Host
: Frame in Device
: Frame in Host
copy = 0
Data Flow – Use OpenCL Effect/Blender
copy = 4n copy = n copy = 3k
Effect 1
(OpenCL) Decoder 1
Effect 2
(OpenCL)
Blender
(OpenCL)
Effect 3
(OpenCL) Decoder 2
Effect 4
(OpenCL)
Blender
(OpenCL)
… … … 11(n - 1)-1
4 7 10
11 14 17 21
2 3 5 6 8 9
12 13 15 16 18 19
1
Effect 2n-1
(OpenCL) Decoder n
Effect 2n
(OpnCL)
Blender
(OpenCL) 11n - 11 11n - 8 11n - 5
11n-10 11n-9 11n-7 11n-6 11n-2 11n-4
11n-3
20
: Frame in Device
: Frame in Host
• If we just simply put OpenCL GPGPU modules into the pipeline • Frame copy overhead = 2k + 3n
Encoder 11n - 1
• Minimum # of frame copy = n + 3k + 4n
If k == 2n
Reduced Host/Device copy
Host
• If we keep frames in Device as long as possible
Effect
(OpenCL) Input Output
Device
: Frame in Device
: Frame in Host
Data Flow OCL Host/Device Frame Movement
Reduced Host/Device copy
• Keep frames in Device as long as possible
Data Flow OCL Host/Device Frame Movement
Blender
(OpenCL) Output
Host
Device
Input1
Input2
: Frame in Device
: Frame in Host
copy = 1 Encoder
OpenCL Effects Data Flow – Reduced
copy = 2n copy = k copy = n
Effect 1
(OpenCL)
Source Video 1
Effect 2
(OpenCL)
Blender
(OpenCL)
Effect 3
(OpenCL)
Source Video 2
Effect 4
(OpenCL)
Blender
(OpenCL)
… … …
3 4
5
6 8 9
10
2
7
1
Effect 2n-1
(OpenCL)
Source Video n
Effect 2n
(OpnCL)
Blender
(OpenCL) 5n-4
5n+1
5n-2 5n-1 5n-3
5(n-1)
5n
: Frame in Device
: Frame in Host
• Keeping frames in GPGPU Device to reduce frame copy • Frame copy overhead = n + 1
• Minimum # of frame copy = 2n + k + n + 1
If k == 2n
Resource inventory
Case 1-1: Decoder OpenCL Effect
– SW Decoder + OpenCL Effect
1 frame copy (OCL Host OCL Device)
– HW Decoder + OpenCL Effect (no share)
2 frame copy (DxVA Device OCL Host (System) OCL Device)
– HW Decoder + OpenCL Effect (DxVA/OpenDecode share)
0 frame copy
Case 1-2: Decoder SW Effect
– SW Decoder + SW Effect
0 frame copy
– HW Decoder + SW Effect
1 frame copy (DxVA Device System)
Resource inventory
Case 2-1: OpenCL Effect Blender
– 2 OpenCL Effect + SW Blender
2 frame copy (2 OCL Device 2 OCL Host)
– 2 OpenCL Effect + OpenCL Blender (no share)
4 frame copy (2 OCL Device 2 OCL Host (System) 2 OCL Device)
– 2 OpenCL Effect + OpenCL Blender (OpenCL shared)
0 frame copy
Case 2-2: SW Effect Blender
– 2 SW Effect + SW Blender
0 frame copy
– 2 SW Effect + OpenCL Blender
2 frame copy (2 OCL Host 2 OCL Device)
Resource inventory
Case 3-1: OpenCL Blender Enc
– OpenCL Blender + SW Encoder
1 frame copy (OCL Device OCL Host)
– OpenCL Blender + HW Encoder (no share)
2 frame copy (OCL Device OCL Host (System) Enc Device)
– OpenCL Blender + HW Encoder (OpenEncode shared)
0 frame copy
Case 3-2: SW Blender Enc
– SW Blender+ SW Enc
0 frame copy
– SW Blender + HW Encoder
1 frame copy (System Enc Device)
Resource inventory
Memory is a limited resource within the GPU
ex: 1024MB in HD 6800 series
Lots of memory activities in video editing pipeline
Memory management is critical
Host-to-Device frame-copy must occur when the pipeline uses a mix of CPU and GPU filters
Can be improved by “zero-copy”, “fast copy” techniques
Frame-copy can be avoided if whole pipeline is in the same Device
Resource inventory
Various HW acceleration techniques adopted
They handle resource differently
– Buffers duplicated
HW Acceleration Tech
Decoder DxVA
Effects
PiP OpenCL
DSP OpenCL/IntelQuickSync/APP/CUDA
Particle D3D11
3D Template D3D9
Title OpenCL
Blender OpenCL
Encoder MSDK/AVT/NVPVENC (Note1)
Note 1: Use HW Encoder filter from IHV
Resource inventory
Multi-thread rendering
– Can leverage available computational resources
– Further increases overall rendering performance
Dedicated Memory Management is necessary
– To improve performance in CPU+GPU mixed pipeline cases
– To avoid memory movement in pure GPGPU pipeline cases
– OpenCL/D3D9/D3D11
– To use limited GPGPU memory in a more efficient way
24 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL – Memory Management
Memory management
2 pipeline scenario
– CPU-only pipeline – use only CPU resource
– HW accelerated pipeline – use GPU resource if possible
To manage resource for different pipeline scenario
– To share memory object
– To reduce memory allocate/destroy
– To increase memory usage efficiency
2-layer design
– Manager + Object
Memory Manager
Memory object manager
– Allocate/Free/Monitor memory object
from both system and GPGPU memory
– Sync up GPGPU device used in pipeline
– Handle out-of-memory situation
– Keep tracking of memory object status
Total object amount
Total used/locked memory size
Used by editing kernel to deliver/manage frame buffers in pipeline
Memory Object
Abstracted frame buffer object
– To carry System/OpenCL/DxVA/D3D9/D3D11 frames
Used by all modules within pipeline
Do host/device migration
– Notify out-of-memory situation
Centrically Managed by Memory Manager
28 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL – Coding examples
Examples
Memory Manager :
Memory Object : Buffer passed through pipe line
Temporary buffer
Effect 1 Decoder Effect 2 Blender Encoder
30 | Presentation Title | Month ##, 2011
Optimizing Video Editing Software with
OpenCL - Demo
Demo
Show whole editing pipeline
– PDR10 (build 5/30)
Demo cases
– No HW acceleration
SW Dec + SW Effect + SW Blender + SW Enc
– Enable all OCL acceleration w/ share
SW Dec + OCL Effect + OCL Blender + SW Enc
– Enable all HW acceleration w/o share
HW Dec + OCL Effect + OCL Blender + HW Enc
– Enable all HW acceleration w/ share
Not ready yet
Performance Performance numbers of 46 OCL Effects
Performance numbers of Demo cases
Test Projects: PackedProject\Project1.pds - PDR 10 2x2 TVWall with 3 OpenCL effects
Source: Sample_H264.m2ts - Full HD H264 clip, 30 sec
Desination: MPEG2 - Profile "(HD) MPEG-2, 1080i" PDR10 build
Sabine Platform: 1.8 GHz, Radeon HD 6620G, 4GB system memory
Total (s) Gain (%)
SW (Effect + Blender) + SW (Dec + Enc) 1045 -
OCL (Effect + Blender) + SW (Dec + Enc) (with overhead) 485 115.46
OCL (Effect + Blender) + SW (Dec + Enc) (w/o overhead) 405 158.02
OCL (Effect + Blender) + HW (Dec + Enc) (w/o overhead) 455 129.67
Q & A
34 | Optimizing Video Editing Software with OpenCL | June 2011
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.