Optimizing Video Editing Software with

Optimizing Video Editing Software with

OpenCL

Stanley Lam

Cyberlink

Cyberlink Senior Program Manager/Technologist

3 | Presentation Title | Month ##, 2011


OpenCL - Introduction

Introduction

Video Editing Pipeline

– Decode Effect Blend Encode

– 200+ effects

OpenCL for Effect/Blend acceleration

– Compatibility

Single-code for multiple devices

– Performance

GPU support

Concerns

– Host-to-device memory copy

– Host code security

Encode Blend Decode Effect

Effects Decoder Blender Encoder

Video Editing Pipeline

4-Stages pipeline 1. Obtain decoded frames from Decoder modules 2. Apply Video Effect on demand 3. Apply Blender to merge frame layers 1-by-1 4. Pass frame into Encoder to produce final video

Encode Blend Decode Effect

Effect Decoder Blender

Effect Decoder Blender

Encoder



OpenCL – Problem statement

Problem statement

OpenCL source buffer needs to be uploaded to the device before applying kernel operations

OpenCL result buffer needs to be downloaded to the host and passed to the next pipeline stage

Such Host-to-Device buffer movements could eliminate any HW acceleration gain

CPU OCL_GPU OCL_Kernel

Gain

OCL_GPU

Gain Effect Name Total(ms) Total (ms) Kernel Upload Download

TV Simulator 14.11 15.46 2.88 6.11 6.47 389.93 % -8.73 %

Platform: Armorhead

Driver: 8.832

SDK: APP SDK 2.4 RC

OS: Win7 Ultimate 64bits



OpenCL – Resource inventory

1 Encode n source k effects n blender

Pipeline: n Source Videos (hence n blenders) with k effects

Effect 1 Decoder 1 Effect 2 Blender


Effect k-1 Decoder n Effect k Blender

Encoder

Resource inventory

copy = 0

Encoder

Typical case: All modules are executed by the CPU (Host) on system memory.

• Minimum # of frame copy = n + k + n

copy = n copy = k copy = n



Effect 2n-1 Decoder n Effect 2n Blender

… … …

3

4

5 6 7

8

1 2

… 4(n-1)

4n-1 4n-2 4n-3

: Frame on system memory

4n

Resource inventory

If k == 2n

If GPGPU Filter is used…

Logic:

– Upload input frame to the Device

– Upload filter kernel code to the Device

– Do kernel code in the Device

– Read-back output frame to the Host

11

OpenCL GPGPU Effect – Host/Device copy

Pipeline OCL Host/Device Frame Movement

Effect

(OpenCL) Output

Input Output

Host

Device

Write Read

Input

• 1 Input frame uploaded to Device • 1 Output frame read-back to Host

: Frame on GPU Device

: Frame on CPU Host

OpenCL GPGPU Blender – Host/Device copy

Pipeline OCL Host/Device Frame Movement

Blender

(OpenCL) Output

Input1 Output

Host

Device

Write Read

Input2

Write

Input1

Input2

• 2 Input frames uploaded to Device • 1 Output frame read-back to Host

: Frame in Device

: Frame in Host

copy = 0

Data Flow – Use OpenCL Effect/Blender

copy = 4n copy = n copy = 3k

Effect 1

(OpenCL) Decoder 1

Effect 2

(OpenCL)

Blender

(OpenCL)

Effect 3

(OpenCL) Decoder 2

Effect 4

(OpenCL)

Blender

(OpenCL)

… … … 11(n - 1)-1

4 7 10

11 14 17 21

2 3 5 6 8 9

12 13 15 16 18 19

1

Effect 2n-1

(OpenCL) Decoder n

Effect 2n

(OpnCL)

Blender

(OpenCL) 11n - 11 11n - 8 11n - 5

11n-10 11n-9 11n-7 11n-6 11n-2 11n-4

11n-3

20

: Frame in Device

: Frame in Host

• If we just simply put OpenCL GPGPU modules into the pipeline • Frame copy overhead = 2k + 3n

Encoder 11n - 1

• Minimum # of frame copy = n + 3k + 4n

If k == 2n

Reduced Host/Device copy

Host

• If we keep frames in Device as long as possible

Effect

(OpenCL) Input Output

Device

: Frame in Device

: Frame in Host

Data Flow OCL Host/Device Frame Movement

Reduced Host/Device copy

• Keep frames in Device as long as possible

Data Flow OCL Host/Device Frame Movement

Blender

(OpenCL) Output

Host

Device

Input1

Input2

: Frame in Device

: Frame in Host

copy = 1 Encoder

OpenCL Effects Data Flow – Reduced

copy = 2n copy = k copy = n

Effect 1

(OpenCL)

Source Video 1

Effect 2

(OpenCL)

Blender

(OpenCL)

Effect 3

(OpenCL)

Source Video 2

Effect 4

(OpenCL)

Blender

(OpenCL)

… … …

3 4

5

6 8 9

10

2

7

1

Effect 2n-1

(OpenCL)

Source Video n

Effect 2n

(OpnCL)

Blender

(OpenCL) 5n-4

5n+1

5n-2 5n-1 5n-3

5(n-1)

5n

: Frame in Device

: Frame in Host

• Keeping frames in GPGPU Device to reduce frame copy • Frame copy overhead = n + 1

• Minimum # of frame copy = 2n + k + n + 1

If k == 2n

Resource inventory

Case 1-1: Decoder OpenCL Effect

– SW Decoder + OpenCL Effect

1 frame copy (OCL Host OCL Device)

– HW Decoder + OpenCL Effect (no share)

2 frame copy (DxVA Device OCL Host (System) OCL Device)

– HW Decoder + OpenCL Effect (DxVA/OpenDecode share)

0 frame copy

Case 1-2: Decoder SW Effect

– SW Decoder + SW Effect

0 frame copy

– HW Decoder + SW Effect

1 frame copy (DxVA Device System)

Resource inventory

Case 2-1: OpenCL Effect Blender

– 2 OpenCL Effect + SW Blender

2 frame copy (2 OCL Device 2 OCL Host)

– 2 OpenCL Effect + OpenCL Blender (no share)

4 frame copy (2 OCL Device 2 OCL Host (System) 2 OCL Device)

– 2 OpenCL Effect + OpenCL Blender (OpenCL shared)

0 frame copy

Case 2-2: SW Effect Blender

– 2 SW Effect + SW Blender

0 frame copy

– 2 SW Effect + OpenCL Blender

2 frame copy (2 OCL Host 2 OCL Device)

Resource inventory

Case 3-1: OpenCL Blender Enc

– OpenCL Blender + SW Encoder

1 frame copy (OCL Device OCL Host)

– OpenCL Blender + HW Encoder (no share)

2 frame copy (OCL Device OCL Host (System) Enc Device)

– OpenCL Blender + HW Encoder (OpenEncode shared)

0 frame copy

Case 3-2: SW Blender Enc

– SW Blender+ SW Enc

0 frame copy

– SW Blender + HW Encoder

1 frame copy (System Enc Device)

Resource inventory

Memory is a limited resource within the GPU

ex: 1024MB in HD 6800 series

Lots of memory activities in video editing pipeline

Memory management is critical

Host-to-Device frame-copy must occur when the pipeline uses a mix of CPU and GPU filters

Can be improved by “zero-copy”, “fast copy” techniques

Frame-copy can be avoided if whole pipeline is in the same Device

Resource inventory

Various HW acceleration techniques adopted

They handle resource differently

– Buffers duplicated

HW Acceleration Tech

Decoder DxVA

Effects

PiP OpenCL

DSP OpenCL/IntelQuickSync/APP/CUDA

Particle D3D11

3D Template D3D9

Title OpenCL

Blender OpenCL

Encoder MSDK/AVT/NVPVENC (Note1)

Note 1: Use HW Encoder filter from IHV

Resource inventory

Multi-thread rendering

– Can leverage available computational resources

– Further increases overall rendering performance

Dedicated Memory Management is necessary

– To improve performance in CPU+GPU mixed pipeline cases

– To avoid memory movement in pure GPGPU pipeline cases

– OpenCL/D3D9/D3D11

– To use limited GPGPU memory in a more efficient way



OpenCL – Memory Management

Memory management

2 pipeline scenario

– CPU-only pipeline – use only CPU resource

– HW accelerated pipeline – use GPU resource if possible

To manage resource for different pipeline scenario

– To share memory object

– To reduce memory allocate/destroy

– To increase memory usage efficiency

2-layer design

– Manager + Object

Memory Manager

Memory object manager

– Allocate/Free/Monitor memory object

from both system and GPGPU memory

– Sync up GPGPU device used in pipeline

– Handle out-of-memory situation

– Keep tracking of memory object status

Total object amount

Total used/locked memory size

Used by editing kernel to deliver/manage frame buffers in pipeline

Memory Object

Abstracted frame buffer object

– To carry System/OpenCL/DxVA/D3D9/D3D11 frames

Used by all modules within pipeline

Do host/device migration

– Notify out-of-memory situation

Centrically Managed by Memory Manager



OpenCL – Coding examples

Examples

Memory Manager :

Memory Object : Buffer passed through pipe line

Temporary buffer

Effect 1 Decoder Effect 2 Blender Encoder



OpenCL - Demo

Demo

Show whole editing pipeline

– PDR10 (build 5/30)

Demo cases

– No HW acceleration

SW Dec + SW Effect + SW Blender + SW Enc

– Enable all OCL acceleration w/ share

SW Dec + OCL Effect + OCL Blender + SW Enc

– Enable all HW acceleration w/o share

HW Dec + OCL Effect + OCL Blender + HW Enc

– Enable all HW acceleration w/ share

Not ready yet

Performance Performance numbers of 46 OCL Effects

Performance numbers of Demo cases

Test Projects: PackedProject\Project1.pds - PDR 10 2x2 TVWall with 3 OpenCL effects

Source: Sample_H264.m2ts - Full HD H264 clip, 30 sec

Desination: MPEG2 - Profile "(HD) MPEG-2, 1080i" PDR10 build

Sabine Platform: 1.8 GHz, Radeon HD 6620G, 4GB system memory

Total (s) Gain (%)

SW (Effect + Blender) + SW (Dec + Enc) 1045 -

OCL (Effect + Blender) + SW (Dec + Enc) (with overhead) 485 115.46

OCL (Effect + Blender) + SW (Dec + Enc) (w/o overhead) 405 158.02

OCL (Effect + Blender) + HW (Dec + Enc) (w/o overhead) 455 129.67

34 | Optimizing Video Editing Software with OpenCL | June 2011

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

Optimizing Video Editing Software with

Documents

Transcript of Optimizing Video Editing Software with