Speeding up VirtualDub

14
Speeding up VirtualDub Speeding up VirtualDub Presented by: Shmuel Habari Presented by: Shmuel Habari Advisor: Zvika Guz Advisor: Zvika Guz Software Systems Lab Software Systems Lab Technion Technion

description

Speeding up VirtualDub. Presented by: Shmuel Habari Advisor: Zvika Guz. Software Systems Lab Technion. What is VirtualDub?. VirtualDub is an incredibly popular open source video processing tool. - PowerPoint PPT Presentation

Transcript of Speeding up VirtualDub

Page 1: Speeding up VirtualDub

Speeding up VirtualDub Speeding up VirtualDub

Presented by: Shmuel HabariPresented by: Shmuel Habari

Advisor: Zvika GuzAdvisor: Zvika Guz

Software Systems LabSoftware Systems Lab

TechnionTechnion

Page 2: Speeding up VirtualDub

What is VirtualDubWhat is VirtualDub??

VirtualDub is an incredibly popular open source video VirtualDub is an incredibly popular open source video processing tool.processing tool.

It is capable of merging videos, cutting scenes, adding It is capable of merging videos, cutting scenes, adding subtitles and applying a wide variety of filters. It also subtitles and applying a wide variety of filters. It also supports third party video compression (i.e. DivX)supports third party video compression (i.e. DivX)

VirtualDub is constantly being refined, expanded and VirtualDub is constantly being refined, expanded and adapted by it’s original creator, Avery Lee .adapted by it’s original creator, Avery Lee .

httphttp://://wwwwww..virtualdubvirtualdub..orgorg//

Page 3: Speeding up VirtualDub

VirtualDub’s BenchmarkVirtualDub’s Benchmark

The benchmark chosen was to use the The benchmark chosen was to use the Resize filterResize filter, an , an often-used cpu heavy filter.often-used cpu heavy filter.

Choosing a vibrant color animation video, so every flaw, Choosing a vibrant color animation video, so every flaw, if any, will be visible.if any, will be visible.

The result video was The result video was made w/o audio made w/o audio filtering, and with no filtering, and with no third party third party compression utilitiescompression utilities..

Page 4: Speeding up VirtualDub

VTune Performance AnalyzerVTune Performance Analyzer

Analyzing the benchmark using VTune: Analyzing the benchmark using VTune:

First step - First step - VDFastMemcpyPartialMMX2VDFastMemcpyPartialMMX2

Page 5: Speeding up VirtualDub

Fast Memory CopyFast Memory Copy

This functions handles copying large quantities of data This functions handles copying large quantities of data from a memory source address, into a memory from a memory source address, into a memory destination address.destination address.

@blastloop:@blastloop:

movq mm0, [edx]movq mm0, [edx]

movq mm1, [edx+8]movq mm1, [edx+8]

movq mm2, [edx+16]movq mm2, [edx+16]

movq mm3, [edx+24]movq mm3, [edx+24]

movq mm4, [edx+32]movq mm4, [edx+32]

movq mm5, [edx+40]movq mm5, [edx+40]

movq mm6, [edx+48]movq mm6, [edx+48]

movq mm7, [edx+56]movq mm7, [edx+56]

movntq [ebx], mm0movntq [ebx], mm0

movntq [ebx+8], mm1movntq [ebx+8], mm1

movntq [ebx+16], mm2movntq [ebx+16], mm2

movntq [ebx+24], mm3movntq [ebx+24], mm3

movntq [ebx+32], mm4movntq [ebx+32], mm4

movntq [ebx+40], mm5movntq [ebx+40], mm5

movntq [ebx+48], mm6movntq [ebx+48], mm6

movntq [ebx+56], mm7movntq [ebx+56], mm7

Each cycle copies the data into Each cycle copies the data into the registers, and then into the the registers, and then into the specified address. specified address.

Moving to the next 64 bytes, the Moving to the next 64 bytes, the loop continues, till all the data loop continues, till all the data

has been copied.has been copied. From observations, the function From observations, the function was called to read 2048 bytes was called to read 2048 bytes every time.every time.

Page 6: Speeding up VirtualDub

Clockticks SamplesClockticks Samples

Again using VTune it was seen that predictably, the most Again using VTune it was seen that predictably, the most clockticks were when reading from the memory.clockticks were when reading from the memory.

Page 7: Speeding up VirtualDub

Dummy LoopDummy Loop

Seeing that, the solution was to fill the Seeing that, the solution was to fill the cache before beginning to copy the cache before beginning to copy the data data I’ve added a dummy loop, a.k.a. I’ve added a dummy loop, a.k.a. @mainloop, reading 1024 bytes @mainloop, reading 1024 bytes ahead, before running blastloop.ahead, before running blastloop.When the cache empties – if we did When the cache empties – if we did not reach the end of the source data, not reach the end of the source data, another 1024 bytes would be read.another 1024 bytes would be read.Using the Dummy loop, a speedup of Using the Dummy loop, a speedup of 4.21%4.21% was gained. was gained.

@mainloop: @mainloop:

mov edi, [edx+896] mov edi, [edx+896]

mov edi, [edx+768] mov edi, [edx+768]

mov edi, [edx+640] mov edi, [edx+640]

mov edi, [edx+512] mov edi, [edx+512]

mov edi, [edx+384] mov edi, [edx+384]

mov edi, [edx+256] mov edi, [edx+256]

mov edi, [edx+128] mov edi, [edx+128]

mov edi, [edx] mov edi, [edx]

mov esi, 16mov esi, 16

@blastloop:@blastloop:

movq mm0, [edx]movq mm0, [edx]

mmovqovq mm1, [edx+8]mm1, [edx+8]

movq mm2, [edx+16]movq mm2, [edx+16]

movq mm3, [edx+24]movq mm3, [edx+24]

movq mm4, [edx+32]movq mm4, [edx+32]

movq mm5, [edx+40]movq mm5, [edx+40]

movq mm6, [edx+48]movq mm6, [edx+48]

movq mm7, [edx+56]movq mm7, [edx+56]

movntq [ebx], mm0movntq [ebx], mm0

movntq [ebx+8], mm1movntq [ebx+8], mm1

movntq [ebx+16], mm2movntq [ebx+16], mm2

movntq [ebx+24], mm3movntq [ebx+24], mm3

movntq [ebx+32], mm4movntq [ebx+32], mm4

movntq [ebx+40], mm5movntq [ebx+40], mm5

movntq [ebx+48], mm6movntq [ebx+48], mm6

movntq [ebx+56], mm7movntq [ebx+56], mm7

Page 8: Speeding up VirtualDub

ThreadsThreads

As stated before, the original VirtualDub is a project in As stated before, the original VirtualDub is a project in development. development.

The original creator had access to code optimizing The original creator had access to code optimizing programs – VTune included – allowing him to improve programs – VTune included – allowing him to improve the code himself, removing many pitfalls and errors the code himself, removing many pitfalls and errors common to non-optimized code.common to non-optimized code.

Also, VirtualDub proved to be multithreaded, to a point:Also, VirtualDub proved to be multithreaded, to a point:

Page 9: Speeding up VirtualDub

ThreadsThreads

The 1The 1stst thread is the processing thread - however, the 2 thread is the processing thread - however, the 2ndnd thread is the thread is the audioaudio thread – since we specifically thread – since we specifically disabled the audio, It did not contain almost any activitydisabled the audio, It did not contain almost any activity::

Therefore – theoretically, Multithreading the Process Therefore – theoretically, Multithreading the Process thread was still possiblethread was still possible

Page 10: Speeding up VirtualDub

ThreadsThreads

At first I had high hopes for multithreading VirtualDub – At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it studying the code I came to the conclusion that it processed the video frame by frame, and in each frame processed the video frame by frame, and in each frame it scanned line by line.it scanned line by line.

Two approachs I decided to try were:Two approachs I decided to try were:– Processing two frames in parallelProcessing two frames in parallel– Cutting a frame in half, and processing the top and bottom in Cutting a frame in half, and processing the top and bottom in

parallel.parallel.

Page 11: Speeding up VirtualDub

ThreadsThreads

At first I had high hopes for multithreading VirtualDub – At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it studying the code I came to the conclusion that it processed the video frame by frame, and in each frame processed the video frame by frame, and in each frame it scanned line by line.it scanned line by line.

Two approachs I decided to try were:Two approachs I decided to try were:– Processing two frames in parallelProcessing two frames in parallel– Cutting a frame in half, and processing the top and bottom in Cutting a frame in half, and processing the top and bottom in

parallel.parallel.

Page 12: Speeding up VirtualDub

ThreadsThreads

However, All my attempts at hyper threading However, All my attempts at hyper threading VirtualDub’s processing failed.VirtualDub’s processing failed.

At first believing that I’ve encountered global variables At first believing that I’ve encountered global variables being addressed, I’ve discovered them to be private being addressed, I’ve discovered them to be private variables to a much higher level class.variables to a much higher level class.

Attempts to duplicate said class in order to split the Attempts to duplicate said class in order to split the workload failed.workload failed.

Page 13: Speeding up VirtualDub

ThreadsThreads

Lastly, I’ve turned to OpenMP, hoping to use it’s innate Lastly, I’ve turned to OpenMP, hoping to use it’s innate capabilities to duplicate the variables into each thread.capabilities to duplicate the variables into each thread.

VirtualDub’s complexity made it impossible for me to VirtualDub’s complexity made it impossible for me to covert it to Intel Compiler – every change resulted in a covert it to Intel Compiler – every change resulted in a staggering amount of errors, each requiring many small staggering amount of errors, each requiring many small code changes, and still more that couldn’t be solved.code changes, and still more that couldn’t be solved.

Limiting the use of Intel compiler into the only necessary Limiting the use of Intel compiler into the only necessary projects did not show an improvement.projects did not show an improvement.

Page 14: Speeding up VirtualDub

ConclusionConclusion

A lot of time and effort were put into this project.A lot of time and effort were put into this project.

To my dismay, it is not evident in percent of speedup, To my dismay, it is not evident in percent of speedup, but rather as error messages and various versions of but rather as error messages and various versions of code, each a bit closer to a working version, but never code, each a bit closer to a working version, but never quite there.quite there.

The bottom line, is that despite the promise initially The bottom line, is that despite the promise initially shown by VirtualDub, ultimately too much had already shown by VirtualDub, ultimately too much had already been originally done in it – leaving it optimized, been originally done in it – leaving it optimized, monstrously big and intricate for my optimization.monstrously big and intricate for my optimization.