ARM Neon Optimization for image interleaving and deinterleaving

ARM NeonOptimization

InterLeaving/De-Interleaving

Pi19404

March 10, 2014

Contents

Contents

ARM Neon Optimization InterLeaving/De-Interleaving 3

0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 ARM Neon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3 Deinterleaving and Interleaving channels of Image . . . . . . 30.4 De-InterLeaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.5 NDK BUILD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.6 InterLeaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90.7 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 | 10

ARM Neon Optimization InterLeaving/De-Interleaving

ARM Neon OptimizationInterLeaving/De-Interleaving

0.1 Introduction In this article we will look at basic interleaving and de-interleavingoperations using ARM Neon optimization and evaluate the per-formance improvements on android based mobile device in com-parison with standard opencv code

0.2 ARM Neon

ARMs NEON technology is a 64/128-bit hybrid SIMD architec-ture designed to accelerate the performance of a piece of code.

SIMD technology allows process multiple data with one instruc-tion call, saving time for other computations A set of pixels willbe processed at a time.

One way to achieve this is to write assembly code ,which requiresa steep learning curve and requires knowledge of processorarchitecture,instruction set etc.

Instead of using low-level instructions directly. There are spe-cial functions, called intrinsic, which can be treated as regularfunctions but they works with input data simultaneously.

0.3 Deinterleaving and Interleavingchannels of Image

NEON structure loads read data from memory into 64-bitNEON registers, with optional deinterleaving. Stores worksimilarly, reinterleaving data from registers before writing itto memory.

3 | 10


A set of neon intrinsic instruction set are provided for dein-terleaving data.

The simultaneously pull data from the memory and seperate thedata into different registers This is called deinterleaving .

The Neon structure loads the data from the memory into 64bit neon registers with optional interleaving.

The opencv funtions split and merge are ported to arm neonand performance comparision with opencv code is performed.

Data loads interleaves elements based on the size specified inthe instruction .

0.4 De-InterLeaving

The de-interleave seperates the pairs of adjacenet elements inthe memory into seperate registers.

the VLD3 instruction seperates/de-interleaves the BGR chan-nels of the image and sperates them into 3 different regis-ters.The BGR values are stored in adjacent memory locations.

The result of vld instruction is then stored to registers whichpoint to destination memory location

vld3_u8

/*This instruction loads the contents of memory location

with interleaving of adjacent memory locations .This results

in 8 elements of memory being loaded into single 64 bit register

and we have 3 such registers as a result of interleaving process.

This may be used when the pointer refers to data of type 8 bit

signed or unsigned integers

*/

vst1_u8

//This instruction is used to store contents of 64 bit register

to desired memory location.8 simultaneous elements (8x8 =64) constituting

the 64 bit register are written to the memory location.

void neon_interlace(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height)

{

4 | 10


int i;

uint8_t *s3 = (uint8_t *)d3;

for(i=0;i


The /data/tmp/local directory and files created under this direc-tory can contains files with execute permission.I could not findany other sud-directory under the file system which provided ex-ecute permission for binaries or ability to provide execute per-missions for binaries.

The script a.ksh being called below exports basic variables andthen executes the binary.

export LD_LIBRARY_PATH=.:${LD_LIBRARY_PATH}

cd /data/local/tmp/NEON_TEST

./helloneon

adb shell /data/local/tmp/NEON_TESST/a.ksh

The performance of neon intrinsic function is compared withstandard opencv split function

OPENCV : 15msNEON : 11ms

There is not a very significant improvement seen due to neonoptimization.

As per many references and by viewing the disassembly outputof the compiler it can be seen that the main reason was foundthat the arm compiler is not able to generate optimized assemblycode .

The compiler generates heavily unoptimized code that results inlarger number of cycles than required.

The compilation commands were taken from the ndk-build ver-bose build output and the -c flag was replaced with -s to gener-ate the assembly code

/opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/

linux-x86/bin/arm-linux-androideabi-gcc -MMD -MP -MF /home/pi19404/

ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o.d

-fpic -ffunction-sections -funwind-tables -fstack-protector

-D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__

-Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os

-fomit-frame-pointer -fno-strict-aliasing -finline-limit=64 -mfpu=neon

-I/usr/local/include -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/

-I/opt/android-ndk-r7/sources//android/cpufeatures

-I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include

-I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include

-I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID

6 | 10


-I/usr/local/include/opencv -I/usr/local/include -I/OpenVision

-I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC

-DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp

-ffast-math -Wa,--noexecstack -O3 -DNDEBUG

-I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include

/home/pi19404/ARM//jni/helloneon-intrinsics.c -S

The above command will generate the the file helloneon-intrinsics.sin the present directory

A lot of unecessary instruction can be observed in the assemblycode.

The assembly level code corresponding to the functions wereoptimized and compiled

For compilation again the debug build output observed from ndk-build process as modified so that helloneon-intrinsics.o objectfile is compiled from helloneon-intrinsics.s and helloneon binaryfile is compiled and linked from all source files.

/opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-gcc \

-MMD -MP -MF \

-fpic -ffunction-sections -funwind-tables -fstack-protector\

-D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ \

-Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os -fomit-frame-pointer \

-fno-strict-aliasing -finline-limit=64 -mfpu=neon -I/usr/local/include \

-I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/

-I/opt/android-ndk-r7/sources//android/cpufeatures

-I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include

-I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include \

-I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID

-I/usr/local/include/opencv -I/usr/local/include -I/OpenVision \

-I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC

-DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp

-ffast-math -Wa,--noexecstack -O3 -DNDEBUG

-I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include

-c /home/pi19404/ARM/jni/helloneon-intrinsics.s \

-o /home/pi19404/ARM/obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o

--sysroot=/opt/android-ndk-r7/platforms/android-14/arch-arm/

/opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/

prebuilt/linux-x86/bin/arm-linux-androideabi-g++ -Wl,--gc-sections

-Wl,-z,nocopyreloc --sysroot=/opt/android-ndk-r7/platforms/android-8/arch-arm

/home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/neon.o

/home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o

7 | 10


/home/pi19404/ARM//obj/local/armeabi-v7a/libcpufeatures.a

/home/pi19404/ARM//obj/local/armeabi-v7a/libgnustl_static.a

/opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/

prebuilt/linux-x86/bin/../lib/gcc/arm-linux-androideabi/4.4.3/libgcc.a

-Wl,--fix-cortex-a8 -Wl,--no-undefined -Wl,-z,noexecstack

-L/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/lib -fPIC -llog

-ldl -lm -lz -lm -lc -lgcc -Wl,-rpath,'libs/armeabi-v7a'

-L/home/pi19404/ARM//jni/../libs/armeabi -llog -Llibs/armebi -Llibs/armeabi-v7a

-lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_flann

-lc -lm -o /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon

cp /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon libs/armeabi-v7a

The results of the optimization process is as followsOPENCV : 15msNEON : 8msNEON OPTIMIZED : 6 ms

Thus a speedup factor of 1.4 and total performance improvementof 2.5x was observed.

Thus it can be seen that atleast 2.5x improvement is observedafter optimizing the assembly code.

This still does not motivate the use of assembly level codingsince the developement effort may outweight the optimizationbenifits.

push {r4, r5, r6, r7, r8, r9, sl, fp}

@store registers on stack

.save {r4, r5, r6, r7, r8, r9, sl, fp}

.LCFI0:

.pad #64

sub sp, sp, #64 @pointer to top of stack

.LCFI1:

mov r7, r0

ldr r4, [sp, #96] @load function arguments r4 64+8*4

ldr r5, [sp, #100] @load function arguments r5 64+9*4

mul r6,r4,r5

asr r6, r6, #3 @divide loop count by 8

.loop:

# load 8 pixels:

vld3.8 {d0-d2},[r7] @load pixels

vst1.8 {d0}, [r1] @store interleaved pixels

vst1.8 {d1}, [r2]

8 | 10


vst1.8 {d2}, [r3]

adds r7, r7, #24 @increment counter

adds r1, r1, #8

adds r3, r2, #8

adds r3, r3, #8

subs r6, r6, #1 @check loop counter

bne .loop

add sp, sp, #64

pop {r4, r5, r6, r7, r8, r9, sl, fp}

bx lr

0.6 InterLeaving

The interleaving operation corresponds to combining 3 inde-pendent channels of a image into multi-channel image.

Each element of idependent channels are stored in adjacent lo-cations in the multi-channel image.

void neon_interleave(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height)

{

int i;

uint8x8x3_t v;

for(i=0;i


Thus by using neon intrinsics we can achieve performance im-provements wrt standard C code and by optimizing the assemblycode further performance benifits can be achived.

It is to be noted that OPENCV code is compiled with SSE opti-mization which may also be in play hence the actual code speedupmay be higher.

However a large speedup was not observed in the interleavingand de-interleaving operation due to optimizing the assemblycode .

0.7 Code

The code for the same can be found in the git repository https://github.com/pi19404/OpenVision in the POC/ARM subdirectory.

The jni subdirectory consists of the source files as well as themake files.

The files generate_assembly.ksh generate the helloneon-intrinsis.sfiles in the ARM directory.After modifying the file copy it tothe jni sub-directory,

compile_assembly.ksh compiles the helloneon-intrinsis.s and alsothe binary file

The binary requires the opencv library files which needs to betransferred to the android mobile device

adb push libs/armeabi-v7a/ /data/local/tmp/NEON_TEST

\item

\url{http://pulsar.webshaker.net/ccc/result.php} shows the number of execution cycles taken by

ARM assembly code ,which can be used to check the performance of compiler generated and optimized

code.

10 | 10

ARM Neon Optimization InterLeaving/De-Interleaving IntroductionARM NeonDeinterleaving and Interleaving channels of ImageDe-InterLeavingNDK BUILDInterLeavingCodeReferences

ARM Neon Optimization for image interleaving and deinterleaving

Documents

Transcript of ARM Neon Optimization for image interleaving and deinterleaving