ARM Neon Optimization for image interleaving and deinterleaving

10

Click here to load reader

description

Basic ARM Neon Optimization for image interleaving and deinterleaving.Tested the code on android platform.Code repository www.github.com/pi19404/OpenVision

Transcript of ARM Neon Optimization for image interleaving and deinterleaving

  • ARM NeonOptimization

    InterLeaving/De-Interleaving

    Pi19404

    March 10, 2014

  • Contents

    Contents

    ARM Neon Optimization InterLeaving/De-Interleaving 3

    0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 ARM Neon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3 Deinterleaving and Interleaving channels of Image . . . . . . 30.4 De-InterLeaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.5 NDK BUILD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.6 InterLeaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90.7 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    ARM Neon OptimizationInterLeaving/De-Interleaving

    0.1 Introduction In this article we will look at basic interleaving and de-interleavingoperations using ARM Neon optimization and evaluate the per-formance improvements on android based mobile device in com-parison with standard opencv code

    0.2 ARM Neon

    ARMs NEON technology is a 64/128-bit hybrid SIMD architec-ture designed to accelerate the performance of a piece of code.

    SIMD technology allows process multiple data with one instruc-tion call, saving time for other computations A set of pixels willbe processed at a time.

    One way to achieve this is to write assembly code ,which requiresa steep learning curve and requires knowledge of processorarchitecture,instruction set etc.

    Instead of using low-level instructions directly. There are spe-cial functions, called intrinsic, which can be treated as regularfunctions but they works with input data simultaneously.

    0.3 Deinterleaving and Interleavingchannels of Image

    NEON structure loads read data from memory into 64-bitNEON registers, with optional deinterleaving. Stores worksimilarly, reinterleaving data from registers before writing itto memory.

    3 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    A set of neon intrinsic instruction set are provided for dein-terleaving data.

    The simultaneously pull data from the memory and seperate thedata into different registers This is called deinterleaving .

    The Neon structure loads the data from the memory into 64bit neon registers with optional interleaving.

    The opencv funtions split and merge are ported to arm neonand performance comparision with opencv code is performed.

    Data loads interleaves elements based on the size specified inthe instruction .

    0.4 De-InterLeaving

    The de-interleave seperates the pairs of adjacenet elements inthe memory into seperate registers.

    the VLD3 instruction seperates/de-interleaves the BGR chan-nels of the image and sperates them into 3 different regis-ters.The BGR values are stored in adjacent memory locations.

    The result of vld instruction is then stored to registers whichpoint to destination memory location

    vld3_u8

    /*This instruction loads the contents of memory location

    with interleaving of adjacent memory locations .This results

    in 8 elements of memory being loaded into single 64 bit register

    and we have 3 such registers as a result of interleaving process.

    This may be used when the pointer refers to data of type 8 bit

    signed or unsigned integers

    */

    vst1_u8

    //This instruction is used to store contents of 64 bit register

    to desired memory location.8 simultaneous elements (8x8 =64) constituting

    the 64 bit register are written to the memory location.

    void neon_interlace(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height)

    {

    4 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    int i;

    uint8_t *s3 = (uint8_t *)d3;

    for(i=0;i

  • ARM Neon Optimization InterLeaving/De-Interleaving

    The /data/tmp/local directory and files created under this direc-tory can contains files with execute permission.I could not findany other sud-directory under the file system which provided ex-ecute permission for binaries or ability to provide execute per-missions for binaries.

    The script a.ksh being called below exports basic variables andthen executes the binary.

    export LD_LIBRARY_PATH=.:${LD_LIBRARY_PATH}

    cd /data/local/tmp/NEON_TEST

    ./helloneon

    adb shell /data/local/tmp/NEON_TESST/a.ksh

    The performance of neon intrinsic function is compared withstandard opencv split function

    OPENCV : 15msNEON : 11ms

    There is not a very significant improvement seen due to neonoptimization.

    As per many references and by viewing the disassembly outputof the compiler it can be seen that the main reason was foundthat the arm compiler is not able to generate optimized assemblycode .

    The compiler generates heavily unoptimized code that results inlarger number of cycles than required.

    The compilation commands were taken from the ndk-build ver-bose build output and the -c flag was replaced with -s to gener-ate the assembly code

    /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/

    linux-x86/bin/arm-linux-androideabi-gcc -MMD -MP -MF /home/pi19404/

    ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o.d

    -fpic -ffunction-sections -funwind-tables -fstack-protector

    -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__

    -Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os

    -fomit-frame-pointer -fno-strict-aliasing -finline-limit=64 -mfpu=neon

    -I/usr/local/include -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/

    -I/opt/android-ndk-r7/sources//android/cpufeatures

    -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include

    -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include

    -I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID

    6 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    -I/usr/local/include/opencv -I/usr/local/include -I/OpenVision

    -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC

    -DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp

    -ffast-math -Wa,--noexecstack -O3 -DNDEBUG

    -I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include

    /home/pi19404/ARM//jni/helloneon-intrinsics.c -S

    The above command will generate the the file helloneon-intrinsics.sin the present directory

    A lot of unecessary instruction can be observed in the assemblycode.

    The assembly level code corresponding to the functions wereoptimized and compiled

    For compilation again the debug build output observed from ndk-build process as modified so that helloneon-intrinsics.o objectfile is compiled from helloneon-intrinsics.s and helloneon binaryfile is compiled and linked from all source files.

    /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-gcc \

    -MMD -MP -MF \

    -fpic -ffunction-sections -funwind-tables -fstack-protector\

    -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ \

    -Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os -fomit-frame-pointer \

    -fno-strict-aliasing -finline-limit=64 -mfpu=neon -I/usr/local/include \

    -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/

    -I/opt/android-ndk-r7/sources//android/cpufeatures

    -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include

    -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include \

    -I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID

    -I/usr/local/include/opencv -I/usr/local/include -I/OpenVision \

    -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC

    -DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp

    -ffast-math -Wa,--noexecstack -O3 -DNDEBUG

    -I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include

    -c /home/pi19404/ARM/jni/helloneon-intrinsics.s \

    -o /home/pi19404/ARM/obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o

    --sysroot=/opt/android-ndk-r7/platforms/android-14/arch-arm/

    /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/

    prebuilt/linux-x86/bin/arm-linux-androideabi-g++ -Wl,--gc-sections

    -Wl,-z,nocopyreloc --sysroot=/opt/android-ndk-r7/platforms/android-8/arch-arm

    /home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/neon.o

    /home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o

    7 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    /home/pi19404/ARM//obj/local/armeabi-v7a/libcpufeatures.a

    /home/pi19404/ARM//obj/local/armeabi-v7a/libgnustl_static.a

    /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/

    prebuilt/linux-x86/bin/../lib/gcc/arm-linux-androideabi/4.4.3/libgcc.a

    -Wl,--fix-cortex-a8 -Wl,--no-undefined -Wl,-z,noexecstack

    -L/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/lib -fPIC -llog

    -ldl -lm -lz -lm -lc -lgcc -Wl,-rpath,'libs/armeabi-v7a'

    -L/home/pi19404/ARM//jni/../libs/armeabi -llog -Llibs/armebi -Llibs/armeabi-v7a

    -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_flann

    -lc -lm -o /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon

    cp /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon libs/armeabi-v7a

    The results of the optimization process is as followsOPENCV : 15msNEON : 8msNEON OPTIMIZED : 6 ms

    Thus a speedup factor of 1.4 and total performance improvementof 2.5x was observed.

    Thus it can be seen that atleast 2.5x improvement is observedafter optimizing the assembly code.

    This still does not motivate the use of assembly level codingsince the developement effort may outweight the optimizationbenifits.

    push {r4, r5, r6, r7, r8, r9, sl, fp}

    @store registers on stack

    .save {r4, r5, r6, r7, r8, r9, sl, fp}

    .LCFI0:

    .pad #64

    sub sp, sp, #64 @pointer to top of stack

    .LCFI1:

    mov r7, r0

    ldr r4, [sp, #96] @load function arguments r4 64+8*4

    ldr r5, [sp, #100] @load function arguments r5 64+9*4

    mul r6,r4,r5

    asr r6, r6, #3 @divide loop count by 8

    .loop:

    # load 8 pixels:

    vld3.8 {d0-d2},[r7] @load pixels

    vst1.8 {d0}, [r1] @store interleaved pixels

    vst1.8 {d1}, [r2]

    8 | 10

  • ARM Neon Optimization InterLeaving/De-Interleaving

    vst1.8 {d2}, [r3]

    adds r7, r7, #24 @increment counter

    adds r1, r1, #8

    adds r3, r2, #8

    adds r3, r3, #8

    subs r6, r6, #1 @check loop counter

    bne .loop

    add sp, sp, #64

    pop {r4, r5, r6, r7, r8, r9, sl, fp}

    bx lr

    0.6 InterLeaving

    The interleaving operation corresponds to combining 3 inde-pendent channels of a image into multi-channel image.

    Each element of idependent channels are stored in adjacent lo-cations in the multi-channel image.

    void neon_interleave(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height)

    {

    int i;

    uint8x8x3_t v;

    for(i=0;i

  • ARM Neon Optimization InterLeaving/De-Interleaving

    Thus by using neon intrinsics we can achieve performance im-provements wrt standard C code and by optimizing the assemblycode further performance benifits can be achived.

    It is to be noted that OPENCV code is compiled with SSE opti-mization which may also be in play hence the actual code speedupmay be higher.

    However a large speedup was not observed in the interleavingand de-interleaving operation due to optimizing the assemblycode .

    0.7 Code

    The code for the same can be found in the git repository https://github.com/pi19404/OpenVision in the POC/ARM subdirectory.

    The jni subdirectory consists of the source files as well as themake files.

    The files generate_assembly.ksh generate the helloneon-intrinsis.sfiles in the ARM directory.After modifying the file copy it tothe jni sub-directory,

    compile_assembly.ksh compiles the helloneon-intrinsis.s and alsothe binary file

    The binary requires the opencv library files which needs to betransferred to the android mobile device

    adb push libs/armeabi-v7a/ /data/local/tmp/NEON_TEST

    \item

    \url{http://pulsar.webshaker.net/ccc/result.php} shows the number of execution cycles taken by

    ARM assembly code ,which can be used to check the performance of compiler generated and optimized

    code.

    10 | 10

    ARM Neon Optimization InterLeaving/De-Interleaving IntroductionARM NeonDeinterleaving and Interleaving channels of ImageDe-InterLeavingNDK BUILDInterLeavingCodeReferences