FINAL Group4

download FINAL Group4

of 24

Transcript of FINAL Group4

  • 8/6/2019 FINAL Group4

    1/24

    1

    Delft University of Technology, EWI

    IN4342 Embedded Systems Laboratory

    Final Report

    Group 4:D. Turi

    D. Burlyaev

    K. Falatehan

    Z. Kocsi-Horvth

    {D.Turi, D.Burlyaev, K.Falatehan

    Z.KocsiHorvath}@student.tudelft.nl

    Period 4, Delft University of Technology, EWI

  • 8/6/2019 FINAL Group4

    2/24

  • 8/6/2019 FINAL Group4

    3/24

    3

    1. IntroductionSome of the usual embedded applications feature control and data processing functionality, whereas

    microprocessor architectures tend to be specialized to effectively carry out one of the few. Thus, in many

    cases it can be difficult to choose an architecture which yields high performance, yet low cost and energy

    efficiency. To tackle this problem heterogeneous multiprocessor architectures have been developed, which

    require special skills from the programmers.

    In the following report we will describe case studies of two simple applications. In both cases we

    partitioned preexisting algorithms so that maximum performance can be achieved. As hardware we used

    the BeagleBoard which features the TI OMAP3530 chip.

    Apart from presenting the programs we also cover various techniques and problems associated with this

    field.

    2. OMAP 3530 ArchitectureThe OMAP 3530 is one of the SoCs (System on Chip) manufactured by Texas Instruments. It

    features a heterogeneous multiprocessor architecture, which consists of two MPU subsystems,

    ARM Cortex-A8 GPP (General purpose processor) and TMS320C64x+ DSP (Digital signal

    processor). This architecture was designed to facilitate high performance computations that

    provide best-in-class video, image, and graphics processing [1].

    Fig. 1 gives a brief overview about the

    architecture. The two MPUs are connected via a

    multilevel bus system which consists of the

    levels L3 and L4 (L1 and L2 is used to designate

    the memories to which the cores have an

    exclusive connection). This hierarchy facilitates

    faster data transfer, as the shared on-chipmemory is attached to L3, while the peripherals

    can be accessed via L4.

    We have used the vendors library, DSPLINK,Fig.1 The architecture of OMAP 3530 SoC

  • 8/6/2019 FINAL Group4

    4/24

    4

    which provides a layer of abstraction from the communication between the cores. Apart from

    data transfer, this library facilitates the Cortex-A8s control over the DSP.

    There are numerous possibilities for communication between the processors, such as message

    queues, shared memory pools, notification, etc. These possibilities as well as differentoptimization approaches will be investigated in this report.

    3. Matrix multiplicationa. IntroductionWe began studying the capabilities of the chip through a simple program which carries out the

    multiplication of a N-by-N matrix. As the GPP runs a Linux operating system which provides

    direct access to the file system, and the terminal output, it is much easier for inexperiencedprogrammers to write programs for it. Thus, we implemented the matrix multiplication algorithm

    first on the GPP and later used it to verify the results obtained from the DSP.

    Due to the indirect connection through DSPLINK it is more difficult to write programs for the

    DSP. Moreover, without the necessary macros and keywords the compiler generates a program of

    suboptimal performance; thus, additional programming effort is needed to make the usage of the

    DSP feasible.

    Apart from the optimality of the code, the other major influence on the execution time is the

    communication architecture between the two cores. Therefore, it is necessary to measure and

    fine-tune the communication, as well.

    In this assignment we used message queues which not only facilitate data transfer but also

    synchronize execution on both cores.

    In the following sections we present our experiments with the above mentioned algorithm and

    communication architecture. We conducted measurements with different matrix and message

    sizes. The elapsed time was measured with use of the system functions provided by the OS. Due

    to the stochastical nature of the execution time we carried out multiple measurements with the

    same parameters, after which we calculated the average. The results were written into files which

    were analyzed by a MATLAB program, which generated graphs of the measurements. These

  • 8/6/2019 FINAL Group4

    5/24

    5

    graphs facilitate a deeper understanding of the dependency of the execution time on the various

    parameters.

    In the following cases the upper part of

    the graphs shows the total executiontime as a function of the matrix size,

    whereas the lower part represents the

    portion of the execution time per one

    element of a matrix as a function of

    matrix size.

    b. Optimization of the communicationFig. 2 shows the measurement results for the GPP execution. As it can be seen, the execution

    time for the 128x128 matrix is about 80 msec. In the followings this result has been used to

    decide whether it is feasible to execute the multiplication on the DSP, i.e. the DSP execution time

    with the communication overhead has to be shorter than this time.

    In the next experiment we inspected the effect of communication overhead on the execution time.

    We implemented two versions of the same program. In the first version we used one message per

    matrix row to send the elements (see Fig.3 ,

    in red). Thus, in the case of small matrices

    there is a large overhead, whereas large

    matrices can be transmitted efficiently. The

    results of this program were compared with

    the second version(see Fig.3 , in blue), where

    we tried to pack as many elements in one

    message as possible, to ensure the optimal

    communication.

    The difference is visible especially on the

    lower part of the graph. Due to the efficiency

    Fig.2 Matrix multiplication on the GPP side

    (element size =16 bit)

    Fig. 3: the communication overhead. Red: message-

    by-row, blue: packing

  • 8/6/2019 FINAL Group4

    6/24

    6

    of the second algorithm fewer messages need to be sent; thus, the execution time is shorter.

    However, this difference diminishes with larger matrices.

    As we can see, the execution time is approximately 400 ms in both cases, which is much more

    than that of the GPP.

    We found that there are two ways to decrease the elapsed time: further reducing communication

    overhead (with increasing message size) or to make the DSP execution run faster through

    optimization of the algorithm.

    c. Optimization of the DSP codeThe code on the DSP was mostly optimized automatically by the compiler. This option has can

    be enabled by the o3 option (in the Makefile). The efficiency of the automatic optimization

    depends on how much information is known by the compiler about the program. This informationcan be supplied by the methods described below.

    With software pipelining the compiler tries to overlap the consecutive loops in a way that the

    hardware is as fully utilized as possible. An important parameter is called the initiation interval

    (ii) which is the number of cycles per

    loop iteration. ii is equivalent to the

    cycle count for the software-pipelined

    loop body. The method is shown on

    Fig. 4. [4].

    Apart from the cycle count of the loop

    body other items are shown, e.g. trip counts

    (the minimal and maximal value of the iterations that is expected by the compiler), the

    dependency and resource bounds (the constraints which stem from the data dependencies and

    hardware resource limits). It is also shown how the instructions have been partitioned among the

    two data paths in the DSP (path A and B).

    The listing for the original code is shown in Listing 1 (see Appendix). The ii was determined to

    be 9 cycles. In the following we used keywords and macros to achieve better efficiency [5].

    Fig.4 Software pipelining [4]

  • 8/6/2019 FINAL Group4

    7/24

    7

    We found that the compiler assumes that there is a dependency between the pointers which are

    used to iterate through the multiplicands and the result. Knowing that these areas are not

    overlapping, the programmer can declare these arrays with the restrict keyword. The result is

    shown in Listing 2. The cycle time has been decreased to 3 cycles which is a significant

    performance improvement. As we can see, this is the result of the loop carried dependency bound

    which fell from the previous 9 to 1 cycle. This decrease in cycle number resulted in the increase

    of the resource bounds.

    In order to optimize this result even more, we can try to use the data paths more efficiently.

    Because the number of operations are not even, the operations are assigned to the computational

    units in the two data paths asymmetrically (i.e. there are periods of time when the units in path

    are computing while the units in the other path are idling). To solve this, the loop can be unrolled;

    in which case we will have an even number of operations. After this, the compiler reorders the

    instructions in such a way that the computational units are evenly utilized.

    To enable unrolling the programmer can include #pragma MUST_ITERATE(, ,) macros, which

    specify the minimum and the maximum number of iterations plus a value which is an integer

    divisor of these. In our program we chose this value to be 2. If the multiplicand matrices have an

    odd size, they are augmented by an extra row and column of zeros.

    The result is shown in Listing 3. As we can see the cycle number did not change but the

    computational unit usage became more balanced. The trip count factor was changed to 2 as

    specified.

    In order to further optimize the execution

    we tried to exploit the SIMD capabilities

    of the DSP. The DSP is capable of

    executing 32 bit wide arithmetical

    operations or, if specified, the same

    operations on 16 bit wide numbers in

    parallel (this is called packed 16 bit

    arithmetic). To be able to fetch the two 16 bit numbers at the same cycle, the data needs to be

    Fig 5. Optimized version on DSP

  • 8/6/2019 FINAL Group4

    8/24

    8

    aligned on halfword boundaries. The compiler can be convinced of alignment with the

    _nassert((int)aligned_ptr % == 0) function.

    The result is shown in Listing 4. As we can see, the cycle count remained the same, which means

    that the execution is roughly the same for 32 bit as 16 bit wide data.

    The result of DSP optimization is shown on Fig. 5. As we can see, the execution time for matrix

    sizes of 128x128 is about 80 ms, roughly the same value as on GPP. Thus, using a DSP co-

    processor became feasible.

    The other way to optimize is to use larger messages. This yields better results, since the portion

    of a message header decreases in comparison with total amount of sent data.

    We found that the optimal message size is at 256*48 bytes for one message (16 bit matrix

    element), for larger messages the latency grows (see Fig. 6). As a final result, the execution time

    is about 40 ms, about half of the GPP execution time.

    Upon inspecting the results of the

    matrix multiplication and verifying it

    with MATLAB we found that the 16 bit

    which is used to store the matrix

    elements can overflow and it is possible

    to get wrong values (on DSP and GPP

    side alike) for large matrices even if the

    individual elements are small. Thus, we

    experimented with elements of size 32

    bits. The latencies doubled approximately as we expected. For larger sizes there is a higher

    latency. The reason for this is the limited cache size.

    d. ConclusionIn this assignment we studied the architecture and programming of a heterogeneous

    multicore processor. We created several versions of the programs, evolving from very

    inefficient one to the optimized version. Each time creating new version we tried to overcome

    the disadvantages of the previous one (communication inefficiency, compilation inefficiently,

    Fig. 6: the effect of message size: blue:

    maximum message size; red: optimal message

  • 8/6/2019 FINAL Group4

    9/24

    9

    etc). We investigated the influence of the compilation modes, messages sizes, and

    communication protocols, as well as DSP optimization approaches to make our program

    more than 400 times faster than it was at the beginning.

    We proved that it is reasonable to use DSP instead of GPP for matrix multiplication since DSPis ~10 times faster. We also showed that the big message sizes can lead not only to benefits

    but also to the communication degradation in some cases.

    4. GPROF analysis of the image recognition application

    Fig. 7 GPROF output

    Analyzing gprof output is one of important steps to boost up system performance. By analyzing

    its output, we will able to learn where our program spent its time and which functions are called

    which functions while it was executing. In the original application code, we could easily locate

    which function(s) is the bottleneck of the system.As it is represented in Fig. 7, gprof will produce two tables, flat profile and call graph [8]. The

    flat profile shows the total amount of time a program spent executing each function. To analyze

    the flat profile, it is always worthwhile to check first the first column, the execution time of a

    function in percentage. This statistic also implies to the fifth column, self ms/call which depicts

  • 8/6/2019 FINAL Group4

    10/24

    10

    the execution time per function call in ms. In our example above, we could easily tell that

    build_hmm() allocates more time of the total running time of the program in comparison with the

    other functions. This is no surprising fact as the function has also the highest self ms/call.

    Despite this fact, it appears that the function is called only twice in contrast with

    states_are_mergeable() which has 98 function calls. If we take a look at

    histogram_equalization(), this function is located both on the second place of the percentage of

    the total running and function calls. Based on these facts, we could assume that both build_hmm()

    and/or histogram_equalization might be the biggest bottleneck in the program.

    The second table that is quite useful to look at is the call graph. The call graph shows how much

    time was spent in each function and its children. In many cases it often occurs that although a

    function does not cost much time, calling its children might cause bottleneck of the program. For

    each function, they are separated with lines of dashes that divide them into entries. The callers of

    a function always located in the preceding lines followed by its children. In the second entry,

    build_hmm spends most of the time followed by histogram _equalization.

    Based on this information and what we have observed in the flat profile, we could conclude that

    build_hmm() and histogram_equalization() are the two functions that we should concentrate on to

    optimize the program.

    5. Optimization on GPPExecution of the initial code of the image processing (Application program) showed that it takes

    ~823 ms for GPP to finish the whole algorithm and present the right recognition. However,

    introducing the O3 optimization for gcc compiler, the execution time reduces almost twice to

    ~427 ms.

    The several versions of the functions were written to explore other possibilities of performance

    optimization on GPP. Hereafter, the main optimization steps are explained.

    a. Square root approximationThe significant impact on the program execution time on GPP(and DSP) is provided by the

    replacement of the function sqrt ofmath.h in build_hmm(int) function by its approximation.

    The method described in Fast square root in C [6] replaces sqrt function by integer-valued

    computation of the square root. Since the result of the square root will be anyway casted into

    unsigned char to be saved in stddev field of the tiger_state_array and eleph_state_array, the final

  • 8/6/2019 FINAL Group4

    11/24

    11

    accuracy of this approach reaches 100% in all cases when the argument of the square root is

    higher than 100. If the argument of the square root is less than 100, the approximation inaccuracy

    influences the performance of the algorithm, since the accuracy does no reach 70%. The decision

    was made to separate the argument range to two parts: [0..100] and (100, ). For the range (100,

    ) the function presented in [6] is used, while for the range [0..100] the table with square root

    values was built, which took 100*8 bits= 100 bytes. The measurement has shown that the

    replacement of the sqrtfunction by the table and the integer-valued computation [6] reduces the

    application execution time by ~ 100 ms.

    b. Replacement floating point by integersTo avoid the time-consuming floating-point operations on both DSP and GPP sides, all float

    variables were replaced by integers and, where it is possible, even by unsigned char variables.

    For instance, the local sum variable (initially offloat type) in histogram_equalization function

    was replaced by unsigned integer equivalent; the accuracy of the computation is not lost since

    later sum variable is casted to unsigned charanyway. In inequities, divisions were replaced by

    multiplications according to algebraic rules. The constants, e.g. TIGER_STATE_THRESHOLD,

    that were initially float constants, were multiplied by 100 to make integer equivalences.

    Corresponding changes in the algorithm were done to keep the correctness of computations.

    c. Loop unrollingManual loop unrolling inside such function like histogram_equalization, states_are_mergeable,

    build_hmm,and analyze_img showed that unrolling the most inner loops two times reduces the

    execution time on GPP side on 3-8 ms. Unrolling the most inner loops four times did not show

    any performance increase in comparison with unrolling twice. As a result, where it is possible the

    inner loops where unrolled twice. Special cases of improvements are described in the next

    sections.

    d. Histogram equalization improvementsSeveral versions of the function were written to reach high performance results:

    1st version: The loops of the function were unrolled in a way that during one iteration of the

    outmost loop the function process 3 images of 3 different types (tiger, elephant, and test images).

    With such unrolling the program is executed within ~489 ms without O3 compiler optimization,

    and 116 ms withO3 gcc.

  • 8/6/2019 FINAL Group4

    12/24

    12

    2nd

    version: All loops were unrolled 2 times. Corresponding results: ~486 ms without O3; ~109

    ms withO3 gcc optimization.

    3rd

    version: This version represents the merge of the previous two: the loops were unrolled 2

    times and each iteration of the outmost loop processes three pictures of different types. The

    performance withoutO3 optimization is ~485.3 ms, withO3 ~118 ms.

    4th version: This version is based on the version 1 but instead of processing three images of

    different types, it processes 2 images of the same type per call.

    Versions of

    histogram_equalization

    Execution time of 24

    images equalization

    (ms) with O3

    Function execution

    time (ms) with O3

    Total execution time of the algorithm

    on GPP after function introduction

    (ms)

    withoutO3 with O3

    1 version 59.9 7.49 489 116

    2 version 53.5 2.23 486 109

    2 times unrolling

    4 times unrolling 110.7

    without unrolling 114.5

    3 version 76 9.5 485.3 118

    4 version 58.1 2.42 485.5 111.2

    Table 1. Performance of histogram_equalization function versions

    According to measurements presented in Table 1, the most significant improvement is gained by

    the version 2 ( utilizes unrolling twice). This version is used in all further versions of the

    programs.

    It is worth to mention that the time execution depends on the images the function is processing.

    The execution time for 8 tiger images is 1 ms longer in comparison with 8 elephant image

    equalization.

    e. HMM building improvementsFor Hidden Markov Model algorithm, the several versions ofbuild_hmm function were created:

    1st

    version: The algorithm is organized in a way that i-th element of the average fields ofstate

    structures is filled during the same iteration as (i-1)-th element of the stddev field of the same

    state structure. The version is organized in a way that elephant and tiger images are equalized

    together each function loop iteration. Thus, the function is called only once, which can restrict

  • 8/6/2019 FINAL Group4

    13/24

    13

    flexibility for further parallelization between GPP and DSP. The performance of this and other

    version are presented in Table 2.

    2nd

    Version : The inner loops for average and stddev calculation were eliminated by unrolling 8

    times (we have 8 images of elephants and 8 ones of tigers). Moreover, the loops for calculation of

    average and stddev elements were merged into one. The most inner loop was unrolled 2 times.

    Since the function states_are_mergeable is called inside the algorithm it was also unrolled twice

    to reach better execution time. The results of the measurements are presented in Table 2.

    Versions of build_hmm Function execution

    time (ms) with O3

    Total execution time of the algorithm

    of GPP after function introduction

    (ms)

    withoutO3 with O3

    1 version 208 571 254

    2 version 19396 101

    without changes in states_are_mergeable 102.3

    4 times unrolling of HMM building instead of 2 times 108.6

    HMM building without unrolling 110

    Table 2. Performance of build_hmm function versions

    According the performance measurement, the 1st

    version is ~5 times slower than the 2nd

    version.

    It can be explain that the version 2 utilizes better spatial locality of the memory.

    f.

    Improvements for the process of image analysis (analyze_img)Several version of analyze_img function were written and tested to find the best program

    execution time:

    1st version: Two inner loops for comparison with averaged tiger and elephant images were

    merged into one.

    2nd Version: This version is based on the Version 1; however, the inner loop of the merged loop

    was unrolled 2 and 4 times. To compare test images with averaged tiger (see Fig. 8) and averaged

    elephant images (see Fig. 9) the switch-case structure was implemented based on is_deleted

    fields. The results are presented in Table 3.

    3rd version: This version is created specifically for DSP-GPP penalization. The function was

    separated to two independent comparisons with averaged images of tiger and averaged image of

    elephant. This version consists of three functions: the outputs of the first two are integer arrays of

  • 8/6/2019 FINAL Group4

    14/24

    14

    8 elements that contain the values oftigerand eleph variable (that represent the level of similarity

    between test images and averaged images of tiger and elephant) for each test image. The third

    function compares these two arrays and outputs the recognition result.

    Versions of

    analyze_img

    Function execution time (ms) with O3 Total execution time of the algorithm

    of GPP after function introduction

    (ms)

    withoutO3 with O3

    1 version 11.5 340 101

    2 version(twice

    unrolled)

    9.5 333

    unrolled 4 times instead of two times 100

    3 version 10.5 370 102

    Table 3. Performance ofanalyze_img function versions

    g. Use of compilation flags for the program execution on GPPSeveral options were tried to decrease the execution time below 98.6 ms that was reached using

    the fastest versions described in previous section. However, the use of -funroll-loops flag

    increased time by 2 ms. -finline-functions and -finline-small-functions did not show visible

    effects. The explanation for these results can be that O3 optimization that is used to get 98.6 ms

    reaches the maximum compiler optimization, and introduction of other flags might not influence

    the performance and in several cases even increase the total execution time.

    h.

    Conclusion:As it was presented in previous sections, significant execution time reduction can be reached by

    using integer variables instead of floating point variable (where it does not corrupt the algorithm),

    loop unrolling, loop merging, sqrt integer-valued approximation. However, it was also shown

    that loop unrolling four times and several compilation flag may even increase the execution time.

    Fig. 8 The averaged elephant image Fig. 9 The averaged tiger image

  • 8/6/2019 FINAL Group4

    15/24

    15

    The Diagram 1 shows the optimization shrink that were described step by step in previous

    sections. The program executed only on GPP can be divided into 3 main steps: Histogram

    equalization, HMM building ,Recognition process. The normalized Diagram 2 shows that O3

    compiler optimization had more significant effect in analyse_img function. However, after

    utilization of manual optimizations the portion of HMM building significantly decreases in total

    time execution.

    The function versions that show the highest execution speed (marked in bold inside Tables 1,2,

    and 3) are used in GPP-DSP algorithm parallelization.

    6. Parallelization with DSPUntil now, DSP has not been used, but since it is possible to parallelize the initial algorithm

    between GPP and DSP units, the opportunity to decrease the execution time was investigated.

    a. Communication DSP-GPPThe main obstacle for simple parallelization is the communication overhead. According to the

    conducted measurements, GPP spends 47 ms to write 320 KB (equivalent of 8 images) into the

    pool. Since the execution of the whole program on GPP takes ~98.6 ms, the communication

    overhead of 47 ms is acceptable. The easiest way to reduce communication overhead is to write

    into the pool bigger types of information. Hereafter, the Table 4 presents the time spent by GPP

    to send 320 KB into the pool.

    Recognition process

    analyze_img

    Diagram 1. Time execution of

    different program versions

    Diagram 2. Normalized time

    execution of different program

    versions

    HMM buildng

    build_hmm

    Histogram equalization

    histogram_equalization

    Initial version initial version final version

    withO3 optimization

    Initial version initial version final version

    withO3 optimization

  • 8/6/2019 FINAL Group4

    16/24

    16

    Type of element written separately to

    the pool

    Size of element type (bits) Time to send 320 KB into pool (ms)

    char 8 47

    Int32 32 ~12.5

    long 64 ~7Table 4. Overhead of sending data into the pool

    In the compiler manual we found that long type on DSP side corresponds to 40 bits, but with a

    help of sizeof operator it returned 64 bits. Moreover, the compiler takes long long variable as

    variable of long type (64 bits).

    Sending one notification (Int32) takes ~30 us, which is negligible in comparison with writing to

    the pool.

    An interesting observation is that writing to the pool takes less time than reading from it.

    It can be explained that, when we are writing to the pool the API function sets up a DMA, starts

    it, and can continue execution right after it. The other way around we have to wait until the DMA

    has finished fetching all the data. This assymetricity is in favor of the main processor, as it can

    quickly transfer data to the co-processor and continue its own task.

    On the DSP side, we can read data from the pool and write them to the DSP RAM. In

    some cases it will decrease function execution time on DSP side; however, if the copying

    overhead takes the main portion of execution, the copying will not reduce the total execution

    time(e.g., histogram equalization for 8 images takes 8 ms less if we read-write data directly

    from/to pool without copying it to DSP RAM). The time spent for copying from pool char by

    char to DSP RAM is presented in table:

    Amount of data written to DSP RAM from the pool (KB) Spent time (ms)

    40 ~2.5

    120 ~4.5

    240 ~7.2

    320 ~9

    Table 5. Overhead of writing from the pool to DSP RAMMeanwhile, copying from pool to GPP RAM takes approximately the same amount of time.

    Copying from the pool can be significantly speed up, if we copy element of bigger type (e.g. long

    instead of char). The proportion of speedup is the same as during writing process (see Table 4).

  • 8/6/2019 FINAL Group4

    17/24

    17

    b. Function execution on DSP sideAmong all functions in the algorithm the histogram is the most suitable for execution on DSP

    side as in most cases the same operation is carried out on large sets of data (e.g. pixels, or the bins

    of the histogram) with narrow types (8 or 16 bit). Usually a DSP is designed to carry out

    operations of this type efficiently. Thus, we considered this function for extra optimization in

    order to gain speed.

    Direct implementation of gpp version with unrolling doesnt working on dsp well. For example,

    histogram equalization for 8 images on DSP side takes ~45.5 ms (reading- writing to the pool),

    while on GPP the same operation will take not more than 18 ms.

    Although the keywords and macros, which were used in the case of matrix multiplication most of

    the time, do not provide any changes in the cycle times of the loops, slight modification in the

    loop bodies, which eliminate data dependencies, can increase performance. One example of these

    dependencies is in the loop where we accumulate the histogram. There is a read and a write to the

    same memory address per loop iteration. This potentially generates RAW hazards in a pipeline.

    To alleviate this problem we can write the results in a temporary histogram. In other occasions

    we tried to manually unroll the loops, in which we used the same index variable to address

    different parts of the same array. This can cause problems when the compiler cannot parallelize

    these instructions as they are reading the same register at the same time. Thus, we created new

    index variables which are incremented in every loop.

    Apart from the above techniques we tried to express the alignment of the variables with the use of

    _nassert() intrinsic in the hope that the compiler will use packed operations. Surprisingly these

    efforts did not lead to any improvements in the initiation intervals of the loops. It is also possible

    to explicitly use packed instructions with the help of intrinsics. However, we found that deeper

    knowledge about the assembly and the architecture of the DSP is needed to use these.

    Thus, with all these optimizations the execution time was ~ 4 ms faster.

    c. Use of Texas Instruments Software LibrariesWe tried using two software libraries [7] provided free of charge by the Texas Instruments:

    FastRTS and IMGLIB. These libraries can be linked to any little-endian project and have been

    optimized by the engineers of the vendor.

  • 8/6/2019 FINAL Group4

    18/24

    18

    The FastRTS library provides floating-point mathematics emulation for fixed-point DSPs. We

    have experimented with changing the division in the histogram_equalization function to the

    function provided by this library and then cast the result to integer. This yielded very poor results.

    On the other hand, IMGLIB contains a function for the histogram calculation, which uses a 256

    wide array of bytes as output just as the loop in the original program, which is poorly optimized

    by the compiler. Thus, an obvious approach is to exchange this loop for the library

    implementation. The histogram function has two implementation: a natural C and an assembly-

    level optimized version. The assembly-optimized version performed very poorly in terms of

    speed, whereas the natural C implementation has approximately the same speed as the original

    loop.

    d. Use of compilation flags for the program execution on DSPIn order to be able to optimize the release version in DSP, we have added a flag k to the

    CFLAGS. The k will keep the assembly file so that we can inspect and analyze compiler

    feedback. However, after getting the optimized version, we have removed this flag as it

    influences also the total performance of the program. To make it easier to inspect the

    optimization problem inside the assembly file, we have also used ss flag. This flag provides

    optimizer comments in assembly. As with k, this flag also degrades the performance of the

    program. However, the optionmw can also be used to list software pipeline information in the

    assembly file, without degrading the performance.

    Memory dependencies are problems that might conceal the real optimized performance.

    Dependency implies that one code can only be executed after the execution of another code. To

    have an optimum efficiency, the compiler should parallelize as many instructions as possible. To

    help a compiler determines if instructions are independent, -pm flag should be used which gives

    the compiler global access to the whole program or module and allows it to be more aggressive in

    ruling out dependencies. It is also mentioned in [5] that pm should be used as much of our

    program as possible. Flag mt can also be used to eliminate dependencies. Using this flag, it isequivalent to adding the .no_mdep directive to the linear assembly source file.

    Combination ofpm and op2 provides a facility to the compiler to inline a function call in a

    loop. Sometimes it may happen that a compiler may not be able to inline a function if the

    function is invoked inside a loop. This disability of a compiler to inline may end up in disability

  • 8/6/2019 FINAL Group4

    19/24

    19

    to parallelize the execution of the code. To prevent this, -pm and op2 should be used to enable

    automatic inlining a function call inside a loop.

    To allow speculative execution, the appropriate amount of padding must be available in data

    memory to insure correct execution. In order to do this, -mh can be used.

    Last but not least, -o3 which has been used in the original Makefile of the program provides the

    highest level of optimization available. This flag will maximize compiler analysis and

    optimization. By using this flag, various loop optimizations are performed, such as software

    pipelining, unrolling, and SIMD. Some file level characteristics are also used to improve

    performance [5].

    e. Algorithm parallelization:1st version: This version of the program outsources the equalization of 8 test images to DSP.

    However, due to communication overhead (9 ms for writing to the pool, and 9 ms to read from it

    on GPP side), the whole benefit of independent parallel equalization on DSP side is eliminated.

    The total execution time is ~102 ms, which is 3 ms longer than execution on GPP, described in

    Section 5f .

    2nd version: As it was shown, the communication overhead plays significant role in the total

    execution time. As a result, the decision was made to limit the communication between GPP and

    DSP as much as possible. The most optimal solution to meet this limitation is to let DSP

    independently equalize, build HMM for one type of images (e.g. elephants). Meanwhile, GPP can

    equalize other type of images (e.g. tiger images) and test images, write test images into the pool

    to let DSP independently compare test images with the averaged elephant picture. Thus, to get the

    recognition result, the DSP send only 9 notifications (8 with values of tiger variables, and one

    notification to send hmm_eleph_states), which takes approximately 30us

    (analyse_img, the version 3 is used, see Section 5f). To organize such parallel processing, it is

    necessary to take timing into account, e.g. DSP should not compare averaged elephant picture

    with test images that are located in the pool before these images were equalized and written into

    the pool by GPP(the semaphores are used). The corresponding process is represented in Scheme

    1:

  • 8/6/2019 FINAL Group4

    20/24

    20

    Scheme 1. Parallel image recognition process

    The measurement shown that the total execution time is 79.2 ms. The described program

    organization has some advantages:

    The amount of communication is limited, the total communication overhead ~ 15 ms Parallel execution of the almost half of the algorithm The synchronization misalignment less than 2 ms: GPP waits DSP at the very end (before

    give result of recognition).

    f. ConclusionIn this section it was described how to parallelize the program to make its execution faster.

    However, it is necessary to highlight that communication overhead takes significant part of the

    program execution 17 %. Because of the overhead, the parallelization in described Version 1

    (see Section 6e) does not give any benefits. However, since it is possible to separate the program

    execution into two almost independent parts (as it is shown in Version 2, Section 6e), the

    parallelization significantly reduces the program execution time. In our case, the execution

  • 8/6/2019 FINAL Group4

    21/24

    21

    speedup due to GPP-DSP parallelization (Version 2) is almost 25% in comparison with the

    program execution only on GPP side(79.2 ms in comparison with 98.6 ms).

    Fig. 10 Execution times on GPP and DSP sidesFig. 10 represents the time periods spent on different operations on GPP and DSP side. DSP

    spends more time on any function execution in comparison with GPP. The histogram

    equalization on DSP side is much slower than on GPP (more than 2 times slower). It can be

    explained by communication overhead, since DSP reads directly from the pool during histogram

    equalization of elephant images. Other functionality is performed in RAM on DSP side, so the

    execution times on DSP side for functions like HMM building and analyse_img are

    approximately the same as on GPP side.

    Comparing the pool-notify architecture with the previously used message queuing, it is

    visible that the different communication architectures are suitable for different requirements in

    data transfer and synchronization. Message architectures feature simple functionality and are

    suitable for infrequent data sending between two parties with a considerable overhead of the

    message headers, whereas the pool can be used as storage for large amount of data which is

    available to many cores. Moreover the message queues provide tight coupling, whereas

    notification couples the systems loosely.

    Several attempts were made to decrease the communication overhead: use of Texas

    Instruments library, compiler optimization, and changing the type of sending information (saving

    4 chars as one long). The communication overhead was decreaseed by the later approach

    approximately 7 times; however, the overhead still takes significant portion of total execution

    ms

  • 8/6/2019 FINAL Group4

    22/24

    22

    time. As a future work, the utilization of another type of communication, Dynamic Memory

    Mapping [9], can be considered.

    7. Final conclusionIn this course we studied programming techniques on heterogeneous multiprocessor

    architectures. We have seen how an application can be separated to functionally different

    components (controldata processing) and assigned to processing units which are more suitable

    for the given task. We have seen that performance is just as much influenced by the adequate

    choice of the communication architecture as the optimality of the code running on a given core.

    We could see that the performance of the system can be increased by an auxiliary core even if it

    executes the given code slower than the main core. Thus, in these systems the load balancing is a

    key factor to achieve performance increase.

    References

    [1] Texas Instruments, OMAP3530/25 Applications Processor: Oct ,2009;

    [2] Texas Instruments, DSP/BIOS LINK USER GUIDE Version 1.60: Oct 21, 2008;

    [3] Datasheet SPRU187O TMS320C Optimizing C compiler;

    [4] http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops;

    [5] SPRU425A TMS320C6000 Optimizing C Compiler tutorial;

    [6] Fast square root in C, IAR Application Note G-002, IAR Systems,

    http://supp.iar.com/FilesPublic/SUPPORT/000419/AN-G-002.pdf;

    [7] Texas Instruments, Software Libraries, http://processors.wiki.ti.com/index.php/software_libraries;

    [8] http://www.eecs.umich.edu/~sugih/pointers/gprof_quick.html;[9] Hari Kanigeri ,Texas Instruments, OMAP3430 Bridge Overview, August, 2008.

    http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loopshttp://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops
  • 8/6/2019 FINAL Group4

    23/24

    23

    8. AppendixListings from the DSP assembly files:

    ;* SOFTWARE PIPELINE INFORMATION

    ;*

    ;* Loop source line : 25

    ;* Loop opening brace source line : 25

    ;* Loop closing brace source line : 27

    ;* Known Minimum Trip Count : 1

    ;* Known Maximum Trip Count : 255

    ;* Known Max Trip Count Factor : 1

    ;* Loop Carried Dependency Bound(^) : 9

    ;* Unpartitioned Resource Bound : 2

    ;* Partitioned Resource Bound(*) : 2

    ;* Resource Partition:

    ;* A-side B-side

    ;* .L units 0 0

    ;* .S units 2* 1

    ;* .D units 2* 1

    ;* .M units 1 0

    ;* .X cross paths 1 0

    ;* .T address paths 2* 1;* Long read paths 0 0

    ;* Long write paths 0 0

    ;* Logical ops (.LS) 0 0 (.L or .S unit)

    ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit)

    ;* Bound(.L .S .LS) 1 1

    ;* Bound(.L .S .D .LS .LSD) 2* 1

    ;*

    ;* Searching for software pipeline schedule at ...

    ;* ii = 9 Schedule found with 1 iterations in parallel

    Listing 1: original matrix multiplication code

    ;* SOFTWARE PIPELINE INFORMATION

    ;*

    ;* Loop source line : 25

    ;* Loop opening brace source line : 25

    ;* Loop closing brace source line : 27

    ;* Known Minimum Trip Count : 4

    ;* Known Maximum Trip Count : 255

    ;* Known Max Trip Count Factor : 1

    ;* Loop Carried Dependency Bound(^) : 1

    ;* Unpartitioned Resource Bound : 3

    ;* Partitioned Resource Bound(*) : 3

    ;* Resource Partition:

    ;* A-side B-side

    ;* .L units 0 0

    ;* .S units 2 1

    ;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2;*;* Searching for software pipeline schedule at ...

    ;* ii = 3 Schedule found with 4 iterations in parallel

    ;*

  • 8/6/2019 FINAL Group4

    24/24

    24

    Listing 2: optimizing with the restrict keyword

    ;* SOFTWARE PIPELINE INFORMATION

    ;*

    ;* Loop source line : 28

    ;* Loop opening brace source line : 28

    ;* Loop closing brace source line : 30

    ;* Known Minimum Trip Count : 20

    ;* Known Maximum Trip Count : 254

    ;* Known Max Trip Count Factor : 2

    ;* Loop Carried Dependency Bound(^) : 1

    ;* Unpartitioned Resource Bound : 3

    ;* Partitioned Resource Bound(*) : 3

    ;* Resource Partition:

    ;* A-side B-side

    ;* .L units 0 0;* .S units 2 1;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2

    ;*;* Searching for software pipeline schedule at ...

    ;* ii = 3 Schedule found with 4 iterations in parallel

    Listing 3: loop-unrolling

    ;* SOFTWARE PIPELINE INFORMATION

    ;*

    ;* Loop source line :34

    ;* Loop opening brace source line : 34

    ;* Loop closing brace source line : 36

    ;* Known Minimum Trip Count : 20

    ;* Known Maximum Trip Count : 254

    ;* Known Max Trip Count Factor :2

    ;* Loop Carried Dependency Bound(^) : 1

    ;* Unpartitioned Resource Bound : 3;* Partitioned Resource Bound(*) :3

    ;* Resource Partition:

    ;* A-side B-side

    ;* .L units 0 0

    ;* .S units 2 1

    ;* .D units 3* 2

    ;* .M units 1 3*

    ;* .X cross paths 1 2

    ;* .T address paths 3* 2

    ;* Long read paths 0 0

    ;* Long write paths 0 0

    ;* Logical ops (.LS) 0 0 (.L or .S unit)

    ;* Addition ops (.LSD) 1 3 (.L or .S or.D unit)

    ;* Bound(.L .S .LS) 1 1

    ;* Bound(.L .S .D .LS .LSD) 2 2

    ;*

    ;* Searching for software pipeline schedule at ...

    ;* ii = 3 Schedule found with 4 iterations in parallel

    Listing 4: packed 16 bit arithmetic