Download - FINAL Group4

8/6/2019 FINAL Group4

1/24

1

Delft University of Technology, EWI

IN4342 Embedded Systems Laboratory

Final Report

Group 4:D. Turi

D. Burlyaev

K. Falatehan

Z. Kocsi-Horvth

{D.Turi, D.Burlyaev, K.Falatehan

Z.KocsiHorvath}@student.tudelft.nl

Period 4, Delft University of Technology, EWI


2/24


3/24

3

1. IntroductionSome of the usual embedded applications feature control and data processing functionality, whereas

microprocessor architectures tend to be specialized to effectively carry out one of the few. Thus, in many

cases it can be difficult to choose an architecture which yields high performance, yet low cost and energy

efficiency. To tackle this problem heterogeneous multiprocessor architectures have been developed, which

require special skills from the programmers.

In the following report we will describe case studies of two simple applications. In both cases we

partitioned preexisting algorithms so that maximum performance can be achieved. As hardware we used

the BeagleBoard which features the TI OMAP3530 chip.

Apart from presenting the programs we also cover various techniques and problems associated with this

field.

2. OMAP 3530 ArchitectureThe OMAP 3530 is one of the SoCs (System on Chip) manufactured by Texas Instruments. It

features a heterogeneous multiprocessor architecture, which consists of two MPU subsystems,

ARM Cortex-A8 GPP (General purpose processor) and TMS320C64x+ DSP (Digital signal

processor). This architecture was designed to facilitate high performance computations that

provide best-in-class video, image, and graphics processing [1].

Fig. 1 gives a brief overview about the

architecture. The two MPUs are connected via a

multilevel bus system which consists of the

levels L3 and L4 (L1 and L2 is used to designate

the memories to which the cores have an

exclusive connection). This hierarchy facilitates

faster data transfer, as the shared on-chipmemory is attached to L3, while the peripherals

can be accessed via L4.

We have used the vendors library, DSPLINK,Fig.1 The architecture of OMAP 3530 SoC


4/24

4

which provides a layer of abstraction from the communication between the cores. Apart from

data transfer, this library facilitates the Cortex-A8s control over the DSP.

There are numerous possibilities for communication between the processors, such as message

queues, shared memory pools, notification, etc. These possibilities as well as differentoptimization approaches will be investigated in this report.

3. Matrix multiplicationa. IntroductionWe began studying the capabilities of the chip through a simple program which carries out the

multiplication of a N-by-N matrix. As the GPP runs a Linux operating system which provides

direct access to the file system, and the terminal output, it is much easier for inexperiencedprogrammers to write programs for it. Thus, we implemented the matrix multiplication algorithm

first on the GPP and later used it to verify the results obtained from the DSP.

Due to the indirect connection through DSPLINK it is more difficult to write programs for the

DSP. Moreover, without the necessary macros and keywords the compiler generates a program of

suboptimal performance; thus, additional programming effort is needed to make the usage of the

DSP feasible.

Apart from the optimality of the code, the other major influence on the execution time is the

communication architecture between the two cores. Therefore, it is necessary to measure and

fine-tune the communication, as well.

In this assignment we used message queues which not only facilitate data transfer but also

synchronize execution on both cores.

In the following sections we present our experiments with the above mentioned algorithm and

communication architecture. We conducted measurements with different matrix and message

sizes. The elapsed time was measured with use of the system functions provided by the OS. Due

to the stochastical nature of the execution time we carried out multiple measurements with the

same parameters, after which we calculated the average. The results were written into files which

were analyzed by a MATLAB program, which generated graphs of the measurements. These


5/24

5

graphs facilitate a deeper understanding of the dependency of the execution time on the various

parameters.

In the following cases the upper part of

the graphs shows the total executiontime as a function of the matrix size,

whereas the lower part represents the

portion of the execution time per one

element of a matrix as a function of

matrix size.

b. Optimization of the communicationFig. 2 shows the measurement results for the GPP execution. As it can be seen, the execution

time for the 128x128 matrix is about 80 msec. In the followings this result has been used to

decide whether it is feasible to execute the multiplication on the DSP, i.e. the DSP execution time

with the communication overhead has to be shorter than this time.

In the next experiment we inspected the effect of communication overhead on the execution time.

We implemented two versions of the same program. In the first version we used one message per

matrix row to send the elements (see Fig.3 ,

in red). Thus, in the case of small matrices

there is a large overhead, whereas large

matrices can be transmitted efficiently. The

results of this program were compared with

the second version(see Fig.3 , in blue), where

we tried to pack as many elements in one

message as possible, to ensure the optimal

communication.

The difference is visible especially on the

lower part of the graph. Due to the efficiency

Fig.2 Matrix multiplication on the GPP side

(element size =16 bit)

Fig. 3: the communication overhead. Red: message-

by-row, blue: packing


6/24

6

of the second algorithm fewer messages need to be sent; thus, the execution time is shorter.

However, this difference diminishes with larger matrices.

As we can see, the execution time is approximately 400 ms in both cases, which is much more

than that of the GPP.

We found that there are two ways to decrease the elapsed time: further reducing communication

overhead (with increasing message size) or to make the DSP execution run faster through

optimization of the algorithm.

c. Optimization of the DSP codeThe code on the DSP was mostly optimized automatically by the compiler. This option has can

be enabled by the o3 option (in the Makefile). The efficiency of the automatic optimization

depends on how much information is known by the compiler about the program. This informationcan be supplied by the methods described below.

With software pipelining the compiler tries to overlap the consecutive loops in a way that the

hardware is as fully utilized as possible. An important parameter is called the initiation interval

(ii) which is the number of cycles per

loop iteration. ii is equivalent to the

cycle count for the software-pipelined

loop body. The method is shown on

Fig. 4. [4].

Apart from the cycle count of the loop

body other items are shown, e.g. trip counts

(the minimal and maximal value of the iterations that is expected by the compiler), the

dependency and resource bounds (the constraints which stem from the data dependencies and

hardware resource limits). It is also shown how the instructions have been partitioned among the

two data paths in the DSP (path A and B).

The listing for the original code is shown in Listing 1 (see Appendix). The ii was determined to

be 9 cycles. In the following we used keywords and macros to achieve better efficiency [5].

Fig.4 Software pipelining [4]


7/24

7

We found that the compiler assumes that there is a dependency between the pointers which are

used to iterate through the multiplicands and the result. Knowing that these areas are not

overlapping, the programmer can declare these arrays with the restrict keyword. The result is

shown in Listing 2. The cycle time has been decreased to 3 cycles which is a significant

performance improvement. As we can see, this is the result of the loop carried dependency bound

which fell from the previous 9 to 1 cycle. This decrease in cycle number resulted in the increase

of the resource bounds.

In order to optimize this result even more, we can try to use the data paths more efficiently.

Because the number of operations are not even, the operations are assigned to the computational

units in the two data paths asymmetrically (i.e. there are periods of time when the units in path

are computing while the units in the other path are idling). To solve this, the loop can be unrolled;

in which case we will have an even number of operations. After this, the compiler reorders the

instructions in such a way that the computational units are evenly utilized.

To enable unrolling the programmer can include #pragma MUST_ITERATE(, ,) macros, which

specify the minimum and the maximum number of iterations plus a value which is an integer

divisor of these. In our program we chose this value to be 2. If the multiplicand matrices have an

odd size, they are augmented by an extra row and column of zeros.

The result is shown in Listing 3. As we can see the cycle number did not change but the

computational unit usage became more balanced. The trip count factor was changed to 2 as

specified.

In order to further optimize the execution

we tried to exploit the SIMD capabilities

of the DSP. The DSP is capable of

executing 32 bit wide arithmetical

operations or, if specified, the same

operations on 16 bit wide numbers in

parallel (this is called packed 16 bit

arithmetic). To be able to fetch the two 16 bit numbers at the same cycle, the data needs to be

Fig 5. Optimized version on DSP


8/24

8

aligned on halfword boundaries. The compiler can be convinced of alignment with the

_nassert((int)aligned_ptr % == 0) function.

The result is shown in Listing 4. As we can see, the cycle count remained the same, which means

that the execution is roughly the same for 32 bit as 16 bit wide data.

The result of DSP optimization is shown on Fig. 5. As we can see, the execution time for matrix

sizes of 128x128 is about 80 ms, roughly the same value as on GPP. Thus, using a DSP co-

processor became feasible.

The other way to optimize is to use larger messages. This yields better results, since the portion

of a message header decreases in comparison with total amount of sent data.

We found that the optimal message size is at 256*48 bytes for one message (16 bit matrix

element), for larger messages the latency grows (see Fig. 6). As a final result, the execution time

is about 40 ms, about half of the GPP execution time.

Upon inspecting the results of the

matrix multiplication and verifying it

with MATLAB we found that the 16 bit

which is used to store the matrix

elements can overflow and it is possible

to get wrong values (on DSP and GPP

side alike) for large matrices even if the

individual elements are small. Thus, we

experimented with elements of size 32

bits. The latencies doubled approximately as we expected. For larger sizes there is a higher

latency. The reason for this is the limited cache size.

d. ConclusionIn this assignment we studied the architecture and programming of a heterogeneous

multicore processor. We created several versions of the programs, evolving from very

inefficient one to the optimized version. Each time creating new version we tried to overcome

the disadvantages of the previous one (communication inefficiency, compilation inefficiently,

Fig. 6: the effect of message size: blue:

maximum message size; red: optimal message


9/24

9

etc). We investigated the influence of the compilation modes, messages sizes, and

communication protocols, as well as DSP optimization approaches to make our program

more than 400 times faster than it was at the beginning.

We proved that it is reasonable to use DSP instead of GPP for matrix multiplication since DSPis ~10 times faster. We also showed that the big message sizes can lead not only to benefits

but also to the communication degradation in some cases.

4. GPROF analysis of the image recognition application

Fig. 7 GPROF output

Analyzing gprof output is one of important steps to boost up system performance. By analyzing

its output, we will able to learn where our program spent its time and which functions are called

which functions while it was executing. In the original application code, we could easily locate

which function(s) is the bottleneck of the system.As it is represented in Fig. 7, gprof will produce two tables, flat profile and call graph [8]. The

flat profile shows the total amount of time a program spent executing each function. To analyze

the flat profile, it is always worthwhile to check first the first column, the execution time of a

function in percentage. This statistic also implies to the fifth column, self ms/call which depicts


10/24

10

the execution time per function call in ms. In our example above, we could easily tell that

build_hmm() allocates more time of the total running time of the program in comparison with the

other functions. This is no surprising fact as the function has also the highest self ms/call.

Despite this fact, it appears that the function is called only twice in contrast with

states_are_mergeable() which has 98 function calls. If we take a look at

histogram_equalization(), this function is located both on the second place of the percentage of

the total running and function calls. Based on these facts, we could assume that both build_hmm()

and/or histogram_equalization might be the biggest bottleneck in the program.

The second table that is quite useful to look at is the call graph. The call graph shows how much

time was spent in each function and its children. In many cases it often occurs that although a

function does not cost much time, calling its children might cause bottleneck of the program. For

each function, they are separated with lines of dashes that divide them into entries. The callers of

a function always located in the preceding lines followed by its children. In the second entry,

build_hmm spends most of the time followed by histogram _equalization.

Based on this information and what we have observed in the flat profile, we could conclude that

build_hmm() and histogram_equalization() are the two functions that we should concentrate on to

optimize the program.

5. Optimization on GPPExecution of the initial code of the image processing (Application program) showed that it takes

~823 ms for GPP to finish the whole algorithm and present the right recognition. However,

introducing the O3 optimization for gcc compiler, the execution time reduces almost twice to

~427 ms.

The several versions of the functions were written to explore other possibilities of performance

optimization on GPP. Hereafter, the main optimization steps are explained.

a. Square root approximationThe significant impact on the program execution time on GPP(and DSP) is provided by the

replacement of the function sqrt ofmath.h in build_hmm(int) function by its approximation.

The method described in Fast square root in C [6] replaces sqrt function by integer-valued

computation of the square root. Since the result of the square root will be anyway casted into

unsigned char to be saved in stddev field of the tiger_state_array and eleph_state_array, the final


11/24

11

accuracy of this approach reaches 100% in all cases when the argument of the square root is

higher than 100. If the argument of the square root is less than 100, the approximation inaccuracy

influences the performance of the algorithm, since the accuracy does no reach 70%. The decision

was made to separate the argument range to two parts: [0..100] and (100, ). For the range (100,

) the function presented in [6] is used, while for the range [0..100] the table with square root

values was built, which took 100*8 bits= 100 bytes. The measurement has shown that the

replacement of the sqrtfunction by the table and the integer-valued computation [6] reduces the

application execution time by ~ 100 ms.

b. Replacement floating point by integersTo avoid the time-consuming floating-point operations on both DSP and GPP sides, all float

variables were replaced by integers and, where it is possible, even by unsigned char variables.

For instance, the local sum variable (initially offloat type) in histogram_equalization function

was replaced by unsigned integer equivalent; the accuracy of the computation is not lost since

later sum variable is casted to unsigned charanyway. In inequities, divisions were replaced by

multiplications according to algebraic rules. The constants, e.g. TIGER_STATE_THRESHOLD,

that were initially float constants, were multiplied by 100 to make integer equivalences.

Corresponding changes in the algorithm were done to keep the correctness of computations.

c. Loop unrollingManual loop unrolling inside such function like histogram_equalization, states_are_mergeable,

build_hmm,and analyze_img showed that unrolling the most inner loops two times reduces the

execution time on GPP side on 3-8 ms. Unrolling the most inner loops four times did not show

any performance increase in comparison with unrolling twice. As a result, where it is possible the

inner loops where unrolled twice. Special cases of improvements are described in the next

sections.

d. Histogram equalization improvementsSeveral versions of the function were written to reach high performance results:

1st version: The loops of the function were unrolled in a way that during one iteration of the

outmost loop the function process 3 images of 3 different types (tiger, elephant, and test images).

With such unrolling the program is executed within ~489 ms without O3 compiler optimization,

and 116 ms withO3 gcc.


12/24

12

2nd

version: All loops were unrolled 2 times. Corresponding results: ~486 ms without O3; ~109

ms withO3 gcc optimization.

3rd

version: This version represents the merge of the previous two: the loops were unrolled 2

times and each iteration of the outmost loop processes three pictures of different types. The

performance withoutO3 optimization is ~485.3 ms, withO3 ~118 ms.

4th version: This version is based on the version 1 but instead of processing three images of

different types, it processes 2 images of the same type per call.

Versions of

histogram_equalization

Execution time of 24

images equalization

(ms) with O3

Function execution

time (ms) with O3

Total execution time of the algorithm

on GPP after function introduction

(ms)

withoutO3 with O3

1 version 59.9 7.49 489 116

2 version 53.5 2.23 486 109

2 times unrolling

4 times unrolling 110.7

without unrolling 114.5

3 version 76 9.5 485.3 118

4 version 58.1 2.42 485.5 111.2

Table 1. Performance of histogram_equalization function versions

According to measurements presented in Table 1, the most significant improvement is gained by

the version 2 ( utilizes unrolling twice). This version is used in all further versions of the

programs.

It is worth to mention that the time execution depends on the images the function is processing.

The execution time for 8 tiger images is 1 ms longer in comparison with 8 elephant image

equalization.

e. HMM building improvementsFor Hidden Markov Model algorithm, the several versions ofbuild_hmm function were created:

1st

version: The algorithm is organized in a way that i-th element of the average fields ofstate

structures is filled during the same iteration as (i-1)-th element of the stddev field of the same

state structure. The version is organized in a way that elephant and tiger images are equalized

together each function loop iteration. Thus, the function is called only once, which can restrict


13/24

13

flexibility for further parallelization between GPP and DSP. The performance of this and other

version are presented in Table 2.

2nd

Version : The inner loops for average and stddev calculation were eliminated by unrolling 8

times (we have 8 images of elephants and 8 ones of tigers). Moreover, the loops for calculation of

average and stddev elements were merged into one. The most inner loop was unrolled 2 times.

Since the function states_are_mergeable is called inside the algorithm it was also unrolled twice

to reach better execution time. The results of the measurements are presented in Table 2.

Versions of build_hmm Function execution

time (ms) with O3

Total execution time of the algorithm

of GPP after function introduction

(ms)

withoutO3 with O3

1 version 208 571 254

2 version 19396 101

without changes in states_are_mergeable 102.3

4 times unrolling of HMM building instead of 2 times 108.6

HMM building without unrolling 110

Table 2. Performance of build_hmm function versions

According the performance measurement, the 1st

version is ~5 times slower than the 2nd

version.

It can be explain that the version 2 utilizes better spatial locality of the memory.

f.

Improvements for the process of image analysis (analyze_img)Several version of analyze_img function were written and tested to find the best program

execution time:

1st version: Two inner loops for comparison with averaged tiger and elephant images were

merged into one.

2nd Version: This version is based on the Version 1; however, the inner loop of the merged loop

was unrolled 2 and 4 times. To compare test images with averaged tiger (see Fig. 8) and averaged

elephant images (see Fig. 9) the switch-case structure was implemented based on is_deleted

fields. The results are presented in Table 3.

3rd version: This version is created specifically for DSP-GPP penalization. The function was

separated to two independent comparisons with averaged images of tiger and averaged image of

elephant. This version consists of three functions: the outputs of the first two are integer arrays of


14/24

14

8 elements that contain the values oftigerand eleph variable (that represent the level of similarity

between test images and averaged images of tiger and elephant) for each test image. The third

function compares these two arrays and outputs the recognition result.

Versions of

analyze_img

Function execution time (ms) with O3 Total execution time of the algorithm

of GPP after function introduction

(ms)

withoutO3 with O3

1 version 11.5 340 101

2 version(twice

unrolled)

9.5 333

unrolled 4 times instead of two times 100

3 version 10.5 370 102

Table 3. Performance ofanalyze_img function versions

g. Use of compilation flags for the program execution on GPPSeveral options were tried to decrease the execution time below 98.6 ms that was reached using

the fastest versions described in previous section. However, the use of -funroll-loops flag

increased time by 2 ms. -finline-functions and -finline-small-functions did not show visible

effects. The explanation for these results can be that O3 optimization that is used to get 98.6 ms

reaches the maximum compiler optimization, and introduction of other flags might not influence

the performance and in several cases even increase the total execution time.

h.

Conclusion:As it was presented in previous sections, significant execution time reduction can be reached by

using integer variables instead of floating point variable (where it does not corrupt the algorithm),

loop unrolling, loop merging, sqrt integer-valued approximation. However, it was also shown

that loop unrolling four times and several compilation flag may even increase the execution time.

Fig. 8 The averaged elephant image Fig. 9 The averaged tiger image


15/24

15

The Diagram 1 shows the optimization shrink that were described step by step in previous

sections. The program executed only on GPP can be divided into 3 main steps: Histogram

equalization, HMM building ,Recognition process. The normalized Diagram 2 shows that O3

compiler optimization had more significant effect in analyse_img function. However, after

utilization of manual optimizations the portion of HMM building significantly decreases in total

time execution.

The function versions that show the highest execution speed (marked in bold inside Tables 1,2,

and 3) are used in GPP-DSP algorithm parallelization.

6. Parallelization with DSPUntil now, DSP has not been used, but since it is possible to parallelize the initial algorithm

between GPP and DSP units, the opportunity to decrease the execution time was investigated.

a. Communication DSP-GPPThe main obstacle for simple parallelization is the communication overhead. According to the

conducted measurements, GPP spends 47 ms to write 320 KB (equivalent of 8 images) into the

pool. Since the execution of the whole program on GPP takes ~98.6 ms, the communication

overhead of 47 ms is acceptable. The easiest way to reduce communication overhead is to write

into the pool bigger types of information. Hereafter, the Table 4 presents the time spent by GPP

to send 320 KB into the pool.

Recognition process

analyze_img

Diagram 1. Time execution of

different program versions

Diagram 2. Normalized time

execution of different program

versions

HMM buildng

build_hmm

Histogram equalization

histogram_equalization

Initial version initial version final version

withO3 optimization

Initial version initial version final version

withO3 optimization


16/24

16

Type of element written separately to

the pool

Size of element type (bits) Time to send 320 KB into pool (ms)

char 8 47

Int32 32 ~12.5

long 64 ~7Table 4. Overhead of sending data into the pool

In the compiler manual we found that long type on DSP side corresponds to 40 bits, but with a

help of sizeof operator it returned 64 bits. Moreover, the compiler takes long long variable as

variable of long type (64 bits).

Sending one notification (Int32) takes ~30 us, which is negligible in comparison with writing to

the pool.

An interesting observation is that writing to the pool takes less time than reading from it.

It can be explained that, when we are writing to the pool the API function sets up a DMA, starts

it, and can continue execution right after it. The other way around we have to wait until the DMA

has finished fetching all the data. This assymetricity is in favor of the main processor, as it can

quickly transfer data to the co-processor and continue its own task.

On the DSP side, we can read data from the pool and write them to the DSP RAM. In

some cases it will decrease function execution time on DSP side; however, if the copying

overhead takes the main portion of execution, the copying will not reduce the total execution

time(e.g., histogram equalization for 8 images takes 8 ms less if we read-write data directly

from/to pool without copying it to DSP RAM). The time spent for copying from pool char by

char to DSP RAM is presented in table:

Amount of data written to DSP RAM from the pool (KB) Spent time (ms)

40 ~2.5

120 ~4.5

240 ~7.2

320 ~9

Table 5. Overhead of writing from the pool to DSP RAMMeanwhile, copying from pool to GPP RAM takes approximately the same amount of time.

Copying from the pool can be significantly speed up, if we copy element of bigger type (e.g. long

instead of char). The proportion of speedup is the same as during writing process (see Table 4).


17/24

17

b. Function execution on DSP sideAmong all functions in the algorithm the histogram is the most suitable for execution on DSP

side as in most cases the same operation is carried out on large sets of data (e.g. pixels, or the bins

of the histogram) with narrow types (8 or 16 bit). Usually a DSP is designed to carry out

operations of this type efficiently. Thus, we considered this function for extra optimization in

order to gain speed.

Direct implementation of gpp version with unrolling doesnt working on dsp well. For example,

histogram equalization for 8 images on DSP side takes ~45.5 ms (reading- writing to the pool),

while on GPP the same operation will take not more than 18 ms.

Although the keywords and macros, which were used in the case of matrix multiplication most of

the time, do not provide any changes in the cycle times of the loops, slight modification in the

loop bodies, which eliminate data dependencies, can increase performance. One example of these

dependencies is in the loop where we accumulate the histogram. There is a read and a write to the

same memory address per loop iteration. This potentially generates RAW hazards in a pipeline.

To alleviate this problem we can write the results in a temporary histogram. In other occasions

we tried to manually unroll the loops, in which we used the same index variable to address

different parts of the same array. This can cause problems when the compiler cannot parallelize

these instructions as they are reading the same register at the same time. Thus, we created new

index variables which are incremented in every loop.

Apart from the above techniques we tried to express the alignment of the variables with the use of

_nassert() intrinsic in the hope that the compiler will use packed operations. Surprisingly these

efforts did not lead to any improvements in the initiation intervals of the loops. It is also possible

to explicitly use packed instructions with the help of intrinsics. However, we found that deeper

knowledge about the assembly and the architecture of the DSP is needed to use these.

Thus, with all these optimizations the execution time was ~ 4 ms faster.

c. Use of Texas Instruments Software LibrariesWe tried using two software libraries [7] provided free of charge by the Texas Instruments:

FastRTS and IMGLIB. These libraries can be linked to any little-endian project and have been

optimized by the engineers of the vendor.


18/24

18

The FastRTS library provides floating-point mathematics emulation for fixed-point DSPs. We

have experimented with changing the division in the histogram_equalization function to the

function provided by this library and then cast the result to integer. This yielded very poor results.

On the other hand, IMGLIB contains a function for the histogram calculation, which uses a 256

wide array of bytes as output just as the loop in the original program, which is poorly optimized

by the compiler. Thus, an obvious approach is to exchange this loop for the library

implementation. The histogram function has two implementation: a natural C and an assembly-

level optimized version. The assembly-optimized version performed very poorly in terms of

speed, whereas the natural C implementation has approximately the same speed as the original

loop.

d. Use of compilation flags for the program execution on DSPIn order to be able to optimize the release version in DSP, we have added a flag k to the

CFLAGS. The k will keep the assembly file so that we can inspect and analyze compiler

feedback. However, after getting the optimized version, we have removed this flag as it

influences also the total performance of the program. To make it easier to inspect the

optimization problem inside the assembly file, we have also used ss flag. This flag provides

optimizer comments in assembly. As with k, this flag also degrades the performance of the

program. However, the optionmw can also be used to list software pipeline information in the

assembly file, without degrading the performance.

Memory dependencies are problems that might conceal the real optimized performance.

Dependency implies that one code can only be executed after the execution of another code. To

have an optimum efficiency, the compiler should parallelize as many instructions as possible. To

help a compiler determines if instructions are independent, -pm flag should be used which gives

the compiler global access to the whole program or module and allows it to be more aggressive in

ruling out dependencies. It is also mentioned in [5] that pm should be used as much of our

program as possible. Flag mt can also be used to eliminate dependencies. Using this flag, it isequivalent to adding the .no_mdep directive to the linear assembly source file.

Combination ofpm and op2 provides a facility to the compiler to inline a function call in a

loop. Sometimes it may happen that a compiler may not be able to inline a function if the

function is invoked inside a loop. This disability of a compiler to inline may end up in disability


19/24

19

to parallelize the execution of the code. To prevent this, -pm and op2 should be used to enable

automatic inlining a function call inside a loop.

To allow speculative execution, the appropriate amount of padding must be available in data

memory to insure correct execution. In order to do this, -mh can be used.

Last but not least, -o3 which has been used in the original Makefile of the program provides the

highest level of optimization available. This flag will maximize compiler analysis and

optimization. By using this flag, various loop optimizations are performed, such as software

pipelining, unrolling, and SIMD. Some file level characteristics are also used to improve

performance [5].

e. Algorithm parallelization:1st version: This version of the program outsources the equalization of 8 test images to DSP.

However, due to communication overhead (9 ms for writing to the pool, and 9 ms to read from it

on GPP side), the whole benefit of independent parallel equalization on DSP side is eliminated.

The total execution time is ~102 ms, which is 3 ms longer than execution on GPP, described in

Section 5f .

2nd version: As it was shown, the communication overhead plays significant role in the total

execution time. As a result, the decision was made to limit the communication between GPP and

DSP as much as possible. The most optimal solution to meet this limitation is to let DSP

independently equalize, build HMM for one type of images (e.g. elephants). Meanwhile, GPP can

equalize other type of images (e.g. tiger images) and test images, write test images into the pool

to let DSP independently compare test images with the averaged elephant picture. Thus, to get the

recognition result, the DSP send only 9 notifications (8 with values of tiger variables, and one

notification to send hmm_eleph_states), which takes approximately 30us

(analyse_img, the version 3 is used, see Section 5f). To organize such parallel processing, it is

necessary to take timing into account, e.g. DSP should not compare averaged elephant picture

with test images that are located in the pool before these images were equalized and written into

the pool by GPP(the semaphores are used). The corresponding process is represented in Scheme

1:


20/24

20

Scheme 1. Parallel image recognition process

The measurement shown that the total execution time is 79.2 ms. The described program

organization has some advantages:

The amount of communication is limited, the total communication overhead ~ 15 ms Parallel execution of the almost half of the algorithm The synchronization misalignment less than 2 ms: GPP waits DSP at the very end (before

give result of recognition).

f. ConclusionIn this section it was described how to parallelize the program to make its execution faster.

However, it is necessary to highlight that communication overhead takes significant part of the

program execution 17 %. Because of the overhead, the parallelization in described Version 1

(see Section 6e) does not give any benefits. However, since it is possible to separate the program

execution into two almost independent parts (as it is shown in Version 2, Section 6e), the

parallelization significantly reduces the program execution time. In our case, the execution


21/24

21

speedup due to GPP-DSP parallelization (Version 2) is almost 25% in comparison with the

program execution only on GPP side(79.2 ms in comparison with 98.6 ms).

Fig. 10 Execution times on GPP and DSP sidesFig. 10 represents the time periods spent on different operations on GPP and DSP side. DSP

spends more time on any function execution in comparison with GPP. The histogram

equalization on DSP side is much slower than on GPP (more than 2 times slower). It can be

explained by communication overhead, since DSP reads directly from the pool during histogram

equalization of elephant images. Other functionality is performed in RAM on DSP side, so the

execution times on DSP side for functions like HMM building and analyse_img are

approximately the same as on GPP side.

Comparing the pool-notify architecture with the previously used message queuing, it is

visible that the different communication architectures are suitable for different requirements in

data transfer and synchronization. Message architectures feature simple functionality and are

suitable for infrequent data sending between two parties with a considerable overhead of the

message headers, whereas the pool can be used as storage for large amount of data which is

available to many cores. Moreover the message queues provide tight coupling, whereas

notification couples the systems loosely.

Several attempts were made to decrease the communication overhead: use of Texas

Instruments library, compiler optimization, and changing the type of sending information (saving

4 chars as one long). The communication overhead was decreaseed by the later approach

approximately 7 times; however, the overhead still takes significant portion of total execution

ms


22/24

22

time. As a future work, the utilization of another type of communication, Dynamic Memory

Mapping [9], can be considered.

7. Final conclusionIn this course we studied programming techniques on heterogeneous multiprocessor

architectures. We have seen how an application can be separated to functionally different

components (controldata processing) and assigned to processing units which are more suitable

for the given task. We have seen that performance is just as much influenced by the adequate

choice of the communication architecture as the optimality of the code running on a given core.

We could see that the performance of the system can be increased by an auxiliary core even if it

executes the given code slower than the main core. Thus, in these systems the load balancing is a

key factor to achieve performance increase.

References

[1] Texas Instruments, OMAP3530/25 Applications Processor: Oct ,2009;

[2] Texas Instruments, DSP/BIOS LINK USER GUIDE Version 1.60: Oct 21, 2008;

[3] Datasheet SPRU187O TMS320C Optimizing C compiler;

[4] http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops;

[5] SPRU425A TMS320C6000 Optimizing C Compiler tutorial;

[6] Fast square root in C, IAR Application Note G-002, IAR Systems,

http://supp.iar.com/FilesPublic/SUPPORT/000419/AN-G-002.pdf;

[7] Texas Instruments, Software Libraries, http://processors.wiki.ti.com/index.php/software_libraries;

[8] http://www.eecs.umich.edu/~sugih/pointers/gprof_quick.html;[9] Hari Kanigeri ,Texas Instruments, OMAP3430 Bridge Overview, August, 2008.
http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loopshttp://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops


23/24

23

8. AppendixListings from the DSP assembly files:

;* SOFTWARE PIPELINE INFORMATION

;*

;* Loop source line : 25

;* Loop opening brace source line : 25

;* Loop closing brace source line : 27

;* Known Minimum Trip Count : 1

;* Known Maximum Trip Count : 255

;* Known Max Trip Count Factor : 1

;* Loop Carried Dependency Bound(^) : 9

;* Unpartitioned Resource Bound : 2

;* Partitioned Resource Bound(*) : 2

;* Resource Partition:

;* A-side B-side

;* .L units 0 0

;* .S units 2* 1

;* .D units 2* 1

;* .M units 1 0

;* .X cross paths 1 0

;* .T address paths 2* 1;* Long read paths 0 0

;* Long write paths 0 0

;* Logical ops (.LS) 0 0 (.L or .S unit)

;* Addition ops (.LSD) 1 0 (.L or .S or .D unit)

;* Bound(.L .S .LS) 1 1

;* Bound(.L .S .D .LS .LSD) 2* 1

;*

;* Searching for software pipeline schedule at ...

;* ii = 9 Schedule found with 1 iterations in parallel

Listing 1: original matrix multiplication code


;*











;* A-side B-side

;* .L units 0 0

;* .S units 2 1

;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2;*;* Searching for software pipeline schedule at ...


;*


24/24

24

Listing 2: optimizing with the restrict keyword


;*











;* A-side B-side

;* .L units 0 0;* .S units 2 1;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2

;*;* Searching for software pipeline schedule at ...


Listing 3: loop-unrolling


;*

;* Loop source line :34





;* Known Max Trip Count Factor :2


;* Unpartitioned Resource Bound : 3;* Partitioned Resource Bound(*) :3


;* A-side B-side

;* .L units 0 0

;* .S units 2 1

;* .D units 3* 2

;* .M units 1 3*

;* .X cross paths 1 2

;* .T address paths 3* 2

;* Long read paths 0 0

;* Long write paths 0 0

;* Logical ops (.LS) 0 0 (.L or .S unit)

;* Addition ops (.LSD) 1 3 (.L or .S or.D unit)

;* Bound(.L .S .LS) 1 1

;* Bound(.L .S .D .LS .LSD) 2 2

;*

;* Searching for software pipeline schedule at ...


Listing 4: packed 16 bit arithmetic