8/6/2019 FINAL Group4
1/24
1
Delft University of Technology, EWI
IN4342 Embedded Systems Laboratory
Final Report
Group 4:D. Turi
D. Burlyaev
K. Falatehan
Z. Kocsi-Horvth
{D.Turi, D.Burlyaev, K.Falatehan
Z.KocsiHorvath}@student.tudelft.nl
Period 4, Delft University of Technology, EWI
8/6/2019 FINAL Group4
2/24
8/6/2019 FINAL Group4
3/24
3
1. IntroductionSome of the usual embedded applications feature control and data processing functionality, whereas
microprocessor architectures tend to be specialized to effectively carry out one of the few. Thus, in many
cases it can be difficult to choose an architecture which yields high performance, yet low cost and energy
efficiency. To tackle this problem heterogeneous multiprocessor architectures have been developed, which
require special skills from the programmers.
In the following report we will describe case studies of two simple applications. In both cases we
partitioned preexisting algorithms so that maximum performance can be achieved. As hardware we used
the BeagleBoard which features the TI OMAP3530 chip.
Apart from presenting the programs we also cover various techniques and problems associated with this
field.
2. OMAP 3530 ArchitectureThe OMAP 3530 is one of the SoCs (System on Chip) manufactured by Texas Instruments. It
features a heterogeneous multiprocessor architecture, which consists of two MPU subsystems,
ARM Cortex-A8 GPP (General purpose processor) and TMS320C64x+ DSP (Digital signal
processor). This architecture was designed to facilitate high performance computations that
provide best-in-class video, image, and graphics processing [1].
Fig. 1 gives a brief overview about the
architecture. The two MPUs are connected via a
multilevel bus system which consists of the
levels L3 and L4 (L1 and L2 is used to designate
the memories to which the cores have an
exclusive connection). This hierarchy facilitates
faster data transfer, as the shared on-chipmemory is attached to L3, while the peripherals
can be accessed via L4.
We have used the vendors library, DSPLINK,Fig.1 The architecture of OMAP 3530 SoC
8/6/2019 FINAL Group4
4/24
4
which provides a layer of abstraction from the communication between the cores. Apart from
data transfer, this library facilitates the Cortex-A8s control over the DSP.
There are numerous possibilities for communication between the processors, such as message
queues, shared memory pools, notification, etc. These possibilities as well as differentoptimization approaches will be investigated in this report.
3. Matrix multiplicationa. IntroductionWe began studying the capabilities of the chip through a simple program which carries out the
multiplication of a N-by-N matrix. As the GPP runs a Linux operating system which provides
direct access to the file system, and the terminal output, it is much easier for inexperiencedprogrammers to write programs for it. Thus, we implemented the matrix multiplication algorithm
first on the GPP and later used it to verify the results obtained from the DSP.
Due to the indirect connection through DSPLINK it is more difficult to write programs for the
DSP. Moreover, without the necessary macros and keywords the compiler generates a program of
suboptimal performance; thus, additional programming effort is needed to make the usage of the
DSP feasible.
Apart from the optimality of the code, the other major influence on the execution time is the
communication architecture between the two cores. Therefore, it is necessary to measure and
fine-tune the communication, as well.
In this assignment we used message queues which not only facilitate data transfer but also
synchronize execution on both cores.
In the following sections we present our experiments with the above mentioned algorithm and
communication architecture. We conducted measurements with different matrix and message
sizes. The elapsed time was measured with use of the system functions provided by the OS. Due
to the stochastical nature of the execution time we carried out multiple measurements with the
same parameters, after which we calculated the average. The results were written into files which
were analyzed by a MATLAB program, which generated graphs of the measurements. These
8/6/2019 FINAL Group4
5/24
5
graphs facilitate a deeper understanding of the dependency of the execution time on the various
parameters.
In the following cases the upper part of
the graphs shows the total executiontime as a function of the matrix size,
whereas the lower part represents the
portion of the execution time per one
element of a matrix as a function of
matrix size.
b. Optimization of the communicationFig. 2 shows the measurement results for the GPP execution. As it can be seen, the execution
time for the 128x128 matrix is about 80 msec. In the followings this result has been used to
decide whether it is feasible to execute the multiplication on the DSP, i.e. the DSP execution time
with the communication overhead has to be shorter than this time.
In the next experiment we inspected the effect of communication overhead on the execution time.
We implemented two versions of the same program. In the first version we used one message per
matrix row to send the elements (see Fig.3 ,
in red). Thus, in the case of small matrices
there is a large overhead, whereas large
matrices can be transmitted efficiently. The
results of this program were compared with
the second version(see Fig.3 , in blue), where
we tried to pack as many elements in one
message as possible, to ensure the optimal
communication.
The difference is visible especially on the
lower part of the graph. Due to the efficiency
Fig.2 Matrix multiplication on the GPP side
(element size =16 bit)
Fig. 3: the communication overhead. Red: message-
by-row, blue: packing
8/6/2019 FINAL Group4
6/24
6
of the second algorithm fewer messages need to be sent; thus, the execution time is shorter.
However, this difference diminishes with larger matrices.
As we can see, the execution time is approximately 400 ms in both cases, which is much more
than that of the GPP.
We found that there are two ways to decrease the elapsed time: further reducing communication
overhead (with increasing message size) or to make the DSP execution run faster through
optimization of the algorithm.
c. Optimization of the DSP codeThe code on the DSP was mostly optimized automatically by the compiler. This option has can
be enabled by the o3 option (in the Makefile). The efficiency of the automatic optimization
depends on how much information is known by the compiler about the program. This informationcan be supplied by the methods described below.
With software pipelining the compiler tries to overlap the consecutive loops in a way that the
hardware is as fully utilized as possible. An important parameter is called the initiation interval
(ii) which is the number of cycles per
loop iteration. ii is equivalent to the
cycle count for the software-pipelined
loop body. The method is shown on
Fig. 4. [4].
Apart from the cycle count of the loop
body other items are shown, e.g. trip counts
(the minimal and maximal value of the iterations that is expected by the compiler), the
dependency and resource bounds (the constraints which stem from the data dependencies and
hardware resource limits). It is also shown how the instructions have been partitioned among the
two data paths in the DSP (path A and B).
The listing for the original code is shown in Listing 1 (see Appendix). The ii was determined to
be 9 cycles. In the following we used keywords and macros to achieve better efficiency [5].
Fig.4 Software pipelining [4]
8/6/2019 FINAL Group4
7/24
7
We found that the compiler assumes that there is a dependency between the pointers which are
used to iterate through the multiplicands and the result. Knowing that these areas are not
overlapping, the programmer can declare these arrays with the restrict keyword. The result is
shown in Listing 2. The cycle time has been decreased to 3 cycles which is a significant
performance improvement. As we can see, this is the result of the loop carried dependency bound
which fell from the previous 9 to 1 cycle. This decrease in cycle number resulted in the increase
of the resource bounds.
In order to optimize this result even more, we can try to use the data paths more efficiently.
Because the number of operations are not even, the operations are assigned to the computational
units in the two data paths asymmetrically (i.e. there are periods of time when the units in path
are computing while the units in the other path are idling). To solve this, the loop can be unrolled;
in which case we will have an even number of operations. After this, the compiler reorders the
instructions in such a way that the computational units are evenly utilized.
To enable unrolling the programmer can include #pragma MUST_ITERATE(, ,) macros, which
specify the minimum and the maximum number of iterations plus a value which is an integer
divisor of these. In our program we chose this value to be 2. If the multiplicand matrices have an
odd size, they are augmented by an extra row and column of zeros.
The result is shown in Listing 3. As we can see the cycle number did not change but the
computational unit usage became more balanced. The trip count factor was changed to 2 as
specified.
In order to further optimize the execution
we tried to exploit the SIMD capabilities
of the DSP. The DSP is capable of
executing 32 bit wide arithmetical
operations or, if specified, the same
operations on 16 bit wide numbers in
parallel (this is called packed 16 bit
arithmetic). To be able to fetch the two 16 bit numbers at the same cycle, the data needs to be
Fig 5. Optimized version on DSP
8/6/2019 FINAL Group4
8/24
8
aligned on halfword boundaries. The compiler can be convinced of alignment with the
_nassert((int)aligned_ptr % == 0) function.
The result is shown in Listing 4. As we can see, the cycle count remained the same, which means
that the execution is roughly the same for 32 bit as 16 bit wide data.
The result of DSP optimization is shown on Fig. 5. As we can see, the execution time for matrix
sizes of 128x128 is about 80 ms, roughly the same value as on GPP. Thus, using a DSP co-
processor became feasible.
The other way to optimize is to use larger messages. This yields better results, since the portion
of a message header decreases in comparison with total amount of sent data.
We found that the optimal message size is at 256*48 bytes for one message (16 bit matrix
element), for larger messages the latency grows (see Fig. 6). As a final result, the execution time
is about 40 ms, about half of the GPP execution time.
Upon inspecting the results of the
matrix multiplication and verifying it
with MATLAB we found that the 16 bit
which is used to store the matrix
elements can overflow and it is possible
to get wrong values (on DSP and GPP
side alike) for large matrices even if the
individual elements are small. Thus, we
experimented with elements of size 32
bits. The latencies doubled approximately as we expected. For larger sizes there is a higher
latency. The reason for this is the limited cache size.
d. ConclusionIn this assignment we studied the architecture and programming of a heterogeneous
multicore processor. We created several versions of the programs, evolving from very
inefficient one to the optimized version. Each time creating new version we tried to overcome
the disadvantages of the previous one (communication inefficiency, compilation inefficiently,
Fig. 6: the effect of message size: blue:
maximum message size; red: optimal message
8/6/2019 FINAL Group4
9/24
9
etc). We investigated the influence of the compilation modes, messages sizes, and
communication protocols, as well as DSP optimization approaches to make our program
more than 400 times faster than it was at the beginning.
We proved that it is reasonable to use DSP instead of GPP for matrix multiplication since DSPis ~10 times faster. We also showed that the big message sizes can lead not only to benefits
but also to the communication degradation in some cases.
4. GPROF analysis of the image recognition application
Fig. 7 GPROF output
Analyzing gprof output is one of important steps to boost up system performance. By analyzing
its output, we will able to learn where our program spent its time and which functions are called
which functions while it was executing. In the original application code, we could easily locate
which function(s) is the bottleneck of the system.As it is represented in Fig. 7, gprof will produce two tables, flat profile and call graph [8]. The
flat profile shows the total amount of time a program spent executing each function. To analyze
the flat profile, it is always worthwhile to check first the first column, the execution time of a
function in percentage. This statistic also implies to the fifth column, self ms/call which depicts
8/6/2019 FINAL Group4
10/24
10
the execution time per function call in ms. In our example above, we could easily tell that
build_hmm() allocates more time of the total running time of the program in comparison with the
other functions. This is no surprising fact as the function has also the highest self ms/call.
Despite this fact, it appears that the function is called only twice in contrast with
states_are_mergeable() which has 98 function calls. If we take a look at
histogram_equalization(), this function is located both on the second place of the percentage of
the total running and function calls. Based on these facts, we could assume that both build_hmm()
and/or histogram_equalization might be the biggest bottleneck in the program.
The second table that is quite useful to look at is the call graph. The call graph shows how much
time was spent in each function and its children. In many cases it often occurs that although a
function does not cost much time, calling its children might cause bottleneck of the program. For
each function, they are separated with lines of dashes that divide them into entries. The callers of
a function always located in the preceding lines followed by its children. In the second entry,
build_hmm spends most of the time followed by histogram _equalization.
Based on this information and what we have observed in the flat profile, we could conclude that
build_hmm() and histogram_equalization() are the two functions that we should concentrate on to
optimize the program.
5. Optimization on GPPExecution of the initial code of the image processing (Application program) showed that it takes
~823 ms for GPP to finish the whole algorithm and present the right recognition. However,
introducing the O3 optimization for gcc compiler, the execution time reduces almost twice to
~427 ms.
The several versions of the functions were written to explore other possibilities of performance
optimization on GPP. Hereafter, the main optimization steps are explained.
a. Square root approximationThe significant impact on the program execution time on GPP(and DSP) is provided by the
replacement of the function sqrt ofmath.h in build_hmm(int) function by its approximation.
The method described in Fast square root in C [6] replaces sqrt function by integer-valued
computation of the square root. Since the result of the square root will be anyway casted into
unsigned char to be saved in stddev field of the tiger_state_array and eleph_state_array, the final
8/6/2019 FINAL Group4
11/24
11
accuracy of this approach reaches 100% in all cases when the argument of the square root is
higher than 100. If the argument of the square root is less than 100, the approximation inaccuracy
influences the performance of the algorithm, since the accuracy does no reach 70%. The decision
was made to separate the argument range to two parts: [0..100] and (100, ). For the range (100,
) the function presented in [6] is used, while for the range [0..100] the table with square root
values was built, which took 100*8 bits= 100 bytes. The measurement has shown that the
replacement of the sqrtfunction by the table and the integer-valued computation [6] reduces the
application execution time by ~ 100 ms.
b. Replacement floating point by integersTo avoid the time-consuming floating-point operations on both DSP and GPP sides, all float
variables were replaced by integers and, where it is possible, even by unsigned char variables.
For instance, the local sum variable (initially offloat type) in histogram_equalization function
was replaced by unsigned integer equivalent; the accuracy of the computation is not lost since
later sum variable is casted to unsigned charanyway. In inequities, divisions were replaced by
multiplications according to algebraic rules. The constants, e.g. TIGER_STATE_THRESHOLD,
that were initially float constants, were multiplied by 100 to make integer equivalences.
Corresponding changes in the algorithm were done to keep the correctness of computations.
c. Loop unrollingManual loop unrolling inside such function like histogram_equalization, states_are_mergeable,
build_hmm,and analyze_img showed that unrolling the most inner loops two times reduces the
execution time on GPP side on 3-8 ms. Unrolling the most inner loops four times did not show
any performance increase in comparison with unrolling twice. As a result, where it is possible the
inner loops where unrolled twice. Special cases of improvements are described in the next
sections.
d. Histogram equalization improvementsSeveral versions of the function were written to reach high performance results:
1st version: The loops of the function were unrolled in a way that during one iteration of the
outmost loop the function process 3 images of 3 different types (tiger, elephant, and test images).
With such unrolling the program is executed within ~489 ms without O3 compiler optimization,
and 116 ms withO3 gcc.
8/6/2019 FINAL Group4
12/24
12
2nd
version: All loops were unrolled 2 times. Corresponding results: ~486 ms without O3; ~109
ms withO3 gcc optimization.
3rd
version: This version represents the merge of the previous two: the loops were unrolled 2
times and each iteration of the outmost loop processes three pictures of different types. The
performance withoutO3 optimization is ~485.3 ms, withO3 ~118 ms.
4th version: This version is based on the version 1 but instead of processing three images of
different types, it processes 2 images of the same type per call.
Versions of
histogram_equalization
Execution time of 24
images equalization
(ms) with O3
Function execution
time (ms) with O3
Total execution time of the algorithm
on GPP after function introduction
(ms)
withoutO3 with O3
1 version 59.9 7.49 489 116
2 version 53.5 2.23 486 109
2 times unrolling
4 times unrolling 110.7
without unrolling 114.5
3 version 76 9.5 485.3 118
4 version 58.1 2.42 485.5 111.2
Table 1. Performance of histogram_equalization function versions
According to measurements presented in Table 1, the most significant improvement is gained by
the version 2 ( utilizes unrolling twice). This version is used in all further versions of the
programs.
It is worth to mention that the time execution depends on the images the function is processing.
The execution time for 8 tiger images is 1 ms longer in comparison with 8 elephant image
equalization.
e. HMM building improvementsFor Hidden Markov Model algorithm, the several versions ofbuild_hmm function were created:
1st
version: The algorithm is organized in a way that i-th element of the average fields ofstate
structures is filled during the same iteration as (i-1)-th element of the stddev field of the same
state structure. The version is organized in a way that elephant and tiger images are equalized
together each function loop iteration. Thus, the function is called only once, which can restrict
8/6/2019 FINAL Group4
13/24
13
flexibility for further parallelization between GPP and DSP. The performance of this and other
version are presented in Table 2.
2nd
Version : The inner loops for average and stddev calculation were eliminated by unrolling 8
times (we have 8 images of elephants and 8 ones of tigers). Moreover, the loops for calculation of
average and stddev elements were merged into one. The most inner loop was unrolled 2 times.
Since the function states_are_mergeable is called inside the algorithm it was also unrolled twice
to reach better execution time. The results of the measurements are presented in Table 2.
Versions of build_hmm Function execution
time (ms) with O3
Total execution time of the algorithm
of GPP after function introduction
(ms)
withoutO3 with O3
1 version 208 571 254
2 version 19396 101
without changes in states_are_mergeable 102.3
4 times unrolling of HMM building instead of 2 times 108.6
HMM building without unrolling 110
Table 2. Performance of build_hmm function versions
According the performance measurement, the 1st
version is ~5 times slower than the 2nd
version.
It can be explain that the version 2 utilizes better spatial locality of the memory.
f.
Improvements for the process of image analysis (analyze_img)Several version of analyze_img function were written and tested to find the best program
execution time:
1st version: Two inner loops for comparison with averaged tiger and elephant images were
merged into one.
2nd Version: This version is based on the Version 1; however, the inner loop of the merged loop
was unrolled 2 and 4 times. To compare test images with averaged tiger (see Fig. 8) and averaged
elephant images (see Fig. 9) the switch-case structure was implemented based on is_deleted
fields. The results are presented in Table 3.
3rd version: This version is created specifically for DSP-GPP penalization. The function was
separated to two independent comparisons with averaged images of tiger and averaged image of
elephant. This version consists of three functions: the outputs of the first two are integer arrays of
8/6/2019 FINAL Group4
14/24
14
8 elements that contain the values oftigerand eleph variable (that represent the level of similarity
between test images and averaged images of tiger and elephant) for each test image. The third
function compares these two arrays and outputs the recognition result.
Versions of
analyze_img
Function execution time (ms) with O3 Total execution time of the algorithm
of GPP after function introduction
(ms)
withoutO3 with O3
1 version 11.5 340 101
2 version(twice
unrolled)
9.5 333
unrolled 4 times instead of two times 100
3 version 10.5 370 102
Table 3. Performance ofanalyze_img function versions
g. Use of compilation flags for the program execution on GPPSeveral options were tried to decrease the execution time below 98.6 ms that was reached using
the fastest versions described in previous section. However, the use of -funroll-loops flag
increased time by 2 ms. -finline-functions and -finline-small-functions did not show visible
effects. The explanation for these results can be that O3 optimization that is used to get 98.6 ms
reaches the maximum compiler optimization, and introduction of other flags might not influence
the performance and in several cases even increase the total execution time.
h.
Conclusion:As it was presented in previous sections, significant execution time reduction can be reached by
using integer variables instead of floating point variable (where it does not corrupt the algorithm),
loop unrolling, loop merging, sqrt integer-valued approximation. However, it was also shown
that loop unrolling four times and several compilation flag may even increase the execution time.
Fig. 8 The averaged elephant image Fig. 9 The averaged tiger image
8/6/2019 FINAL Group4
15/24
15
The Diagram 1 shows the optimization shrink that were described step by step in previous
sections. The program executed only on GPP can be divided into 3 main steps: Histogram
equalization, HMM building ,Recognition process. The normalized Diagram 2 shows that O3
compiler optimization had more significant effect in analyse_img function. However, after
utilization of manual optimizations the portion of HMM building significantly decreases in total
time execution.
The function versions that show the highest execution speed (marked in bold inside Tables 1,2,
and 3) are used in GPP-DSP algorithm parallelization.
6. Parallelization with DSPUntil now, DSP has not been used, but since it is possible to parallelize the initial algorithm
between GPP and DSP units, the opportunity to decrease the execution time was investigated.
a. Communication DSP-GPPThe main obstacle for simple parallelization is the communication overhead. According to the
conducted measurements, GPP spends 47 ms to write 320 KB (equivalent of 8 images) into the
pool. Since the execution of the whole program on GPP takes ~98.6 ms, the communication
overhead of 47 ms is acceptable. The easiest way to reduce communication overhead is to write
into the pool bigger types of information. Hereafter, the Table 4 presents the time spent by GPP
to send 320 KB into the pool.
Recognition process
analyze_img
Diagram 1. Time execution of
different program versions
Diagram 2. Normalized time
execution of different program
versions
HMM buildng
build_hmm
Histogram equalization
histogram_equalization
Initial version initial version final version
withO3 optimization
Initial version initial version final version
withO3 optimization
8/6/2019 FINAL Group4
16/24
16
Type of element written separately to
the pool
Size of element type (bits) Time to send 320 KB into pool (ms)
char 8 47
Int32 32 ~12.5
long 64 ~7Table 4. Overhead of sending data into the pool
In the compiler manual we found that long type on DSP side corresponds to 40 bits, but with a
help of sizeof operator it returned 64 bits. Moreover, the compiler takes long long variable as
variable of long type (64 bits).
Sending one notification (Int32) takes ~30 us, which is negligible in comparison with writing to
the pool.
An interesting observation is that writing to the pool takes less time than reading from it.
It can be explained that, when we are writing to the pool the API function sets up a DMA, starts
it, and can continue execution right after it. The other way around we have to wait until the DMA
has finished fetching all the data. This assymetricity is in favor of the main processor, as it can
quickly transfer data to the co-processor and continue its own task.
On the DSP side, we can read data from the pool and write them to the DSP RAM. In
some cases it will decrease function execution time on DSP side; however, if the copying
overhead takes the main portion of execution, the copying will not reduce the total execution
time(e.g., histogram equalization for 8 images takes 8 ms less if we read-write data directly
from/to pool without copying it to DSP RAM). The time spent for copying from pool char by
char to DSP RAM is presented in table:
Amount of data written to DSP RAM from the pool (KB) Spent time (ms)
40 ~2.5
120 ~4.5
240 ~7.2
320 ~9
Table 5. Overhead of writing from the pool to DSP RAMMeanwhile, copying from pool to GPP RAM takes approximately the same amount of time.
Copying from the pool can be significantly speed up, if we copy element of bigger type (e.g. long
instead of char). The proportion of speedup is the same as during writing process (see Table 4).
8/6/2019 FINAL Group4
17/24
17
b. Function execution on DSP sideAmong all functions in the algorithm the histogram is the most suitable for execution on DSP
side as in most cases the same operation is carried out on large sets of data (e.g. pixels, or the bins
of the histogram) with narrow types (8 or 16 bit). Usually a DSP is designed to carry out
operations of this type efficiently. Thus, we considered this function for extra optimization in
order to gain speed.
Direct implementation of gpp version with unrolling doesnt working on dsp well. For example,
histogram equalization for 8 images on DSP side takes ~45.5 ms (reading- writing to the pool),
while on GPP the same operation will take not more than 18 ms.
Although the keywords and macros, which were used in the case of matrix multiplication most of
the time, do not provide any changes in the cycle times of the loops, slight modification in the
loop bodies, which eliminate data dependencies, can increase performance. One example of these
dependencies is in the loop where we accumulate the histogram. There is a read and a write to the
same memory address per loop iteration. This potentially generates RAW hazards in a pipeline.
To alleviate this problem we can write the results in a temporary histogram. In other occasions
we tried to manually unroll the loops, in which we used the same index variable to address
different parts of the same array. This can cause problems when the compiler cannot parallelize
these instructions as they are reading the same register at the same time. Thus, we created new
index variables which are incremented in every loop.
Apart from the above techniques we tried to express the alignment of the variables with the use of
_nassert() intrinsic in the hope that the compiler will use packed operations. Surprisingly these
efforts did not lead to any improvements in the initiation intervals of the loops. It is also possible
to explicitly use packed instructions with the help of intrinsics. However, we found that deeper
knowledge about the assembly and the architecture of the DSP is needed to use these.
Thus, with all these optimizations the execution time was ~ 4 ms faster.
c. Use of Texas Instruments Software LibrariesWe tried using two software libraries [7] provided free of charge by the Texas Instruments:
FastRTS and IMGLIB. These libraries can be linked to any little-endian project and have been
optimized by the engineers of the vendor.
8/6/2019 FINAL Group4
18/24
18
The FastRTS library provides floating-point mathematics emulation for fixed-point DSPs. We
have experimented with changing the division in the histogram_equalization function to the
function provided by this library and then cast the result to integer. This yielded very poor results.
On the other hand, IMGLIB contains a function for the histogram calculation, which uses a 256
wide array of bytes as output just as the loop in the original program, which is poorly optimized
by the compiler. Thus, an obvious approach is to exchange this loop for the library
implementation. The histogram function has two implementation: a natural C and an assembly-
level optimized version. The assembly-optimized version performed very poorly in terms of
speed, whereas the natural C implementation has approximately the same speed as the original
loop.
d. Use of compilation flags for the program execution on DSPIn order to be able to optimize the release version in DSP, we have added a flag k to the
CFLAGS. The k will keep the assembly file so that we can inspect and analyze compiler
feedback. However, after getting the optimized version, we have removed this flag as it
influences also the total performance of the program. To make it easier to inspect the
optimization problem inside the assembly file, we have also used ss flag. This flag provides
optimizer comments in assembly. As with k, this flag also degrades the performance of the
program. However, the optionmw can also be used to list software pipeline information in the
assembly file, without degrading the performance.
Memory dependencies are problems that might conceal the real optimized performance.
Dependency implies that one code can only be executed after the execution of another code. To
have an optimum efficiency, the compiler should parallelize as many instructions as possible. To
help a compiler determines if instructions are independent, -pm flag should be used which gives
the compiler global access to the whole program or module and allows it to be more aggressive in
ruling out dependencies. It is also mentioned in [5] that pm should be used as much of our
program as possible. Flag mt can also be used to eliminate dependencies. Using this flag, it isequivalent to adding the .no_mdep directive to the linear assembly source file.
Combination ofpm and op2 provides a facility to the compiler to inline a function call in a
loop. Sometimes it may happen that a compiler may not be able to inline a function if the
function is invoked inside a loop. This disability of a compiler to inline may end up in disability
8/6/2019 FINAL Group4
19/24
19
to parallelize the execution of the code. To prevent this, -pm and op2 should be used to enable
automatic inlining a function call inside a loop.
To allow speculative execution, the appropriate amount of padding must be available in data
memory to insure correct execution. In order to do this, -mh can be used.
Last but not least, -o3 which has been used in the original Makefile of the program provides the
highest level of optimization available. This flag will maximize compiler analysis and
optimization. By using this flag, various loop optimizations are performed, such as software
pipelining, unrolling, and SIMD. Some file level characteristics are also used to improve
performance [5].
e. Algorithm parallelization:1st version: This version of the program outsources the equalization of 8 test images to DSP.
However, due to communication overhead (9 ms for writing to the pool, and 9 ms to read from it
on GPP side), the whole benefit of independent parallel equalization on DSP side is eliminated.
The total execution time is ~102 ms, which is 3 ms longer than execution on GPP, described in
Section 5f .
2nd version: As it was shown, the communication overhead plays significant role in the total
execution time. As a result, the decision was made to limit the communication between GPP and
DSP as much as possible. The most optimal solution to meet this limitation is to let DSP
independently equalize, build HMM for one type of images (e.g. elephants). Meanwhile, GPP can
equalize other type of images (e.g. tiger images) and test images, write test images into the pool
to let DSP independently compare test images with the averaged elephant picture. Thus, to get the
recognition result, the DSP send only 9 notifications (8 with values of tiger variables, and one
notification to send hmm_eleph_states), which takes approximately 30us
(analyse_img, the version 3 is used, see Section 5f). To organize such parallel processing, it is
necessary to take timing into account, e.g. DSP should not compare averaged elephant picture
with test images that are located in the pool before these images were equalized and written into
the pool by GPP(the semaphores are used). The corresponding process is represented in Scheme
1:
8/6/2019 FINAL Group4
20/24
20
Scheme 1. Parallel image recognition process
The measurement shown that the total execution time is 79.2 ms. The described program
organization has some advantages:
The amount of communication is limited, the total communication overhead ~ 15 ms Parallel execution of the almost half of the algorithm The synchronization misalignment less than 2 ms: GPP waits DSP at the very end (before
give result of recognition).
f. ConclusionIn this section it was described how to parallelize the program to make its execution faster.
However, it is necessary to highlight that communication overhead takes significant part of the
program execution 17 %. Because of the overhead, the parallelization in described Version 1
(see Section 6e) does not give any benefits. However, since it is possible to separate the program
execution into two almost independent parts (as it is shown in Version 2, Section 6e), the
parallelization significantly reduces the program execution time. In our case, the execution
8/6/2019 FINAL Group4
21/24
21
speedup due to GPP-DSP parallelization (Version 2) is almost 25% in comparison with the
program execution only on GPP side(79.2 ms in comparison with 98.6 ms).
Fig. 10 Execution times on GPP and DSP sidesFig. 10 represents the time periods spent on different operations on GPP and DSP side. DSP
spends more time on any function execution in comparison with GPP. The histogram
equalization on DSP side is much slower than on GPP (more than 2 times slower). It can be
explained by communication overhead, since DSP reads directly from the pool during histogram
equalization of elephant images. Other functionality is performed in RAM on DSP side, so the
execution times on DSP side for functions like HMM building and analyse_img are
approximately the same as on GPP side.
Comparing the pool-notify architecture with the previously used message queuing, it is
visible that the different communication architectures are suitable for different requirements in
data transfer and synchronization. Message architectures feature simple functionality and are
suitable for infrequent data sending between two parties with a considerable overhead of the
message headers, whereas the pool can be used as storage for large amount of data which is
available to many cores. Moreover the message queues provide tight coupling, whereas
notification couples the systems loosely.
Several attempts were made to decrease the communication overhead: use of Texas
Instruments library, compiler optimization, and changing the type of sending information (saving
4 chars as one long). The communication overhead was decreaseed by the later approach
approximately 7 times; however, the overhead still takes significant portion of total execution
ms
8/6/2019 FINAL Group4
22/24
22
time. As a future work, the utilization of another type of communication, Dynamic Memory
Mapping [9], can be considered.
7. Final conclusionIn this course we studied programming techniques on heterogeneous multiprocessor
architectures. We have seen how an application can be separated to functionally different
components (controldata processing) and assigned to processing units which are more suitable
for the given task. We have seen that performance is just as much influenced by the adequate
choice of the communication architecture as the optimality of the code running on a given core.
We could see that the performance of the system can be increased by an auxiliary core even if it
executes the given code slower than the main core. Thus, in these systems the load balancing is a
key factor to achieve performance increase.
References
[1] Texas Instruments, OMAP3530/25 Applications Processor: Oct ,2009;
[2] Texas Instruments, DSP/BIOS LINK USER GUIDE Version 1.60: Oct 21, 2008;
[3] Datasheet SPRU187O TMS320C Optimizing C compiler;
[4] http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops;
[5] SPRU425A TMS320C6000 Optimizing C Compiler tutorial;
[6] Fast square root in C, IAR Application Note G-002, IAR Systems,
http://supp.iar.com/FilesPublic/SUPPORT/000419/AN-G-002.pdf;
[7] Texas Instruments, Software Libraries, http://processors.wiki.ti.com/index.php/software_libraries;
[8] http://www.eecs.umich.edu/~sugih/pointers/gprof_quick.html;[9] Hari Kanigeri ,Texas Instruments, OMAP3430 Bridge Overview, August, 2008.
http://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loopshttp://processors.wiki.ti.com/index.php/C6000_Compiler:_Tuning_Software_Pipelined_Loops8/6/2019 FINAL Group4
23/24
23
8. AppendixListings from the DSP assembly files:
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 25
;* Loop opening brace source line : 25
;* Loop closing brace source line : 27
;* Known Minimum Trip Count : 1
;* Known Maximum Trip Count : 255
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 9
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound(*) : 2
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 2* 1
;* .D units 2* 1
;* .M units 1 0
;* .X cross paths 1 0
;* .T address paths 2* 1;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1 1
;* Bound(.L .S .D .LS .LSD) 2* 1
;*
;* Searching for software pipeline schedule at ...
;* ii = 9 Schedule found with 1 iterations in parallel
Listing 1: original matrix multiplication code
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 25
;* Loop opening brace source line : 25
;* Loop closing brace source line : 27
;* Known Minimum Trip Count : 4
;* Known Maximum Trip Count : 255
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 1
;* Unpartitioned Resource Bound : 3
;* Partitioned Resource Bound(*) : 3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 2 1
;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2;*;* Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 4 iterations in parallel
;*
8/6/2019 FINAL Group4
24/24
24
Listing 2: optimizing with the restrict keyword
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 28
;* Loop opening brace source line : 28
;* Loop closing brace source line : 30
;* Known Minimum Trip Count : 20
;* Known Maximum Trip Count : 254
;* Known Max Trip Count Factor : 2
;* Loop Carried Dependency Bound(^) : 1
;* Unpartitioned Resource Bound : 3
;* Partitioned Resource Bound(*) : 3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0;* .S units 2 1;* .D units 3* 2;* .M units 1 3*;* .X cross paths 1 2;* .T address paths 3* 2;* Long read paths 0 0;* Long write paths 0 0;* Logical ops (.LS) 0 0 (.L or .S unit);* Addition ops(.LSD) 1 3 (.L or .S or .D unit);* Bound(.L .S .LS) 1 1;* Bound(.L .S .D .LS .LSD) 2 2
;*;* Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 4 iterations in parallel
Listing 3: loop-unrolling
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line :34
;* Loop opening brace source line : 34
;* Loop closing brace source line : 36
;* Known Minimum Trip Count : 20
;* Known Maximum Trip Count : 254
;* Known Max Trip Count Factor :2
;* Loop Carried Dependency Bound(^) : 1
;* Unpartitioned Resource Bound : 3;* Partitioned Resource Bound(*) :3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 2 1
;* .D units 3* 2
;* .M units 1 3*
;* .X cross paths 1 2
;* .T address paths 3* 2
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 3 (.L or .S or.D unit)
;* Bound(.L .S .LS) 1 1
;* Bound(.L .S .D .LS .LSD) 2 2
;*
;* Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 4 iterations in parallel
Listing 4: packed 16 bit arithmetic