Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on...
-
Upload
belinda-jackson -
Category
Documents
-
view
226 -
download
0
Transcript of Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on...
![Page 1: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/1.jpg)
1
Averaging FilterComparing performance of
C++ and ‘our’ ASM
Example of program developmenton SHARC using C++ and assembly
Planned for Tuesday 7rd October AfternoonPractical examples handled in Lab 1
![Page 2: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/2.jpg)
2
Demo (uTTCOS) and Test (E-UNIT) configurations
True A/D
TrueleftChannel_In
Audio ISR with Filter
TrueleftChannel_Out
True D/A
DMA CHANNEL
DMA CHANNEL
YOUR SOFTWARE
YOUR SOFTWARE
Test InAudio array
MockleftChannel_In
Filter
MockleftChannel_Out
Test OutAudio array
MOCK ReceiveD2A
Mock TransmitA2D
YOUR SOFTWARE
YOUR SOFTWARE
TestSet up InAudio[ ]
Set up Expected[ ]
In Loop {Call Filter to
produceOutAudio[ ]
}
Compare Expected[ ] and
OutAudio
![Page 3: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/3.jpg)
3
Mock Device Registers “satisfy linker”CCES says “inconsistent” definition
• Poor mock – we move values in Audio Device registers by hand
• Can we “MOCK” – Receive_ADC_Samples– Typical industrial testing approach needed when
hardware “NOT-YET-DEVELOPED
![Page 4: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/4.jpg)
4
Better Simulation
• What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer
• Currently “LeftChannel_In1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals
• So we could start testing the algorithm validity (not its speed) by changing LeftChannel_In1 by “mocking “ReciveA2D( )” and “TransmitD2A( ) audio devices
![Page 5: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/5.jpg)
5
Using ‘MockDevice.c” loads (RECAP)What do we do about ‘Receive_ADC_Samples ( )?’
• These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL).
![Page 6: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/6.jpg)
6
Mocked device inside Assign1LibraryCan be used during Lab 1 -- 4
MADEPRIVATE (FIXED)
GOOD OR BAD IDEA?
VARIETY OFALGORITHMSTESTED
![Page 7: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/7.jpg)
7
Use GUI to add new test group forAveraging code – 3 styles of tests (RECAP)
![Page 8: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/8.jpg)
8
Testing• Test that it works• Test that it meets real time performance
– Measure ms / Sample for 1 channel = Time-1CH– Require 20 ms > 8 * Time-1CH
• Move code onto Resource chart.– Determine theoretical best time if all optimizations Could be found
• Test to determine real cycle count Cycle / Tap / Sample• Examine CPP .lst file (.i or .is) or your ASM file to determine expected cycle
count– Work out why the difference between theory and real– Looking at accuracy of better than 1 cycle in 1000– Assume 1 cycle per instruction except jumps and memory accesses and movement of
I registers to memory – or any other delay we find common• Be able to move the theoretical calculation for other processor architecture
(timings) for MidTerm 1 on Thursday 23rd Oct
![Page 9: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/9.jpg)
9
Theoretical Analysis
• We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes
• We are not using any C++ DSP extensions, so expected efficient rather than optimized code
• Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture?
![Page 10: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/10.jpg)
10
Expectations
• First instruction after a jump takes 3 cycles to finish executing
• After that 1 instruction, all things being equal, takes 1 cycle
• 1 cycle for a read, write, add, multiple• D? cycles for a division
![Page 11: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/11.jpg)
11
Averaging Filter with LoopTheoretical Analysis
• Fetch N values from memory -- N cycles• Perform N add operations -- N cycles• Go round the sum for-loop -- N * FLC cycles
– Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop
• Exit for loop (done once) -- EFL cycles• Do division -- D cycles• Return a value from function -- RV cycles -- • Enter and exit Average routine -- EER cycles
AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles
VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS
![Page 12: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/12.jpg)
12
Modify tests so can handle both CPP and ASM versions (Cut-and-paste)
• Not the timing that’s the problem at this moment
• It’s ‘does the ASM and CPP code work’ at all!
![Page 13: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/13.jpg)
13
Check what function needs developing
• Fix compiler error with prototype in ‘Assign1.h”
• Linker error message says ‘wrong prototype’ (NM)
![Page 14: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/14.jpg)
14
Check to see if can run the Tests that call ASM code without crashing
C++ prototypeextern “C” void Function(void)
![Page 15: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/15.jpg)
15
Getting the same constants in an include file working in both CPP and ASM
• Use this type of syntax in ‘Assign1.h’– Conditional code generation
• And in assembly code files
![Page 16: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/16.jpg)
16
Initial testing done with small NN = 4 (as can work out expected result)
• Write the test – C++ code expected to pass
– 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4
![Page 17: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/17.jpg)
17
Look for ‘one out error’ in loopsCommon DSP mistake
• Remember to fix error in ASM ‘pseudo code’
![Page 18: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/18.jpg)
18
Initial testing done with small NN = 8 (as can work out expected result)
• Write the test – C++ code expected to pass– Asm code MUST fail test – otherwise test is poor– Must fail as there is no ASM code to allow pass to
occur. This is the TEST of the TEST
Now have 4 tests passing rather than 3, including ASM test
INDICATES BAD TEST – WHY?
![Page 19: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/19.jpg)
19
Improved test. Don’t allow ‘old correct value’ in output from C++ test
Defect might have been identified by reversing test order
![Page 20: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/20.jpg)
20
What registers can we use in assembly?
• Don’t usewithoutperformingsave immediatelyand later recoveroperations.
• Otherwise C and C++will crash
• These okayto use in
assembly
![Page 21: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/21.jpg)
21
Here’s the full software loop structureNote the formatting for easy code review (Required)
Each time aroundLoop – 9 cycles forControl
Not the 5 we thought
![Page 22: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/22.jpg)
22
dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)
• Both instructions use the ‘eye’ 4 index register (volatile)• dm(2, I4) – is a pre-modify memory operation
– The 1 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)
• ADD IS NOT preformed in parallel with other operations?
– LEAVE value in index register I4 unchanged– Used in array addressing
• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 2
• DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other operations?)
![Page 23: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/23.jpg)
23
![Page 24: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/24.jpg)
24
Other bits of code needed
![Page 25: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/25.jpg)
25
Add assembly language ‘externs’ to ‘Assign1.h
• Still have not codedthe division – fake it by hard-coding * 1/4
• Must be an easier way to code memory – Yes – use post increment operation using pointers
and not using array indexing
![Page 26: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/26.jpg)
26
Code fails -- Most likely place to look for defects are in loop operations
Forgot to set loopCounter =0And loopMax to N when weAdded code for the new loops
![Page 27: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/27.jpg)
27
Try persuading the “assembler” to pre-calculate F3 = (1.0 / N) at ‘compile time’, not ‘run-time’
Code should now work forN = 64 – so can compare timing with C code
![Page 28: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/28.jpg)
28
If we believe tests then calculation accuracy is lower (5E-06 for larger N)
Despite lousy ASM codewe already beating compilerin ‘debug’ mode(around 2N)
![Page 29: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/29.jpg)
29
Before optimizing, we need to add a few more tests to check code valid
Uses sum of N integersN (N + 1) / 2
Accuracy now set to 1E-5
![Page 30: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/30.jpg)
30
Use post-modify address modesum = sum + *pt++; ( N = 64)
• ASM was 2400 cycles (N = 64), is now 2208– Expect improvement of N = 64 cycles (2 instead of 3 instructions)– Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster
2 cycle stall till M4 ready to use?
![Page 31: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/31.jpg)
31
dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)
• Both instructions use the ‘eye’ 4 index register • dm(2, I4) – is a pre-modify memory operation
– The 2 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)– LEAVE value in index register I4 unchanged– Used in array addressing
• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE)
• POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES
![Page 32: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/32.jpg)
32
Using pre-modify and post-modify addressing – replace 6 instructions by 2
Expect 4 * N faster (256)Was 2208, is 1704 = 500 cyclesClose to N * 6 faster!
![Page 33: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649ef45503460f94c07837/html5/thumbnails/33.jpg)
33
Need to force “C++” to optimize
• Our asm code 1704 cycles• Optimized “C” 205 cycles
– 1500 cycles faster or roughly N * 23.5 cycles faster• FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256• Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182
– Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4)
CONCLUSIONWe have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE
NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS
WE KNOW MORE, so should be able to write faster code (if we need to)