Automatic Application Profiling
-
Upload
dustin-cotton -
Category
Documents
-
view
39 -
download
0
description
Transcript of Automatic Application Profiling
Automatic Application ProfilingAutomatic Application Profiling
Lecture 22
Today – What parts of the code are slow?Today – What parts of the code are slow?• Amdahl’s law
• How to get the processor to tell us what’s taking the most time – Statistical program counter sampling
Amdahl’s LawAmdahl’s Law
• Gene Amdahl - “Optimize the common case”• Double the speed of ¼ of the program:
• Quadruple the speed of ¼ of the program:
Enhanced
EnhancedNormal Speedup
FractionFraction
Speedup
1
14.1
2
25.075.0
1
Speedup
23.1
4
25.075.0
1
Speedup
Work on the Slow PartWork on the Slow Part
• Double the speed of ¾ of the program:
• Quadruple the speed of ¾ of the program:
Enhanced
EnhancedNormal Speedup
FractionFraction
Speedup
1
6.125.0
2
75.01
Speedup
29.225.0
4
75.01
Speedup
How Do We Find the Slow PartsHow Do We Find the Slow Parts• Option A: Measure the amount of time each region takes to execute
– Codeunsigned long times[NUM_FUNCTIONS];
void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start;}
• Pros– Exact. Can get single cycle accuracy if needed.
• Cons– Tedious. Must add code to each function to be monitored.
– Need access to source code, which may be a problem for library functions.
Program Counter SamplingProgram Counter Sampling• Option B: Periodically examine the PC to see what’s running
– Result shows average fraction of time spent executing a region of code
• Supporting data structure: table of region information– Starting and ending addresses:
defined before sampling– Execution counts:updated
during sampling
• Sampling– Use a timer to interrupt application periodically– Within ISR
• Read PC off of stack• Examine table of region addresses to determine currently executing region N• Increment entry N of execution count table• Also increment a total number of ticks variable (to reveal out-of-range PC
instances)
– At end• Provide execution count table to user (via file, serial port, debugger, etc.)
typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count;} PROFILE_T;
Configure Timer to Generate Periodic InterruptConfigure Timer to Generate Periodic Interrupt
• Call this function with desired sampling frequency in Hz
void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }
Profiler Interrupt Service RoutineProfiler Interrupt Service Routine
• Don’t forget to register ISR in vector table!
#pragma INTERRUPT/B profile_intrvoid profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } }}
Configure Profile Table with Region InformationConfigure Profile Table with Region Information
• Where do we find region addresses? Next page.
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };
To profile modules (source files)
To profile functions
Finding the Region AddressesFinding the Region Addresses• Get module addresses from linker’s map file (in debug directory)
• Get function lengths (if needed) from .LST file– Second column is start address of each assembly instruction– Subtract function’s first address from its last address to find length
506 ;## # FUNCTION LCD_Erase_FB507 ;## # FRAME AUTO (y) size 2, offset -4508 ;## # FRAME AUTO (x) size 2, offset -2509 ;## # ARG Size(0) Auto Size(4) Context Size(5)510 511 .align512 ;## # C_SRC : void LCD_Erase_FB(void) {513 .glb _LCD_Erase_FB514 00212 _LCD_Erase_FB:515 00212 7CF204 enter #04H516 ;## # C_SRC : for (x=0; x<8; x++)517 00215 D90BFE mov.w #0000H,-2[FB] ; x P(etc.)539 ;## # C_SRC : }540 0023F 7DF2 exitd541 00241 E7:
program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD
Overview of Profiling ApproachOverview of Profiling Approach• Start with the big picture (rough details) and use that to
determine where to look next• Profiling sequence
– module-level (file-level)
– function-level within the most common module
– basic block-level within the most common function
Detailed Steps to using profile.c/hDetailed Steps to using profile.c/h• Enable list file creation for each C source file
– HEW: Options -> Renesas M16C Standard Toolchain• C Tab -> Category: List. Check –dS and –dSL boxes
• Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts
• Compile– May need to add function prototypes if profiles table is declared before the functions
are
• Update profiles array with correct starting and ending addresses
• Recompile
• Run
• Examine profiles after running long enough
How Long is Enough?How Long is Enough?• Complex statistical question
– The statistician I asked said “it depends” and changed the subject• So, run it until the digits you care about stop changing• Example: Module-level profiling
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 100 1000 10000 100000
Number of Samples
Fra
cti
on
of
Pro
gra
m T
ime
mainprofileskp26skp_lcdlibrary
Where does the Lab 4 Skeleton Spend its Time?Where does the Lab 4 Skeleton Spend its Time?
We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are
• Start with modules
• Then look at functions in module
• Then look at basic blocks within function
DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON;
while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++)
LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1);
LCD_Display_FB(LCD_LINE1); } }
Step One: Profile ModulesStep One: Profile Modules
• Define a profile region per module, and one for all the library functions
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
Step One: Profile ModulesStep One: Profile Modules
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };
Step 1 ResultsStep 1 Results
• Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead
Execution Time per Module
main
profile
skp26
skp_lcd
library
other
Count Timemain 70 0.21%profile 0 0.00%skp26 0 0.00%skp_lcd 9642 29.11%library 23415 70.68%other 1 0.00%
Step Two: Profile LibraryStep Two: Profile Library
• We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
lib 1
lib 2
lib 3
lib 4
lib 5
lib 6
lib 7
lib 8
Step Two: Profile LibraryStep Two: Profile Library
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0}};
Step Two ResultsStep Two Results
• Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1
Execution Time per Library Function Group
lib 1
lib 2
lib 3
lib 4
lib 5
lib 6
lib 7
lib 8
other
Count Timelib 1 3079 10.22%lib 2 1796 5.96%lib 3 865 2.87%lib 4 3092 10.26%lib 5 687 2.28%lib 6 11044 36.65%lib 7 131 0.43%lib 8 444 1.47%other 8999 29.86%
Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions
• Examine the nine functions in these three groups, grouping two functions together
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0}};
Step Three ResultsStep Three Results
• Most time spent in double precision floating point divide• 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible
Execution Time per Library Function
_F4DIV_F4TOF8+_F8ADD_I4DIVU_I4TOF4_LTOF_F4RTOL_F8DIV_F8EQother
Count Time_F4DIV 482 1.83%_F4TOF8+_F8ADD2413 9.14%_I4DIVU 0 0.00%_I4TOF4 18 0.07%_LTOF 2748 10.41%_F4RTOL 36 0.14%_F8DIV 9418 35.67%_F8EQ 0 0.00%other 11291 42.76%
26406 57.24%
Disadvantages of SamplingDisadvantages of Sampling• Sampling is inexact - not guaranteed to get everything that runs
– Code which disables interrupts (e.g. ISRs, OS code) is not measured
– Rarely executed code may be missed
– Takes time for numbers to settle down
– Profile changes based on mode of program
• If manually creating table, user needs to update address table with each code change
Implementing InstrumentationImplementing Instrumentation• Tedious to do manually for large programs, so automate
• Have compiler instrument code for you– gcc and other compilers support profiling using a command line switch
– They provide a tool to process the output file to determine how much time each function takes
• Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL– Tool processes binary files to run your instrumentation procedures for each
procedure, basic block, or instruction
• For the M16C, what would be best?– Create a program which reads the map file and creates a C file declaring our
profiles array with correct region names and addresses• Probably easiest to use a scripting language: sed, awk, perl, (f)lex
• Probably not enough memory to instrument all functions, so must be selective