Project Presentation by Joshua George Advisor – Dr. Jack Davidson.

Project Presentation

by Joshua GeorgeAdvisor – Dr. Jack Davidson

Exploiting hardware for loop optimizations

ZOLB – Zero Overhead Loop Buffers

Decrement, compare and jump instructions

Compare with zero instructions

Background VPO – Very Portable Optimizer – operates on low-

level, m/c independent form – RTLs ZOLB – Many DSPs have a compiler managed

cache for loops Reduces loop over-head

No branch Buffered in internal buffer – save on instruction fetch

Can reduce code size Power overhead remains low

Decrement, compare and jump Eg. loop instruction on x86, banz on tms320c54x

Compare with zero Eg. SPARC

Status Added support for Repeat

instructions (ZOLB) on tms320c54x. Support for converting loops to

count down so as to make use of decrement, compare and jmp instructions – retargetted to three machines – x86, SPARC and tms320c54x.

Implementation - guidelines Add minimum possible code to m/c

dependent part (md), while doing most of the implementation in the m/c independent part (lib).

Design the interface between lib and md to allow for possible issues with other targets.

Issues (ZOLB) How to describe effectively?

An example :-BRC=10; (Block Repeat Count)RSA=L1;REA=EN[L1]; (Repeat Start Address and Repeat End

Address)L1:

w[0]=w[0]+1;W[w[0]]=0;PC=BRC>0,L1;BRC=BRC-1;

Issues (ZOLB) How to bind the rpt instruction to the

start of the rpt block (on the tms320c54x, the start of the rpt block is implicitly the instruction after the rpt instruction) Changing vpo to support ‘binding’ of an

instruction to the next would be overkill. Solution: Make fixentry() take care of this.

(after vpo has finished its optimization loop).

Issues (ZOLB) How to describe unrepeatable

instructions? The machine description sets the

UNREPEATABLE flag for each unrepeatable instruction.

Machine description also provides a list of instructions that disappear after conversion. VPO ignores instructions in this list when checking for unrepeatability.

Issues (ZOLB) How to specify end-label?

If we simply label the next-block, vpo wont print the label since it cannot see a jump to that label.

Solution: Use mangled version of the start label (eg. L1_end) as the end label for the rpt instruction.

Output same mangled version of the start label when the last instruction in the rptblock is encountered in fixentry. Note that this last instruction contains the start label.

Implementation Information supplied by md to lib.

Which instructions are unrepeatable. The number of instructions that would remain after the

conversion. The list of rtls involved in the compare and jmp. The elements involved in the compare (the register,

expression it is being compared with, and the relational operator) helps to determine iteration count.

Identifying a comparison rtl. How to initialize a register to an expression.

Note : Many md parts were already in place – for eg. loop strength reduction support code.

Implementation (cont..) The md does the actual insertion of rpt

rtlsb[0]=10; (Block Repeat Count) sr_init()b[1]=L1;b[2]=EN[L1]; md_convert_rpt_block

(Repeat Start Address and Repeat End Address)

L1:w[0]=w[0]+1;W[w[0]]=0;

PC=b[0]>0,L1;b[0]=b[0]-1; md_convert_rpt_block

(The last rtl is simply converted to a label when outputting the assembly)

Implementation What is done in lib?

Ensuring that the instructions in the loop are repeatable. Counting number of instructions that will remain in loop

after conversion. This is useful to allow md to determine if it wants to convert this to a single-rpt instruction.

Analysis of uses and life-time of loop control variable to determine if control variable increments can disappear.

Finding iteration count of the loop. Identifying loop control variable/increment points. Finding loop exit block.

Note : A lot of functionality (marked above) was already present in vpo lib.

ExampleBefore conversion – 5 instructions. Has a branch.

w[0]=_A; stm #0, ar0

L1:

w[0]=w[0]+1;W[w[0]]=0; st #0, *ar0+

r[0] = (w[0]{24)}24; ld *(ar0), A

r[0] = r[0] – (_A + 10); sub _A+#10, A

PC=r[0]<0,L1; bc L1, Alt

After conversion to a single instruction repeat – only 3 instructions. Dynamic instruction count becomes much higher once the instruction is in the pipeline.

w[0]=_A; stm #0,ar0

n[1]=L6;n[2]=EL[L6];n[0]=9; rpt #9

L6

w[0]=w[0]+1;W[w[0]]=0; st #0, *ar0+

PC=n[0]>0,L6;n[0]=n[0]-1;

Future work How to prevent vpo from changing

block size (for eg. when spills are added)?

In single repeat instruction, how to add support for auto-increment direct addressing mode. Eg. rpt #123

mvdk *ar1, #800h

Count down loops Objective – convert loops to count down

to zero, instead of counting up to a constant or counting down to a constant.

Reasoning Most architectures have a single compare to

zero instruction. Comparing to other values needs at least one more instruction.

Some architectures can decrement, compare and jmp in a single instruction!

Implementation Information supplied by md to lib

List of registers that are candidates to form the count down to zero induction variable. (eg. on x86 it is advantageous to do this conversion only if the count down uses the ecx register)

Is this conversion worthwhile on this loop.

Implementation (cont..) Information supplied by md to lib

How to initialize a register to an expression.

How to decrement a register. Elements of a comparison. Identifying a comparison rtl. The relop used for comparing to zero.


Finding the expression that represents the iteration count.

Identifying the loop control variable/increment points.

Analysis of uses and life-time of loop control. variable to determine if conversion is worth-while. Decision made by md.

Identifying the exit block. Spill/re-load new loop control variable if

needed.


Analyze list of candidate registers to select the best one for this loop.

First preference – the current control variable, provided it is free.

If worthwhile, then any other free register. Last option is to use a register that is live

across the loop, but not used within the loop. This register will have to be spilled in the loop pre-header and reloaded at loop exit.

Performance – spec on x86

200

220

240

260

280

300

320

340

360

380

400

gzip vpr gcc mcf parser perl gap vortexbzip2 twolf

No Count Down

With Count Down

Analysis Average performance has

improved after applying the count down optimization.

Conclusion More fine-tuning needed to realize

substantial performance gains. Primary objective of adding easily

retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

Acknowledgements

Dr. Jack Davidson (advisor)Jason Hiser

Clark Coleman

Project Presentation by Joshua George Advisor – Dr. Jack Davidson.

Documents

Transcript of Project Presentation by Joshua George Advisor – Dr. Jack Davidson.