Loop Unrolling & Predication

29
Loop Unrolling & Predication CSE 820

description

Loop Unrolling & Predication. CSE 820. Software Pipelining. With software pipelining a reorganized loop contains instructions from different iterations of the original loop. Sometimes called symbolic loop unrolling. Software Pipelined Loop. - PowerPoint PPT Presentation

Transcript of Loop Unrolling & Predication

Page 1: Loop Unrolling & Predication

Loop Unrolling&

Predication

CSE 820

Page 2: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Software Pipelining

With software pipelining a reorganized loop contains instructions from different iterations of the original loop.

Sometimes called symbolic loop unrolling.

Page 3: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Software Pipelined Loop

Page 4: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Unrolled Loopselect subset of each iteration (bold)

Iteration 1: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1)

Iteration 2: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1)

Iteration 3: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1)

Page 5: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Software Pipelining

Loop: S.D F4, 16 (R1); stores into M[i] ADD.D F4, F0, F2 ; adds to M[i-1] L.D F0,0 (R1) ; loads M[i-2] DADDUI R1, R1, # -8 BNE R1, R2, Loop

Requires start-up and clean-up.

Page 6: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Symbolic Loop Unrolling

Software pipelining can be thought of as symbolic loop unrolling, but has the advantage of generating less code.

Page 7: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Software Pipelining has less overhead

Page 8: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Global Code Scheduling

allows moving instructions across branches

Most techniques concentrate on determining aStraight-line code segment representing

the most frequently executed code

Page 9: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Trace Scheduling

Concept1. Guess the likely path through branches

(called the trace)

2. Trace now contains long stretches of code without taken branches (predicted)

3. Schedule the trace allowing movement across branches

• Add code to off-the-trace to undo the effects of movement

• The increased ability to move across branches should improve scheduling

Page 10: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Movement + Undo

Considerif (cond)

then { x=x + 5; // likely }else // unlikely

After Movementx = x + 5;if (cond)

then { // likely}else { x = x – 5; // unlikely} // undo

Page 11: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Select a trace

Page 12: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Trace showing jumps off the trace

Page 13: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Superblocks

Avoid the multiple entry and exits of traces.

Superblock has one entry and multiple exits which makes scheduling easier.

The one-entry-multiple-exit is achieved by duplicating code where the unlikely path exits the trace so that no reentry is needed.

Page 14: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Superblock: one entry and multiple exits

Page 15: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Predicated Instructions

Requires– Hardware– ISA modification

Predicated instructions eliminate branches, converting a control dependence into a data dependence.

IA-64 has predicated instructions, but many existing ISA contain at least one(the conditional move).

Page 16: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Conditional Moveif (R1 == 0) R2 = R3;

Branch: BNEZ R1,L ADDU R2, R3, R0L:

Conditional Move: CMOVZ R2, R3, R1

In a pipeline, the control dependence at the beginning of the pipeline is transformed into a data dependence at the end of the pipeline.

Page 17: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Full Predication

Every instruction has a predicate:if the predicate is false, it becomes a NOP.

It is particularly useful for global scheduling since non-loop branches can be eliminated: the harder ones to schedule.

Page 18: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Exceptions & Predication

A predicated instruction must not be allowed to generate an exception,if the predicate is false.

Page 19: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Implementation

Although predicated instructions can be annulled early in the pipeline, annulling during commit delays annulment until later so data hazards have an opportunity to be resolved.

The disadvantage is that resources such as functional units and registers (rename or other) are used.

Page 20: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Predication is good for…

• Short alternative control flow

• Eliminating some unpredictable branches

• Reducing the overhead of global scheduling

But the precise rules for compilation are still being determined.

Page 21: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Limitations

• Annulled instructions waste resources: registers, functional units, cache & memory bandwidth

• If predicate condition cannot be separated from the instruction, a branch might have had better performance, if it could have been accurately predicted.

Page 22: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Limitations (con’t)

• Predication across multiple branches can complicate control and is undesirable unless hardware supports it (as in IA-64).

• Predicated instructions may have a speed penalty—not the case when all instructions are predicated.

Page 23: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Example

if (A==0) A=B; else A= A+4;

LD R1,0(R3) ;load ABNEZ R1,L1 ;test ALD R1,0(R2) ;then clauseJ L2 ;skip else

L1: DADDI R1,R1,#4 ;else clause

L2: SD R1,0(R3) ;store A

Page 24: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Hoist Load

if (A==0) A=B; else A= A+4;LD R1,0(R3) ;load ALD R14,0(R2) ;speculative load

B BEQZ R1,L3 ;other branch of ifDADDI R14,R1,#4 ;else clause

L3: SD R14,0(R3) ;store A

What if speculative load raises an exception?

Page 25: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Guardif (A==0) A=B; else A= A+4;

LD R1,0(R3) ;load AsLD R14,0(R2) ;speculative loadBNEZ R1,L1 ;test ASPECCK 0(R2) ;speculative

checkJ L2 ;skip else

L1: DADDI R14,R1,#4 ;else clauseL2: SD R14,0(R3) ;store A

sLD does not raise certain exceptions; leaves them for SPECCK (IA-64).

Page 26: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Other exception techniques

• Poison bit: – applied to destination register. – set upon exception– raise exception upon access to poisoned

register.

Page 27: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Hoist Load above Store

If memory addresses are known, a load can be hoisted above a store.

If not, …add a special instruction to check addresses before the loaded value is used.(It is similar to SPECCK shown earlier: IA-64)

Page 28: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

Speculation: soft vs. hard

• must be able to disambiguate memory(to hoist loads past stores), but at compile time information is insufficient

• hardware works best when control flow is unpredictable and when hardware branch prediction is superior

• exception handling is easier in hardware• trace techniques require compensation code• compilers see further for better scheduling

Page 29: Loop Unrolling & Predication

Michigan State UniversityComputer Science and Engineering

IA-64