A Compiler Framework to Support Speculative Multi-Core...

A Compiler Framework to Support A Compiler Framework to Support Speculative MultiSpeculative Multi--Core ProcessorsCore Processorsfor Generalfor General--Purpose ApplicationsPurpose Applications

Pen-Chung Yew 游本中

Department of Computer Science and EngineeringUniversity of Minnesota

http://www.cs.umn.edu/Agassiz

2006/4/12 CTHPC 2006 - P.C.Yew 2

OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single-core processorsSpeculative optimizations for multi-core processorsConclusions

2006/4/12 CTHPC 2006 - P.C.Yew 3

MultiMulti--Core Processors on Technology Core Processors on Technology Road Map Road Map

Multi-core processors on Intel’s roadmap SMPs have been around for a long time. What is new for multi-core?Why general-purpose applications now?Use thread-level parallelism (TLP) to improve instruction-level parallelism (ILP)

2006/4/12 CTHPC 2006 - P.C.Yew 4

Use TLP to Support ILPUse TLP to Support ILP

Time

op1 op2 op3 op4

op5 op6 op7 op8

op9 op10 op11 op12

……………..

op21 op22 op23 op24

t1

t2

t3

t6

Time

t1

t2

t3

t6

op1 op2 op3 op4

op5 op6 op7 op8

op9 op10 op11 op12

…………

op21 op22 op23 op24

Th1 Th2 Th3 Th4

Multi-Core

TLPSuperscalar

ILP

2006/4/12 CTHPC 2006 - P.C.Yew 5

TLP Challenges in GeneralTLP Challenges in General--Purpose ApplicationsPurpose Applications

Mostly Do-while loops– Need thread-level control speculation

Parallelism exists mostly in outer loops – Not good for VLIW (i.e. software pipelining ) or vector

processing => need thread-level supportPointers complicate alias and data dependence analysis– Need runtime support for disambiguation and data speculation

Many small loops and doacross loops– Need fast and low overhead communication

Small basic blocks – need to exploit both ILP and TLPNeed new approaches to apply parallel processing

for such applications!!

2006/4/12 CTHPC 2006 - P.C.Yew 6

Speculation:Speculation:Breaking Program Dependency Breaking Program Dependency

Speculation is an effective approach to break dependences

Optimize program execution by ignoring infrequent data dependence edges, or taking predicted pathsCheck possible violation (mis-speculation) at runtimeRecover if violation occurs

2006/4/12 CTHPC 2006 - P.C.Yew 7

Speculation on Intel IA64Speculation on Intel IA64

Both control and data speculation are supported on Intel IA64– Special instructions and hardware are provided– ld.s, ld.a, ld.sa and ld.c, chk.a, chk.s

Memory load operation is targeted for speculation– Memory delay is usually the bottleneck of performance– Memory load is usually the start of speculative operations

2006/4/12 CTHPC 2006 - P.C.Yew 8

OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single threadSpeculative optimizations for multi-threaded processorsConclusions

2006/4/12 CTHPC 2006 - P.C.Yew 9

A Compiler Framework:A Compiler Framework:Intel Open Research Compiler (ORC)Intel Open Research Compiler (ORC)

Speculative alias and dataflow analysis

Control flow graph (control speculation)

Speculative use-def chain/ SSA form(data speculation)

Control flow analysis

Edge/path profile

Heuristic rules

Alias /dependence profile

Heuristic rules

Speculative optimizations PRE based optimizations: PRE for expressions,

register promotion, global value numbering based redundancy elimination, strength reduction, … Instruction scheduling …

2006/4/12 CTHPC 2006 - P.C.Yew 10

Main Compiler Components in Main Compiler Components in the Frameworkthe Framework

Efficient alias and data dependence profiling tools, or heuristic algorithmsAnnotating profiled information in compilerSpeculative optimizationsRecovery code generation

2006/4/12 CTHPC 2006 - P.C.Yew 11

Alias and Data Dependence Alias and Data Dependence ProfilingProfiling

Instrumentation-based alias profilingInstrumentation-based data dependence profilingTechniques to reduce profiling overhead

• T.Chen et al, Data Dependence Profiling for Speculative Optimizations, Proc. Of Int’l Conf on Compiler Construction (CC), March 2004

• T.Chen et al, An Empirical Study on the Granularity of Pointer Analysis in C programs, Proc. 15th Workshop on Languages and Compilers for Parallel Computing (LCPC15), August 2002

2006/4/12 CTHPC 2006 - P.C.Yew 12

Crucial Considerations Crucial Considerations

Program coverage: 10/90 rule– uncovered regions => use compiler

analysis results or heuristic rulesInput sensitivityProfiling overhead (space and time)Using alias and data dependence profiles is inherently speculative => need hardware support for correct execution

2006/4/12 CTHPC 2006 - P.C.Yew 13

Alias Profiling vs. Static AnalysisAlias Profiling vs. Static Analysis

0%

50%

100%am

mp art

bzip

2cr

afty eo

neq

uake ga

pgc

cgz

ipm

esa

mcf

pa

rser

perlb

mk

twol

fvo

rtex

vpr

avg

occur > 5%

occur < 5%

never occurat runtime

Most possible aliases reported by compiler do not occur at runtime

2006/4/12 CTHPC 2006 - P.C.Yew 14

Data Dependence ProfilingData Dependence Profiling

Data dependence edges among memory references and function callsDetailed information– type: flow, anti, output, or input– probability: frequency of occurrence

When loops are targeted– dependence distance: limited

2006/4/12 CTHPC 2006 - P.C.Yew 15

Overhead of DD ProfilingOverhead of DD Profiling

96 110 102121120

0

20

40

60

80

bzip2 crafty gap gcc gzip mcf parser perlbmk twolf vortex vpr average

X tim

es s

low

er

alias DD without distance DD for innermost loops DD 4-level loops

Compiler: ORC version 2.0Machine: Itanium2, 900 MHz and 2G memoryBenchmarks: SPEC CPU2000 IntInstrumentation optimization has been done

2006/4/12 CTHPC 2006 - P.C.Yew 16

Techniques to Reduce Profiling Techniques to Reduce Profiling OverheadOverhead

Reduce the space and time requirements by hash table– Larger granularity of address– Smaller iteration counter

Sampling– Sample the snap shots of procedures or loops

instead of individual references– Use instrumentation-based sampling framework

Switch at procedures or loops

2006/4/12 CTHPC 2006 - P.C.Yew 17


2006/4/12 CTHPC 2006 - P.C.Yew 18

Speculating on Data DependenceSpeculating on Data Dependence

More speculative optimizations

I1: … = *qI2: *p = b

I3: … = *q

I4: *r = …

I5: … = *pI6: *r= …

Speculate on this dependence

Redundancy elimination opportunity

2006/4/12 CTHPC 2006 - P.C.Yew 19

Speculate on Data DependencesSpeculate on Data Dependences


I1: … = *qI2: *p = b

I3: … = *q

I4: *r = …

I5: … = *pI6: *r= …


Copy propagationopportunity

2006/4/12 CTHPC 2006 - P.C.Yew 20

Speculate on Data DependencesSpeculate on Data Dependences


I1: … = *qI2: *p = b

I3: … = *q

I4: *r = …

I5: … = *pI6: *r= …


Dead storeelimination opportunity

2006/4/12 CTHPC 2006 - P.C.Yew 21

Integrate Data Speculation in Traditional Integrate Data Speculation in Traditional OptimizationsOptimizations

Code Scheduling

…Redundancy Elimination

CopyPropagation

Dead StoreElimination

Code Scheduling

…Redundancy Elimination

CopyPropagation


Analysis Phase

Code Transformation Phase

Optimization extension to handle data dependence speculation

Traditional optimizations need to be patched to handle data speculation

2006/4/12 CTHPC 2006 - P.C.Yew 22

Example of Speculative SSA FormExample of Speculative SSA Form a1 = … *p1 = 4 a2← χ ( a1 ) b2← χ ( b1 ) v2← χ ( v1 ) … = a2 a3= 4 μ(a3), μ(b2), μ(v2) … = *p1

(a) traditional SSA graph

a1 = … *p1 = 4 a2← χ ( a1 ) b2← χs ( b1 ) v2← χ ( v1 ) … = a2 a3= 4 μ(a3), μ s(b2), μ(v2) … = *p1

(b) speculative SSA graph

The points-to set of pointer pobtained by alias profiling is {b}

2006/4/12 CTHPC 2006 - P.C.Yew 23

Improved Speculative Improved Speculative Optimizations FrameworkOptimizations Framework

Code Scheduling

… Redundancy Elimination

CopyPropagation


Code Scheduling … Redundancy

EliminationCopy

PropagationDead StoreElimination

Traditional Compiler Analysis Phase

Traditional Code Transformation Phase

Speculative Data Dependence Analysis

Data Speculative Code Motion

2006/4/12 CTHPC 2006 - P.C.Yew 24


2006/4/12 CTHPC 2006 - P.C.Yew 25

Compiler Optimizations for Compiler Optimizations for Speculative ThreadsSpeculative Threads

Without compiler optimization, there is limited TLP even under perfect hardware support. [OplingerPACT 99]Compiler have to decide– Which loops/regions to be transformed into thread– Use synchronization or speculation– How to schedule the code to improve overlaps– What transformations to be used – When/How to generate recovery code

2006/4/12 CTHPC 2006 - P.C.Yew 26

Compiler FrameworkCompiler Framework

Program Loop Selection

Thread Partitioningnon-loops

loops

loop threads

non-loop threads

Edge / DepProfiling

Instruction Scheduling

2006/4/12 CTHPC 2006 - P.C.Yew 27

Loop SelectionLoop Selection

-40%-20%

0%20%40%60%80%

100%

mcfcra

fty

twolf gz

ipbz

ip2vo

rtex

vpr

parse

r

gap

gcc

perlb

mk

Outer loop Inner loop Best

prog

ram

spe

edup

Carefully selected loops can improve performance significantly!

2006/4/12 CTHPC 2006 - P.C.Yew 28

ExampleExample

A

B C

D E

GF

H

I

A two-level nested loop

2006/4/12 CTHPC 2006 - P.C.Yew 29

Example: Loop SelectionExample: Loop Selection

A

B C

D E

GF

H

I

Generate loop threads for the selected inner loops

2006/4/12 CTHPC 2006 - P.C.Yew 30

ExampleExample

Partition the remaining part

A

B C

D E

GF

H

I

2006/4/12 CTHPC 2006 - P.C.Yew 31

ExampleExample

Further partitioning

A

B C

D E

GF

H

I

2006/4/12 CTHPC 2006 - P.C.Yew 32

ExampleExample

A

B C

D E

GF

H

I

A

C

D

G

C

E

G

C

D

F

H

I

Sequential execution

A C

D

G

C

E

G

C

D

F

H

I

Parallel execution

2006/4/12 CTHPC 2006 - P.C.Yew 33

Thread Partition AlgorithmThread Partition Algorithm

Bi_Partition(T, T1, T2) {P = Form_Path(T);for each BB b on the path P {

benefit = Perf_Estimation(P, b);if (benefit > max_benefit) {

max_benefit = benefit;T1 = all BBs reachable from b;T2 = all remaining BBs in T;

}}

}

Bi_Partition(T, T1, T2) {P = Form_Path(T);for each BB b on the path P {

benefit = Perf_Estimation(P, b);if (benefit > max_benefit) {

max_benefit = benefit;T1 = all BBs reachable from b;T2 = all remaining BBs in T;

}}

}

Thread_Partition(Procedure P) {Form_Loop_Tree(P);for each unselected loop L in the loop

tree {Initialize_Thread_Queue(L);while (!Thread_Queue_Empty()) {

T = Thread_Queue_Pop();if (Bi_Partition(T, T1, T2)) {

Thread_Queue_Push(T1);Thread_Queue_Push(T2);

}}

} // bottom-up traversal}

Thread_Partition(Procedure P) {Form_Loop_Tree(P);for each unselected loop L in the loop

tree {Initialize_Thread_Queue(L);while (!Thread_Queue_Empty()) {

T = Thread_Queue_Pop();if (Bi_Partition(T, T1, T2)) {

Thread_Queue_Push(T1);Thread_Queue_Push(T2);

}}

} // bottom-up traversal}

2006/4/12 CTHPC 2006 - P.C.Yew 34

Speculative Code MotionSpeculative Code Motion

*p = *p =

*p = = *p

= *p = *p

*p =

*p =

*p =

= *p

= *p

= *p

stall

critical path

other computation

before code motion after code motion

2006/4/12 CTHPC 2006 - P.C.Yew 35

Recovery Code GenerationRecovery Code Generation

Representation of recovery code is crucialIt should not affect the later compiler optimizationsRecovery code could be represented as a heavily-biased If-Then-Else structureGeneration of recovery code is rather straightforward with such a representation

2006/4/12 CTHPC 2006 - P.C.Yew 36

ConclusionsConclusionsMicroprocessors have caught up with supercomputers in ’90 and have gone multi-coreIt is non-trivial to apply current supercomputing technologies to general-purpose applicationsNew architectural support such as thread-level speculative execution, and new compiler techniques such as speculative optimizations using alias and data dependence profiling, even dynamic optimization at runtime, are crucial – as alwaysA new exciting era for parallel processing has arrived – regardless of we are prepared or not!

2006/4/12 CTHPC 2006 - P.C.Yew 37

ReferencesReferences(1) J.Lin et al, A Compiler Framework for Speculative Analysis and Optimizations,

Proc. Of ACM/SIGPLAN Conf. On Programming Language Design and Implementation (PLDI), June 2003, also in ACM Trans. On Architecture and Code Optimization (TACO), Vol. 1, No. 3, Sept. 2004, pp. 247-271

(2) J. Lin et al, Recovery Code Generation for General Speculative Optimizations, to appear in ACM Trans. On Architecture and Code Optimization (TACO) 2006.

(3) X.Dai et al, A General Compiler Framework for Speculative Optimizations UsingData Speculative Code Motion, Proc. Of the 3rd Annual IEEE/ACM Int’l Symp. On Code Generation and Optimization (CGO)

(4) T.Chen et al, Data Dependence Profiling for Speculative Optimizations, Proc. Of Int’l Conf on Compiler Construction (CC), March 2004

(5) T.Chen et al, An Empirical Study on the Granularity of Pointer Analysis in C programs, Proc. 15th Workshop on Languages and Compilers for Parallel Computing (LCPC), August 2002

(6) J.Y.Tsai et al, The Superthreaded Processor Architecture, IEEE Trans on Computers, special issue on Multithreaded Architecture, Vol. 48, No. 9, Sept 1999

2006/4/12 CTHPC 2006 - P.C.Yew 38

Supplement Slides

2006/4/12 CTHPC 2006 - P.C.Yew 39

Control SpeculationControl Speculation

ld.s: move the load operation across the barrier imposed by control dependency (branch instruction)Check the speculation with chk.s

other instructionsbr

ld

control dep

barrier

ld.sbr

chk.suse

control dep

barrier

traditional optimizations control speculation

2006/4/12 CTHPC 2006 - P.C.Yew 40

Control Speculation Control Speculation ––Instruction SchedulingInstruction Scheduling

ld.s r31 = [r30] if (c){ chk.s r31, recovery next: …. } recovery: ld r31 = [r30] br next

cmp c

ld r31 = [r30]

true, probability 90%

false

….

bz c

2006/4/12 CTHPC 2006 - P.C.Yew 41

Data Speculation:Data Speculation:Advance Load Address Table (ALAT)Advance Load Address Table (ALAT)

Reg No. Address Valid

r18 &a valid

2006/4/12 CTHPC 2006 - P.C.Yew 42

Data Speculation: ExampleData Speculation: Example

… = a*q = ..… = a

Original program

ld r18=[a]store [*q] = ld r18=[a]

Traditional compiler code

• *q and a are possible aliases

• a is reloaded after the store of *q

2006/4/12 CTHPC 2006 - P.C.Yew 43

Data Speculation: ExampleData Speculation: Example

ld.a r18 = [a]

st [*q] =

ld.c r18 = [a]

reg addr valid

r18 &a valid

2006/4/12 CTHPC 2006 - P.C.Yew 44

Instruction Cache

Data Cache

Thread processing unit

Execution Unit

Comm. Unit

Memory Buffer

Writeback Unit


Execution Unit

Comm. Unit

Memory Buffer

Writeback Unit


Execution Unit

Comm. Unit

Memory Buffer

Writeback Unit


Execution Unit

Comm. Unit

Memory Buffer

Writeback Unit

2006/4/12 CTHPC 2006 - P.C.Yew 45

Performance Improvement of Performance Improvement of Speculative Register PromotionSpeculative Register Promotion

0%2%4%6%8%

10%12%14%16%

ammp art

equa

ke

bzip2 gz

ip mcf

parse

r

twolf

Impr

ovem

ent p

erce

ntag

e

cpu cycle data access cycle loads retired

Based on alias profile and compared with –O3 with type-based alias analysis on Intel ORC compiler

2006/4/12 CTHPC 2006 - P.C.Yew 46

7

TLP for Each Benchmark

Potential Speedup of Whole Program by TLP

0

1

2

3

4

5

6

7

bzip2 gap gzip mcf parser vortex twolf vpr

A Compiler Framework to Support Speculative Multi-Core...

Documents

Transcript of A Compiler Framework to Support Speculative Multi-Core...