A Compiler Framework to Support Speculative Multi-Core...
Transcript of A Compiler Framework to Support Speculative Multi-Core...
A Compiler Framework to Support A Compiler Framework to Support Speculative MultiSpeculative Multi--Core ProcessorsCore Processorsfor Generalfor General--Purpose ApplicationsPurpose Applications
Pen-Chung Yew 游本中
Department of Computer Science and EngineeringUniversity of Minnesota
http://www.cs.umn.edu/Agassiz
2006/4/12 CTHPC 2006 - P.C.Yew 2
OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single-core processorsSpeculative optimizations for multi-core processorsConclusions
2006/4/12 CTHPC 2006 - P.C.Yew 3
MultiMulti--Core Processors on Technology Core Processors on Technology Road Map Road Map
Multi-core processors on Intel’s roadmap SMPs have been around for a long time. What is new for multi-core?Why general-purpose applications now?Use thread-level parallelism (TLP) to improve instruction-level parallelism (ILP)
2006/4/12 CTHPC 2006 - P.C.Yew 4
Use TLP to Support ILPUse TLP to Support ILP
Time
op1 op2 op3 op4
op5 op6 op7 op8
op9 op10 op11 op12
……………..
op21 op22 op23 op24
t1
t2
t3
t6
Time
t1
t2
t3
t6
op1 op2 op3 op4
op5 op6 op7 op8
op9 op10 op11 op12
…………
op21 op22 op23 op24
Th1 Th2 Th3 Th4
Multi-Core
TLPSuperscalar
ILP
2006/4/12 CTHPC 2006 - P.C.Yew 5
TLP Challenges in GeneralTLP Challenges in General--Purpose ApplicationsPurpose Applications
Mostly Do-while loops– Need thread-level control speculation
Parallelism exists mostly in outer loops – Not good for VLIW (i.e. software pipelining ) or vector
processing => need thread-level supportPointers complicate alias and data dependence analysis– Need runtime support for disambiguation and data speculation
Many small loops and doacross loops– Need fast and low overhead communication
Small basic blocks – need to exploit both ILP and TLPNeed new approaches to apply parallel processing
for such applications!!
2006/4/12 CTHPC 2006 - P.C.Yew 6
Speculation:Speculation:Breaking Program Dependency Breaking Program Dependency
Speculation is an effective approach to break dependences
Optimize program execution by ignoring infrequent data dependence edges, or taking predicted pathsCheck possible violation (mis-speculation) at runtimeRecover if violation occurs
2006/4/12 CTHPC 2006 - P.C.Yew 7
Speculation on Intel IA64Speculation on Intel IA64
Both control and data speculation are supported on Intel IA64– Special instructions and hardware are provided– ld.s, ld.a, ld.sa and ld.c, chk.a, chk.s
Memory load operation is targeted for speculation– Memory delay is usually the bottleneck of performance– Memory load is usually the start of speculative operations
2006/4/12 CTHPC 2006 - P.C.Yew 8
OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single threadSpeculative optimizations for multi-threaded processorsConclusions
2006/4/12 CTHPC 2006 - P.C.Yew 9
A Compiler Framework:A Compiler Framework:Intel Open Research Compiler (ORC)Intel Open Research Compiler (ORC)
Speculative alias and dataflow analysis
Control flow graph (control speculation)
Speculative use-def chain/ SSA form(data speculation)
Control flow analysis
Edge/path profile
Heuristic rules
Alias /dependence profile
Heuristic rules
Speculative optimizations PRE based optimizations: PRE for expressions,
register promotion, global value numbering based redundancy elimination, strength reduction, … Instruction scheduling …
2006/4/12 CTHPC 2006 - P.C.Yew 10
Main Compiler Components in Main Compiler Components in the Frameworkthe Framework
Efficient alias and data dependence profiling tools, or heuristic algorithmsAnnotating profiled information in compilerSpeculative optimizationsRecovery code generation
2006/4/12 CTHPC 2006 - P.C.Yew 11
Alias and Data Dependence Alias and Data Dependence ProfilingProfiling
Instrumentation-based alias profilingInstrumentation-based data dependence profilingTechniques to reduce profiling overhead
• T.Chen et al, Data Dependence Profiling for Speculative Optimizations, Proc. Of Int’l Conf on Compiler Construction (CC), March 2004
• T.Chen et al, An Empirical Study on the Granularity of Pointer Analysis in C programs, Proc. 15th Workshop on Languages and Compilers for Parallel Computing (LCPC15), August 2002
2006/4/12 CTHPC 2006 - P.C.Yew 12
Crucial Considerations Crucial Considerations
Program coverage: 10/90 rule– uncovered regions => use compiler
analysis results or heuristic rulesInput sensitivityProfiling overhead (space and time)Using alias and data dependence profiles is inherently speculative => need hardware support for correct execution
2006/4/12 CTHPC 2006 - P.C.Yew 13
Alias Profiling vs. Static AnalysisAlias Profiling vs. Static Analysis
0%
50%
100%am
mp art
bzip
2cr
afty eo
neq
uake ga
pgc
cgz
ipm
esa
mcf
pa
rser
perlb
mk
twol
fvo
rtex
vpr
avg
occur > 5%
occur < 5%
never occurat runtime
Most possible aliases reported by compiler do not occur at runtime
2006/4/12 CTHPC 2006 - P.C.Yew 14
Data Dependence ProfilingData Dependence Profiling
Data dependence edges among memory references and function callsDetailed information– type: flow, anti, output, or input– probability: frequency of occurrence
When loops are targeted– dependence distance: limited
2006/4/12 CTHPC 2006 - P.C.Yew 15
Overhead of DD ProfilingOverhead of DD Profiling
96 110 102121120
0
20
40
60
80
bzip2 crafty gap gcc gzip mcf parser perlbmk twolf vortex vpr average
X tim
es s
low
er
alias DD without distance DD for innermost loops DD 4-level loops
Compiler: ORC version 2.0Machine: Itanium2, 900 MHz and 2G memoryBenchmarks: SPEC CPU2000 IntInstrumentation optimization has been done
2006/4/12 CTHPC 2006 - P.C.Yew 16
Techniques to Reduce Profiling Techniques to Reduce Profiling OverheadOverhead
Reduce the space and time requirements by hash table– Larger granularity of address– Smaller iteration counter
Sampling– Sample the snap shots of procedures or loops
instead of individual references– Use instrumentation-based sampling framework
Switch at procedures or loops
2006/4/12 CTHPC 2006 - P.C.Yew 17
OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single threadSpeculative optimizations for multi-threaded processorsConclusions
2006/4/12 CTHPC 2006 - P.C.Yew 18
Speculating on Data DependenceSpeculating on Data Dependence
More speculative optimizations
I1: … = *qI2: *p = b
I3: … = *q
I4: *r = …
I5: … = *pI6: *r= …
Speculate on this dependence
Redundancy elimination opportunity
2006/4/12 CTHPC 2006 - P.C.Yew 19
Speculate on Data DependencesSpeculate on Data Dependences
More speculative optimizations
I1: … = *qI2: *p = b
I3: … = *q
I4: *r = …
I5: … = *pI6: *r= …
Speculate on this dependence
Copy propagationopportunity
2006/4/12 CTHPC 2006 - P.C.Yew 20
Speculate on Data DependencesSpeculate on Data Dependences
More speculative optimizations
I1: … = *qI2: *p = b
I3: … = *q
I4: *r = …
I5: … = *pI6: *r= …
Speculate on this dependence
Dead storeelimination opportunity
2006/4/12 CTHPC 2006 - P.C.Yew 21
Integrate Data Speculation in Traditional Integrate Data Speculation in Traditional OptimizationsOptimizations
Code Scheduling
…Redundancy Elimination
CopyPropagation
Dead StoreElimination
Code Scheduling
…Redundancy Elimination
CopyPropagation
Dead StoreElimination
Analysis Phase
Code Transformation Phase
Optimization extension to handle data dependence speculation
Traditional optimizations need to be patched to handle data speculation
2006/4/12 CTHPC 2006 - P.C.Yew 22
Example of Speculative SSA FormExample of Speculative SSA Form a1 = … *p1 = 4 a2← χ ( a1 ) b2← χ ( b1 ) v2← χ ( v1 ) … = a2 a3= 4 μ(a3), μ(b2), μ(v2) … = *p1
(a) traditional SSA graph
a1 = … *p1 = 4 a2← χ ( a1 ) b2← χs ( b1 ) v2← χ ( v1 ) … = a2 a3= 4 μ(a3), μ s(b2), μ(v2) … = *p1
(b) speculative SSA graph
The points-to set of pointer pobtained by alias profiling is {b}
2006/4/12 CTHPC 2006 - P.C.Yew 23
Improved Speculative Improved Speculative Optimizations FrameworkOptimizations Framework
Code Scheduling
… Redundancy Elimination
CopyPropagation
Dead StoreElimination
Code Scheduling … Redundancy
EliminationCopy
PropagationDead StoreElimination
Traditional Compiler Analysis Phase
Traditional Code Transformation Phase
Speculative Data Dependence Analysis
Data Speculative Code Motion
2006/4/12 CTHPC 2006 - P.C.Yew 24
OutlineOutlineSpeculative multi-core processors and general-purpose applicationsA compiler framework to support speculative execution and optimizationsSpeculative optimizations for single threadSpeculative optimizations for multi-threaded processorsConclusions
2006/4/12 CTHPC 2006 - P.C.Yew 25
Compiler Optimizations for Compiler Optimizations for Speculative ThreadsSpeculative Threads
Without compiler optimization, there is limited TLP even under perfect hardware support. [OplingerPACT 99]Compiler have to decide– Which loops/regions to be transformed into thread– Use synchronization or speculation– How to schedule the code to improve overlaps– What transformations to be used – When/How to generate recovery code
2006/4/12 CTHPC 2006 - P.C.Yew 26
Compiler FrameworkCompiler Framework
Program Loop Selection
Thread Partitioningnon-loops
loops
loop threads
non-loop threads
Edge / DepProfiling
Instruction Scheduling
2006/4/12 CTHPC 2006 - P.C.Yew 27
Loop SelectionLoop Selection
-40%-20%
0%20%40%60%80%
100%
mcfcra
fty
twolf gz
ipbz
ip2vo
rtex
vpr
parse
r
gap
gcc
perlb
mk
Outer loop Inner loop Best
prog
ram
spe
edup
Carefully selected loops can improve performance significantly!
2006/4/12 CTHPC 2006 - P.C.Yew 28
ExampleExample
A
B C
D E
GF
H
I
A two-level nested loop
2006/4/12 CTHPC 2006 - P.C.Yew 29
Example: Loop SelectionExample: Loop Selection
A
B C
D E
GF
H
I
Generate loop threads for the selected inner loops
2006/4/12 CTHPC 2006 - P.C.Yew 30
ExampleExample
Partition the remaining part
A
B C
D E
GF
H
I
2006/4/12 CTHPC 2006 - P.C.Yew 31
ExampleExample
Further partitioning
A
B C
D E
GF
H
I
2006/4/12 CTHPC 2006 - P.C.Yew 32
ExampleExample
A
B C
D E
GF
H
I
A
C
D
G
C
E
G
C
D
F
H
I
Sequential execution
A C
D
G
C
E
G
C
D
F
H
I
Parallel execution
2006/4/12 CTHPC 2006 - P.C.Yew 33
Thread Partition AlgorithmThread Partition Algorithm
Bi_Partition(T, T1, T2) {P = Form_Path(T);for each BB b on the path P {
benefit = Perf_Estimation(P, b);if (benefit > max_benefit) {
max_benefit = benefit;T1 = all BBs reachable from b;T2 = all remaining BBs in T;
}}
}
Bi_Partition(T, T1, T2) {P = Form_Path(T);for each BB b on the path P {
benefit = Perf_Estimation(P, b);if (benefit > max_benefit) {
max_benefit = benefit;T1 = all BBs reachable from b;T2 = all remaining BBs in T;
}}
}
Thread_Partition(Procedure P) {Form_Loop_Tree(P);for each unselected loop L in the loop
tree {Initialize_Thread_Queue(L);while (!Thread_Queue_Empty()) {
T = Thread_Queue_Pop();if (Bi_Partition(T, T1, T2)) {
Thread_Queue_Push(T1);Thread_Queue_Push(T2);
}}
} // bottom-up traversal}
Thread_Partition(Procedure P) {Form_Loop_Tree(P);for each unselected loop L in the loop
tree {Initialize_Thread_Queue(L);while (!Thread_Queue_Empty()) {
T = Thread_Queue_Pop();if (Bi_Partition(T, T1, T2)) {
Thread_Queue_Push(T1);Thread_Queue_Push(T2);
}}
} // bottom-up traversal}
2006/4/12 CTHPC 2006 - P.C.Yew 34
Speculative Code MotionSpeculative Code Motion
*p = *p =
*p = = *p
= *p = *p
*p =
*p =
*p =
= *p
= *p
= *p
stall
critical path
other computation
before code motion after code motion
2006/4/12 CTHPC 2006 - P.C.Yew 35
Recovery Code GenerationRecovery Code Generation
Representation of recovery code is crucialIt should not affect the later compiler optimizationsRecovery code could be represented as a heavily-biased If-Then-Else structureGeneration of recovery code is rather straightforward with such a representation
2006/4/12 CTHPC 2006 - P.C.Yew 36
ConclusionsConclusionsMicroprocessors have caught up with supercomputers in ’90 and have gone multi-coreIt is non-trivial to apply current supercomputing technologies to general-purpose applicationsNew architectural support such as thread-level speculative execution, and new compiler techniques such as speculative optimizations using alias and data dependence profiling, even dynamic optimization at runtime, are crucial – as alwaysA new exciting era for parallel processing has arrived – regardless of we are prepared or not!
2006/4/12 CTHPC 2006 - P.C.Yew 37
ReferencesReferences(1) J.Lin et al, A Compiler Framework for Speculative Analysis and Optimizations,
Proc. Of ACM/SIGPLAN Conf. On Programming Language Design and Implementation (PLDI), June 2003, also in ACM Trans. On Architecture and Code Optimization (TACO), Vol. 1, No. 3, Sept. 2004, pp. 247-271
(2) J. Lin et al, Recovery Code Generation for General Speculative Optimizations, to appear in ACM Trans. On Architecture and Code Optimization (TACO) 2006.
(3) X.Dai et al, A General Compiler Framework for Speculative Optimizations UsingData Speculative Code Motion, Proc. Of the 3rd Annual IEEE/ACM Int’l Symp. On Code Generation and Optimization (CGO)
(4) T.Chen et al, Data Dependence Profiling for Speculative Optimizations, Proc. Of Int’l Conf on Compiler Construction (CC), March 2004
(5) T.Chen et al, An Empirical Study on the Granularity of Pointer Analysis in C programs, Proc. 15th Workshop on Languages and Compilers for Parallel Computing (LCPC), August 2002
(6) J.Y.Tsai et al, The Superthreaded Processor Architecture, IEEE Trans on Computers, special issue on Multithreaded Architecture, Vol. 48, No. 9, Sept 1999
2006/4/12 CTHPC 2006 - P.C.Yew 38
Supplement Slides
2006/4/12 CTHPC 2006 - P.C.Yew 39
Control SpeculationControl Speculation
ld.s: move the load operation across the barrier imposed by control dependency (branch instruction)Check the speculation with chk.s
other instructionsbr
ld
control dep
barrier
ld.sbr
chk.suse
control dep
barrier
traditional optimizations control speculation
2006/4/12 CTHPC 2006 - P.C.Yew 40
Control Speculation Control Speculation ––Instruction SchedulingInstruction Scheduling
ld.s r31 = [r30] if (c){ chk.s r31, recovery next: …. } recovery: ld r31 = [r30] br next
cmp c
ld r31 = [r30]
true, probability 90%
false
….
bz c
2006/4/12 CTHPC 2006 - P.C.Yew 41
Data Speculation:Data Speculation:Advance Load Address Table (ALAT)Advance Load Address Table (ALAT)
Reg No. Address Valid
r18 &a valid
2006/4/12 CTHPC 2006 - P.C.Yew 42
Data Speculation: ExampleData Speculation: Example
… = a*q = ..… = a
Original program
ld r18=[a]store [*q] = ld r18=[a]
Traditional compiler code
• *q and a are possible aliases
• a is reloaded after the store of *q
2006/4/12 CTHPC 2006 - P.C.Yew 43
Data Speculation: ExampleData Speculation: Example
ld.a r18 = [a]
st [*q] =
ld.c r18 = [a]
reg addr valid
r18 &a valid
2006/4/12 CTHPC 2006 - P.C.Yew 44
Instruction Cache
Data Cache
Thread processing unit
Execution Unit
Comm. Unit
Memory Buffer
Writeback Unit
Thread processing unit
Execution Unit
Comm. Unit
Memory Buffer
Writeback Unit
Thread processing unit
Execution Unit
Comm. Unit
Memory Buffer
Writeback Unit
Thread processing unit
Execution Unit
Comm. Unit
Memory Buffer
Writeback Unit
2006/4/12 CTHPC 2006 - P.C.Yew 45
Performance Improvement of Performance Improvement of Speculative Register PromotionSpeculative Register Promotion
0%2%4%6%8%
10%12%14%16%
ammp art
equa
ke
bzip2 gz
ip mcf
parse
r
twolf
Impr
ovem
ent p
erce
ntag
e
cpu cycle data access cycle loads retired
Based on alias profile and compared with –O3 with type-based alias analysis on Intel ORC compiler
2006/4/12 CTHPC 2006 - P.C.Yew 46
7
TLP for Each Benchmark
Potential Speedup of Whole Program by TLP
0
1
2
3
4
5
6
7
bzip2 gap gzip mcf parser vortex twolf vpr