POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask...
Transcript of POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask...
![Page 1: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/1.jpg)
POSH: A TLS Compiler that Exploits Program Structure
Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau† and Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
†Computer Engineering DepartmentUniversity of California, Santa Cruz
![Page 2: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/2.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 2
CMPs are available now!
Chip multiprocessors (CMPs) have arrivedPentium D, Opteron, Power5, Niagara, …
Good speedups for parallel programsHow to speed up single-threaded applications?
Especially the hard-to-parallelize applications, such as SPECint.
Thread-Level Speculation (TLS)
![Page 3: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/3.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 3
RAW
Iteration J
… = A[4] ……
A[5] = …
for(i=0;i<n;i++){… = A[B[i]] …
…A[C[i]] = …
}
What is Thread Level Speculation (TLS)?
Execute potentially-dependent tasks in parallel• Assume no dependence across tasks will be violated
TLS hardware• Track memory accesses; buffer unsafe state• Detect any violation (RAW)• Squash offending tasks, repair polluted state, restart tasks
Iteration J+1
… = A[2] ……
A[2] = …
Iteration J+2
… = A[5] ……
A[6] = …
![Page 4: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/4.jpg)
Novel Compiler and Architecture Optimizations for TLS Wei Liu, University of Illinois 4
Key to the Acceptance of TLS
Fully automated TLS compilers
• Do not need to prove the independence between tasks
• Crucial impact on performance• How to break the code into tasks• When to spawn the tasks
![Page 5: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/5.jpg)
Novel Compiler and Architecture Optimizations for TLS Wei Liu, University of Illinois 5
Contributions
POSH, a fully automated TLS compiler infrastructure• Exploits code structure (loop iterations and
subroutines of any nesting level) for tasks• Uses a simple profiling pass to discard ineffective
tasks• Leverages both parallelism and data prefetching
Speedup: 1.30 on average for SPECint 2000
Detailed characterization of speedup sources and task behavior.
• 26% of the speedup is from prefetching (i.e. 8% w.r.t. the baseline)
![Page 6: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/6.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 6
Outline
Background of TLSContributions of POSHPOSH: A TLS CompilerExperimental ResultsConclusions
![Page 7: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/7.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 7
Compiler Passes in gcc-3.5
Flowchart of the POSH Framework
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Binary Code
Profiler
![Page 8: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/8.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 8
Hardware Assumptions
No special hardware for register communication• Dependence detection through memory
Support spawn and commit instructions
![Page 9: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/9.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 9
Compiler Passes in gcc-3.5
Task Selection
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Task Code
Profiler
![Page 10: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/10.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 10
Task Selection
Break the sequential program into tasksSelect the following structures
• Subroutines• Subroutine continuations• Loop iterations• Loop continuations• Continuation == Code region that follows
subroutines or loops
Mark the begin point for each task
A
B
C
D
Begin point A
Begin point B
Begin point C
Begin point D
![Page 11: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/11.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 11
Compiler Passes in gcc-3.5
Value Prediction
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Task Code
Profiler
![Page 12: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/12.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 12
Value Prediction
Predict the values of variables that cross task boundaries
• reduce violations
Software Value Predictor• Function return variables • Loop induction variables • Induction-like variables
for (i = 0; i < 100; i++) {…if (result == 0) {
j++; /* induction-like variable */}…
}
![Page 13: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/13.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 13
Compiler Passes in gcc-3.5
Spawn Hoisting
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Task Code
Profiler
![Page 14: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/14.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 14
Spawn Hoisting
The spawn instruction is hoisted relative to the beginning of a taskRestrictions on hoisting (See paper for more details)
• Definition of any variables in the task• Control safe location• Reverse sequential order for multiple spawn instructions
A
B
Spawn B
![Page 15: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/15.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 15
Compiler Passes in gcc-3.5
Refinement
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Task Code
Profiler
![Page 16: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/16.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 16
Compiler Passes in gcc-3.5
The POSH Profiler
Task Selection
ValuePrediction
Spawn Hoisting Refinement
ProgramStructure
DependenceRestriction
Placement
TaskSize
ProfilingFeedback
Parallelism
Generate Task Code
Profiler
![Page 17: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/17.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 17
POSH Profiler
Runs a TLS binary with tasks selected by the compiler • Uses train input set
Executes the TLS binary sequentially• No TLS architectural support assumed for generality• With rudimentary timing, it can collect information about
potential run-time violations
Provides a simple L2 cache model to estimate missesNot tied to a fixed number of processors
![Page 18: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/18.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 18
Profiling Pass
Goal: Minimize Execution TimePreserve parallelism
• Find tasks with the best overlap
Reward prefetching due to task squashes• Keep squashed tasks that prefetch even with small or no
overlap
Overlap PrefetchingBenefit = +
![Page 19: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/19.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 19
Estimate Overlap
Spawn Task 2
Task 2
Task 1
Pro
filer
Exe
cutio
n
TEnd
TSpawn
TBegin
TStore ST XLD X
Profitable Overlap
TSpawn+Toverhead
TLoad
TLoad < TStore !!!
![Page 20: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/20.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 20
Estimate Prefetching
Spawn Task 2
Task 1
Pro
filer
Exe
cutio
n
ST X
LD Y
Profitable Prefetching
LD X Task 2
LD Y
LD X
![Page 21: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/21.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 21
Outline
Background of TLSContributions of POSHPOSH: A TLS CompilerExperimental ResultsConclusions
![Page 22: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/22.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 22
Methodology
Simulated CMP architecture• 4 GHz• Private 16k L1 per core; 1 MB Shared L2
Unmodified SPECint 2000 applications
TLS CMP
Four 3-issue cores with TLS
Serial
One 3-issue core
![Page 23: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/23.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 23
Task Characterization
Static Information• On average 32.2 Subr Tasks• On average 7.3 loops with Loop Tasks• About 6 live-ins and 6 live-outs per task
Dynamic Information• Average task size: 476 instructions• Most tasks have 50-1000 dynamic instructions• 45.7% tasks commit
![Page 24: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/24.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 24
Application Speedups
Using both forms of tasks is simple and effectiveTLS: an average speedup of 1.30!
0
0.5
1
1.5
2
bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean
Spee
dup
over
Ser
ial
Subr Loop Subr+Loop
1.30
![Page 25: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/25.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 25
Contribution of Prefetching
0
0.5
1
1.5
2
bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean
Spee
dup
over
Ser
ial
TLS_NoPrefetch TLS
1.301.22
About a quarter of the speedup is from prefetching
![Page 26: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/26.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 26
Effectiveness of Profiler
0
0.5
1
1.5
2
bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean
Spee
dup
over
Ser
ial
No Profiler With Profiler
Profiling is needed for good performanceProfiler reduces the average number of tasks from 198 to 39
1.301.04
![Page 27: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/27.jpg)
POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 27
Conclusions
POSH: A fully automated TLS compiler
• Simple: Use program structure
• Effective: Prefetching + Parallelism
• Speedup: on average 1.30 for SPECint 2000
![Page 28: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops](https://reader033.fdocuments.net/reader033/viewer/2022052009/601e4d9094f409607b042d3f/html5/thumbnails/28.jpg)
POSH: A TLS Compiler that Exploits Program Structure
Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau† and Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
†Computer Engineering DepartmentUniversity of California, Santa Cruz