Data-dependence Profiling to Enable Safe Thread Level...

10
Data-dependence Profiling to Enable Safe Thread Level Speculation Arnamoy Bhattacharyya University of Alberta Edmonton, Alberta Canada [email protected] José Nelson Amaral University of Alberta Edmonton, Alberta Canada [email protected] Hal Finkel Argonne National Laboratory [email protected] ABSTRACT Data-dependence profiling is a technique that enables a com- piler to judiciously decide when the execution of a loop — which the compiler could not prove to be dependence free — should be speculated through the use of Thread Level Spec- ulation (TLS). The data collected by a data-dependence pro- filer can be used to predict if may dependencies reported by a compiler static analysis are likely to materialize at runtime. A cost analysis can then be used to decide that some loops with a lower probability of dependence should be specula- tively parallelized. This paper addresses the question as to whether a loops’ dependence behaviour changes when the in- put to the program changes — a study of 57 different bench- marks indicates that it usually does not change. Then the paper describes SpecEval, an automatic speculative paral- lelization framework that uses single-input data-dependence profiles to find speculation candidates in the SPEC2006 and PolyBench/C benchmarks. This paper also presents a per- formance evaluation of TLS implementation in IBM’s Blue- Gene/Q supercomputer and shows that the performance of TLS is affected by several factors, including the number of speculated loops, the execution-time coverage of speculated loops, the miss-speculation overhead, the L1 cache miss rate and the effect on dynamic instruction path length. 1. INTRODUCTION Existing auto-parallelizers must follow a conservative ap- proach and generate sequential code for loops with potential dependencies. These auto-parallelization frameworks can only parallelize a loop when the compiler can prove, using compile-time and/or run-time techniques, that the parallel execution of the loop will not affect the correctness of the program. This constraint often restricts the maximum par- allelism that can be extracted from loops. A compiler uses the result from a compile-time/run-time dependence analy- sis to make a decision about parallelizing a loop so that all executions of the program are correct. Dependence-analyses check whether the same address may be referenced (loaded Copyright c 2015 Arnamoy Bhattacharyya, Jose Amaral, Hal Finkel. Per- mission to copy is hereby granted provided the original copyright notice is reproduced in copies made. from or stored into) by different iterations of a loop. If two different iterations of the loop access the same memory ad- dress, the loop contains a loop-carried dependence and it is not parallelized. If a compiler cannot determine whether there will be a dependence at run-time, data-dependence profiling can be used to predict the actual occurrence of dependencies at run- time [5, 20]. Data-dependence profiling records the memory addresses accessed by possibly dependent load/store instruc- tions for all may-dependencies. These run-time dependen- cies for a training run of the program is saved to a pro- file file and the number of actual dependencies recorded — hopefully for multiple runs with different data inputs — is used to predict the materialization of such dependencies in future runs of the same program. A parallelization that relies on data-dependence profiling must execute the paral- lel code speculatively. The speculative execution provides a safe fall-back execution path for the runs in which a profile- based prediction of absence of dependence turns out to be incorrect. Until recently, there were limited options for this fall-back path. Examples include the fall-back code for ad- vanced loads in the Intel IA64 architecture and software so- lutions based on a combination of compiler-generated code and runtime functionality. Hardware-supported thread-level speculation (TLS) offers an easier-to-implement and more efficient fall-back path [33]. Thus, two important questions are in front of us. (1) How effectively can the TLS support be used to improve the per- formance of standard benchmarks? (2) Do data-dependence profiling predictions vary with the set of inputs used for the profiling runs? Earlier studies had suggested that, for widely used sets of benchmarks, the dependence behaviour does not change with the data input [10, 15, 31]. This result is con- firmed by our own study, described in this paper, and it is great news because it indicates that, for a wide class of ap- plications, an accurate data-dependence prediction can be made based on a single profiling run that executes the loop of interest. A more recent study has identified benchmarks where the dependence behaviour does vary with changes to the data input to the benchmark [32]. This same study re- veals that often changes to dependence behaviour is present but rare, thus strengthening the argument for the use of TLS as performance-favourable choice. Earlier proposals to enable speculative execution of loop iterations based on data-dependency invariability did not provide a safety net to ensure that all possible runs of the program would produce the correct result [31]. One of the main contributions of this paper is to combine this simpler data-dependence profiling 91

Transcript of Data-dependence Profiling to Enable Safe Thread Level...

Page 1: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

Data-dependence Profiling to Enable Safe Thread LevelSpeculation

Arnamoy BhattacharyyaUniversity of AlbertaEdmonton, Alberta

[email protected]

José Nelson AmaralUniversity of AlbertaEdmonton, Alberta

[email protected]

Hal FinkelArgonne National Laboratory

[email protected]

ABSTRACTData-dependence profiling is a technique that enables a com-piler to judiciously decide when the execution of a loop —which the compiler could not prove to be dependence free —should be speculated through the use of Thread Level Spec-ulation (TLS). The data collected by a data-dependence pro-filer can be used to predict if may dependencies reported by acompiler static analysis are likely to materialize at runtime.A cost analysis can then be used to decide that some loopswith a lower probability of dependence should be specula-tively parallelized. This paper addresses the question as towhether a loops’ dependence behaviour changes when the in-put to the program changes — a study of 57 different bench-marks indicates that it usually does not change. Then thepaper describes SpecEval, an automatic speculative paral-lelization framework that uses single-input data-dependenceprofiles to find speculation candidates in the SPEC2006 andPolyBench/C benchmarks. This paper also presents a per-formance evaluation of TLS implementation in IBM’s Blue-Gene/Q supercomputer and shows that the performance ofTLS is affected by several factors, including the number ofspeculated loops, the execution-time coverage of speculatedloops, the miss-speculation overhead, the L1 cache miss rateand the effect on dynamic instruction path length.

1. INTRODUCTIONExisting auto-parallelizers must follow a conservative ap-

proach and generate sequential code for loops with potentialdependencies. These auto-parallelization frameworks canonly parallelize a loop when the compiler can prove, usingcompile-time and/or run-time techniques, that the parallelexecution of the loop will not affect the correctness of theprogram. This constraint often restricts the maximum par-allelism that can be extracted from loops. A compiler usesthe result from a compile-time/run-time dependence analy-sis to make a decision about parallelizing a loop so that allexecutions of the program are correct. Dependence-analysescheck whether the same address may be referenced (loaded

Copyright c© 2015 Arnamoy Bhattacharyya, Jose Amaral, Hal Finkel. Per-mission to copy is hereby granted provided the original copyright notice isreproduced in copies made.

from or stored into) by different iterations of a loop. If twodifferent iterations of the loop access the same memory ad-dress, the loop contains a loop-carried dependence and it isnot parallelized.

If a compiler cannot determine whether there will be adependence at run-time, data-dependence profiling can beused to predict the actual occurrence of dependencies at run-time [5, 20]. Data-dependence profiling records the memoryaddresses accessed by possibly dependent load/store instruc-tions for all may-dependencies. These run-time dependen-cies for a training run of the program is saved to a pro-file file and the number of actual dependencies recorded —hopefully for multiple runs with different data inputs — isused to predict the materialization of such dependencies infuture runs of the same program. A parallelization thatrelies on data-dependence profiling must execute the paral-lel code speculatively. The speculative execution provides asafe fall-back execution path for the runs in which a profile-based prediction of absence of dependence turns out to beincorrect. Until recently, there were limited options for thisfall-back path. Examples include the fall-back code for ad-vanced loads in the Intel IA64 architecture and software so-lutions based on a combination of compiler-generated codeand runtime functionality. Hardware-supported thread-levelspeculation (TLS) offers an easier-to-implement and moreefficient fall-back path [33].

Thus, two important questions are in front of us. (1) Howeffectively can the TLS support be used to improve the per-formance of standard benchmarks? (2) Do data-dependenceprofiling predictions vary with the set of inputs used for theprofiling runs? Earlier studies had suggested that, for widelyused sets of benchmarks, the dependence behaviour does notchange with the data input [10, 15, 31]. This result is con-firmed by our own study, described in this paper, and it isgreat news because it indicates that, for a wide class of ap-plications, an accurate data-dependence prediction can bemade based on a single profiling run that executes the loopof interest. A more recent study has identified benchmarkswhere the dependence behaviour does vary with changes tothe data input to the benchmark [32]. This same study re-veals that often changes to dependence behaviour is presentbut rare, thus strengthening the argument for the use ofTLS as performance-favourable choice. Earlier proposalsto enable speculative execution of loop iterations based ondata-dependency invariability did not provide a safety net toensure that all possible runs of the program would producethe correct result [31]. One of the main contributions of thispaper is to combine this simpler data-dependence profiling

91

Page 2: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

prediction with TLS to provide a framework for speculativeparallelization.

Using a single 16-core compute node of the TLS-enabledIBM’s BlueGene/Q supercomputer as an experimental plat-form, this paper makes the following contributions:

• A description of SpecEval, a new automatic speculativeparallelization framework that combines LLVM withthe IBM bgxlc r compiler to evaluate the performanceof TLS in the BG/Q. SpecEval uses single-input data-dependence profile to find loops that are candidates forspeculation in the SPEC2006 and PolyBench/C bench-marks.

• A detailed study, using 57 benchmarks, that confirmsthe previous claim that loop-dependence behaviour doesnot change based on program input.

• The first study of the performance impact of TLS whenapplied along with the existing auto-parallelizers of thebgxlc r compiler (SIMDizer and OpenMP parallelizer).

• A detailed study of different factors that impact thespeculative execution of loops in the BG/Q.

2. RELATED WORKArchitecture techniques to support TLS have been exten-

sively studied and found to be successful due to their loweroverhead. Multiscalar [28] introduced hardware-based TLSand initially used hardware buffers called Address Resolu-tion Buffers (ARB) [11]. Later studies relied on shared-memory cache coherence protocols to support TLS [7, 12,13, 17, 23, 30]. Proposals on software-only TLS revealedvery high overhead for such systems [25, 6].

Several profiling and speculation frameworks have beenproposed [4, 5, 8, 9, 16, 20]. Embla is a data-dependenceprofiler that supports TLS [10]. Chen et al. propose adata-dependence profiler developed for speculative optimiza-tions [5]. They perform speculative Partial RedundancyElimination (PRE) and code scheduling using a naive pro-filer and the speculative support provided through the Ad-vanced Load Address Table (ALAT) in the Itanium proces-sors. Wu et al. use the concept of independence windowsand dependence clustering to find opportunities for specula-tive parallelization [37]. The proposed profiler, called DProf,performs iteration-grain disambiguation and differentiatesbetween intra- and inter-iteration dependencies. POSH isa TLS compiler that uses a simple profiling pass to discardineffective tasks. The criteria to select a task for TLS inPOSH includes the task size, the expected squash frequency— obtained by simulation of the parallel execution — andL2 cache misses [22].

Previous research found that there is limited variabilityin the dependence behaviour of loops across inputs [31, 15,10]. In this study, an evaluation of a wide range of bench-marks, which have been used to evaluate the potential ofTLS confirms those findings. Earlier research proposals forhardware/hardware+software TLS used simulation to pre-dict the performance of such systems. To the best of ourknowledge this is the first evaluation of the actual perfor-mance of a hardware-supported TLS system.

The IBM POWER8 processor introduces transactions withsuspended regions. Using this mechanism to implementTLS, Le et al. report significant performance gain in a mi-cro benchmark [19]. However, Odaira and Nakaike found

it much more difficult to produce significant performanceimprovements when implementing TLS on top of off-the-shelf hardware-supported transactional memory [26]. Morerecently, by manually changing loops in two SPEC CPUbenchmarks, Nakaike et al. report speedups of 15% and25% due to TLS supported by the suspend/resume instruc-tions [24].

3. INPUT SENSITIVITY IN DATA DEPEN-DENCE PROFILING

This section examines the sensitivity, to variations in theprogram data input, of the data-dependence prediction ob-tained from profiling. This study examined loops from 57different benchmarks from the SPEC2006, PolyBench/C,Biobenchmarks and NAS benchmark suites with different in-puts as listed in Table 1. The study confirms the observationof previous studies [10, 15, 31], and reveals that none of thesebenchmarks have a loop where variations on dependence be-haviour across inputs was observed — admittedly the studyused a small number of inputs for each benchmark. For loopsthat were signalled as containing may-dependencies by thecompiler, either the dependence materializes for all programinputs examined or it materializes for none of them. A pos-sible inference from this result is that the may-dependenceis due to the imprecision of the static dependence analysisrather than actual variations in the dependence behaviour ofthe program. This finding is promising for the potential useof TLS because it indicates that costly many-input profil-ing is seldom required. Furthermore, von Koch and Frankefound that dependences that inhibit loop parallelization arefairly stable across program inputs [32].

4. SPECEVAL: A SPECULATIVE PARAL-LELIZATION FRAMEWORK

SpecEval is an evaluation framework for speculative par-allelization that uses LLVM passes to find may dependenciesinside loops, to profile and to generate debug information sothat source-code instrumentation can be performed. SpecE-val then uses bgxlc r to produce executable code from theinstrumented source code. Even though our group has ac-cess to the source code of the XL compiler for other researchinitiatives, for this research such access was not possible be-cause the goal is to produce a TLS-enabled path in LLVM,an open-source compiler. Thus SpecEval is only an evalua-tion framework, rather than a compilation solution, to assessthe potential for the use of TLS in the BG/Q. A shown inFigure 1, SpecEval has three phases:

1. Collection of profile and debug information: Thefirst step is the compilation of the source code to In-termediate Representation (IR) with the -g option ofthe LLVM compiler so that debug information can becollected. The debug information is used to map ex-ecutable code to source code and thus allows SpecE-val to determine the source-code file name and linenumber where each speculative loop appears so thatan speculation pragma can be inserted. Step 2 runsa dependence-analysis pass to find may dependencies.In step 3, a newly written instrumentation pass insertscalls to functions of a newly written library to preparethe code for profiling. Step 4 uses a newly writtenLLVM pass to collect the source-code file name and

92

Page 3: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

Table 1: Benchmarks used in the input dependence behaviour study. BB = BiobenchmarkSuite Benchmark Suite benchmark Suite Benchmark Suite Benchmark

SPEC2006

lbm

PolyBench/C

2mm

NAS

BT

BB

mummerh264ref 3mm CG protparhmmer gemm DC dnamovemcf gramschmidt EP dnacompsjeng jacobi FT dnaml(2)sphinx3 lu IS dnamlkbzip2 seidel LU dnadistmilc cholesky MG dolmovegobmk dynprog SP restml(2)namd fdtd 2d UA kitsch

BB

gendist

BB

tigr

BB

clustalw

BB

dynproghmmer protdist dnapars dnapennydnainvar seqboot dnamlk2 dollopdolpenny fitch neighbor

Source  code  

IR  with  debug  informa5on  

1.  Compile  with  –g  op5on  

may-­‐dependences  

Instrumented  IR  for  profiling  

dependence  profile  

5.  Profiling  run  using                                    an  input  

Loop  log  

Instrumenta5on  Program  

Instrumented  source  code  

Op5mized  executable  

7.  Test  run  using  different  input  

3.  Instrumenta5on              pass  

4.  Collect  informa5on  for  source  code    instrumenta5on   Profiling    

library  

Phase  1  

Phase  2  

Phase  3  

6.  Source  code  instrumenta5on  

2.  Dependence  analysis  pass  

bgxlc_r  

Figure 1: The SpecEval Framework.

line number for all loops in the program. This debuginformation is stored in a file indicated as Loop log inFigure 1. Step 5 executes the instrumented IR pro-duced at step 3 with a training input to collect thedata-dependence profile.

2. Source-code instrumentation: A newly written Cprogram takes the Loop log and the data-dependenceprofile file as inputs and inserts speculative pragmasin the source code before the for loops that are foundto be speculation candidates based on a heuristic. Theheuristic selects a loop for speculation if the may de-pendencies of the loop are predicted to not materialize,through profiling, and the loop takes more than 1.2%of the whole program execution time. This thresholdwas selected empirically to prevent slowdowns due tothe overhead of speculating loops that represent an in-significant portion of the program execution time. TheTLS pragma — ‘#pragma speculative for ’, is used tosignal to the XL compiler that the loop should be spec-ulated. This BG/Q-specific pragma divides the totaliteration space into chunks of:

number of iterationsnumber of threads

Figure 2 shows a sample instrumented loop. For a

given loop-nest of n-dimensions, if the pragma is ap-plied to multiple loop levels, bgxlc r automatically flat-tens the pragma to be effective on the outermost loop.

3. Test run: The instrumented source code produced inthe previous phase is compiled with the bgxlc r com-piler to produce an executable that can be run on theBG/Q.1 A different input is used in the profiling runthan the input used for the test run.

#pragma speculative forfor ( i = 0; i < x; i++ ){ //loop body}

Figure 2: A Sample loop with the TLS pragma in-serted

This framework leveraged the dependence analysis and theversatility of the LLVM framework [18] and the compiler andruntime infrastructure developed by IBM to enable a firstperformance evaluation of the hardware support for TLS onthe BG/Q.

5. PROFILE-DRIVEN TLSThis section evaluates the impact of applying profile-driven

TLS to benchmarks from the SPEC2006 and PolyBench/Cbenchmark suites using SpecEval — these suites have beenused in previous TLS studies [15, 27, 37]. Times reportedare an average execution time from 60 runs. For all exper-iments, a one sample Student’s t-test performed on the 60execution times shows that the p-values are in the range be-tween 0.15-0.33 — a p-value less than or equal to 0.05 wouldindicate a significant variation in the execution time.

Previous studies have used TLS for benchmarks with manyparallel loops without regards for the nature of potential de-pendencies [25, 29]. However, there is no point on applyingTLS to a loop that is known to not have dependencies —such loops should be executed in parallel rather than spec-ulatively — or to a loop that is known to have an actualloop-carried dependence — the speculation of such loops iscertain to fail. Therefore SpecEval only applies TLS to loopsfor which the compiler reported may dependence and where

1The r option generates thread-safe code.

93

Page 4: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

no true loop-carried dependencies exist. Therefore the op-portunities for TLS under such assumptions are diminishedby the quality of the dependence analysis in the compiler.SpecEval currently rely on the loop dependence analysis inLLVM. A similar approach can be used for other compilationframeworks in the future.

OpenMP Speculative

milc0 1.8 2

h264ref0 1.6 1.7

lbm0 3.5 4

sphinx30 1.9 1.9

bzip200 1.1 1.2

mcf00 2.3 2.3

namd 2.4 2.6

gobmk00 1 0.93

hmmer0 1.54 1.6

sjeng 1.2 0.97

BenchmarkOpenMP Speculative

2mm 8 8

3mm 8.2 8.2

gemm 7.9 8.3

fdtdH2d 1 1

jacobi 1 0.9

lu 1 0.8

seidel 1.2 0.7

cholesky 1 0.8

dynprog 1 0.8

gramschmidt 1 0.85

SIMD OpenMP SpeculativeOracle

milc0 1.1 1.3 1.5 1.7

lbm0 1 1.5 2.1 2.1

bzip200 1.1 1.1 1.2 1.3

mcf00 1 1.8 2 2.4

namd 2.1 2.3 2.4 2.4

hmmer0 1 1.12 1.2 1.2

sphinx30 1 1.2 1.2 1.3

gobmk00 1 1 1 1

sjeng 1 1.1 0.5 1.1

h264ref0 1 1.4 0.3 1.4

BenchmarkSIMD OpenMP SpeculativeOracle

gemm 2.8 3.2 3.3 3.3

lu 1.1 1.12 1.2 1.5

2mm 3.2 4 4 4

3mm 3.1 4 4 4

fdtdH2d 1 1 1 1

gramschmidt 1 1 1 1

seidel 1.2 1.2 0.98 1.2

cholesky 1 1 0.9 1

dynprog 1 1 0.8 1.01

jacobi 1 1 0.8 1

AutoHOpenMPAutoHOpenMP+TLSOracle

milc0 18 36 55

lbm0 40 110 110

bzip200 0 10 18

mcf00 80 100 140

namd 10 14 14

hmmer0 12 20 40

sphinx30 20 20 30

gobmk00 0 0 0

sjeng 10 H50 10

h264ref0 40 H70 40

BenchmarkAutoHOpenMPAutoHOpenMP+TLSOracle

gemm 14 18 18

lu 2 9 11

2mm 25 25 25

3mm 29 29 29

fdtdH2d 0 0 0

gramschmidt 0 0 0

seidel 0 H18 0

cholesky 0 H10 0

dynprog 0 H20 1

jacobi 0 H20 0

TLS+AutoHOpenMP

milc0 10

h264ref0 6

lbm0 14

sphinx30 0

bzip200 9

mcf00 0

namd 8

gobmk00 H7

hmmer0 H4

sjeng H4

TLS+AutoHOpenMP

2mm 0

3mm 0

gemm 5

fdtdH2d 0

jacobi H10

lu H20

seidel H42

cholesky H20

dynprog H20

gramschmidt H15

00

0.50

10

1.50

20

2.50

30

3.50

40

4.50

milc00 h264ref00 lbm00 sphinx300 bzip2000 mcf000 namd0 gobmk000 hmmer00 sjeng0

Speedu

p.

SPEC2006.Benchamrks.

OpenMP0 SpeculaPve0

00

10

20

30

40

50

60

70

80

90

2mm0 3mm0 gemm0 fdtdH2d0 jacobi0 lu0 seidel0 cholesky0 dynprog0 gramschmidt0

Speedu

p.

PolyBench/C.Benchmarks.

OpenMP0 SpeculaPve0

00

0.50

10

1.50

20

2.50

30

milc00 lbm00 bzip2000 mcf000 namd0 hmmer00 sphinx300 gobmk000 sjeng0 h264ref00

Speedu

p.

SPEC2006.Benchmarks.

SIMD0 OpenMP0 SpeculaPve0

00

0.50

10

1.50

20

2.50

30

3.50

40

4.50

gemm0 lu0 2mm0 3mm0 fdtdH2d0 gramschmidt0 seidel0 cholesky0 dynprog0 jacobi0

Speedu

p.

PolyBench/C.Benchmarks.

SIMD0 OpenMP0 SpeculaPve0

H1000

H500

00

500

1000

1500

2000

milc00 lbm00 bzip2000 mcf000 namd0 hmmer00 sphinx300 gobmk000 sjeng0 h264ref00

Percen

tage.cha

nge.in..

execu>

on.>me.

SPEC2006.Benchmarks.

AutoHOpenMP0 AutoHOpenMP+TLS0 Oracle0

H300

H200

H100

00

100

200

300

400

gemm0 lu0 2mm0 3mm0 fdtdH2d0 gramschmidt0 seidel0 cholesky0 dynprog0 jacobi0

Percen

tage.cha

nge.in.

execu>

on.>me.

PolyBench/C.Benchmarks.

AutoHOpenMP0 AutoHOpenMP+TLS0 Oracle0

H100

H50

00

50

100

150

200

milc00 h264ref00 lbm00 sphinx300 bzip2000 mcf000 namd0 gobmk000 hmmer00 sjeng0

Percen

tage.cha

nge.in.

speedu

p.over.AutoAOpe

nMP.

SPEC2006.Benchmarks.

TLS+AutoHOpenMP0

H500

H400

H300

H200

H100

00

100

2mm0 3mm0 gemm0 fdtdH2d0 jacobi0 lu0 seidel0 cholesky0 dynprog0 gramschmidt0

Percen

tage.cha

nge.in.

speedu

p.over.Auto_

Ope

nMP.

SPEC2006.Benchmarks.

TLS+AutoHOpenMP0

milclbm

milc

bzip2mcf

namdhmmer

sphinx3gobmk

sjengh264ref

AutoOpenMP

AutoOpenMP+TLS Oracle

20015010050

Figure 3: Percentage change in execution time fromdifferent parallelization techniques for SPEC2006benchmarks over the auto-SIMDized code. TheOpenMP and TLS versions use 4 threads.

5.1 Interaction between TLS, AutoSIMD, andAutoOpenMP parallelizers

A loop is SIMDziable if its execution can use the vector in-structions and vector registers existing in the machine. Else-where such loops are also called vectorizable. The baselinefor comparison is an executable generated by the bgxlc com-piler from IBM at the highest level of optimization(-O5) with auto-SIMDization. This code is generated withthe following command: bgxlc -O5 -qsimd=auto -qhot -

qstrict -qprefetch.2 Figure 3 compares three compilationversions with this baseline. AutoOpenMP is an automat-ically parallelized version using OpenMP and SIMDizationgenerated by bgxlc r by adding the option -qsmp=auto to thecompilation command. AutoOpenMP+TLS is the samecode as AutoOpenMP with TLS applied by SpecEval to someloops that were not parallelized by bgxlc because of may de-pendencies. Loops that are parallelized with the OpenMPoption are called parallelizable elsewhere. The -qsmp optionis modified to-qsmp=auto:speculative. Oracle is obtainedby applying TLS incrementally to candidate loops. If the ap-plication of TLS to a candidate loop degrades performancethen that loop is rejected from TLS by the oracle. The or-acle uses an unrealizable heuristic that knows which loopsare profitable for TLS. Therefore the experiment with theoracle is a limit study that estimates the best performancethat could be obtained using TLS in SpecEval. The oracleis not a practical solution for actual application programs,but in a performance evaluation study it provides valuableinformation to indicate how well the heuristic that selectsloops for speculation is doing.

The results in Figure 3 indicate that there are three classesof benchmarks - benchmarks that achieve speedup with TLS(milc, lbm, bzip2, mcf, namd, hmmer), benchmarks where

2By default the -O5 level turns on the automatic vector-ization (SIMDization) and the various optimizations of hotloops (e.g. loop unrolling etc.). The option -qstrictmaintains the correct semantics of the program when usinghigher-level optimizations. The option -qprefetch enablesprefetching.

performance neither improves nor degrades (sphinx3, gobmk)and benchmarks that suffer performance degradation withTLS (h264ref, sjeng). The superior oracle performancefor mcf is explained by some loops, speculated by SpecEval,that contain function calls that introduce dependencies. InSection 6.4 a heuristic that excludes these loops from TLSgives comparable performance to that of the oracle.

The coverage of a loop is the percentage of the total execu-tion time of the program that can be attributed to that loop.There is limited performance improvement when TLS is ap-plied to cold loops due to the speculative thread-creationoverhead. Bzip2, sjeng are benchmarks that contain spec-ulative loops with poor coverage (Table 2).

The BG/Q transactional-memory subsystem supports twomodes for speculative execution: a long-running (LR) modeand a short-running (SR) mode [33, 34, 35]. Only the LRmode was available for TLS during this study. In the LRmode the L1 cache must be flushed at the start of eachspeculative region. As a result, a significant number of L1cache misses happens at the start of a speculative regionleading to the increase in L1 cache misses observed in theexperimental evaluation and limiting TLS performance. InBG/Q the hardware support for transactional memory isbuilt on top of the hardware support for TLS — the maindistinction between TLS and TM is the need to enforce thecommit order in TLS. Wang et al. describes the SR and LRmodes of BG/Q [34].

As the results in Figure 4 indicate, amongst the Poly-Bench/C benchmarks, the performance of cholesky anddynprog is affected by the flushing of the L1 cache requiredfor TLS in BG/Q (Table 3). Although this hypothesis couldnot be experimentally evaluated in this study, it is possiblethat the use of the Short-running (SR) mode [33] would bemore suitable for the speculative execution of these bench-marks.

The starting of a TLS execution requires the executionof additional code — such as saving of register context be-fore entering a speculative region and obtaining a specu-lative ID — that accounts for some of the TLS overheadthat is reflected in the increase in dynamic instruction pathlength. Jacobi and seidel are two benchmarks that ex-perience a significant path-length increase (Table 4) thuslimiting/degrading their performance.

OpenMP Speculative

milc0 1.8 2

h264ref0 1.6 1.7

lbm0 3.5 4

sphinx30 1.9 1.9

bzip200 1.1 1.2

mcf00 2.3 2.3

namd 2.4 2.6

gobmk00 1 0.93

hmmer0 1.54 1.6

sjeng 1.2 0.97

BenchmarkOpenMP Speculative

2mm 8 8

3mm 8.2 8.2

gemm 7.9 8.3

fdtdH2d 1 1

jacobi 1 0.9

lu 1 0.8

seidel 1.2 0.7

cholesky 1 0.8

dynprog 1 0.8

gramschmidt 1 0.85

SIMD OpenMP SpeculativeOracle

milc0 1.1 1.3 1.5 1.7

lbm0 1 1.5 2.1 2.1

bzip200 1.1 1.1 1.2 1.3

mcf00 1 1.8 2 2.4

namd 2.1 2.3 2.4 2.4

hmmer0 1 1.12 1.2 1.2

sphinx30 1 1.2 1.2 1.3

gobmk00 1 1 1 1

sjeng 1 1.1 0.5 1.1

h264ref0 1 1.4 0.3 1.4

BenchmarkSIMD OpenMP SpeculativeOracle

gemm 2.8 3.2 3.3 3.3

lu 1.1 1.12 1.2 1.5

2mm 3.2 4 4 4

3mm 3.1 4 4 4

fdtdH2d 1 1 1 1

gramschmidt 1 1 1 1

seidel 1.2 1.2 0.98 1.2

cholesky 1 1 0.9 1

dynprog 1 1 0.8 1.01

jacobi 1 1 0.8 1

AutoHOpenMPAutoHOpenMP+TLSOracle

milc0 18 36 55

lbm0 40 110 110

bzip200 0 10 18

mcf00 80 100 140

namd 10 14 14

hmmer0 12 20 40

sphinx30 20 20 30

gobmk00 0 0 0

sjeng 10 H50 10

h264ref0 40 H70 40

BenchmarkAutoHOpenMPAutoHOpenMP+TLSOracle

gemm 14 18 18

lu 2 9 11

2mm 25 25 25

3mm 29 29 29

fdtdH2d 0 0 0

gramschmidt 0 0 0

seidel 0 H18 0

cholesky 0 H10 0

dynprog 0 H20 1

jacobi 0 H20 0

TLS+AutoHOpenMP

milc0 10

h264ref0 6

lbm0 14

sphinx30 0

bzip200 9

mcf00 0

namd 8

gobmk00 H7

hmmer0 H4

sjeng H4

TLS+AutoHOpenMP

2mm 0

3mm 0

gemm 5

fdtdH2d 0

jacobi H10

lu H20

seidel H42

cholesky H20

dynprog H20

gramschmidt H15

00

0.50

10

1.50

20

2.50

30

3.50

40

4.50

milc00 h264ref00 lbm00 sphinx300 bzip2000 mcf000 namd0 gobmk000 hmmer00 sjeng0

Speedu

p.

SPEC2006.Benchamrks.

OpenMP0 SpeculaPve0

00

10

20

30

40

50

60

70

80

90

2mm0 3mm0 gemm0 fdtdH2d0 jacobi0 lu0 seidel0 cholesky0 dynprog0 gramschmidt0

Speedu

p.

PolyBench/C.Benchmarks.

OpenMP0 SpeculaPve0

00

0.50

10

1.50

20

2.50

30

milc00 lbm00 bzip2000 mcf000 namd0 hmmer00 sphinx300 gobmk000 sjeng0 h264ref00

Speedu

p.

SPEC2006.Benchmarks.

SIMD0 OpenMP0 SpeculaPve0

00

0.50

10

1.50

20

2.50

30

3.50

40

4.50

gemm0 lu0 2mm0 3mm0 fdtdH2d0 gramschmidt0 seidel0 cholesky0 dynprog0 jacobi0

Speedu

p.

PolyBench/C.Benchmarks.

SIMD0 OpenMP0 SpeculaPve0

H1000

H500

00

500

1000

1500

2000

milc00 lbm00 bzip2000 mcf000 namd0 hmmer00 sphinx300 gobmk000 sjeng0 h264ref00

Percen

tage.cha

nge.in..

execu>

on.>me.

SPEC2006.Benchmarks.

AutoHOpenMP0 AutoHOpenMP+TLS0 Oracle0

H300

H200

H100

00

100

200

300

400

gemm0 lu0 2mm0 3mm0 fdtdH2d0 gramschmidt0 seidel0 cholesky0 dynprog0 jacobi0

Percen

tage.cha

nge.in.

execu>

on.>me.

PolyBench/C.Benchmarks.

AutoHOpenMP0 AutoHOpenMP+TLS0 Oracle0

H100

H50

00

50

100

150

200

milc00 h264ref00 lbm00 sphinx300 bzip2000 mcf000 namd0 gobmk000 hmmer00 sjeng0

Percen

tage.cha

nge.in.

speedu

p.over.AutoAOpe

nMP.

SPEC2006.Benchmarks.

TLS+AutoHOpenMP0

H500

H400

H300

H200

H100

00

100

2mm0 3mm0 gemm0 fdtdH2d0 jacobi0 lu0 seidel0 cholesky0 dynprog0 gramschmidt0

Percen

tage.cha

nge.in.

speedu

p.over.Auto_

Ope

nMP.

SPEC2006.Benchmarks.

TLS+AutoHOpenMP0

gemmlu

gemm

2mm3mm

fdtd-2dgramschmidt

seideljacobicholesky

dynprog

AutoOpenMP

AutoOpenMP + TLS

Oracle

40

20

-20

0

Figure 4: Percentage change in execution time fromdifferent parallelization techniques for PolyBench/Cbenchmarks over the auto-SIMDized code. TheOpenMP and TLS versions use 4 threads.

94

Page 5: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

6. LIMITATIONS TO TLS PERFORMANCEThe effect of TLS on performance is affected by the amount

of time spent in speculated loops, by the cost of misspecula-tion, by changes in cache behaviour, and by the instruction-path-length increase of the programs.

6.1 TLS Startup Overhead on the BG/QStarting a TLS region on the BG/Q requires operating

system actions, primarily to setup threads with a specialvirtual-memory structure. Knowing the cost of the setupprocess is important to understanding the profitability ofspeculating the execution of a loop. Profitable use of TLSgenerally requires choosing an speculative region of codethat is large enough to amortize these overheads yet smallenough to have a low conflict probability. A separate set ofexperiments measure the TLS startup overhead. In theseexperiments, a simple loop with independent iterations isexecuted many times: sometimes sequentially, sometimesin parallel using OpenMP, and sometimes speculatively inparallel using TLS. The test was constructed so that the it-eration count and the iteration independence could not bedetermined at compile time. Furthermore, the ordering ofloop executions using OpenMP and TLS were permuted tosearch for any ’switching penalty’ between the two methods,but none was found. Cycle counts for each loop executionwere obtained from the processor’s time-base register. Withrare exception, the serial loop executed in 217 cycles. Asshown in Figure 6, the two parallel methods, while similarto each other, exhibit a complex distribution of executioncycle counts. We were unable to determine the cause of thedistribution’s complexity, but one thing is clear: startingan OpenMP or TLS region on the BG/Q takes many hun-dreds of thousands of cycles, and sometimes nearly an orderof magnitude more. Even though the timings are highlyvariable, they were also deterministic. This relatively largeoverhead associated with TLS region startup explains thesuccess of heuristics, such as those used in this work, thatlimit the use of TLS to high-coverage loops.

6.2 Loops parallelized and their coverageTable 2 shows the total number of loops in each of the

studied benchmarks, the number of loops parallelized byeach of the three different parallelization techniques — auto-matic OpenMP parallelization, automatic SIMDization andspeculative parallelization via TLS. The table also presentsthe coverage of speculatively parallelized loops in the SPEC2006and PolyBench/C benchmarks, and the percentage of spec-ulative regions that are successfully committed.

The interesting benchmark in Table 2 is h264ref becauseit has the highest number of speculatively parallelized loopswith good coverage among all other benchmarks but stillit experiences a slowdown due to speculative execution, asseen in Figure 3. Experiments reveal that function calls fromwithin the speculative loop body introduce dependenciesduring run-time and those dependencies are not detectedby the context-insensitive dependence profiling. This typeof misspeculation overhead also limits the TLS performancefor sjeng.

The performance improvement from TLS for milc is lim-ited due to the speculative execution of loops with poor cov-erage that introduces TLS thread creation overhead. Anideal TLS candidate loop will have high coverage, i.e. it isa hot loop, with very low probability of dependence viola-

tion. The number of loops speculatively parallelized for lbmis small but they take a significant portion of the whole pro-gram execution time (98%). These hot loops of lbm are goodspeculation candidates and are examples of cases where TLScan be beneficial.

The PolyBench/C benchmarks 2mm and 3mm execute ma-trix multiplication. Their large loops can be parallelizedwith OpenMP and thus are not candidates for speculation.For gemm, the speculatively parallel loops have good coveragethat accounts for the speedup from TLS. But for cholesky,dynprog and fdtd-2d, the poor coverage of loops results ina TLS slow-down.Gramschmidt does not have any speculative loops because

cold loops (less than 1.2% coverage) are filtered out by thespeculation heuristic as they are not beneficial for TLS [3].

6.3 Misspeculation OverheadThe abortion of speculative execution because of depen-

dencies and the re-execution of the parallel section sequen-tially imposes overhead that results in performance degrada-tion. In the absence of an inter-procedural data-dependenceanalysis, the speculation of loops with function calls is riskybecause the execution of the callee may introduce depen-dencies that were not detected by the analysis and maythus lead to thread squashing. The measurement of mis-speculation overhead revealed that some of the benchmarkscontain loops where function calls introduce actual depen-dencies during run-time.

A speculative thread either commits successfully or end ina non-committed state — in which case cache-line versionsassociated with the thread are discarded.

The se print stats function from the speculation.h headerfile in the BG/Q runtime system is used to collect variousstatistics for a speculative region including the number ofsuccessfully committed/non-committed threads.

The percentage of successfully committed threads is agood proxy measurement for the misspeculation overhead.Ideally a benchmark that can benefit from TLS should havea high percentage of successful completion of speculativethreads.

Table 2 also shows the percentage of speculative threadscommitted for the SPEC2006 and PolyBench/C benchmarks.The best TLS performance, in terms of successful threadcompletion, is in lbm. For sjeng and h264ref the percent-age is much lower, giving an indication of a huge amountof wasted computation that causes their slowdown. Closerinvestigation of these benchmarks reveals that most of theloops speculatively parallelized contain function calls thatintroduce new dependencies during run-time. Thoughh264ref has a high number of loops speculatively paral-lelized (Table 2), the presence of dependence resulted fromthe function calls inside the loops accounts for the slowdown.Mcf, hmmer, milc, namd and bzip2 also suffer from this phe-nomenon.

Among the PolyBench/C benchmarks, gemm and lu havea high percentage of speculative thread completion that ac-counts for their speedup. For four of the benchmarks - ja-cobi, seidel, cholesky and dynprog there is a high per-centage of thread completion but still these benchmarks ex-perience slowdown.

Experiments show that cholesky and dynprog suffer froman increase in L1 cache misses, likely caused by the LR-modecache flush, while jacobi and seidel suffer from a signifi-

95

Page 6: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

Table 2: Number of loops parallelized by OpenMP, SIMDized and speculated with the coverage of speculatedloops. Coverage is the fraction of total time spent in the speculated loops (expressed as a percentage).Rightmost column is the percentage of speculative threads that successfully committed.

Suite Benchmarks Total OpenMP SIMDized Speculative %# Loops # Loops # Loops # Loops Cover. Commit

SPEC2006

lbm 23 4 0 5 98 % 94 %h264ref 1870 179 3 47 82 % 12 %hmmer 851 105 17 30 80 % 79 %mcf 52 9 0 12 65 % 68%sjeng 254 9 0 16 32 % 8%sphinx3 609 11 0 2 91 % 29%bzip2 232 4 0 2 35 % 78%gobmk 1265 0 0 0 0 % -milc 421 7 2 22 33 % 79%namd 619 9 7 25 92 % 80%

PolyBench/C

2mm 20 7 3 0 0 % -3mm 27 10 3 0 0 % -gemm 13 3 4 4 40 % 89%gramschmidt 10 3 0 0 0 % -jacobi 9 3 0 2 3 % 78%lu 8 3 1 5 45 % 89%seidel 7 4 0 2 3 % 70%cholesky 9 0 0 4 10 % 79%dynprog 9 7 0 3 18 % 82%fdtd 2d 14 2 0 3 20 % 68%

cant increase in dynamic instruction path length. These twobenchmarks contain loops that have low iteration counts.

Two approaches can be used to overcome the misspecula-tion overhead due to function calls:

Conservative Approach: The compiler conservatively doesnot allow loops with function calls to be executed inparallel regardless of whether the function call changesthe dependence behaviour of the loop. This heuris-tic might miss parallelization opportunities where thefunction call is harmless.

Precise Approach: A more sophisticated inter-proceduraldependence analysis technique is used to decide whetherthe function call introduces new dependencies duringrun-time.

The performance effect of adopting the conservative approachon the SPEC2006 benchmarks is evaluated in Section 6.4.

6.4 Preventing Speculation of Loops with Side-Effect Function Calls

Allowing speculative execution of loops with function callsfor h264ref and sjeng introduces misspeculation overheadthat results in performance degradation. Function calls in-side speculated loop bodies in these benchmarks introducedependencies across iterations at run time. The static de-pendence analysis in LLVM is currently unable to detectthese dependencies. An inter-procedural dependence analy-sis based on profiling information is not yet available.

This section studies the performance effect of preventingthe speculative execution of loops with function calls thatmay have side effects.

The results in Figure 5 indicate that preventing specula-tive execution of loops with function calls changes the perfor-mance degradation of h264ref and sjeng into performancegains. The percentage of successfully committed threadsjumps up from 12% to 96% for h264ref and from 8% to 97%for sjeng. Using this approach, a 32% change is achievedfor hmmer because hmmer contains some loops where new de-pendencies are introduced at run-time due to function calls.

Speculative Speculative+no.function.callh264ref. 0.3 1.6sjeng 0.5 1.1hmmer. 1.2 1.5sphinx3. 1.2 1.3bzip2.. 1.2 1.2mcf.. 2 2namd 2.4 2.4gobmk.. 1 1milc. 1.5 1.5lbm. 2.1 2.1

BenchmarkSpeculative Speculative+no.function.call2mm 4 43mm 4 4gemm 3.3 3.3fdtdC2d 1 1jacobi 0.8 0.8lu 1.2 1.2seidel 0.98 0.98cholesky 0.9 0.9dynprog 0.8 0.8gramschmidt 1 1

TLS TLS+no.function.callOracleh264ref. C70 30 40sjeng C50 8 10hmmer. 20 30 40sphinx3. 20 30 30mcf 100 130 140

0.0.5.1.

1.5.2.

2.5.3.

h264ref.. sjeng. hmmer.. sphinx3.. bzip2... mcf... namd. gobmk... milc.. lbm..

SPEC20061SpeculaKve. SpeculaKve+no.funcKon.call.

0.

1.

2.

3.

4.

5.

2mm. 3mm. gemm. fdtdC2d. jacobi. lu. seidel. cholesky. dynprog. gramschmidt.

PolyBench/C1SpeculaKve. SpeculaKve+no.funcKon.call.

C100.

C50.

0.

50.

100.

150.

200.

h264ref.. sjeng. hmmer.. sphinx3.. mcf.

Percen

tage1cha

nge1in1

execu;

on1;me1

Benchmarks1

TLS. TLS+no.funcKon.call. Oracle.TLS TLS + No Function Call Oracle

h264ref sjeng hmmer sphinx3 mcf

200

150

100

50

0

-50

-100

Figure 5: Percentage change in execution time ofSPEC2006 benchmarks over auto-SIMDized codeafter filtering speculative execution of loops withfunction calls.

Also the performance of mcf comes close to the performanceof the oracle version. Performance does not degrade for anyof these benchmarks, thus indicating that there are no loopsin these benchmarks that are both hot and that have func-tion calls that affect the run-time dependencies.

The performance of the PolyBench/C benchmarks doesnot change when this approach is used. Most PolyBench/Cbenchmarks are kernel benchmarks and the loops inside themdo not contain function calls.

6.5 L1 Cache Miss RateOne of the most dominant run-time overhead in the BG/Q

TLS is caused by the loss of L1 cache support due to the L1flush needed for the bookkeeping of speculative state in L2.The L2 and L1P buffer load latencies are 13x and 5x higherthan the L1 load latency. The L1 miss rate both for thesequential and parallel versions of the code gives an ideaabout the performance gain or loss for the benchmarks.

The Hardware Performance Monitor (HPM) library of theBG/Q is used to collect the L1 miss statistics. Table 3 givesthe L1 cache hit rate for the sequential version and the three

96

Page 7: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

0 100 200 300 400 500 600

600 744 888 1032 1176 1320 1464 1608 Freq

uenc

y of

Sam

ples

Cycles/1000

Single Iteration Cycle Counts for TLS

0 100 200 300 400 500 600

600 728 856 984 1112 1240 1368 1496 1624 Freq

uenc

y of

Sam

ples

Cycles/1000

Single Iteration Cycle Counts for OpenMP

600 744 888 1032 1176 1320 1464 1608Cycles/1000

600 728 856 984 1112 1240 1368 1496 1624Cycles/1000

Figure 6: Distribution of the number of cycles re-quired to execute a single iteration of a simple loopwith TLS and OpenMP.

parallel versions of the SPEC2006 and PolyBench/C bench-marks.

The speculative execution of sjeng results in a high L1miss rate. This high miss rate is the effect of flushing the L1cache before entering the TLS region in the LR mode. Apartfrom the function calls that introduce data-dependencies,the high L1 miss rate affects the performance for sjeng.

Similar effect can be seen for the two PolyBench/C bench-marks - cholesky and dynprog. These two benchmarks havea high percentage of successful completion of speculativethreads — see Table 2. The speculative execution of theselected loops affects the cache performance due to localityof data between threads. The cost of bringing the data againafter flushing the cache is likely the cause for the slowdownin these benchmarks.

For jacobi and seidel benchmarks, though the specula-tive execution of the loops result in better cache utilization,the benchmarks experience a slowdown. The reason for thisslowdown is the increase in instruction path length.3 Thetwo benchmarks fdtd-2d and gobmk do not experience anychange in cache utilization for the three parallelization tech-niques (automatic OpenMP, SIMDization and speculativeparallelization), because there are no parallelizable loops.

6.6 Increase in Instruction Path LengthAutomatic OpenMP and speculative parallelization insert

calls to OpenMP and TLS run-time functions, respectively,into the parallelized loops. Code is also inserted for savingthe consistent system state so that the system can be rolledback to a previous state in case of a dependence violationand thread squashing. Table 4 shows the effect of TLS oncode growth.

The code growth for the PolyBench/C benchmarks is higherthan for the SPEC2006 benchmarks because loops consti-tute a major portion of the PolyBench/C benchmarks andtherefore the parallelization of these loops affects the codesize more significantly. For SPEC2006 benchmarks the code

3The instruction path length is the total number of instruc-tions executed by an application.

Table 4: Percentages of dynamic instruction-path-length increase of the three parallel versions of theSPEC2006 and PolyBench/C benchmarks with re-spect to their sequential version.

Suite Benchmark SIMD AutoOMP Speculative

SPEC2006

lbm 0.0 0.3 26.0h264ref 0.6 15.0 56.0hmmer 10.0 35.0 37.0mcf 0.0 12.0 23.0sjeng 0.0 0.0 45.0sphinx3 0.0 18.0 19.0bzip2 0.0 2.0 3.0gobmk 0.0 0.0 0.0milc 0.9 12.0 23.0namd 1.0 12.0 25.0

PolyBench/C

2mm 13.0 45.0 45.03mm 13.0 46.0 46.0gemm 11.0 45.0 45.0gramschmidt 0.0 46.0 46.0jacobi 0.0 95.0 112.0lu 1.0 12.0 13.0seidel 0.0 98.0 123.0cholesky 0.0 0.0 99.0dynprog 0.0 0.0 75.0fdtd 2d 0.0 0.0 79.0

growth is relatively smaller, the highest being for the spec-ulative parallelization of h264ref where a large number ofloops are speculatively parallelized (Table 2).

The data in Table 4 indicate that jacobi and seidel ex-perience a significant code growth that is likely to explaintheir slowdown. Benchmarks such as cholesky, dynprog andfdtd_2d also suffer code growth due to the presence of loopswith poor coverage (see Table 2). Loops whose speculationleads to non-trivial code growth should be only judiciouslyspeculated.

7. PERFORMANCE TRENDSAs technology scales, the number of cores that can be in-

tegrated onto a processor increases. Thus, it is important tounderstand whether TLS can efficiently utilize all the avail-able cores. This section examines the scalability of TLS per-formance for the SPEC2006 and PolyBench/C benchmarksby comparing the speedup achieved using 2 to 64 threads.The results of this study are shown in Figure 7.

2 4 8 16 32 64milc+ 1.5 1.5 1.7 1.7 1.7 1.7h264ref+ 1.6 1.6 1.7 1.8 1.7 1.7lbm+ 2.1 2.1 2.5 2.5 2.6 2.6sphinx3+ 1.3 1.3 1.4 1.3 1.3 1.3bzip2++ 1.2 1.2 1.2 1.2 1.22 1.2mcf++ 2 2 2.1 2.2 2.3 2.3namd 2.4 2.4 2.5 2.6 2.6 2.6hmmer+ 1.5 1.5 1.52 1.52 1.5 1.5sjeng 1.1 1.1 1.2 1.2 1.2 1.2

2 4 8 16 32 64gemm 3.3 3.4 3.5 3.5 3.5 3.5fdtd>2d 1 1 1 1 1 1jacobi 0.8 0.7 0.7 0.8 0.7 0.7lu 1.2 1.3 1.3 1.2 1.2 1.1seidel 0.98 0.98 0.98 0.98 0.98 0.98cholesky 0.9 1 0.9 0.9 0.9 0.9dynprog 0.8 0.8 0.9 0.8 0.8 0.8

0+

0.5+

1+

1.5+

2+

2.5+

3+

2+ 4+ 8+ 16+ 32+ 64+

Speedu

p&

Number&of&Threads&

SPEC2006&

milc++

h264ref++

lbm++

sphinx3++

bzip2+++

mcf+++

namd+

hmmer++

sjeng+

0+

0.5+

1+

1.5+

2+

2.5+

3+

3.5+

4+

2+ 4+ 8+ 16+ 32+ 64+

Speedu

p&

Number&of&Threads&&

PolyBench/C&

gemm+

fdtd>2d+

jacobi+

lu+

seidel+

cholesky+

dynprog+

2 4 8 16 32 64# of Threads

3

2.5

2

1.5

1

0.5

milch264reflbmsphinx3bzip2mcfnamdhmmersjeng

Figure 7: Scalability of speculatively parallelizedversions of the SPEC2006 benchmarks.

As seen in the figure, lbm, sphinx3, namd and h264ref

contain a number of loops with good coverage. Therefore,

97

Page 8: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

Table 3: L1 Cache hit rate (percentage) for the sequential and three parallel versions of the SPEC2006 andPolyBench/C benchmarks.

Suite Benchmark Sequential SIMD AutoOMP Speculative

SPEC2006

lbm 95 94 94 93h264ref 96 95 95 94hmmer 98 97 97 95mcf 92 92 95 95sjeng 96 96 95 90sphinx3 96 96 95 95bzip2 95 95 95 97gobmk 97 97 97 97milc 95 97 97 98namd 96 98 97 98

PolyBench/C

2mm 98 98 99 993mm 98 98 99 99gemm 98 96 98 98gramschmidt 97 97 97 97jacobi 97 97 97 97lu 96 96 95 96seidel 98 97 98 98cholesky 98 98 96 88dynprog 97 96 97 90fdtd 2d 98 98 98 98

these benchmarks show some scalability with the increasingnumber of threads.Mcf is a benchmark that scales up to 32 threads due to

cache prefetching [27]. The performance of milc scales upto 8 threads. For hmmer and sjeng, the performance im-provement for TLS is negligible in all configurations. Whilethe reasons for the lack of scalability differ from benchmarkto benchmark, it is obvious that the amount of parallelismis limited. There is also very little performance change withincreasing number of threads.

7.1 Using Clauses with the Basic TLS PragmaAll the experiments discussed so far use the basic TLS

pragma available for TLS on BG/Q. The bgxlc r compileralso supports OpenMP-like clauses that can be used to op-timize the performance of speculative loops [14]. Theseclauses offer more flexibility in the following two aspects:

• Scoping of variables: The clauses default, shared,private, firstprivate and lastprivate give the option tospecify the scope of the variables used inside the loop.

• Work Distribution: The clauses num threads andschedule give the option to change the number of threadsand the distribution of work among threads.

This case study illustrates the use of specific clauses on thelbm and h264ref benchmarks. These benchmarks are chosenbecause lbm has loops that are suitable for TLS executionand h264ref has the highest number of speculatively paral-lelized loops. The clauses are manually added to pragmasto study their performance effects. This manual instrumen-tation allowed the parallelization of more loops.

This study also investigates the impact of different workdistribution strategies on the TLS performance for the spec-ulatively parallelized loops for the two benchmarks. Theperformance evaluation indicates that the scoping of vari-ables results in negligible performance variations for thesetwo benchmarks (improvements of .05 % and .01 % respec-tively for lbm and h264ref), and that the different workdistribution strategies do not change the performance at all.

But still, the question remains whether there will be anysignificant performance change for other benchmarks dueto the modification of the basic pragma. Previous workmention that finding the best-suited (OpenMP) pragma au-tomatically in loops is non-trivial and needs programmer’s

support [31, 36]. One approach for automatic modificationof the pragmas for work sharing is to use machine-learningtechniques [31, 36]. Auto-scoping of variables is still not sup-ported in bgxlc r. Techniques used by the Oracle’s Solariscompiler can be explored for auto-scoping [21].

8. LESSONS LEARNEDThe introduction posed some important research ques-

tions regarding the use of TLS to improve performance. Thisstudy is limited to standard benchmarks that were not de-signed with the idea of exploiting TLS in mind and is lim-ited to the first implementation of hardware-supported TLScommercially available. Nonetheless, the results presentedindicate that obtaining significant performance gains fromTLS alone may be harder than previously thought. Theoverall performance gains reported appear underwhelmingand would discourage the use of TLS if significant effortwould be required by the final user. However the goodnews is that, given that the machinery to support TLS isalready implemented in the hardware, once the compilationand run-time system support is in place, deploying TLS toan application requires minimum effort. The confirmationthat for the majority of the application the materializationof may-dependencies is independent of the program input —and therefore the solution that a single profiling run wouldbe sufficient to determine which may-dependencies can besafely speculated — makes the use of TLS even simpler.

A second lesson learned is that, even though the cost forstarting a TLS region is similar to the cost of starting anOpenMP region, both of them can be high. An interestingdirection for future improvement, which requires access tothe IBM proprietary software stack, would be to identifywhich portion of this overhead is due to the hardware andto the software and to try to improve the software stack toreduce this cost.

9. CONCLUSIONThis research connected the dependence analysis in one

compiler (LLVM) and the support for TLS in another (bgxlc r)to create a new automatic speculative parallelization frame-work, SpecEval, that allowed for an evaluation of the per-formance effects of TLS on the BlueGene/Q supercomputer.This evaluation found that many factors must be taken into

98

Page 9: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

consideration when selecting candidate loops for specula-tion. These factors include: number and coverage of spec-ulative loops, misspeculation overhead due to function callsinside the loop body, increase in L1 cache misses and dy-namic instruction path length increase.

The evaluation, using benchmarks from the SPEC2006and PolyBench/C suites, found that benchmarks that haveloops with good coverage and without dependence violation,such as lbm, are well suited for TLS. But the widespreaduse of TLS in benchmarks that contain function calls withside effects will require more sophisticated inter-proceduraldependence analysis in the compiler.

This research was originally motivated by the idea thatinformation obtained from profiling is expected to vary ac-cording to the program input used for profiling [1]. Whilefor some programs where variation in dependence behaviourwith input does occur [2], it was pleasant to confirm earlierclaims that, in practice, for many benchmarks loops’ depen-dence behaviour does not change based on inputs on a widerange of benchmarks. Based on this finding, single inputdata-dependence profiles could be used to find the specula-tion candidate loops in this study.

Acknowledgments.We thank Paul Berube and Matthew Gaudet for providingvaluable feedback throughout the course of this research. Iwould also like to thank Jeff Hammond at the Argonne Na-tional Laboratory for his support and for giving access tothe BG/Q. This research was partially funded by the Nat-ural Science and Engineering Research Council of Canadathrough a Collaborative Research and Development grantand through the IBM Centre for Advanced Studies. Thisresearch used resources of the Argonne Leadership Comput-ing Facility, which is a DOE Office of Science User Facilitysupported under Contract DE-AC02-06CH11357.

10. REFERENCES[1] P. Berube and J. Amaral. Combined profiling: A

methodology to capture varied program behavioracross multiple inputs. In International Symposium onPerformance Analysis of Systems and Software(ISPASS), pages 210–220, New Brunswik, NJ, USA,2012.

[2] A. Bhattacharyya. Do inputs matter? Usingdata-dependence profiling to evaluate thread levelspeculation in the Blue Gene/Q. Master’s thesis,University of Alberta, Edmonton, Canada, 2013.

[3] A. Bhattacharyya and J. N. Amaral. Automaticspeculative parallelization of loops using polyhedraldependence analysis. In Proceedings of the FirstInternational Workshop on Code OptimiSation forMultI and Many Cores, COSMIC ’13, pages 1:1–1:9,Shenzhen, China, 2013.

[4] L.-L. Chen and Y. Wu. Aggressive compileroptimization and parallelization with thread-levelspeculation. In International Conference on ParallelProcessing (ICPP), pages 607–614, Kaohsiung,Taiwan, 2003.

[5] T. Chen, J. Lin, X. Dai, W.-C. Hsu, and P.-C. Yew.Data dependence profiling for speculativeoptimizations. In E. Duesterwald, editor, CompilerConstruction, volume 2985 of Lecture Notes in

Computer Science, pages 57–72. Springer BerlinHeidelberg, 2004.

[6] M. Cintra and D. R. Llanos. Toward efficient androbust software speculative parallelization onmultiprocessors. In Principles and practice of parallelprogramming (PPoPP), pages 13–24, San Diego, CA,USA, 2003.

[7] M. Cintra and J. Torrellas. Eliminating squashesthrough learning cross-thread violations in speculativeparallelization for multiprocessors. InHigh-Performance Computer Architecture (HPCA),pages 43–, Cambridge, MA, USA, 2002.

[8] J. Da Silva and J. G. Steffan. A probabilistic pointeranalysis for speculative optimizations. In ArchitecturalSupport for Programming Languages and OperatingSystems (ASPLOS), pages 416–425, San Jose, CA,USA, 2006.

[9] X. Dai, A. Zhai, W.-C. Hsu, and P.-C. Yew. A generalcompiler framewor for speculative optimizations usingdata speculative code motion. In Code Generation andOptimization (CGO), pages 280–290, San Jose, CA,USA, 2005.

[10] K.-F. Faxen, K. Popov, L. Albertsson, and S. Janson.Embla - data dependence profiling for parallelprogramming. In Complex, Intelligent and SoftwareIntensive Systems (CISIS), pages 780–785, 2008.

[11] M. Franklin and G. S. Sohi. Arb: A hardwaremechanism for dynamic reordering of memoryreferences. IEEE Transactions on Computers,45(5):552–571, May 1996.

[12] S. Gopal, T. Vijaykumar, J. Smith, and G. Sohi.Speculative versioning cache. In High-PerformanceComputer Architecture (HPCA), pages 195–, LasVegas, NE, USA, 1998.

[13] L. Hammond, M. Willey, and K. Olukotun. Dataspeculation support for a chip multiprocessor. InArchitectural Support for Programming Languages andOperating Systems (ASPLOS), pages 58–69, San Jose,CA, USA, 1998.

[14] IBM. Ibm xl c/c++ for blue gene/q, v12.1.http://www-03.ibm.com/software/products/us/en/

xlcc+forbluegene/.

[15] H. Kim, N. P. Johnson, J. W. Lee, S. A. Mahlke, andD. I. August. Automatic speculative doall for clusters.In Code Generation and Optimization (CGO), pages94–103, San Jose, CA, USA, 2012.

[16] M. Kim, H. Kim, and C.-K. Luk. Sd3: A scalableapproach to dynamic data-dependence profiling. InInternational Symposium on Microarchitecture(MICRO), pages 535–546, Atlanta, GA, USA, 2010.

[17] V. Krishnan and J. Torrellas. The need for fastcommunication in hardware-based speculative chipmultiprocessors. In Parallel Architectures andCompilation Techniques(PACT), pages 24–, NewportBeach, CA, USA, 1999.

[18] C. Lattner and V. Adve. LLVM: A CompilationFramework for Lifelong Program Analysis &Transformation. In Code Generation and Optimization(CGO), Palo Alto, California, Mar 2004.

[19] H. Le, G. Guthrie, D. Williams, M. Michael, B. Frey,W. Starke, C. May, R. Odaira, and T. Nakaike.Transactional memory support in the ibm power8

99

Page 10: Data-dependence Profiling to Enable Safe Thread Level Speculationarnamoyb/papers/cascon_2015.pdf · 25% due to TLS supported by the suspend/resume instruc-tions [24]. 3. INPUT SENSITIVITY

processor. IBM Journal of Research and Development,59(1):8:1–8:14, Jan 2015.

[20] J. Lin, T. Chen, W.-C. Hsu, P.-C. Yew, R. D.-C. Ju,T.-F. Ngai, and S. Chan. A compiler framework forspeculative analysis and optimizations. InProgramming Language Design and Implementation(PLDI), pages 289–299, San Diego, California, USA,2003.

[21] Y. Lin, C. Terboven, D. Mey, and N. Copty.Automatic scoping of variables in parallel regions ofan openmp program. In Shared Memory ParallelProgramming with Open MP, volume 3349 of LectureNotes in Computer Science, pages 83–97. Springer,2005.

[22] W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss,J. Renau, and J. Torrellas. POSH: A TLS compilerthat exploits program structure. In Principles andpractice of parallel programming (PPoPP), pages158–167, New York, NY, USA, 2006.

[23] P. Marcuello and A. Gonzalez. Clustered speculativemultithreaded processors. In International Conferenceon Supercomputing (ICS), pages 365–372, Rhodes,Greece, 1999.

[24] T. Nakaike, R. Odaira, M. Gaudet, M. M. Michael,and H. Tomari. Quantitative comparison of hardwaretransactional memory for Blue Gene/Q, zEnterpriseEC12, Intel Core, and POWER8. In InternationalSymposium on Computer Architecture (ISCA),Portland, OR, June 2015.

[25] C. E. Oancea and A. Mycroft. Software thread-levelspeculation: An optimistic library implementation. InProceedings of the 1st International Workshop onMulticore Software Engineering (IWMSE), pages23–32, Leipzig, Germany, 2008.

[26] R. Odaira and T. Nakaike. Thread-level speculationon off-the-shelf hardware transactional memory. InIEEE International Symposium on WorkloadCharacterization (IISWC), pages 212–221, Raleigh,NC, October 2014.

[27] V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, andT.-F. Ngai. Exploring speculative parallelism inspec2006. In International Symposium onPerformance Analysis of Systems and Software(ISPASS), pages 77–88, 2009.

[28] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar.Multiscalar processors. In International Symposium onComputer Architecture (ISCA), pages 414–425, S.Margherita Ligure, Italy, 1995.

[29] M. F. Spear, K. Kelsey, T. Bai, L. Dalessandro, M. L.Scott, C. Ding, and P. Wu. Fastpath speculativeparallelization. In Languages and Compilers forParallel Computing (LCPC), pages 338–352, Newark,DE, USA, 2010.

[30] J. G. Steffan, C. B. Colohan, A. Zhai, and T. C.Mowry. A scalable approach to thread-levelspeculation. In International Symposium on ComputerArchitecture (ISCA), pages 1–12, Vancouver, BC,Canada, 2000.

[31] G. Tournavitis, Z. Wang, B. Franke, and M. F.O’Boyle. Towards a holistic approach toauto-parallelization: Integrating profile-drivenparallelism detection and machine-learning basedmapping. In Programming Language Design andImplementation (PLDI), pages 177–187, Dublin,Ireland, 2009.

[32] T. J. K. E. von Koch and B. Franke. Variability ofdata dependences and control flow. In InternationalSymposium on Performance Analysis of Systems andSoftware (ISPASS), pages 180–189, March 2014.

[33] A. Wang, M. Gaudet, P. Wu, J. N. Amaral,M. Ohmacht, C. Barton, R. Silvera, and M. Michael.Evaluation of blue gene/q hardware support fortransactional memories. In Parallel Architectures andCompilation Techniques(PACT), pages 127–136,Minneapolis, MN, USA, 2012.

[34] A. Wang, M. Gaudet, P. Wu, M. Ohmacht, J. N.Amaral, C. Barton, R. Silvera, and M. M. Michael.Evaluation of Blue Gene/Q hardware support fortransactional memories. In Parallel Architectures andCompilation Techniques(PACT), pages 127–136,Minneapolis, MN, USA, September 2012.

[35] A. Wang, M. Gaudet, P. Wu, M. Ohmacht, J. N.Amaral, C. Barton, R. Silvera, and M. M. Michael.Software support and evaluation of hardwaretransaction memory on blue gene/q. IEEETransactions on Computers, 64(1):233–346, January2015.

[36] Z. Wang and M. F. O’Boyle. Mapping parallelism tomulti-cores: A machine learning based approach. InPrinciples and practice of parallel programming(PPoPP), pages 75–84, Raleigh, NC, USA, 2009.

[37] P. Wu, A. Kejariwal, and C. Cascaval.Compiler-driven dependence profiling to guideprogram parallelization. In Languages and Compilersfor Parallel Computing (LCPC), pages 232–248,

Edmonton, AB, Canada, 2008.

100