Scheduling on Asymmetric Parallel Architectures

Scheduling on Asymmetric Parallel Architectures

Filip Blagojevic

Dissertation submitted to the faculty of the Virginia Polytechnic Institute andState University in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Computer Science and Applications

Committee Members:Dimitrios S. Nikolopoulos (Chair)

Kirk W. CameronWu-chun Feng

David K. LowenthalCalvin J. Ribbens

May 30, 2008Blacksburg, Virginia

Keywords: Multicore processors, Cell BE, process scheduling,high-performance computing, performance prediction, runtime adaptation

c© Copyright 2008, Filip Blagojevic

Scheduling on Asymmetric Parallel ArchitecturesFilip Blagojevic

(ABSTRACT)

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelismon heterogeneous multi-core processors. Heterogeneous multi-core processors integrate con-ventional cores that run legacy codes with specialized cores that serve as computational ac-celerators. The term multi-grain parallelism refers to the exposure of multiple dimensions ofparallelism from within the runtime system, so as to best exploit a parallel architecture withheterogeneous computational capabilities between its cores and execution units. To maximizeperformance on heterogeneous multi-core processors, programs need to expose multiple dimen-sions of parallelism simultaneously. Unfortunately, programming with multiple dimensions ofparallelism is to date an ad hoc process, relying heavily on the intuition and skill of program-mers. Formal techniques are needed to optimize multi-dimensional parallel program designs.We investigate user- and kernel-level schedulers that dynamically ”rightsize” the dimensionsand degrees of parallelism on the asymmetric parallel platforms. The schedulers address theproblem of mapping application-specific concurrency to an architecture with multiple hardwarelayers of parallelism, without requiring programmer intervention or sophisticated compiler sup-port. Our runtime environment outperforms the native Linux and MPI scheduling environmentby up to a factor of 2.7. We also present a model of multi-dimensional parallel computationfor steering the parallelization process on heterogeneous multi-core processors. The model pre-dicts with high accuracy the execution time and scalability of a program using conventionalprocessors and accelerators simultaneously. More specifically, the model reveals optimal de-grees of multi-dimensional, task-level and data-level concurrency, to maximize performanceacross cores. We evaluate our runtime policies as well as the performance model we developed,on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using tworealistic bioinformatics applications.

ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Dimitrios S. Nikolopoulos for his guidance during mygraduate studies. I would also like to thank Dr. Alexandros Stamatakis, Dr. Xizhou Feng,and Dr. Kirk Cameron for providing us with the original MPI implementations of PBPI andRAxML and for discussions on scheduling and modeling the Cell/BE. I would like to thank tothe members of the PEARL group, Dr. Christos Antonopoulos, Dr. Matthew Curtis-Maury,Scott Schneider, Jae-Sung Yeom, and Benjamin Rose, for their involvement in the projects pre-sented in this dissertation. I would also like to thank my Ph.D. committee for their discussionand suggestions for this work: Dr. Kirk W. Cameron, Dr. Davd Lowenthal, Dr. Wu-chunFeng, and Dr. Calvin J. Ribbens. Also, I thank Georgia Tech, its Sony-Toshiba-IBM Centerof Competence, and NSF, for the Cell/BE resources that have contributed to this research. Fi-nally, I would like to thank the institutions that have funded this research: the National ScienceFoundation and the U.S. Department of Energy.

iii

This page intentionally left blank.

iv

Contents

1 Problem Statement 11.1 Mapping Parallelism to Asymmetric Parallel Architectures . . . . . . . . . . . 2

2 Statement of Objectives 52.1 Dynamic Multigrain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Rightsizing Multigrain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 MMGP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Experimental Testbed 113.1 RAxML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 PBPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Code Optimization Methdologies for Asymmetric Multi-core Systems with Explic-itly Managed Memories 174.1 Porting and Optimizing RAxML on Cell . . . . . . . . . . . . . . . . . . . . . 18

4.2 Function Off-loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Optimizing Off-Loaded Functions . . . . . . . . . . . . . . . . . . . . 19

4.2.2 Vectorizing Conditional Statements . . . . . . . . . . . . . . . . . . . 20

4.2.3 Double Buffering and Memory Management . . . . . . . . . . . . . . 23

4.2.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.5 PPE-SPE Communication . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.6 Increasing the Coverage of Offloading . . . . . . . . . . . . . . . . . . 28

4.3 Parallel Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Scheduling Multigrain Parallelism on Asymmetric Systems 335.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

5.2 Scheduling Multi-Grain Parallelism on Cell . . . . . . . . . . . . . . . . . . . 33

5.2.1 Event-Driven Task Scheduling . . . . . . . . . . . . . . . . . . . . . . 34

5.2.2 Scheduling Loop-Level Parallelism . . . . . . . . . . . . . . . . . . . 36

5.2.3 Implementing Loop-Level Parallelism . . . . . . . . . . . . . . . . . . 42

5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism . . . . . . . . . . . 43

5.3.1 Application-Specific Hybrid Parallelization on Cell . . . . . . . . . . . 44

5.3.2 MGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 S-MGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism . . . . . . . . . 51

5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Model of Multi-Grain Parallelism 616.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Modeling Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2.1 Hardware Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2.2 Application Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3 Model of Multi-grain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.3.1 Modeling sequential execution . . . . . . . . . . . . . . . . . . . . . . 66

6.3.2 Modeling parallel execution on APUs . . . . . . . . . . . . . . . . . . 67

6.3.3 Modeling parallel execution on HPUs . . . . . . . . . . . . . . . . . . 69

6.3.4 Using MMGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3.5 MMGP Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.4 Experimental Validation and Results . . . . . . . . . . . . . . . . . . . . . . . 72

6.4.1 MMGP Parameter approximation . . . . . . . . . . . . . . . . . . . . 73

6.4.2 Case Study I: Using MMGP to parallelize PBPI . . . . . . . . . . . . . 74

6.4.3 Case Study II: Using MMGP to Parallelize RAxML . . . . . . . . . . 77

6.4.4 MMGP Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Scheduling Asymmetric Parallelism on a PS3 Cluster 857.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.2 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.3 PS3 Cluster Scalability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3.1 MPI Communication Performance . . . . . . . . . . . . . . . . . . . . 88

7.3.2 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 88

vi

7.4 Modeling Hybrid Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.4.1 Modeling PPE Execution Time . . . . . . . . . . . . . . . . . . . . . . 94

7.4.2 Modeling the off-loaded Computation . . . . . . . . . . . . . . . . . . 96

7.4.3 DMA Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.4.4 Cluster Execution Modeling . . . . . . . . . . . . . . . . . . . . . . . 98

7.4.5 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.5 Co-Scheduling on Asymmetric Clusters . . . . . . . . . . . . . . . . . . . . . 99

7.6 PS3 versus IBM QS20 Blades . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Kernel-Level Scheduling 1078.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.2 SLED Scheduler Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.3 ready to run List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.3.1 ready to run List Organization . . . . . . . . . . . . . . . . . . . . . . 110

8.3.2 Splitting ready to run List . . . . . . . . . . . . . . . . . . . . . . . . 111

8.4 SLED Scheduler - Kernel Level . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.5 SLED Scheduler - User Level . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.6.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.6.2 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.6.3 PBPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.6.4 RAxML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9 Future Work 1279.1 Integrating ready-to-run list in the Kernel . . . . . . . . . . . . . . . . . . . . 128

9.2 Load Balancing and Task Priorities . . . . . . . . . . . . . . . . . . . . . . . . 130

9.3 Increasing Processor Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 131

9.4 Novel Applications and Programming Models . . . . . . . . . . . . . . . . . . 132

9.5 Conventional Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

9.6 MMGP extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

10 Overview of Related Research 13510.1 Cell – Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.2 Process Scheduling - Related Research . . . . . . . . . . . . . . . . . . . . . . 138

vii

10.3 Modeling – Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 14110.3.1 PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14110.3.2 BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14210.3.3 LogP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14310.3.4 Models Describing Nested Parallelism . . . . . . . . . . . . . . . . . . 144

Bibliography 147

viii

List of Figures

2.1 A hardware abstraction of an accelerator-based architecture. Host processingunits (HPUs) supply coarse-grain parallel computation across accelerators. Ac-celerator processing units (APUs) are the main computation engines and maysupport internally finer grain parallelism. . . . . . . . . . . . . . . . . . . . . . 6

3.1 Organization of Cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 The likelihood vector structure is used in almost all memory traffic be-tween main memory and the local storage of the SPEs. The structure is 128-bitaligned, as required by the Cell architecture. . . . . . . . . . . . . . . . . . . . 23

4.2 The body of the first loop in newview(): a) Non–vectorized code, b) Vector-ized code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 The second loop in newview(). Non–vectorized code shown on the left, vector-ized code shown on the right. spu madd() multiplies the first two arguments andadds the result to the third argument. spu splats() creates a vector by replicatinga scalar element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Performance of (a) RAxML and (b) PBPI with different number of MPI pro-cesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Scheduler behavior for two off-loaded tasks, representative of RAxML. Case(a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the be-havior of the Linux scheduler with the same workload. The numbers correspondto MPI processes. The shaded slots indicate context switching. The exampleassumes a Cell-like system with four SPEs. . . . . . . . . . . . . . . . . . . . 36

5.2 Parallelizing a loop across SPEs using a work-sharing model with an SPE des-ignated as the master. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

ix

5.3 The data structure Pass is used for communication among SPEs. The vi ad

variables are used to pass input arguments for the loop body from one localstorage to another. The variable sig is used as a notification signal that thememory transfer for the shared data updated during the loop is completed. Thevariable res is used to send results back to the master SPE, and as a dependenceresolution mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Parallelization of the loop from function evaluate() in RAxML. The leftside depitcs the code executed by the master SPE, while the right side depitcsthe code executed by a worker SPE. Num SPE represents the number of SPEworker threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 Comparison of task-level and hybrid parallelization schemes in RAxML, on theCell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16,(b) 1–128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6 MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML treescreated: (a) 1–16, (b) 1–128. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Execution time of RAxML with a variable number of SPE threads. The inputdataset is 25 SC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.8 Execution times of RAxML, with various static multi-grain scheduling strate-gies. The input dataset is 25 SC. . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.9 The sampling phase of S-MGPS. Samples are taken from four execution inter-vals, during which the code performs identical operations. For each sample,each MPI process uses a variable number of SPEs to parallelize its enclosedloops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.10 PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1-4, deg(LLP)=1–16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1 A hardware abstraction of an accelerator-based architecture with two layers ofparallelism. Host processing units (HPUs) relatively supply coarse-grain paral-lel computation across accelerators. Accelerator processing units (APUs) arethe main computation engines and may support internally finer grain paral-lelism. Both HPUs and APUs have local memories and communicate throughshared-memory or message-passing. Additional layers of parallelism can beexpressed hierarchically in a similar fashion. . . . . . . . . . . . . . . . . . . . 62

x

6.2 Our application abstraction of two parallel tasks. Two tasks are spawned bythe main process. Each task exhibits phased, multi-level parallelism of varyinggranularity. In this paper, we address the problem of mapping tasks and subtasksto accelerator-based systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3 The sub-phases of a sequential application are readily mapped to HPUs andAPUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2executes on the APU. HPUs and APUs are assumed to communicate via sharedmemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.4 Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads com-putations to one APU (part a) and two APUs (part b). The single point-to-pointtransfer of part a is modeled as overhead plus computation time on the APU.For multiple transfers, there is additional overhead (g), but also benefits due toparallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.5 Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs(2 on the right and 2 on the left). The first thread on the HPU offloads compu-tation to APU1 and APU2 then idles. The second HPU thread is switched in,offloads code to APU3 and APU4, and then idles. APU1 and APU2 completeand return data followed by APU3 and APU4. . . . . . . . . . . . . . . . . . 69

6.6 MMGP predictions and actual execution times of PBPI, when the code uses onedimension of PPE (HPU) parallelism. . . . . . . . . . . . . . . . . . . . . . . 75

6.7 MMGP predictions and actual execution times of PBPI, when the code usesone dimension of SPE (APU) parallelism, with a data-parallel implementationof the maximum likelihood calculation. . . . . . . . . . . . . . . . . . . . . . 76

6.8 MMGP predictions and actual execution times of PBPI, when the code uses twodimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees ofparallelism which optimizes performance is 4-way PPE parallelism combinedwith 4-way SPE parallelism. The chart illustrates the results when both SPEparallelism and PPE parallelism are scaled to two Cell processors. . . . . . . . 78

6.9 MMGP predictions and actual execution times of RAxML, when the code usesone dimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2. . . . . . 79

6.10 MMGP predictions and actual execution times of RAxML, when the code usesone dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2. . . . . . 80

6.11 MMGP predictions and actual execution times of RAxML, when the code usestwo dimensions of SPE (APU) and PPE (HPU) parallelism. Performance isoptimized by oversubscribing the PPE and maximizing task-level parallelism. . 82

xi

6.12 Overhead of the sampling phase when MMGP scheduler is used with the PBPIapplication. PBPI is executed multiple times with 107 input species. The se-quence size of the input file is varied from 1,000 to 10,000. In the worst case,the overhead of the sampling phase is 2.2% (sequence size 7,000). . . . . . . . 83

7.1 MPI Allreduce() performance on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memoryfor communication within the node. . . . . . . . . . . . . . . . . . . . . . . . 89

7.2 MPI Send/Recv() latency on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memoryfor communication within the node. . . . . . . . . . . . . . . . . . . . . . . . 90

7.3 Measured and predicted performance of applications on the PS3 cluster. PBPIis executed with weak scaling. RAxML is executed with strong scaling. x-axisnotation: Nnode - number of nodes, Nprocess - number of processes per node,NSPE - number of SPEs per process. . . . . . . . . . . . . . . . . . . . . . . . 92

7.4 Four cases illustrating the importance of co-scheduling PPE threads and SPEthreads. Threads labeled ”P” are PPE threads, while threads labeled ”S” areSPE threads. We assume that P-threads and S-threads communicate throughshared memory. P-threads poll shared memory locations directly to detect if apreviously off-loaded S-thread has completed. Striped intervals indicate yield-ing of the PPE, dark intervals indicate computation leading to a thread off-loadon an SPE, light intervals indicate computation yielding the PPE without off-loading on an SPE. Stars mark cases of mis-scheduling. . . . . . . . . . . . . . 95

7.5 SPE execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.6 Double buffering template for tiled parallel loops. . . . . . . . . . . . . . . . . 97

7.7 Performance of yield-if-not-ready policy and the native Linux scheduler in PBPIand RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number ofprocesses per node, NSPE - number of SPEs per process. . . . . . . . . . . . . 101

7.8 Performance of different scheduling strategies in PBPI and RAxML. . . . . . . 103

7.9 Comparison between the PS3 cluster and an IBM QS20 cluster. . . . . . . . . . 104

8.1 Upon completing the assigned tasks, the SPEs send signal to the PPE processesthrough the ready-to-run list. The PPE process which decides to yield passesthe data from the ready-to-run list to the kernel, which in return can schedulethe appropriate process on the PPE. . . . . . . . . . . . . . . . . . . . . . . . 108

xii

8.2 Vertical overview of the SLED scheduler. The user level part contains the ready-

to-run list, shared among the processes, while the kernel part contains the sys-tem call through which the information from the ready-to-run list is passed tothe kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.3 ProcessP1, which is bound to CPU1, needs to be scheduled to run by the sched-uler that was invoked on CPU2. Consequently, the kernel needs to performmigration of the process P1, from CPU1 to CPU2 . . . . . . . . . . . . . . . . 112

8.4 System call for migrating the processes across the execution contexts. Functionsched migrate task() performs the actual migration. SLEDS yield() functionschedules the process to be the next to run on the CPU. . . . . . . . . . . . . . 113

8.5 The ready to run list is split in two parts. Each of the two sublists contain pro-cesses that are sharing the execution context (CPU1 or CPU2). This approachavoids any possibility of expensive process migration across the execution con-texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.6 Execution flow of the SLEDS yield() function: (a) The appropriate process isfound in the running list (tree), (b) The process is pulled out from the list, andits priority is increased, (c) The process is returned to the list, and since itspriority is increased it will be stored at the left most position. . . . . . . . . . . 115

8.7 Outline of the SLEDS scheduler: Upon off-loading a process is required to callthe SLEDS Offload() function. SLEDS Offload() checks if the off-loaded taskhas finished (Line 14), and if not, calls the yield() function. yield() scans theready to run list, and yields to the next process by executing SLEDS yield()

system call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.8 Execution times of RAxML when the ready to run list is scanned between 50and 1000 times. x-axis represents the number of scans of the ready to run list.y-axis represents the execution time. Note that the lowest value for the y-axisis 12.5, and the difference between the lowest and the highest execution time is4.2%. The input file contains 10 species, each represented by 1800 nucleotides. 118

8.9 Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased. . . 119

8.10 Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased –task size is limited to 2.1us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8.11 EDTLP outperforms SLED for small task sizes due to higher complexity of theSLED scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xiii

8.12 Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than15µs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.13 Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than15µs – task size is limited to 2.µs. . . . . . . . . . . . . . . . . . . . . . . . . 122

8.14 Comparison of EDTLP and SLED schemes using the PBPI application. Theapplication is executed multiple times with varying length of the input sequence(represented on the x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.15 Comparison of EDTLP and the combination of SLED and EDTLP schemesusing the PBPI application. The application is executed multiples time withvarying length of the input sequence (represented on the x-axis). . . . . . . . . 124

8.16 Comparison of EDTLP and SLED schemes using the RAxML application. Theapplication is executed multiple times with varying length of the input sequence(represented on the x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.17 Comparison of EDTLP and the combination of SLED and RAxML schemesusing the RAxML application. The application is executed multiple times withvarying length of the input sequence (represented on the x-axis). . . . . . . . . 125

9.1 Upon completing the assigned tasks, SPEs send signals to PPE processes throughthe ready-to-run list. The PPE process which decides to yield passes the datafrom the ready-to-run queue to the kernel, which in return can schedule theappropriate process on the PPE. . . . . . . . . . . . . . . . . . . . . . . . . . 129

xiv

List of Tables

4.1 Execution time of RAxML (in seconds). The input file is 42 SC. (a) The wholeapplication is executed on the PPE, (b) newview() is offloaded on one SPE. . . . 20

4.2 Execution time of RAxML after the floating-point conditional statement is trans-formed to an integer conditional statement and vectorized. The input file is 42 SC. 22

4.3 Execution time of RAxML with double buffering applied to overlap DMAtransfers with computation. The input file is 42 SC. . . . . . . . . . . . . . . . 24

4.4 Execution time of RAxML following vectorization. The input file is 42 SC. . . 27

4.5 Execution time of RAxML following the optimization of communication to usedirect memory-to-memory transfers. The input file is 42 SC. . . . . . . . . . . 28

4.6 Execution time of RAxML after offloading and optimizing three functions:newview(), makenewz() and evaluate(). The input file is 42 SC. . . . . . . . . . 29

5.1 Performance comparison for (a) RAxML and (b) PBPI with two schedulers.The second column shows execution time with the EDTLP scheduler. The thirdcolumn shows execution time with the native Linux kernel scheduler. The work-load for RAxML contains 42 organisms. The workload for PBPI contains 107organisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Execution time of RAxML when loop-level parallelism (LLP) is exploited inone bootstrap, via work distribution between SPEs. The input file is 42 SC: (a)DNA sequences are represented with 10,000 nucleotides, (b) DNA sequencesare represented with 20,000 nucleotides. . . . . . . . . . . . . . . . . . . . . . 40

5.3 Execution time of PBPI when loop-level parallelism (LLP) is exploited via workdistribution between SPEs. The input file is 107 SC: (a) DNA sequences arerepresented with 1,000 nucleotides, (b) DNA sequences are represented with10,000 nucleotides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

xv

5.4 Efficiency of different program configurations with two data sets in RAxML.The best configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The bestconfiguration for 25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds thedegree of a given dimension of parallelism (LLP or TLP). . . . . . . . . . . . . 54

5.5 RAxML – Comparison between S-MGPS and static scheduling schemes, illus-trating the convergence overhead of S-MGPS. . . . . . . . . . . . . . . . . . . 55

5.6 PBPI – comparison between S-MGPS and static scheduling schemes: (a) deg(TLP)=1,deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1–4; (d) deg(TLP)=8, deg(LLP)=1–2. . . . . . . . . . . . . . . . . . . . . . . . . 58

xvi

Chapter 1

Problem Statement

In the quest for delivering higher performance to scientific applications, hardware designers be-

gan to move away from superscalar processor models and embraced architectures with multiple

processing cores. Although all commodity microprocessor vendors are marketing multicore

processors, these processors are largely based on replication of superscalar cores. Unfortu-

nately, superscalar designs exhibit well-known performance and power limitations. These limi-

tations, in conjunction with a sustained requirement for higher performance, stimulated interest

in unconventional processor designs, that combine parallelism with acceleration. These designs

leverage multiple cores some of which are customized accelerators for data-intensive computa-

tion. Examples of these heterogeneous, accelerator-based parallel architectures are Cell BE [3],

GPGPU [4], Rapport KiloCore [2], EXOCHI [96], etc.

As a case study and a representative of the accelerator-based asymmetric architectures, in

this dissertation we investigate the Cell Broadband Engine (CBE). Cell has recently drawn

considerable attention by industry and academia. Since it was originally designed for the game

box market, Cell has low cost and a modest power budget. Nevertheless, the processor is able

to achieve unprecedented peak performance for some real-world applications. IBM announced

recently the use of Cell chips in a new Petaflop system with 16,000 Cells named RoadRunner,

due for delivery in 2008.

1

The potential of the Cell BE has been demonstrated convincingly in a number of stud-

ies [33,39,69,74,91]. Thanks to eight high-frequency execution cores with pipelined SIMD ca-

pabilities, and an aggressive data transfer architecture, Cell has a theoretical peak performance

of over 200 Gflops for single-precision FP calculations and a peak memory bandwidth of over

25 Gigabytes/s. These performance figures position Cell ahead of the competition against the

most powerful commodity microprocessors. Cell has already demonstrated impressive perfor-

mance ratings in applications and computational kernels with highly vectorizable data paral-

lelism, such as signal processing, compression, encryption, dense and sparse numerical ker-

nels [12, 13, 15, 39, 48, 49, 66, 75, 78, 79, 99].

1.1 Mapping Parallelism to Asymmetric Parallel Architec-

tures

Arguably, one of the most difficult problems that programmers face while migrating to a new

parallel architecture is the mapping of algorithms and data to the architecture. Accelerator-

based multi-core processors complicate this problem in two ways. Firstly, by introducing het-

erogeneous execution cores, the user needs to be concerned with mapping each component of

the application to the type of core that best matches the computational and memory bandwidth

demand of the component. Secondly, by providing multiple cores with embedded SIMD or

multi-threading capabilities, the user needs to be concerned with extracting multiple dimen-

sions of parallelism from the application and mapping each dimension to parallel execution

units, so as to maximize performance.

Cell provides a motivating and timely example for the problem of mapping algorithmic

parallelism to modern multi-core architectures. The processor can exploit task and data par-

allelism, both across and within its cores. On accelerator-based multi-core architectures the

programmer must be aware of core heterogeneity, and carefully balance execution between the

2

host and accelerator cores. Furthermore, the programmer faces a seemingly vast number of

options for parallelizing code on these architectures. Functional and data decompositions of the

program can be implemented on both, the host and the accelerator cores. Functional decom-

positions can be achieved by dividing functions between the hosts and the accelerators and by

off-loading functions from the hosts to accelerators at runtime. Data decompositions are also

possible, by using SIMDization on the vector units of the accelerator cores, or loop-level paral-

lelization across accelerators, or a combination of loop-level parallelization across accelerators

and SIMDization within accelerators.

In this thesis we explore different approaches used to automatize mapping applications to

asymmetric parallel architectures. We explore both, runtime and static approaches for combin-

ing and managing functional and data decomposition. We combine and orchestrate multiple

levels of parallelism inside an application in order to achieve both, harmoniously utilization

of all host and accelerator cores, as well as high memory bandwidth available on asymmetric

multi-core processors. Although we chose Cell to be our case study, our scheduling algorithms

and decisions are general and can be applied to any asymmetric parallel architecture.

3

Chapter 2

Statement of Objectives

2.1 Dynamic Multigrain Parallelism

While many studies have been focused on performance evaluation and optimizations for the

heterogeneous multi-core architectures [23, 31, 54, 63, 65, 74, 98], the optimal mapping of par-

allel applications to these architectures has not been investigated. In this thesis we explore

heterogeneous multi-core architectures from a different perspective, namely that of multigrain

parallelization. The asymmetric parallel architectures have a specific design, they can exploit

orthogonal dimensions of task and data parallelism on a single chip. The processor is controlled

by one or more host processing elements, which usually schedule the computation off-loaded to

accelerator processing units. The accelerators are usually SIMD processors and provide the bulk

of the processor’s computational power. A general design of heterogeneous, accelerator-based

architectures is represented in Figure 2.1.

To simplify programming and improve efficiency on asymmetric parallel architectures, we

present a set of dynamic scheduling policies and the associated mechanisms. We introduce an

event-driven scheduler, EDTLP, which oversubscribes the host processing cores and exposes

dynamic parallelism across accelerators. We also propose MGPS, a scheduling module which

controls multi-grain parallelism on the fly to monotonically increase accelerator utilization.

5

APU/LM

#1

HPU/LM

#1

HPU/LM

#NHP

APU/LM

#2

APU/LM

#NAP

Shared Memory / Message Interface

Figure 2.1: A hardware abstraction of an accelerator-based architecture. Host processing units(HPUs) supply coarse-grain parallel computation across accelerators. Accelerator processingunits (APUs) are the main computation engines and may support internally finer grain paral-lelism.

MGPS monitors the number of active accelerators used by off-loaded tasks over discrete inter-

vals of execution and makes a prediction on the best combination of dimensions and granularity

of parallelism to expose to the hardware. The purpose of these policies is to exploit the proper

layers and degrees of parallelism from the application, in order to maximize efficiency of the

processor’s computational cores. We explore the design and implementation of our schedul-

ing policies using two real-world scientific applications, RAxML [87] and PBPI [45]. RAxML

and PBPI are bioinformatics applications used for generating the phylogenetic trees, and we

describe them in more detail in Chapter 3.

One of the most efficient execution models on asymmetric parallel architectures, which

reduces the idle time on the host processors as well as on the accelerators, is to oversubscribe

the host processors unit with multiple processes. In this approach one or more accelerators are

assigned to each process for off-loading the expensive computation. Although the offloading

approach enables high utilization of the architecture, it also increases contention and the number

of context-switches on the host processor unit, as well as time necessary of a single context-

switch to complete. To reduce the contention caused by context switching, and the idle time

that occurs on the accelerator cores as a consequence, we designed and implemented slack-

6

minimizer scheduler (SLED). In our case study, the SLED scheduler is capable of improving

the performance on the Cell processor for up to 17%.

The study related to dynamic scheduling strategies makes the following contributions:

• We present a runtime system and scheduling policies that exploit polymorphic (task and

loop-level) parallelism on asymmetric parallel processors. Our runtime system is adap-

tive, in the sense that it chooses the form and degree of parallelism to expose to the

hardware, in response to workload characteristics. Since the right choice of form(s) and

degree(s) of parallelism depends non-trivially on workload characteristics and user input,

our runtime system unloads an important burden from the programmer.

• We show that dynamic multigrain parallelization is a necessary optimization for sustain-

ing maximum performance on asymmetric parallel architectures, since no static paral-

lelization scheme is able to achieve high accelerator efficiency in all cases.

• We present an event-driven multithreading execution engine, which achieves higher effi-

ciency on accelerators by oversubscribing the host core.

• We present a feedback-guided scheduling policy for dynamically triggering and throttling

loop-level parallelism across accelerators. We show that work-sharing of divisible tasks

across accelerators should be used when the event-driven multithreading engine leaves

more than half of the accelerators idle. We observe benefits from loop-level paralleliza-

tion of off-loaded tasks across accelerators. However, we also observe that loop-level

parallelism should be exposed only in conjunction with low-degree task-level parallelism.

• We present the kernel-level extensions to our runtime system, which enable efficient pro-

cess scheduling in a case when the host core is oversubscribed with multiple processes

7

2.2 Rightsizing Multigrain Parallelism

When executing multi-level parallel application on asymmetric parallel processors, the perfor-

mance can be strongly affected by the execution configuration. In case of RAxML execution

on the Cell processor, depending on the runtime degree of each level of parallelism in the ap-

plication, the performance variation can be as high as 40%. To address the issue of determining

the most optimal parallel configuration, we introduce a new runtime scheduler, S-MGPS, which

performs sampling and timing of the dominant phases in the application in order to determine

the most efficient mapping of different levels of parallelism to the architecture. There are sev-

eral essential differences between S-MGPS and our previously introduced runtime scheduler,

MGPS. MGPS is a utilization-driven scheduler, which seeks the highest possible accelerator

utilization by exploiting additional layers of parallelism when some accelerator cores appear

underutilized. MGPS attempts to increase utilization by creating more accelerator tasks from

innermost layers of parallelism, more specifically, as many tasks as the number of idle acceler-

ators recorded during intervals of execution. S-MGPS is a scheduler which seeks the optimal

application-system configuration, in terms of layers of parallelism exposed to the hardware and

degree of granularity per layer of parallelism, based on runtime task throughput of the appli-

cation and regardless of system utilization. S-MGPS takes into account the cumulative effects

of contention and other system bottlenecks on software parallelism and can converge to the

best multi-grain parallel execution algorithm. MGPS on the other hand uses only informa-

tion on SPE utilization and may often converge to a suboptimal multi-grain parallel execution

algorithm. A further contribution of S-MGPS is that the scheduler is immune to the initial

configuration of parallelism in the application and uses a sampling method which is indepen-

dent of application-specific parameters, or input. On the contrary, the performance of MGPS is

sensitive to both the initial structure of parallelism in the application and input.

Although the scientific codes we use in this thesis implement similar functionality, they

differ in their structure and parallelization strategies and raise different challenges for user-level

8

schedulers. We show that S-MGPS performs within 2% off the optimal scheduling algorithm in

PBPI and within 2%–10% off the optimal scheduling algorithm in RAxML. We also show that

S-MGPS adapts well to variation of the input size and granularity of parallelism, whereas the

performance of MGPS is sensitive to both these factors.

2.3 MMGP Model

The technique used by the S-MGPS scheduler might not be scalable to large, complex systems,

large applications, or applications with behavior that varies significantly with the input. The

execution time of a complex application is the function of many parameters. A given parallel

application may consist of N phases where each phase is affected differently by accelerators.

Each phase can exploit d dimensions of parallelism or any combination thereof such as ILP,

TLP, or both. Each phase or dimension of parallelism can use any of m different programming

and execution models such as message passing, shared memory, SIMD, or any combination

thereof. Accelerator availability or use may consist of c possible configurations, involving dif-

ferent numbers of accelerators. Exhaustive analysis of the execution time for all combinations

requires at least N × d×m× c trials with any given input.

Models of parallel computation have been instrumental in the adoption and use of parallel

systems. Unfortunately, commonly used models [24,35] are not directly portable to accelerator-

based systems. First, the heterogeneous processing common to these systems is not reflected

in most models of parallel computation. Second, current models do not capture the effects of

multi-grain parallelism. Third, few models account for the effects of using multiple program-

ming models in the same program. Parallel programming at multiple dimensions and with a

synthesis of models consumes both enormous amounts of programming effort and significant

amounts of execution time, if not handled with care. To overcome these deficits, we present a

model for multi-dimensional parallel computation on asymmetric multi-core processors. Con-

sidering that each dimension of parallelism reflects a different degree of computation granular-

9

ity, we name the model MMGP, for Model of Multi-Grain Parallelism.

MMGP is an analytical model which formalizes the process of programming accelerator-

based systems and reduces the need for exhaustive measurements. This proposal presents a

generalized MMGP model for accelerator-based architectures with one layer of host processor

parallelism and one layer of accelerator parallelism, followed by the specialization of this model

for the Cell Broadband Engine.

The input to MMGP is an explicitly parallel program, with parallelism expressed with

machine-independent abstractions, using common programming libraries and constructs. Upon

identification of a few key parameters of the application derived from micro-benchmarking and

profiling of a sequential run, MMGP predicts with reasonable accuracy the execution time with

all feasible mappings of the application to host processors and accelerators. MMGP is fast

and reasonably accurate, therefore it can be used to quickly identify optimal operating points,

in terms of the exposed layers of parallelism and the degree of parallelism in each layer, on

accelerator-based systems. Experiments with two complete applications from the field of com-

putational phylogenetics on a shared-memory multiprocessor with single and multiple nodes

that contain the Cell BE, show that MMGP models parallel execution time of complex parallel

codes with multiple layers of task and data parallelism, with mean error in the range of 1%–6%,

across all feasible program configurations on the target system. Due to the narrow margin of

error, MMGP predicts accurately the optimal mapping of programs to cores for the cases we

have studied so far.

10

Chapter 3

Experimental Testbed

This chapter provides details on our experimental testbed, including the two applications that

we used to study user-level schedulers on the Cell BE (RAxML and PBPI) and the hardware

platform on which we conducted this thesis.

RAxML and PBPI are computational biology applications designed to determine the phy-

logenetic trees. Phylogenetic trees are used to represent the evolutionary history of a set of n

organisms. An alignment with the DNA or AA sequences representing those n organisms (also

called taxa) can be used as input for the computation of phylogenetic trees. In a phylogeny

the organisms of the input data set are located at the tips (leaves) of the tree whereas the inner

nodes represent extinct common ancestors. The branches of the tree represent the time which

was required for the mutation of one species into another, new one. The generation of phylo-

genies with computational methods has many important applications in medical and biological

research (see [14] for a summary).

The fundamental algorithmic problem computational phylogeny faces consists of the im-

mense amount of alternative tree topologies which grows exponentially with the number of or-

ganisms n, e.g. for n = 50 organisms there exist 2.84 ∗ 1076 alternative trees (number of atoms

in the universe ≈ 1080). In fact, it has only recently been shown that the phylogeny problem is

NP-hard [34]. In addition, generating phylogenies is very memory- and floating point-intensive

11

process, such that the application of high performance computing techniques as well as the as-

sessment of new CPU architectures can contribute significantly to the reconstruction of larger

and more accurate trees. The computation of the phylogenetic tree containing representatives

of all living beings on earth is still one of the grand challenges in Bioinformatics.

3.1 RAxML

RAxML-VI-HPC (v2.1.3) (Randomized Axelerated Maximum Likelihood version VI for High

Performance Computing) [87] is a program for large-scale ML-based (Maximum Likelihood

[43]) inference of phylogenetic (evolutionary) trees using multiple alignments of DNA or AA

(Amino Acid) sequences. The program is freely available as open source code at icwww.epfl.ch/˜stamatak.

The current version of RAxML incorporates a rapid hill climbing search algorithm. A re-

cent performance study [87] on real world datasets with ≥ 1,000 sequences reveals that it is

able to find better trees in less time and with lower memory consumption than other current ML

programs (IQPNNI, PHYML, GARLI). Moreover, RAxML-VI-HPC has been parallelized with

MPI (Message Passing Interface), to enable embarrassingly parallel non-parametric bootstrap-

ping and multiple inferences on distinct starting trees in order to search for the best-known ML

tree. Like every ML-based program, RAxML exhibits a source of fine-grained loop-level par-

allelism in the likelihood functions which consume over 90% of the overall computation time.

This source of parallelism scales well on large memory-intensive multi-gene alignments due to

increased cache efficiency.

The MPI version of RAxML is the basis of our Cell version of the code [20]. In RAxML

multiple inferences on the original alignment are required in order to determine the best-known

(best-scoring) ML tree (we use the term best-known because the problem is NP-hard). Fur-

thermore, bootstrap analyses are required to assign confidence values ranging between 0.0 and

1.0 to the internal branches of the best-known ML tree. This allows determining how well-

supported certain parts of the tree are and is important for the biological conclusions drawn

12

from it. All those individual tree searches, be it bootstrap or multiple inferences, are completely

independent from each other and can thus be exploited by a simple master-worker MPI scheme.

Each search can further exploit data parallelism via thread-level parallelization of loops and/or

SIMDization.

3.2 PBPI

PBPI is based on Bayesian phylogenetic inference, which constructs phylogenetic trees from

DNA or AA sequences using the Markov Chain Monte Carlo (MCMC) sampling method. The

program is freely available as open source code at www.pbpi.org.The MCMC method is inher-

ently sequential, and the state of each time step depends on previous time steps. Therefore,

the PBPI application uses algorithmic improvements described below to achieve highly effi-

cient parallel inference of phylogenetic trees. PBPI exploits multi-grain parallelism, to achieve

scalability on large-scale distributed memory systems, such as the IBM BlueGene/L [45]. The

algorithm of PBPI can be summarized as follows:

1. Partition the Markov chains into chain groups, and split the data set into segments along

the sequences.

2. Organize the virtual processors that execute the code into a two-dimensional grid; map

each chain group to a row on the grid and map each segment to a column on the grid.

3. During each generation, compute the partial likelihood across all columns and use all-to-

all communication to collect the complete likelihood values to all virtual processors on

the same row.

4. When there are multiple chains, randomly choose two chains for swapping using point-

to-point communication.

13

PowerPC

PPE

I/OController

ControllerMemory

Element Interconnect BUS (EIB)

SPE

LS

SPE SPE SPE

SPE SPE SPE SPE

LS LS LS

LS LS LS LS

Figure 3.1: Organization of Cell.

From a computational perspective, PBPI differs substantially from RAxML. While RAxML

is embarrassingly parallel, PBPI uses a predetermined virtual processor topology and a corre-

sponding data decomposition method. While the degree of task parallelism in RAxML may

vary considerably at runtime, PBPI exposes from the beginning of execution, a high-degree of

two-dimensional data parallelism to the runtime system. On the other hand, while the degree

of task parallelism can be controlled dynamically in RAxML without performance penalty, in

PBPI changing the degree of outermost data parallelism requires data redistribution and incurs

a high performance penalty.

3.3 Hardware Platform

The Cell BE is a heterogeneous multi-core processor which integrates a simultaneous multi-

threading PowerPC core ( the Power Processing Element or PPE), and eight specialized accel-

erator cores (the Synergistic Processing Elements or SPEs) [40]. These elements are connected

in a ring topology on an on-chip network called the Element Interconnect Bus (EIB). The orga-

nization of Cell is illustrated in Figure 3.1.

The PPE is a 64-bit SMT processor running the PowerPC ISA, with vector/SIMD multime-

dia extensions [71]. The PPE has two levels of on-chip cache. The L1-I and L1-D caches of the

PPE have a capacity of 32 KB. The L2 cache of the PPE has a capacity of 512 KB.

14

Each SPE is a 128-bit vector processor with two major components: a Synergistic Processor

Unit (SPU) and a Memory Flow Controller (MFC). All instructions are executed on the SPU.

The SPU includes 128 registers, each 128 bits wide, and 256 KB of software-controlled local

storage. The SPU can fetch instructions and data only from its local storage and can write data

only to its local storage. The SPU implements a Cell-specific set of SIMD intrinsics. All single

precision floating point operations on the SPU are fully pipelined and the SPU can issue one

single-precision floating point operation per cycle. Double precision floating point operations

are partially pipelined and two double-precision floating point operations can be issued every six

cycles. Double-precision FP performance is therefore significantly lower than single-precision

FP performance. With all eight SPUs active and fully pipelined double precision FP operation,

the Cell BE is capable of a peak performance of 21.03 Gflops. In single-precision FP operation,

the Cell BE is capable of a peak performance of 230.4 Gflops [33].

The SPE can access RAM through direct memory access (DMA) requests. DMA transfers

are handled by the MFC. All programs running on an SPE use the MFC to move data and

instructions between local storage and main memory. Data transferred between local storage

and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16

KB. DMA-lists can be used for transferring more than 16 KB of data. A list can have up to

2,048 DMA requests, each for up to 16 KB. The MFC supports only DMA transfer sizes that

are 1, 2, 4, 8 or multiples of 16 bytes long.

The EIB is an on-chip coherent bus that handles communication between the PPE, SPE,

main memory, and I/O devices. Physically, the EIB is a 4-ring structure, which can transmit 96

bytes per cycle, for a maximum theoretical memory bandwidth of 204.8 Gigabytes/second. The

EIB can support more than 100 outstanding DMA requests.

In this work we are using a Cell blade (IBM BladeCenter QS20) with two Cell BEs running

at 3.2 GHz, and 1GB of XDR RAM (512 MB per processor). The PPEs run Linux Fedora Core

6. We use IBM SDK2.1 and Lam/MPI 7.1.3.

15

Chapter 4

Code Optimization Methdologies forAsymmetric Multi-core Systems withExplicitly Managed Memories

Accelerator-based architectures with explicitly managed memories have the advantage of achiev-

ing a high degree of communication-computation overlap. While this is a highly desirable goal

in high-performance computing, it is also a significant drawback prom the programability per-

spective. Managing all memory accesses from the application level significantly increases the

complexity of the written code. In our work, we investigate the execution models that reduce

the complexity of the code written for the asymmetric architectures, but still achieve desirable

performance and high utilization of the available architectural resources. We investigate a set of

optimizations that have the most significant impact on the performance of scientific applications

executed on the asymmetric architectures. In our case study, we investigate the optimization

process which enables efficient execution of RAxML and PBPI on the Cell architecture.

The results presented in this chapter indicate that RAxML and PBPI are highly optimized for

Cell, and also motivate the discussion presented in the rest of the thesis. Cell-specific optimiza-

tion applied to the two bioinformatics applications resulted in more than two times speedup. At

the same time, we show that regardless of being extensively optimized for sequential execution,

parallel applications demand sophisticated scheduling support for efficient parallel execution

on heterogeneous multi-core platforms.

17

4.1 Porting and Optimizing RAxML on Cell

We ported RAxML to Cell in four steps:

1. We ported the MPI code on the PPE;

2. We offloaded the most time-consuming parts of each MPI process on the SPEs;

3. We optimized the SPE code using vectorization of floating point computation, vectoriza-

tion of control statements coupled with a specialized casting transformation, overlapping

of computation and communication (double buffering) and other communication opti-

mizations;

4. Lastly, we implemented multi-level parallelization schemes across and within SPEs in

selected cases, as well as a scheduler for effective simultaneous exploitation of task, loop,

and SIMD parallelism.

We outline optimizations 1-3 in the rest of the chapter. We focus on multi-level paralleliza-

tion, as well as different scheduling policies in Chapter 5.

4.2 Function Off-loading

We profiled the application using gprofile to identify the computationally intensive functions

that could be candidates for offloading and optimization on SPEs. We used an IBM Power5

processor for profiling RAxML. For the profiling and benchmarking runs of RAxML presented

in this chapter, we used the input file 42 SC, which contains 42 organisms, each represented by

a DNA sequence of 1167 nucleotides. The number of distinct data patterns in a DNA alignment

is on the order of 250.

On the IBM Power5, 98.77% of the total execution time is spent in three functions:

• 77.24% in newview() - which computes the partial likelihood vector [44] at an inner node

of the phylogenetic tree,

18

• 19.16% in makenewz() - which optimizes the length of a given branch with respect to the

tree likelihood using the Newton–Raphson method,

• 2.37% in evaluate() - which calculates the log likelihood score of the tree at a given branch

by summing over the partial likelihood vector entries.

These functions are the best candidates for offloading on SPEs.

The prerequisite for computing evaluate() and makenewz() is that the likelihood vectors

at the nodes of the phylogenetic tree that are right and left of the current branch have been

computed. Thus, makenewz() and evaluate() initially make calls to newview(), before they can

execute their own computation. The newview() function at an inner node p of a tree, calls itself

recursively when the two children r and q are not tips (leaves) and the likelihood array for r and

q has not already been computed. Consequently, the first candidate for offloading is newview().

Although makenewz() and evaluate() are both taking a smaller portion of the execution time

than newview(), offloading these two functions results in significant speedup (see Section 4.2.6).

Besides the fact that each function can be executed faster on an SPE, having all three functions

offloaded to an SPE reduces significantly the amount of PPE-SPE communication.

In order to have a function executed on an SPE, we spawn an SPE thread at the beginning

of each MPI process. The thread executes the offloaded function upon receiving a signal from

the PPE and returns the result back to the PPE upon completion. To avoid excessive overhead

from repeated thread spawning and joining, threads remain bound on SPEs and busy-wait for

the PPE signal, before starting to execute a function.

4.2.1 Optimizing Off-Loaded Functions

The discussion in this section refers to function newview(), which is the most computationally

expensive in the code. Table 4.1 summarizes the execution times of RAxML before and after

newview() is offloaded. The first column shows the number of workers (MPI processes) used

in the experiment and the amount of work (bootstraps) performed. The maximum number

19

1 worker, 1 bootstrap 24.4s2 workers, 8 bootstraps 134.1s2 workers, 16 bootstraps 267.7s2 workers, 32 bootstraps 539s

1 worker, 1 bootstrap 45s2 workers, 8 bootstraps 201.9s2 workers, 16 bootstraps 401.7s2 workers, 32 bootstraps 805s

(a) (b)

Table 4.1: Execution time of RAxML (in seconds). The input file is 42 SC. (a) The wholeapplication is executed on the PPE, (b) newview() is offloaded on one SPE.

of workers we use is 2, since more workers would conflict on the PPE which is 2-way SMT

processor. Executing small number of workers results in low SPE utilization (each worker uses

1 SPE). In Section 4.3, we present results when the PPE is oversubscribed with up to 8 worker

processes.

As shown in Table 4.1, merely offloading newview() causes performance degradation. We

profiled the new version of the code in order to get a better understanding of the major bot-

tlenecks. Inside newview(), we identified 3 parts where the function spends almost its entire

lifetime: the first part includes a large if(. . .) statement with a conjunction of four arithmetic

comparisons used to check if small likelihood vector entries need to be scaled to avoid numerical

underflow (similar checks are used in every ML implementation); the second time-consuming

part involves DMA transfers; the third includes the loops that perform the actual likelihood

vector calculation. In the next few sections we describe the techniques used to optimize the

aforementioned parts in newview(). The same techniques were applied to the other offloaded

functions.

4.2.2 Vectorizing Conditional Statements

RAxML always invokes newview() at an inner node of the tree (p) which is at the root of a sub-

tree. The main computational kernel in newview() has a switch statement which selects one out

of four paths of execution. If one or both descendants (r and q) of p are tips (leaves), the com-

putations of the main loop in newview() can be simplified. This optimization leads to significant

20

performance improvements [87]. To activate the optimization, we use four implementations of

the main computational part of newview() for the case that r and q are tips, r is a tip, q is a tip,

or r and q are both inner nodes.

Each of the four execution paths in newview() leads to a distinct—highly optimized—

version of the loop which performs the actual likelihood vector calculations. Each iteration

of this loop executes the previously mentioned if() statement (Section 4.2.1), to check for like-

lihood scaling. Mis-predicted branches in the compiled code for this statement incur a penalty

of approximately 20 cycles [92]. We profiled newview() and found that 45% of the execution

time is spent in this particular conditional statement. Furthermore, almost all the time is spent

in checking the condition, while negligible time is spent in the body of code in the fall-through

part of the conditional statement. The problematic conditional statement is shown below. The

symbol ml is a constant and all operands are double precision floating point numbers.

if (ABS(x3->a) < ml && ABS(x3->g) < ml &&ABS(x3->c) < ml && ABS(x3->t) < ml) {

. . .

}

This statement is a challenge for a branch predictor, since it implies 8 conditions, one for

each of the four ABS() macros and the four comparisons against the minimum likelihood value

constant (ml).

On an SPE, comparing integers can be significantly faster than comparing doubles, since

integer values can be compared using the SPE intrinsics. Although the current SPE intrinsics

support only comparison of 32-bit integer values, the comparison of 64-bit integers is also pos-

sible by combining different intrinsics that operate on the 32-bit integers. The current spu-gcc

compiler automatically optimizes an integer branch using the SPE intrinsics. To optimize the

problematic branches, we made the observation that integer comparison is faster than floating

21

1 worker, 1 bootstrap 32.5s2 workers, 8 bootstraps 151.7s2 workers, 16 bootstraps 302.7s2 workers, 32 bootstraps 604s

Table 4.2: Execution time of RAxML after the floating-point conditional statement is trans-formed to an integer conditional statement and vectorized. The input file is 42 SC.

point comparison on an SPE. According to the IEEE standard, numbers represented in float and

double formats are “lexicographically ordered” [61], i.e., if two floating point numbers in the

same format are ordered, then they are ordered the same way when their bits are reinterpreted as

Sign-Magnitude integers [61]. In other words, instead of comparing two floating point numbers

we can interpret their bit pattern as integers, and do an integer comparison. The final outcome

of comparing the integer interpretation of two doubles (floats) will be the same as comparing

their floating point values, as long as one of the numbers is positive. In our case, all operands

are positive, consequently instead of floating point comparison we can perform an integer com-

parison.

To get an absolute value of a floating point number, we used the spu and() logic intrinsic,

which performs vector bit-wise AND operation. With spu and() we always set the left most

bit of a floating point number to one. If the number is already positive, nothing will change,

since the most significant bit is already one. In this way, we avoid using ABS(), which uses a

conditional statement to check if the operand is greater than or less than 0. After getting absolute

values of all the operands involved in the problematic if() statement, we cast each operand to

an unsigned long long value and perform the comparison. The optimized conditional statement

is presented in Figure 4.2.2. Following optimization of the offending conditional statement,

its contribution to execution time in newview() comes down to 6%, as opposed to 45% before

optimization. The total execution time (Table 4.2) improves by 25%–27%.

22

unsigned long long a[4];

a[0] = *(unsigned long long*)&x3->a & 0x7fffffffffffffffULL;a[1] = *(unsigned long long*)&x3->c & 0x7fffffffffffffffULL;a[2] = *(unsigned long long*)&x3->g & 0x7fffffffffffffffULL;a[3] = *(unsigned long long*)&x3->t & 0x7fffffffffffffffULL;

if (*(unsigned long long*)&a[0] < minli &&

*(unsigned long long*)&a[1] < minli &&

*(unsigned long long*)&a[2] < minli &&

*(unsigned long long*)&a[3] < minli){

. . .

}

4.2.3 Double Buffering and Memory Management

Depending on the size of the input alignment, the major calculation loop (the loop that performs

the calculation of the likelihood vector) in newview() can execute up to 50,000 iterations. The

number of iterations is directly related to the alignment length. The loop operates on large

arrays, and each member in the arrays is an instance of a likelihood vector structure, shown in

Figure 4.1 . The arrays are allocated dynamically at runtime. Since there is no limit on the

typedef struct likelihood_vector{

double a, c, g, t;int exp;

} likelivector __attribute__((aligned(128)));

Figure 4.1: The likelihood vector structure is used in almost all memory traffic betweenmain memory and the local storage of the SPEs. The structure is 128-bit aligned, as requiredby the Cell architecture.

size of these arrays, we are unable to keep all the members of the arrays in the local storage of

23

1 worker, 1 bootstrap 31.1s2 workers, 8 bootstraps 145.4s2 workers, 16 bootstraps 290s2 workers, 32 bootstraps 582.6s

Table 4.3: Execution time of RAxML with double buffering applied to overlap DMA transferswith computation. The input file is 42 SC.

SPEs. Instead, we strip-mine the arrays, by fetching a few array elements to local storage at a

time, and execute the corresponding loop iterations on a batch of elements at a time. We use a

2 KByte buffer for caching likelihood vectors, which is enough to store the data needed for 16

loop iterations. It should be noted that the space used for buffers is much smaller than the size

of the local storage.

In the original code where SPEs wait for all DMA transfers, the idle time accounts for 11.4%

of execution time of newview(). We eliminated the waiting time by using double buffering to

overlap DMA transfers with computation. The total execution time of the application after

applying double buffering and tuning the data transfer size (set to 2 KBytes) is shown in Table

4.3.

4.2.4 Vectorization

All calculations in newview() are enclosed in two loops. The first loop has a small trip count

(typically 4–25 iterations) and computes the individual transition probability matrices (see Sec-

tion 4.2.1) for each distinct rate category of the CAT or Γ models of rate heterogeneity [86].

Each iteration executes 36 double precision floating point operations. The second loop com-

putes the likelihood vector. Typically, the second loop has a large trip count, which depends on

the number of distinct data patterns in the data alignment. For the 42 SC input file, the second

loop has 228 iterations and executes 44 double precision floating point operations per iteration.

Each SPE on the Cell is capable of exploiting data parallelism via vectorization. The SPE vector

registers can store two double precision floating point elements. We vectorized the two loops in

24

newview() using these registers.

The kernel of the first loop in newview() is shown in Figure 4.2a. In Figure 4.2b we

for( ... ){ki = *rptr++;

d1c = exp (ki * lz10);d1g = exp (ki * lz11);d1t = exp (ki * lz12);

*left++ = d1c * *EV++;

*left++ = d1g * *EV++;

*left++ = d1t * *EV++;

*left++ = d1c * *EV++;

*left++ = d1g * *EV++;

*left++ = d1t * *EV++;. . .

}

1: vector double *left_v =(vector double*)left;

2: vector double lz1011 =(vector double)(lz10,lz11);. . .

for( ... ){3: ki_v = spu_splats(*rptr++);

4: d1cg = _exp_v ( spu_mul(ki_v,lz1011) );d1tc = _exp_v ( spu_mul(ki_v,lz1210) );d1gt = _exp_v ( spu_mul(ki_v,lz1112) );

left_v[0] = spu_mul(d1cg,EV_v[0]);left_v[1] = spu_mul(d1tc,EV_v[1]);left_v[2] = spu_mul(d1gt,EV_v[2]);

. . .}

(a) (b)

Figure 4.2: The body of the first loop in newview(): a) Non–vectorized code, b) Vectorizedcode.

show the same code vectorized for the SPE. For better understanding of the vectorized code we

briefly describe the SPE vector instructions we used:

• Instruction labeled 1 creates a vector pointer to an array consisting of double elements.

• Instruction labeled 2 joins two double elements, lz10 and lz11, into a single vector

element.

• Instruction labeled 3 creates a vector from a single double element.

• Instruction labeled 4 is a composition of 2 different vector instructions:

25

for( . . . ){ump_x1_0 = x1->a;ump_x1_0 += x1->c * *left++;ump_x1_0 += x1->g * *left++;ump_x1_0 += x1->t * *left++;

ump_x1_1 = x1->a;ump_x1_1 += x1->c * *left++;ump_x1_1 += x1->g * *left++;ump_x1_1 += x1->t * *left++;

. . .

}

for( . . . ){

a_v = spu_splats(x1->a);c_v = spu_splats(x1->c);g_v = spu_splats(x1->g);t_v = spu_splats(x1->t);l1 = (vector double)(left[0],left[3]);l2 = (vector double)(left[1],left[4]);l3 = (vector double)(left[2],left[5]);ump_v1[0] = spu_madd(c_v,l1,a_v);ump_v1[0] = spu_madd(g_v,l2,ump_v1[0]);ump_v1[0] = spu_madd(t_v,l3,ump_v1[0]);

. . .}

Figure 4.3: The second loop in newview(). Non–vectorized code shown on the left, vectorizedcode shown on the right. spu madd() multiplies the first two arguments and adds the result tothe third argument. spu splats() creates a vector by replicating a scalar element.

1. spu mul() multiplies two vectors (in this case the arguments are vectors of dou-

bles.)

2. exp v() is the vector version of the exponential instruction.

After vectorization, the number of the floating point instructions executed in the body of the first

loop is 24. Also, there is one additional instruction for creating a vector from a scalar element.

Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic

vectorization of this code would be particularly challenging for a compiler.

Figure 4.3 illustrates the second loop(showing a few selected instructions which dominate

execution time in the loop). The variables x1->a, x1->c, x1->g, and x1->t belong to the same

C structure (likelihood vector) and occupy contiguous memory locations. Only three of these

variables are multiplied by the elements of the array left[ ]. This makes vectorization more dif-

ficult, since the code requires vector construction instructions such as spu splats(). Obviously,

there are many different possibilities for vectorizing this code. The scheme shown in Figure 4.3

26

1 worker, 1 bootstrap 27.82 workers, 8 bootstraps 132.3s2 workers, 16 bootstraps 265.2s2 workers, 32 bootstraps 527s

Table 4.4: Execution time of RAxML following vectorization. The input file is 42 SC.

is the one that achieved the best performance in our tests. Note that due to involved pointer

arithmetic on dynamically allocated data structures, automatic vectorization of this code may

be challenging for a compiler. After vectorization, the number of floating point instructions in

the body of the loops drops from 36 to 24 for the first loop, and from 44 to 22 for the second

loop. Vectorization adds 25 instructions for creating vectors.

Without vectorization, newview() spends 69.4% of its execution time in the two loops. Fol-

lowing vectorization, the time spent in loops drops to 57% of the execution time of newview().

Table 4.4 shows execution times following vectorization.

4.2.5 PPE-SPE Communication

Although newview() accounts for most of the execution time, its granularity is fine and its con-

tribution to execution time is attributed to the large number of invocations. For the 42 SC input,

newview() is invoked 230,500 times and the average execution time per invocation is 71µs. In

order to invoke an offloaded function, the PPE needs to send a signal to an SPE. Also, after an

offloaded function completes, it sends the result back to the PPE.

In an early implementation of RAxML, we used mailboxes to implement the communica-

tion between the PPE and SPEs. We observed that PPE-SPE communication can be significantly

improved if it is performed through main memory and SPE local storage instead of mailboxes.

Using memory-to-memory communication improves execution time by 5%–6.4%. Table 4.5

shows RAxML execution times, including all optimizations discussed so far and direct memory

to memory communication, for the 42 SC input. It is interesting to note that direct memory-

27

1 worker, 1 bootstrap 26.4s2 workers, 8 bootstraps 123.3s2 workers, 16 bootstraps 246.8s2 workers, 32 bootstraps 493.3s

Table 4.5: Execution time of RAxML following the optimization of communication to usedirect memory-to-memory transfers. The input file is 42 SC.

to-memory communication is an optimization which scales with parallelism on Cell, i.e. its

performance impact grows as the code uses more SPEs. As the number of workers and boot-

straps executed on the SPEs increases, the code becomes more communication-intensive, due

to the fine granularity of the offloaded functions.

4.2.6 Increasing the Coverage of Offloading

In addition to newview(), we offloaded makenewz() and evaluate(). All three offloaded functions

were packaged in a single code module loaded on the SPEs. The advantage of using a single

module is that it can be loaded to the local storage once when an SPE thread is created and

remain pinned in local storage for the rest of the execution. Therefore, the cost of loading the

code on SPEs is amortized and communication between the PPE and SPEs is reduced. For

example, when newview() is called by makenewz() or evaluate(), there is no need for any PPE-

SPE communication, since all functions already reside in SPE local storage.

Offloading all three critical functions improves performance by a further 25%–31%. A

more important implication is that after offloading and optimization of all three functions, the

RAxML code split between the PPE and one SPE becomes actually faster than the sequential

code executed exclusively on the PPE, by as much as 19%. Function offloading is another

optimization which scales with parallelism. When more than one MPI processes are used and

more than one bootstraps are offloaded to SPEs by each process, the gains from offloading rise

to 36%. Table 4.6 illustrates execution times after full function offloading.

28

1 worker, 1 bootstrap 19.8s2 workers, 8 bootstraps 86.8s2 workers, 16 bootstraps 173s2 workers, 32 bootstraps 344.4s

Table 4.6: Execution time of RAxML after offloading and optimizing three functions:newview(), makenewz() and evaluate(). The input file is 42 SC.

RAxML

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8

Number of Processes

Execu

tio

n T

ime (

s)

PBPI

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8

Number of Processes

Execu

tio

n T

ime (

s)

(a) (b)

Figure 4.4: Performance of (a) RAxML and (b) PBPI with different number of MPI processes.

4.3 Parallel Execution

After improving the performance of RAxML and PBPI using the presented optimization tech-

niques, we investigated parallel execution of both applications on the Cell processor. To achieve

higher utilization of the Cell chip, we oversubscribed the PPE with different number of MPI pro-

cesses (2 – 8) and assigned a single SPE to each MPI process. The execution time of different

parallel configurations is presented in Figure 4.4. In the presented experiments we use strong

scaling, i.e. the computation increases with the number of processors growing.

In Figure 4.4(a) we observe that for any number of processes larger than two, the execution

time of RAxML remains constant. There are two reasons responsible for the detected behavior:

1. On-chip contention, as well as bus and memory contention which occurs on the PPE side

when the PPE is oversubscribed by multiple processes,

29

2. Linux kernel is oblivious to the process of off-loading which results in poor scheduling

decisions. Each process following the off-loading execution model constantly alternates

the execution between the PPE and an SPE. Unaware of the execution alternation, the OS

allows processes to keep the control over the resources which are not actually used. In

other words, the PPE might be assigned to a process which is currently switched to the

SPE execution.

In PBPI case, Figure 4.4(b) we observe similar performance as with RAxML. From the

presented experiments it is clear that naive parallelization of the applications, where the PPE

is simply oversubscribed with multiple processes, does not provide satisfactory performance.

Poor scaling of the applications is a strong motivation for detail exploration of different parallel

programming models as well as scheduling policies for the asymmetric processors. We continue

the discussion about parallel execution on heterogeneous architectures in Chapter 5.

4.4 Chapter Summary

In this chapter we presented a set of optimizations which enable efficient sequential execution of

scientific applications on asymmetric platforms. We exploited the fact that our test application

contain large computational functions (loops) which consume majority of the execution time.

Nevertheless, this assumption does not reduce the generality of the presented techniques, since

large, time-consuming computational loops are common in most of the scientific codes.

We explored a total of five optimizations and the performance implications of these opti-

mizations: I) Offloading the bulk of the maximum likelihood tree calculation on the accelera-

tors; II) Casting and vectorization of expensive conditional statements involving multiple, hard

to predict conditions; III) Double buffering for overlapping memory communication with com-

putation; IV) Vectorization of the core of the floating point computation; V) Optimization of

communication between the host core and accelerators using direct memory-to-memory trans-

fers;

30

In our case study, starting from an optimized version of RAxML and PBPI for conventional

uniprocessors and multiprocessors, we were able to boost performance on the Cell processor by

more than a factor of two.

31

Chapter 5

Scheduling Multigrain Parallelism onAsymmetric Systems

5.1 Introduction

In this chapter, we investigate runtime scheduling policies for mapping different layers of par-

allelism, exposed by an application, to the Cell processor. We assume that applications describe

all available algorithmic parallelism to the runtime system explicitly, while the runtime system

dynamically selects the degree of granularity and the dimensions of parallelism to expose to the

hardware at runtime, using dynamic scheduling mechanisms and policies. In other words, the

runtime system is responsible for partitioning algorithmic parallelism in layers that best match

the diverse capabilities of the processor cores, while at the same time rightsizing the granularity

of parallelism in each layer.

5.2 Scheduling Multi-Grain Parallelism on Cell

We hereby explore the possibilities for exploiting multi-grain parallelism on Cell. The Cell

PPE can execute two threads or processes simultaneously, from where parts of code can be off-

loaded and executed on SPEs. To increase the sources of parallelism for SPEs, the user may

consider two approaches:

• The user may oversubscribe the PPE with more processes or threads, than the number of

33

processes/threads that the PPE can execute simultaneously. In other words, the program-

mer attempts to find more parallelism to off-load to accelerators, by attempting a more

fine-grain task decomposition of the code. In this case, the runtime system needs to sched-

ule the host processes/threads so as to minimize the idle time on the host core while the

computation is off-loaded to accelerators. We present an event-driven task-level scheduler

(EDTLP) which achieves this goal in Section 5.2.1.

• The user can introduce a new dimension of parallelism to the application by distributing

loops from within the off-loaded functions across multiple SPEs. In other words, the

user can exploit data parallelism both within and across accelerators. Each SPE can work

on a part of a distributed loop, which can be further accelerated with SIMDization. We

present case studies that motivate the dynamic extraction of multi-grain parallelism via

loop distribution in Section 5.2.2.

5.2.1 Event-Driven Task Scheduling

EDTLP is a runtime scheduling module which can be embedded transparently in MPI codes.

The EDTLP scheduler operates under the assumption that the code to off-load to accelerators

is specified by the user at the level of functions. In the case of Cell, this means that the user

has either constructed SPE threads in a separate code module, or annotated the host PPE code

with directives to extract SPE threads via a compiler [17]. The EDTLP scheduler avoids un-

derutilization of SPEs by oversubscribing the PPE and preventing a single MPI process from

monopolizing the PPE.

Informally, the EDTLP scheduler off-loads tasks from MPI processes. A task ready for off-

loading serves as an event trigger for the scheduler. Upon the event occurrence, the scheduler

immediately attempts to serve the MPI process that carries the task to off-load and sends the

task to an available SPE, if any. While off-loading a task, the scheduler suspends the MPI

process that spawned the task and switches to another MPI process, anticipating that more tasks

34

will be available for off-loading from ready-to-run MPI processes. Switching upon off-loading

prevents MPI processes from blocking the PPE while waiting for their tasks to return. The

scheduler attempts to sustain a high supply of tasks for off-loading to SPEs by serving MPI

processes round-robin.

The downside of a scheduler based on oversubscribing a processor is context-switching

overhead. Cell in particular also suffers from the problem of interference between processes

or threads sharing the SMT PPE core. The granularity of the off-loaded code determines if the

overhead introduced by oversubscribing the PPE can be tolerated. The code off-loaded to an

SPE should be coarse enough to marginalize the overhead of context switching performed on

the PPE. The EDTLP scheduler addresses this issue by performing granularity control of the

off-loaded tasks and preventing off-loading of code that does not meet a minimum granularity

threshold.

Figure 5.1 illustrates an example of the difference between scheduling MPI processes with

the EDTLP scheduler and the native Linux scheduler. In this example, each MPI process has

one task to off-load to SPEs. For illustrative purposes only, we assume that there are only 4

SPEs on the chip. In Figure 5.1(a), once a task is sent to an SPE, the scheduler forces a context

switch on the PPE. Since the PPE is a two-way SMT, two MPI processes can simultaneously

off-load tasks to two SPEs. The EDTLP scheduler enables the use of four SPEs via function off-

loading. On the contrary, if the scheduler waits for the completion of a task before providing

an opportunity to another MPI process to off-load (Figure 5.1 (b)), the application can only

utilize two SPEs. Realistic application tasks often have significantly shorter lengths than the

time quanta used by the Linux scheduler. For example, in RAxML, task lengths measure in the

order of tens of microseconds, when Linux time quanta measure to tens of milliseconds.

Table 5.1(a) compares the performance of the EDTLP scheduler to that of the native Linux

scheduler, using RAxML and running a workload comprising 42 organisms. In this experiment,

the number of performed bootstraps is not constant and it is equal to the number of MPI pro-

cesses. The EDTLP scheduler outperforms the Linux scheduler by up to a factor of 2.7. In the

35

(a) (b)

Figure 5.1: Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a)illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the behavior of the Linuxscheduler with the same workload. The numbers correspond to MPI processes. The shadedslots indicate context switching. The example assumes a Cell-like system with four SPEs.

experiment with PBPI, Table 5.1(b), we execute the code with one Markov chain for 20,000

generations and we change the number of MPI processes used across runs. PBPI is also exe-

cuted with weak scaling, i.e. we increase the size of the DNA alignment with the number of

processes. The workload for PBPI includes 107 organisms. EDTLP outperforms the Linux

scheduler policy in PBPI by up to a factor of 2.7.

5.2.2 Scheduling Loop-Level Parallelism

The EDTLP model described in Section 5.2 is effective if the PPE has enough coarse-grained

functions to off-load to SPEs. In cases where the degree of available task parallelism is less

than the number of SPEs, the runtime system can activate a second layer of parallelism, by

splitting an already off-loaded task across multiple SPEs. We implemented runtime support

for parallelization of for-loops enclosed within off-loaded SPE functions. We parallelize loops

in off-loaded functions using work-sharing constructs similar to those found in OpenMP. In

RAxML, all for-loops in the three off-loaded functions have no loop-carried dependencies, and

obtain speedup from parallelization, assuming that there are enough idle SPEs dedicated to their

execution. The number of SPEs activated for work-sharing is user- or system-controlled, as in

36

EDTLP Linux1 worker, 1 bootstrap 19.7s 19.7s2 workers, 2 bootstraps 22.2s 30s3 workers, 3 bootstraps 26s 40.7s4 workers, 4 bootstraps 28.1s 43.3s5 workers, 5 bootstraps 33s 60.7s6 workers, 6 bootstraps 34s 61.8s7 workers, 7 bootstraps 38.8s 81.2s8 workers, 8 bootstraps 39.8s 81.7s

(a)

EDTLP Linux1 worker, 20,000 gen. 27.77s 27.54s2 workers, 20,000 gen. 30.2s 30s3 workers, 20,000 gen. 31.92s 56.16s4 workers, 20,000 gen. 36.4s 63.7s5 workers, 20,000 gen. 40.12s 93.71s6 workers, 20,000 gen. 41.48s 93s7 workers, 20,000 gen. 53.93s 144.81s8 workers, 20,000 gen. 52.64s 135.92s

(b)

Table 5.1: Performance comparison for (a) RAxML and (b) PBPI with two schedulers. Thesecond column shows execution time with the EDTLP scheduler. The third column showsexecution time with the native Linux kernel scheduler. The workload for RAxML contains 42organisms. The workload for PBPI contains 107 organisms.

OpenMP. We discuss dynamic system-level control of loop parallelism further in Section 5.3.

The parallelization scheme is outlined in Figure 5.2. The program is executed on the PPE

until the execution reaches the parallel loop to be off-loaded. At that point the PPE sends a

signal to a single SPE which is designated as the master. The signal is processed by the master

and further broadcasted to all workers involved in parallelization. Upon a signal reception,

each SPE worker fetches the data necessary for loop execution. We ensure that SPEs work

on different parts of the loop and do not overlap by assigning a unique identifier to each SPE

thread involved in parallelization of the loop. Global data, changed by any of the SPEs during

37

loop execution, is committed to main memory at the end of each iteration. After processing

the assigned parts of the loop, the SPE workers send a notification back to the master. If the

loop includes a reduction, the master collects also partial results from the SPEs and accumulates

them locally. All communication between SPEs is performed on chip in order to avoid the long

latency of communicating through shared memory.

Note that in our loop parallelization scheme on Cell, all work performed by the master SPE

can also be performed by the PPE. In this case, the PPE would broadcast a signal to all SPE

threads involved in loop parallelization and the partial results calculated by SPEs would be

accumulated back at the PPE. Such collective operations increase the frequency of SPE-PPE

communication, especially when the distributed loop is a nested loop. In the case of RAxML,

in order to reduce SPE-PPE communication and avoid unnecessary invocation of the MPI pro-

cess that spawned the parallelized loop, we opted to use an SPE to distribute loops to other

SPEs and collect the results from other SPEs. In PBPI, we let the PPE execute the master

thread during loop parallelization, since loops are coarse enough to overshadow the loop exe-

cution overhead. Optimizing and selecting between these loop execution schemes is a subject

of ongoing research.

SPE threads participating in loop parallelization are created once upon off-loading the code

for the first parallel loop to SPEs. The threads remain active and pinned to the same SPEs during

the entire program execution, unless the scheduler decides to change the parallelization strategy

and redistribute the SPEs between one or more concurrently executing parallel loops. Pinned

SPE threads can run multiple off-loaded loop bodies, as long as the code of these loop bodies

fits on the local storage of the SPEs. If the loop parallelization strategy is changed on the fly by

the runtime system, a new code module with loop bodies that implement the new parallelization

strategy is loaded on the local storage of the SPEs.

Table 5.2 illustrates the performance of the basic loop-level parallelization scheme of our

runtime system in RAxML. Table 5.2(a) illustrates the execution time of RAxML using one

MPI process and performing one bootstrap, on a data set which comprises 42 organisms. This

38

Worker1

from 7/8x to x

iterations

executes

. . .

signal to MasterWorker1 sending stop

Master sending start signal

Master sending start signalto Worker1

from 1 to x/8

iterations

Worker7

Worker7

Worker7 sending stop signal to Master

Master

executesMaster

x − Total numberof iterations

from x/8 to x/4iterationsexecutesWorker1

to Worker7

Figure 5.2: Parallelizing a loop across SPEs using a work-sharing model with an SPE designatedas the master.

experiment isolates the impact of our loop-level parallelization mechanisms on Cell. The num-

ber of iterations in parallelized loops depends on the size of the input alignment in RAxML. For

the given data set, each parallel loop executes 228 iterations.

The results shown in Table 5.2(a) suggest that when using loop-level parallelism RAxML

sees a reasonable yet limited performance improvement. The highest speedup (1.72) is achieved

with 7 SPEs. The reasons for the modest speedup are the non-optimal coverage of loop-level

parallelism —more specifically, less than 90% of the original sequential code is covered by par-

allelized loops—, the fine granularity of the loops, and the fact that most loops have reductions,

which create bottlenecks on the Cell DMA engine. The performance degradation that occurs

when 5 or 6 SPEs are used, happens because of specific memory alignment constraints that have

to be met on the SPEs. It is due to alignment constraints that in certain occasions it is not pos-

sible to evenly distribute the data used in the loop body and therefore the workload of iterations

between SPEs. More specifically, the use of character arrays for the main data set in RAxML

39

1 worker, 1 boot., no LLP 19.7s1 worker, 1 boot., 2 SPEs used for LLP 14s1 worker, 1 boot., 3 SPEs used for LLP 13.36s1 worker, 1 boot., 4 SPEs used for LLP 12.8s1 worker, 1 boot., 5 SPEs used for LLP 13.8s1 worker, 1 boot., 6 SPEs used for LLP 12.47s1 worker, 1 boot., 7 SPEs used for LLP 11.4s1 worker, 1 boot., 8 SPEs used for LLP 11.44s

(a)

1 worker, 1 boot., no LLP 47.9s1 worker, 1 boot., 2 SPEs used for LLP 29.5s1 worker, 1 boot., 3 SPEs used for LLP 23.3s1 worker, 1 boot., 4 SPEs used for LLP 20.5s1 worker, 1 boot., 5 SPEs used for LLP 18.7s1 worker, 1 boot., 6 SPEs used for LLP 18.1s1 worker, 1 boot., 7 SPEs used for LLP 17.1s1 worker, 1 boot., 8 SPEs used for LLP 16.8s

(b)

Table 5.2: Execution time of RAxML when loop-level parallelism (LLP) is exploited in onebootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequencesare represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nu-cleotides.

forces array transfers in multiples of 16 array elements. Consequently, loop distribution across

processors is done with a minimum chunk size of 16 iterations.

Loop-level parallelization in RAxML can achieve higher speedup in a single bootstrap with

larger input data sets. Alignments that have a larger number of nucleotides per organism have

more loop iterations to distribute across SPEs. To illustrate the behavior of loop-level paral-

lelization with coarser loops, we repeated the previous experiment using a data set where the

DNA sequences are represented with 20,000 nucleotides. The results are shown in Table 5.2(b).

The performance of the loop-level parallelization scheme always increases with the number of

SPEs in this experiment.

40

PBPI exhibits clearly better scalability than RAxML with LLP, since the granularity of loops

is coarser in PBPI than RAxML. Table 5.3 illustrates the execution times when PBPI is executed

with a variable number of SPEs used for LLP. Again, we control the granularity of the off-loaded

code by using different data sets: Table 5.3(a) shows execution times for a data set that contains

107 organisms, each represented by a DNA sequence of 3,000 nucleotides. Table 5.3(b) shows

execution times for a data set that contains 107 organisms, each represented by a DNA sequence

of 10,000 nucleotides. We run PBPI with one Markov chain for 20,000 generations. For the

two data sets, PBPI achieves a maximum speedup of 4.6 and 6.1 respectively, after loop-level

parallelization.

1 worker, 1,000 gen., no LLP 27.2s1 worker, 1,000 gen., 2 SPEs used for LLP 14.9s1 worker, 1,000 gen., 3 SPEs used for LLP 11.3s1 worker, 1,000 gen., 4 SPEs used for LLP 8.4s1 worker, 1,000 gen., 5 SPEs used for LLP 7.3s1 worker, 1,000 gen., 6 SPEs used for LLP 6.8s1 worker, 1,000 gen., 7 SPEs used for LLP 6.2s1 worker, 1,000 gen., 8 SPEs used for LLP 5.9s

(a)

1 worker, 20,000 gen., no LLP 262s1 worker, 20,000 gen., 2 SPEs used 131.3s1 worker, 20,000 gen., 3 SPEs used 92.3s1 worker, 20,000 gen., 4 SPEs used 70.1s1 worker, 20,000 gen., 5 SPEs used 58.1s1 worker, 20,000 gen., 6 SPEs used 49s1 worker, 20,000 gen., 7 SPEs used 43s1 worker, 20,000 gen., 8 SPEs used 39.7s

(b)

Table 5.3: Execution time of PBPI when loop-level parallelism (LLP) is exploited via workdistribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides.

41

struct Pass{

volatile unsigned int v1_ad;volatile unsigned int v2_ad;//...arguments for loop bodyvolatile unsigned int vn_ad;volatile double res;volatile int sig[2];

} __attribute__((aligned(128)));

Figure 5.3: The data structure Pass is used for communication among SPEs. The vi ad vari-ables are used to pass input arguments for the loop body from one local storage to another. Thevariable sig is used as a notification signal that the memory transfer for the shared data updatedduring the loop is completed. The variable res is used to send results back to the master SPE,and as a dependence resolution mechanism.

5.2.3 Implementing Loop-Level Parallelism

The SPE threads participating in loop work-sharing constructs are created once upon function

off-loading. Communication among SPEs participating in work-sharing constructs is imple-

mented using DMA transfers and the communication structure Pass, depicted in Figure 5.3.

The Pass structure is private to each thread. The master SPE thread allocates an array of

Pass structures. Each member of this array is used for communication with an SPE worker

thread. Once the SPE threads are created, they exchange the local addresses of their Pass

structures. This address exchange is performed through the PPE. Whenever one thread needs

to send a signal to a thread on another SPE, it issues an mfc put() request and sets the

destination address to be the address of the Pass structure of the recipient.

In Figure 5.4, we illustrate a RAxML loop parallelized with work-sharing among SPE

threads. Before executing the loop, the master thread sets the parameters of the Pass struc-

ture for each worker SPE and issues one mfc put() request per worker. This is done in

send to spe(). Worker i uses the parameters of the received Pass structure and fetches

the data needed for the loop execution to its local storage (function fetch data()). After

42

finishing the execution of its portion of the loop, a worker sets the res parameter in the local

copy of the structure Pass and sends it to the master, using send to master(). The master

accumulates the results from all workers and commits the sum to main memory.

Immediately after calling send to spe(), the master participates in the execution of the

loop. The master tends to have a slight head start over the workers. The workers need to

complete several DMA requests before they can start executing the loop, in order to fetch the

required data from the master’s local storage or shared memory. In fine-grained off-loaded

functions such as those encountered in RAxML, load imbalance between the master and the

workers is noticeable. To achieve better load balancing, we set the master to execute a slightly

larger portion of the loop. A fully automated and adaptive implementation of this purposeful

load unbalancing is obtained by timing idle periods in the SPEs across multiple invocations of

the same loop. The collected times are used for tuning iteration distribution in each invocation,

in order to reduce idle time on SPEs.

5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism

Merging task-level and loop-level parallelism on Cell can improve the utilization of acceler-

ators. A non-trivial problem with such a hybrid parallelization scheme is the assignment of

accelerators to tasks. The optimal assignment is largely application-specific, task-specific and

input-specific. We support this argument using RAxML as an example. The discussion in this

section is limited to RAxML, where the degree of outermost parallelism can be changed ar-

bitrarily by varying the number of MPI processes executing bootstraps, with a small impact

on performance. PBPI uses a data decomposition approach which depends on the number of

processors, therefore dynamically varying the number of MPI processes executing the code at

runtime can not be accomplished without data redistribution.

43

Master SPE:

struct Pass pass[Num_SPE];

for(i=0; i < Num_SPE; i++){pass[i].sig[0] = 1;

...send_to_spe(i,&pass[i]);

}

/* Paralelized loop */for ( ... ){

. . .}

tr->likeli = sum;

for(i=0; i < Num_SPE; i++){while(pass[i].sig[1] == 0);pass[i].sig[1] = 0;tr->likeli += pass[i].res;

}

commit(tr->likeli);

Worker SPE:

struct Pass pass;

while(pass.sig[0]==0);fetch_data();

/* Paralelized loop */for ( ... ){

. . .}

tr->likeli = sum;pass.res = sum;pass.sig[1] = 1;send_to_master(&pass);

Figure 5.4: Parallelization of the loop from function evaluate() in RAxML. The left sidedepitcs the code executed by the master SPE, while the right side depitcs the code executed bya worker SPE. Num SPE represents the number of SPE worker threads.

5.3.1 Application-Specific Hybrid Parallelization on Cell

We present a set of experiments with RAxML performing a number of bootstraps ranging be-

tween 1 and 128. In these experiments we use three versions of RAxML. Two of the three ver-

sions use hybrid parallelization models combining task- and loop-level parallelism. The third

version exploits only task-level parallelism and uses the EDTLP scheduler. More specifically,

in the first version, each off-loaded task is parallelized across 2 SPEs, and 4 MPI processes

are multiplexed on the PPE, executing 4 concurrent bootstraps. In the second version, each

off-loaded task is parallelized across 4 SPEs and 2 MPI processes are multiplexed on the PPE,

44

10

20

30

40

50

60

70

80

90

100

110

120

0 2 4 6 8 10 12 14 16

Exec

uti

on

tim

e in

sec

ond

s

Number of bootstraps

EDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop

EDTLP

(a)

0

100

200

300

400

500

600

700

800

900

0 20 40 60 80 100 120 140

Exec

uti

on t

ime

in s

eco

nds


EDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop

EDTLP

(b)

Figure 5.5: Comparison of task-level and hybrid parallelization schemes in RAxML, on theCell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16, (b) 1–128.

executing 2 concurrent bootstraps. In the third version, the code concurrently executes 8 MPI

processes, the off-loaded tasks are not parallelized and the tasks are scheduled with the EDTLP

scheduler. Figure 5.5 illustrates the results of the experiments, with a data set representing 42

organisms. The x-axis shows the number of bootstraps, while the y-axis shows execution time

in seconds.

As expected, the hybrid model outperforms EDTLP when up to 4 bootstraps are executed,

since only a combination of EDTLP and LLP can off-load code to more than 4 SPEs simul-

45

taneously. With 5 to 8 bootstraps, the hybrid models execute bootstraps in batches of 2 and 4

respectively, while the EDTLP model executes all bootstraps in parallel. EDTLP activates 5

to 8 SPEs solely for task-level parallelism, leaving room for loop-level parallelism on at most

3 SPEs. This proves to be unnecessary, since the parallel execution time is determined by the

length of the non-parallelized off-loaded tasks that remain on at least one SPE. In the range

between 9 and 12 bootstraps, combining EDTLP and LLP selectively, so that the first 8 boot-

straps execute with EDTLP and the last 4 bootstraps execute with the hybrid scheme is the best

option. For the input data set with 42 organisms, performance of EDTLP and hybrid EDTLP-

LLP schemes is almost identical when the number of bootstraps is between 13 and 16. When

the number of bootstraps is higher than 16, EDTLP clearly outperforms any hybrid scheme

(Figure 5.5(b)).

The reader may notice that the problem of hybrid parallelization is trivialized when the

problem size is scaled beyond a certain point, which is 28 bootstraps in the case of RAxML

(see Section 5.3.2). A production run of RAxML for real-world phylogenetic analysis would

require up to 1,000 bootstraps, thus rendering hybrid parallelization seemingly unnecessary.

However, if a production RAxML run with 1,000 bootstraps were to be executed across multiple

Cell BEs, and assuming equal division of bootstraps between the processors, the cut-off point

for EDTLP outperforming the hybrid EDTLP-LLP scheme would be set at 36 Cell processors.

Beyond this scale, performance per processor would be maximized only if LLP were employed

in conjunction with EDTLP on each Cell. Although this observation is empirical and somewhat

simplifying, it is further supported by the argument that scaling across multiple processors will

in all likelihood increase communication overhead and therefore favor a parallelization scheme

with less MPI processes. The hybrid scheme reduces the volume of MPI processes compared

to the pure EDTLP scheme, when the granularity of work per Cell becomes fine.

46

5.3.2 MGPS

The purpose of MGPS is to dynamically adapt the parallel execution by either exposing only

one layer of task parallelism to the SPEs via event-driven scheduling, or expanding to the second

layer of data parallelism and merging it with task parallelism when SPEs are underutilized at

runtime.

MGPS extends the EDTLP scheduler with an adaptive processor-saving policy. The sched-

uler runs locally in each process and it is driven by two events:

• arrivals, which correspond to off-loading functions from PPE processes to SPE threads;

• departures, which correspond to completion of SPE functions.

MGPS is invoked upon arrivals and departures of tasks. Initially, upon arrivals, the scheduler

conservatively assigns one SPE to each off-loaded task. Upon a departure, the scheduler mon-

itors the degree of task-level parallelism exposed by each MPI process, i.e. how many discrete

tasks were off-loaded to SPEs while the departing task was executing. This number reflects the

history of SPE utilization from task-level parallelism and is used to switch from the EDTLP

scheduling policy to a hybrid EDTLP-LLP scheduling policy. The scheduler monitors the num-

ber of SPEs that execute tasks over epochs of 100 off-loads. If the observed SPE utilization

is over 50% the scheduler maintains the most recently selected scheduling policy (EDTLP or

EDTLP-LLP). If the observed SPE utilization falls under 50% and the scheduler uses EDTLP,

it switches to EDTLP-LLP by loading parallelized versions of the loops in the local storages of

SPEs and performing loop distribution.

To switch between different parallel execution models at runtime, the runtime system uses

code versioning. It maintains three versions of the code of each task. One version is used for

execution on the PPE. The second version is used for execution on an SPE from start to finish,

using SIMDization to exploit the vector execution units of the SPE. The third version is used

for distribution of the loop enclosed by the task between more than one SPEs. The use of code

47

versioning increases code management overhead, as SPEs may need to load different versions

of the code of each off-loaded task at runtime. On the other hand, code versioning obviates

the need for conditionals that would be used in a monolithic version of the code. These con-

ditionals are expensive on SPEs, which lack branch prediction capabilities. Our experimental

analysis indicates that overlaying code versions on the SPEs via code transfers ends up being

slightly more efficient than using monolithic code with conditionals. This happens because of

the overhead and frequency of the conditionals in the monolithic version of the SPE code, but

also because the code overlays leave more space available in the local storage of SPEs for data

caching and buffering to overlap computation and communication [20].

We compare MGPS to EDTLP and two static hybrid (EDTLP-LLP) schedulers, using 2

SPEs per loop and 4 SPEs per loop respectively. Figure 5.6 shows the execution times of MGPS,

EDTLP-LLP and EDTLP with various RAxML workloads. The x-axis shows the number of

bootstraps, while the y-axis shows execution time. We observe benefits from using MGPS for up

to 28 bootstraps. Beyond 28 bootstraps, MGPS converges to EDTLP and both are increasingly

faster than static EDTLP-LLP execution, as the number of bootstraps increases.

A clear disadvantage of MGPS is that the time needed for any adaptation decision depends

on the total number of off-loading requests, which in turn is inherently application-dependent

and input-dependent. If the off-loading requests from different processes are spaced apart, there

may be extended idle periods on SPEs, before adaptation takes place. Another disadvantage of

MGPS is the dependency of its dynamic scheduling policy on the initial configuration used to

execute the application. In RAxML, MGPS converges to the best execution strategy only if the

application begins by oversubscribing the PPE and exposing the maximum degree of task-level

parallelism to the runtime system. This strategy is unlikely to converge to the best scheduling

policy in other applications, where task-level parallelism is limited and data parallelism is more

dominant. In this case, MGPS would have to commence its optimization process from a dif-

ferent program configuration favoring data-level rather than task-level parallelism. We address

the aforementioned shortcomings via a sampling-based MGPS algorithm (S-MGPS), which we

48

10

20

30

40

50

60

70

80

90

100

110

120

0 2 4 6 8 10 12 14 16

Exec

uti

on

tim

e in

sec

ond

s


MGPSEDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop

EDTLP

(a)

0

100

200

300

400

500

600

700

800

900

0 20 40 60 80 100 120 140

Exec

uti

on t

ime

in s

eco

nds


MGPSEDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop

EDTLP

(b)

Figure 5.6: MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML treescreated: (a) 1–16, (b) 1–128.

introduce in the next section.

5.4 S-MGPS

We begin this section by presenting a motivating example to show why controlling concur-

rency on the Cell is useful, even if SPEs are seemingly fully utilized. This example motivates

the introduction of a sampling-based algorithm that explores the space of program and system

49

configurations that utilize all SPEs, under different distributions of SPEs between concurrently

executing tasks and parallel loops. We present S-MGPS and evaluate S-MGPS using RAxML

and PBPI.

5.4.1 Motivating Example

Increasing the degree of task parallelism on Cell comes at a cost, namely increasing contention

between MPI processes that time-share the PPE. Pairs of processes that execute in parallel on

the PPE suffer from contention for shared resources, a well-known problem of simultaneous

multithreaded processors. Furthermore, with more processes, context switching overhead and

lack of co-scheduling of SPE threads and PPE threads from which the SPE threads originate,

may harm performance. On the other hand, while loop-level parallelization can ameliorate PPE

contention, its performance benefit depends on the granularity and locality properties of parallel

loops.

Figure 5.7 shows the efficiency of loop-level parallelism in RAxML when the input data

set is relatively small. The input data set in this example (25 SC) has 25 organisms, each

of them represented by a DNA sequence of 500 nucleotides. In this experiment, RAxML is

executed multiple times with a single worker process and a variable number of SPEs used for

LLP. The best execution time is achieved with 5 SPEs. The behavior illustrated in Figure 5.7 is

caused by several factors, including the granularity of loops relative to the overhead of PPE-SPE

commnication and load imbalance (discussed in Section 5.2.2).

By using two dimensions of parallelism to execute an application, the runtime system can

control both PPE contention and loop-level parallelization overhead. Figure 5.8 illustrates an

example in which multi-grain parallel executions outperform one-dimensional parallel execu-

tions in RAxML, for any number of bootstraps. In this example, RAxML is executed with three

static parallelization schemes, using 8 MPI processes and 1 SPE per process, 4 MPI processes

and 2 SPEs per process, or 2 MPI processes and 4 SPEs per process respectively. The input data

50

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8

Number of SPEs

Exeutin time (s)

Figure 5.7: Execution time of RAxML with a variable number of SPE threads. The inputdataset is 25 SC.

0

50

100

150

200

250

300

350

400

0 20 40 60 80 100 120 140


Execution time (s)

8 worker processes, 1 SPE per off-loaded task

4 worker processes, 2SPEs per off-loaded task

2 worker processes, 4 SPEs per off-loaded task

Figure 5.8: Execution times of RAxML, with various static multi-grain scheduling strategies.The input dataset is 25 SC.

set is 25 SC. Using this data set, RAxML performs the best with a multi-level parallelization

model when 4 MPI processes are simultaneously executed on the PPE and each of them uses 2

SPEs for loop-level parallelization.

5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism

The S-MGPS scheduler automatically determines the best parallelization scheme for a specific

workload, by using a sampling period. During the sampling period, S-MGPS performs a search

51

of program configurations along the available dimensions of parallelism. The search starts with

a single MPI process and during the first step S-MGPS determines the optimal number of SPEs

that should be used by a single MPI process. The search is implemented by sampling execution

phases of the MPI process with different degrees of loop-level parallelism. Phases represent

code that is executed repeatedly in an application and dominates execution time. In case of

RAxML and PBPI, phases are the off-loaded tasks. Although we identify phases manually in

our execution environment, the selection process for phases is trivial and can be automated in a

compiler. Furthermore, parallel applications almost always exhibit a very strong runtime peri-

odicity in their execution patterns, which makes the process of isolating the dominant execution

phases straightforward.

Once the first sampling step of S-MGPS is completed, the search continues by sampling ex-

ecution intervals with every feasible combination of task-level and loop-level parallelism. In the

second phase of the search, the degree of loop-level parallelism never exceeds the optimal value

determined by the first sampling step. For each execution interval, the scheduler uses execution

time of phases as a criterion for selecting the optimal dimension(s) and granularity of paral-

lelism per dimension. S-MGPS uses a performance-driven mechanism to rightsize parallelism

on Cell, as opposed to the utilization-driven mechanism used in MGPS.

Figure 5.9 ilustrates the steps of the sampling phase when 2 MPI processes are executed

on the PPE. This process can be performed for any number of MPI processes that can be exe-

cuted on a single Cell node. For each MPI process, the runtime system uses a variable number

of SPEs, ranging from 1 up to the optimal number of SPEs determined by the first phase of

sampling.

The purpose of the sampling period is to determine the configuration of parallelism that

maximizes efficiency. We define a throughput metric W as:

52

EIB

SPE2 SPE3 SPE4

SPE5

SPE1

SPE6 SPE7 SPE8

PPE

Process1

Process2

EIB

SPE2 SPE3 SPE4

SPE5

SPE1

SPE6 SPE7 SPE8

PPE

Process1

Process2

(a) (b)

EIB

SPE2 SPE3 SPE4

SPE5

SPE1

SPE6 SPE7 SPE8

PPE

Process1

Process2

EIB

SPE2 SPE3 SPE4

SPE5

SPE1

SPE6 SPE7 SPE8

PPE

Process1

Process2

(c) (d)

Figure 5.9: The sampling phase of S-MGPS. Samples are taken from four execution intervals,during which the code performs identical operations. For each sample, each MPI process usesa variable number of SPEs to parallelize its enclosed loops.

W =C

T(5.1)

where C is the number of completed tasks and T is execution time. Note that a task is de-

fined as a function off-loaded on SPEs, therefore C captures application- and input-dependent

behavior. S-MGPS computes C by counting the number of task off-loads. This metric works

reasonably well, assuming that tasks of the same type (i.e. the same function or chunk of an

expensive computational loop, off-loaded multiple times on an SPE) have approximately the

same execution time. This is indeed the case in the applications that we studied. The metric can

be easily extended so that each task is weighed with its execution time relative to the execution

time of other tasks, to account for unbalanced task execution times. We do not explore this

option further in this thesis.

53

S-MGPS calculates efficiency for every sampled configuration and selects the configuration

with the maximum efficiency for the rest of the execution. In Table 5.4 we represent partial

results of the sampling phase in RAxML for different input datasets. In this example, the

degree of task-level parallelism sampled is 8, 4 and 2, while the degree of loop-level parallelism

sampled is 1, 2 and 4. In the case of RAxML we set a single sampling phase to be time necessary

for all active worker processes to finish a single bootstrap. Therefore, in the case of RAxML

in Table 5.4, the number of bootstraps and the execution time differ across sampling phases:

when the number of active workers is 8, the sampling phase will contain 8 bootstraps, when the

number of active workers is 4 the sampling phase will contain 4 bootstraps, etc. Nevertheless,

the throughput (W ) remains invariant across different sampling phases and always represents

the efficiency of a certain configuration, i.e. amount of work done per second. Results presented

in Table 5.4 confirm that S-MGPS converges to the optimal configurations (4x2 and 8x1) for

the input files 25 SC and 42 SC.

Dataset deg(TLP) × # bootstr. per # off-loaded phase Wdeg (LLP) sampling phase tasks duration time

42 SC 8x1 8 2,526,126 41.73s 60,53542 SC 4x2 4 1,263,444 21.05s 60,02142 SC 2x4 2 624,308 14.42s 43,29425 SC 8x1 8 1,261,232 16.53s 76,29925 SC 4x2 4 612,155 8.01s 76,42325 SC 2x4 2 302,394 5.6s 53,998

Table 5.4: Efficiency of different program configurations with two data sets in RAxML. Thebest configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The best configuration for25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds the degree of a given dimension ofparallelism (LLP or TLP).

Since the scheduler performs an exhaustive search, for the 25 SC input, the total number

of bootstraps required for the sampling period on Cell is 17, for up to 8 MPI processes and

1 to 5 SPEs used per MPI process for loop-level parallelization. The upper bound of 5 SPEs

per loop is determined by the first step of the sampling period. Assuming that performance is

optimized if the maximum number of SPEs of the processor are involved in parallelization, the

54

feasible configurations to sample are constrained by deg(TLP)×deg(LLP)=8, for a single Cell

with 8 SPEs. Under this constraint, the number of samples needed by S-MGPS on Cell drops

to 3. Unfortunately, when considering only configurations that use all SPEs, the scheduler may

omit a configuration that does not use all SPEs but still performs better than the best scheme

that uses all processor cores. In principle, this situation may occur in certain non-scalable

codes or code phases. To address such cases, we recommend the use of exhaustive search in

S-MGPS, given that the total number of feasible configurations of SPEs on a Cell is manageable

and small compared to the number of tasks and the number of instances of each task executed

in real applications. This assumption may need to be revisited in the future for large-scale

systems with many cores and exhaustive search may need to be replaced by heuristics such as

hill climbing or simulated annealing.

In Table 5.5 we compare the performance of S-MGPS to the static scheduling policies with

both one-dimensional (TLP) and multi-grain (TLP-LLP) parallelism on Cell, using RAxML.

For a small number of bootstraps, S-MGPS underperforms the best static scheduling scheme

by 10%. The reason is that S-MGPS expends a significant percentage of execution time in

the sampling period, while executing the program in mostly suboptimal configurations. As the

number of bootstraps increases, S-MGPS comes closer to the performance of the best static

scheduling scheme (within 3%–5%).

deg(TLP)=8, deg(TLP)=4, deg(TLP)=2,deg(LLP)=1 deg(LLP)=2 deg(LLP)=4 S-MGPS

32 boots. 60s 57s 80s 63s64 boots. 117s 112s 161s 118s128 boots. 231s 221s 323s 227s

Table 5.5: RAxML – Comparison between S-MGPS and static scheduling schemes, illustratingthe convergence overhead of S-MGPS.

To map PBPI to Cell, we used a hybrid parallelization approach where a fixed number of

MPI processes is multiplexed on the PPE and multiple SPEs are used for loop-level paralleliza-

tion. The performance of the parallelized off-loaded code in PBPI is influenced by the same

55

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of SPEs

Execution time (s)

1 MPI process2 MPI processes4 MPI processes8 MPI processes

Figure 5.10: PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1-4,deg(LLP)=1–16

factors as in RAxML: granularity of the off-loaded code, PPE-SPE communication, and load

imbalance. In Figure 5.10 we present the performance of PBPI when a variable number of SPEs

is used to execute the parallelized off-loaded code. The input file we used in this experiment is

107 SC, including 107 organisms, each represented by a DNA sequence of 1,000 nucleotides.

We run PBPI with one Markov chain for 200,000 generations. Figure 5.10 contains four exe-

cutions of PBPI with 1, 2, 4 and 8 MPI processes with 1–16, 1–8, 1–4 and 1–2 SPEs used per

MPI process respectively. In all experiments we use a single BladeCenter with two Cell BE

processors (total of 16 SPEs).

In the experiments with 1 and 2 MPI processes, the off-loaded code scales successfully only

up to a certain number of SPEs, which is always smaller than the number of total available SPEs.

Furthermore, the best performance in these two cases is reached when the number of SPEs used

for parallelization is smaller than the total number of available SPEs. The optimal number of

SPEs in general depends on the input data set and on the outermost parallelization and data

decomposition scheme of PBPI. The best performance for the specific dataset is reached by

using 4 MPI processes, spread across 2 Cell BEs, with each process using 4 SPEs on one Cell

BE.This optimal operating point shifts with different data set sizes.

The fixed virtual processor topology and data decomposition method used in PBPI prevents

56

dynamic scheduling of MPI processes at runtime without excessive overhead. We have exper-

imented with the option of dynamically changing the number of active MPI processes via a

gang scheduling scheme, which keeps the total number of active MPI processes constant, but

co-schedules MPI processes in gangs of size 1, 2, 4, or 8 on the PPE and uses 8, 4, 2, or 1

SPE(s) per MPI process per gang respectively, for the execution of parallel loops. This scheme

also suffered from system overhead, due to process control and context switching on the SPEs.

Pending better solutions for adaptively controlling the number of processes in MPI, we evalu-

ated S-MGPS in several scenarios where the number of MPI processes remains fixed. Using

S-MGPS we were able to determine the optimal degree of loop-level parallelism, for any given

degree of task-level parallelism (i.e. initial number of MPI processes) in PBPI. Being able to

pinpoint the optimal SPE configuration for LLP is still important since different loop paral-

lelization strategies can result in a significant difference in execution time. For example, the

naıve parallelization strategy, where all available SPEs are used for parallelization of off-loaded

loops, can result in up to 21% performance degradation (see Figure 5.10).

Table 5.6 shows a comparison of execution times when S-MGPS is used and when different

static parallelization schemes are used. S-MGPS performs within 2% of the optimal static

parallelization scheme. S-MGPS also performs up to 20% better than the naıve parallelization

scheme where all available SPEs are used for LLP (see Table 5.6(b)).

5.5 Chapter Summary

In this chapter we investigated policies and mechanisms pertaining to scheduling multigrain

parallelism on the Cell Broadband Engine. We proposed an event-driven task scheduler, striv-

ing for higher utilization of SPEs via oversubscribing the PPE. We have explored the conditions

under which loop-level parallelism within off-loaded code can be used. We have also proposed a

comprehensive scheduling policy for combining task-level and loop-level parallelism autonom-

ically within MPI code, in response to workload fluctuation. Using a bio-informatics code with

57

(a)

deg(LLP)deg(TLP)=1 1 2 3 4 5 6 7 8

Time 502 267.8 222.8 175.8 142.1 118.6 108.1 134.3deg(TLP)=1 9 10 11 12 13 14 15 16

Time (s) 122 111.9 138.3 109.2 122.3 133.2 115.3 116.5S-MGPS Time (s) 110.3

(b)

deg(LLP)deg(TLP)=2 1 2 3 4 5 6 7 8

Time (s) 275.9 180.8 139.4 113.5 91.3 97.3 102.55 115S-MGPS Time (s) 93

(c)

deg(LLP)deg(TLP)=4 1 2 3 4

Time (s) 180.6 118.67 94.63 83.61S-MGPS Time (s) 85.9

(d)

deg(LLP)deg(TLP)=8 1 2

Time (s) 355.5 265S-MGPS Time (s) 267

Table 5.6: PBPI – comparison between S-MGPS and static scheduling schemes: (a)deg(TLP)=1, deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1–4; (d) deg(TLP)=8, deg(LLP)=1–2.

inherent multigrain parallelism as a case study, we have shown that our user-level scheduling

policies outperform the native OS scheduler by up to a factor of 2.7.

Our MGPS scheduler proves to be responsive to small and large degrees of task-level and

data-level parallelism, at both fine and coarse levels of granularity. This kind of parallelism

is commonly found in optimization problems where many workers are spawned to search a

very large space of solutions, using a heuristic. RAxML is representative of these applica-

tions. MGPS is also appropriate for adaptive and irregular applications such as adaptive mesh

refinement, where the application has task-level parallelism with variable granularity (because

of load imbalance incurred while meshing subdomains with different structural properties) and,

in some implementations, a statically unpredictable degree of task-level parallelism (because

of non-deterministic dynamic load balancing which may be employed to improve execution

time). N-body simulations and ray-tracing are applications that exhibit similar properties and

can also benefit from our scheduler. As a final note, we observe that MGPS reverts to the best

static scheduling scheme for regular codes with a fixed degree of task-level parallelism, such as

blocked linear algebra kernels.

We also investigated the problem of mapping multi-dimensional parallelism on hetero-

58

geneous parallel architectures with both conventional and accelerator cores. We proposed a

feedback-guided dynamic scheduling scheme, S-MGPS, which rightsizes parallelism on the fly,

without a priori knowledge of application-specific information and regardless of the input data

set.

59

Chapter 6

Model of Multi-Grain Parallelism

6.1 Introduction

The migration of parallel programming models to accelerator-based architectures raises many

challenges. Accelerators require platform-specific programming interfaces and re-formulation

of parallel algorithms to fully exploit the additional hardware. Furthermore, scheduling code

on accelerators and orchestrating parallel execution and data transfers between host processors

and accelerators is a non-trivial exercise, as discussed in Chapter 5.

Although being able to accurately determine the most efficient execution configuration of a

multi-level parallel application, the S-MGPS scheduler (Section 5.4) requires sampling of many

different configurations, at runtime. The sampling time grows with the number of accelerators

on the chip, and with the number of different levels of parallelism available in the applica-

tion. To pinpoint the most efficient execution configuration without using the sampling phase,

we develop a model for multi-dimensional parallel computation on heterogeneous multi-core

processors. We name the model Model of Multi-Grain Parallelism (MMGP). The model is ap-

plicable to any type of accelerator based architecture, and in Section 6.4 we test the accuracy

and usability of the MMGP model on the multicore Cell architecture.

61

APU/LM

#1

HPU/LM

#1

HPU/LM

#NHP

APU/LM

#2

APU/LM

#NAP

Shared Memory / Message Interface

Figure 6.1: A hardware abstraction of an accelerator-based architecture with two layers ofparallelism. Host processing units (HPUs) relatively supply coarse-grain parallel computationacross accelerators. Accelerator processing units (APUs) are the main computation engines andmay support internally finer grain parallelism. Both HPUs and APUs have local memories andcommunicate through shared-memory or message-passing. Additional layers of parallelism canbe expressed hierarchically in a similar fashion.

6.2 Modeling Abstractions

Performance can be dramatically affected by the assignment of tasks to resources on a complex

parallel architecture with multiple types of parallel execution vehicles. We intend to create a

model of performance that captures the important costs of parallel task assignment at multiple

levels of granularity, while maintaining simplicity. Additionally, we want our techniques to be

independent of both programming models and the underlying hardware. Thus, in this section

we identify abstractions necessary to allow us to define a simple, accurate model of parallel

computation for accelerator-based architectures.

6.2.1 Hardware Abstraction

Figure 6.1 shows our abstraction for accelerator-based architectures. In this abstraction, each

node consists of multiple host processing units (HPU) and multiple accelerator processing units

(APU). Both the HPUs and APUs have local and shared memory. Multiple HPU-APU nodes

form a cluster. We model the communication cost for i and j, where i and j are HPUs, APUs,

62

and/or HPU-APU nodes, using a variant of the LogP model [35] of point-to-point communi-

cation:

Ci,j = Oi + L+Oj (6.1)

Where Ci,j is the communication cost, Oi and Oj is the overhead of sender and receiver respec-

tively, and L is the communication latency.

In this hardware abstraction, we model an HPU, APU, or HPU-APU node as a sequential

device with streaming memory accesses. For simplicity, we assume that additional levels of

parallelism in HPUs or APUs, such as ILP and SIMD, can be reflected with a parameter that

represents computing capacity. We could alternatively express multi-grain parallelism hierar-

chically, but this complicates model descriptions without much added value. Assumption of

streaming memory accesses, allows inclusion of the effects of communication and computation

overlap.

6.2.2 Application Abstraction

Figure 6.2 provides an illustrative view of the succeeding discussion. We model the workload

of a parallel application using a version of the Hierarchical Task Graph (HTG [52]). An HTG

represents multiple levels of concurrency with progressively finer granularity when moving

from outermost to innermost layers. We use a phased HTG, in which we partition the application

into multiple phases of execution and split each phase into nested sub-phases, each modeled as

a single, potentially parallel task. Each subtask may incorporate one or more layers of data or

sub-task parallelism. The degree of concurrency may vary between tasks and within tasks.

Mapping a workload with nested parallelism as shown in Figure 6.2 to an accelerator-based

multi-core architecture can be challenging. In the general case, any application task of any

granularity could map to any type combination of HPUs and APUs. The solution space under

these conditions can be unmanageable.

In this work, we confine the solution space by making some assumptions about the applica-

63

Task2Subtask2

Subtask3

Main Process

Time

Main Process

Subtask1

Subtask2

Subtask3

Subtask1

Task1

Task1 Task2

Task2

Figure 6.2: Our application abstraction of two parallel tasks. Two tasks are spawned by themain process. Each task exhibits phased, multi-level parallelism of varying granularity. In thispaper, we address the problem of mapping tasks and subtasks to accelerator-based systems.

tion and hardware. First, we assume that the amount and type of parallelism is known a priori

for all phases in the application. In other words, we assume that the application is explicitly par-

allelized, in a machine-independent fashion. More specifically, we assume that the application

exposes all available layers of inherent parallelism to the runtime environment, without how-

ever specifying how to map this parallelism to parallel execution vehicles in hardware. In other

words, we assume that the application’s parallelism is expressed independently of the number

and the layout of processors in the architecture. The parallelism of the application is represented

by a phased HTG graph. The intent of our work is to improve and formalize programming of

accelerator-based multicore architectures. We believe it is not unreasonable to assume those

interested in porting code and algorithms to such systems would have detailed knowledge about

the inherent parallelism of their application. Furthermore, explicit, processor-independent par-

allel programming is considered by many as a means to simplify parallel programming mod-

els [10].

Second, we prune the number and type of hardware configurations. We assume hardware

64

configurations consist of a hierarchy of nested resources, even though the actual resources may

not be physically nested in the architecture. Each resource is assigned to an arbitrary level of

parallelism in the application and resources are grouped by level of parallelism in the applica-

tion. For instance, the Cell Broadband Engine can be considered as 2 HPUs and 8 APUs, where

the two HPUs correspond to the PowerPC dual-thread SMT core and APUs to the synergistic

(SPE) accelerator cores. HPUs support parallelism of any granularity, however APUs support

the same or finer, not coarser, granularity. This assumption is reasonable since it represents

faithfully all current accelerator architectures, where front-end processors offload computation

and data to accelerators. This assumption simplifies modeling of both communication and com-

putation.

6.3 Model of Multi-grain Parallelism

This section provides theoretical rigor to our approach. We present MMGP, a model which

predicts execution time on accelerator-based system configurations and applications under the

assumptions described in the previous section. Readers familiar with point-to-point models of

parallel computation may want to skim this section and continue directly to the results of our

execution time prediction techniques discussed in Section 6.4.

We follow a bottom-up approach. We begin by modeling sequential execution on the HPU,

with part of the computation off-loaded to a single APU. Next, we incorporate multiple APUs

in the model, followed by multiple HPUs. We end up with a general model of execution time,

which is not particularly practical. Hence, we reduce the general model to reflect different uses

of HPUs and APUs on real systems. More specifically, we specialize the model to capture the

scheduling policy of threads on the HPUs and to estimate execution times under different map-

pings of multi-grain parallelism across HPUs and APUs. Lastly, we describe the methodology

we use to apply MMGP to real systems.

65

HPU_1

APU_1

shared Memory

Phase_1

Phase_2

Phase_3

(a) an architecture with one HPU and one APU (b) an application with three phases

Figure 6.3: The sub-phases of a sequential application are readily mapped to HPUs and APUs.In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU.HPUs and APUs are assumed to communicate via shared memory.

6.3.1 Modeling sequential execution

As the starting point, we consider the mapping of the program to an accelerator-based architec-

ture that consists of one HPU and one APU, and an application with one phase decomposed into

three sub-phases, a prologue and epilogue running on the HPU, and a main accelerated phase

running on the APU, as illustrated in Figure 6.3.

Offloading computation incurs additional communication cost, for loading code and data on

the APU, and saving results calculated from the APU. We model each of these communication

costs with a latency and an overhead at the end-points, as in Equation 6.1. We assume that

APU’s accesses to data during the execution of a procedure are streamed and overlapped with

APU computation. This assumption reflects the capability of current streaming architectures,

such as the Cell and Merrimac [37], to aggressively overlap memory latency with computa-

tion, using multiple buffers. Due to overlapped memory latency, communication overhead is

assumed to be visible only during loading the code and arguments of a procedure on the APU

and during returning the result of a procedure from the APU to the HPU. We combine the com-

munication overhead for offloading the code and arguments of a procedure and signaling the

execution of that procedure on the APU in one term (Os), and the overhead for returning the

result of a procedure from the APU to the HPU in another term (Or).

We can model the execution time for the offloaded sequential execution for sub-phase 2 in

66

Figure 6.3 as:

Toffload(w2) = TAPU(w2) +Or +Os (6.2)

where TAPU(w2) is the time needed to complete sub-phase 2 without additional overhead.

Further, we can write the total execution time of all three sub-phases as:

T = THPU(w1) + TAPU(w2) +Or +Os + THPU(w3) (6.3)

To reduce complexity, we replace THPU(w1)+THPU(w3) with THPU , TAPU(w2) with TAPU ,

and Os +Or with Ooffload. Therefore, we can rewrite Equation 6.3 as:

T = THPU + TAPU +Ooffload (6.4)

The application model in Figure 6.3 is representative of one of potentially many phases in

an application. We further modify Equation 6.4 for a generic application with N phases, where

each phase i offloads a part of its computation on one APU:

T =N∑

i=1

(THPU,i + TAPU,i +Ooffload) (6.5)

6.3.2 Modeling parallel execution on APUs

Each offloaded part of a phase may contain fine-grain parallelism, such as task-level parallelism

at the sub-procedural level or data-level parallelism in loops. This parallelism can be exploited

by using multiple APUs for the offloaded workload. Figure 6.4 shows the execution time de-

composition for execution using one APU and two APUs. We assume that the code off-loaded

to an APU during phase i, has a part which can be further parallelized across APUs, and a part

executed sequentially on the APU. We denote TAPU,i(1, 1) as the execution time of the further

parallelized part of the APU code during the ith phase. The first index 1 refers to the use of

one HPU thread in the execution. We denote TAPU,i(1, p) as the execution time of the same

67

part when p APUs are used to execute this part during the ith phase. We denote as CAPU,i the

non-parallelized part of APU code in phase i. Therefore, we obtain:

TAPU,i(1, p) =TAPU,i(1, 1)

p+ CAPU,i (6.6)

Or

Ti (1,1)

Os

T (1,1)APU,i

C

��

��

��

��

��

��

��

��

Time

T

HPU APU

Overhead associated with offloading (gap)PPE, SPE Computation

HPU,i (1,1)

APU

Os

��

��

Or

��

��

��

��

��

��

Time

Ti

HPU APU1 APU2

Offloading gap

Receiving gap

(1,2)

THPU,i (1,2)

T

C

APU (1,2)

APU

(a) Offloading to one APU (b) offloading to two APUs

Figure 6.4: Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads compu-tations to one APU (part a) and two APUs (part b). The single point-to-point transfer of parta is modeled as overhead plus computation time on the APU. For multiple transfers, there isadditional overhead (g), but also benefits due to parallelization.

Given that the HPU offloads to APUs sequentially, there exists a latency gap between con-

secutive offloads on APUs. Similarly, there exists a gap between receiving return values from

two consecutive offloaded procedures on the HPU. We denote with g the larger of the two gaps.

On a system with p APUs, parallel APU execution will incur an additional overhead as large as

p · g. Thus, we can model the execution time in phase i as:

Ti(1, p) = THPU,i +TAPU,i(1, 1)

p+ CAPU,i +Ooffload + p · g (6.7)

68

6.3.3 Modeling parallel execution on HPUs

An accelerator-based architecture can support parallel HPU execution in several ways, by pro-

viding a multi-core HPU, an SMT HPU or combinations thereof. As a point of reference, we

consider an architecture with one SMT HPU, which is representative of the Cell BE.

Since the compute intensive parts of an application are typically offloaded to APUs, the

HPUs are expected to be in idle state for extended intervals. Therefore, multiple threads can

be used to reduce idle time on the HPU and provide more sources of work for APUs, so that

APUs are better utilized. It is also possible to oversubscribe the HPU with more threads than

the number of available hardware contexts, in order to expose more parallelism via offloading

on APUs.

Figure 6.5 illustrates the execution timeline when two threads share the same HPU, and each

thread offloads parallelized code on two APUs. We use different shade patterns to represent the

workload of different threads.

TAPU,i1 (2,2)

CAPU

CAPU

TAPU,i2 (2,2)

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

HPU APU1 APU2APU3APU4

O s

HPU Thread 2HPU Thread 1

1

T1

Ti (2,2)Os

HPU,i(2,2)

rO

Or

(2,2)HPU,iT2

Ti (2,2)2

Figure 6.5: Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2on the right and 2 on the left). The first thread on the HPU offloads computation to APU1 andAPU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4,and then idles. APU1 and APU2 complete and return data followed by APU3 and APU4.

69

For m concurrent HPU threads, where each thread uses p APUs for distributing a single

APU task, the execution time of a single off-loading phase can be represented as:

T ki (m, p) = T k

HPU,i(m, p) + T kAPU,i(m, p) +Ooffload + p · g (6.8)

where T ki (m, p) is the completion time of the kth HPU thread during the ith phase.

Modeling the APU time

Similarly to Equation 6.6, we can write the APU time of the k-th thread in phase i in Equation

6.8 as:

T kAPU,i(m, p) =

TAPU,i(m, 1)

p+ CAPU,i (6.9)

Different parallel implementations may result in different TAPU,i(m, 1) terms and a different

number of offloading phases. For example, the implementation could parallelize each phase

among m HPU threads and then offload the work of each HPU thread to p APUs, resulting

in the same number of offloading phases and a reduced APU time during each phase, i.e.,

TAPU,i(m, 1) =TAPU,i(1,1)

m. As another example, the HPU threads can be used to execute multi-

ple identical tasks, resulting in a reduced number of offloading phases (i.e., N/m, where N is

the number of offloading phases when there is only one HPU thread) and the same APU time

in each phase, i.e., TAPU,i(m, 1) = TAPU,i(1, 1).

Modeling the HPU time

The execution time of each HPU thread is affected by three factors:

1. Contention between HPU threads for shared resources.

2. Context switch overhead related to resource scheduling.

3. Global synchronization between dependent HPU threads.

70

Considering all three factors, we can model the execution time of an HPU thread in phase i as:

T kHPU,i(m, p) = αm · THPU,i(1, p) + TCSW +OCOL (6.10)

In this equation TCSW is context switching time on the HPU, and OCOL is the time needed for

collective communication. The parameter αm is introduced to account for contention between

threads that share resources on the HPU. On SMT and CMP HPUs, such resources typically

include one or more levels of the on-chip cache memory. On SMT HPUs in particular, shared

resources include also TLBs, branch predictors and instruction slots in the pipeline. Contention

between threads often introduces artificial load imbalance due to occasional unfair hardware

policies of allocating resources between threads.

Synthesis

Combining Equation (6.8)-(6.10) and summarizing all phases, we can write the execution time

for MMGP as:

T (m, p) = αm ·THPU(1, 1)+TAPU(1, 1)

m · p+CAPU +N ·(OOffload+TCSW +OCOL+p·g) (6.11)

Due to limited hardware resources (i.e. number of HPUs and APUs), we further constrain

this equation to m × p ≤ NAPU , where NAPU is the number of available APUs. As described

later in this paper, we can either measure or approximate all parameters in Equation 6.11 from

microbenchmarks and profiles of sequential runs of the program.

6.3.4 Using MMGP

Given a parallel application, MMGP can be applied using the following process:

1. Calculate parameters including OOffload, αm, TCSW and OCOL using micro-benchmarks

for the target platform.

71

2. Profile a short run of the sequential execution with off-loading to a single APU, to estimate

THPU(1), g, TAPU(1, 1) and CAPU .

3. Solve a special case of Equation 6.11 (e.g. 6.7) to find the optimal mapping between

application concurrency and HPUs and APUs available on the target platform.

6.3.5 MMGP Extensions

We note that the concepts and assumptions mentioned in this section do not preclude further

specialization of MMGP for higher accuracy. For example, in Section 6.3.1 we assume com-

putation and data communication overlap. This assumption reflects the fact that streaming

processors can typically overlap completely memory access latency with computation. For

non-overlapped memory accesses, we can employ a DMA model as a specialization of the

overhead factors in MMGP. Also, in Sections 6.3.2 and 6.3.3 we assume only two levels of

parallelism. MMGP is easily extensible to additional levels but the terms of the equations grow

quickly without conceptual additions. Furthermore, MMGP can be easily extended to reflect

specific scheduling policies for threads on HPUs and APUs, as well as load imbalance in the

distribution of tasks between HPUs and APUs. To illustrate the usefulness of our techniques we

apply them to a real system. We next present results from applying MMGP to Cell.

6.4 Experimental Validation and Results

We use MMGP to derive multi-grain parallelization schemes for two bioinformatics applica-

tions, RAxML and PBPI, described in Chapter 3, on a shared-memory dual Cell blade, IBM

QS20. Although we are using only two applications in our experimental evaluation, we should

point out that these are complete applications used for real-world biological data analyses, and

that they are fully optimized for the Cell BE using an arsenal of optimizations, including vector-

ization, loop unrolling, double buffering, if-conversion and dynamic scheduling. Furthermore,

72

these applications have inherent multi-grain concurrency and non-trivial scaling properties in

their phases, therefore scheduling them optimally on Cell is a challenging exercise for MMGP.

Lastly, in the absence of comprehensive suites of benchmarks (such as NAS or SPEC HPC)

ported on Cell, optimized, and made available to the community by experts, we opted to use,

PBPI and RAxML, codes on which we could verify that enough effort has been invested towards

Cell-specific parallelization and optimization.

6.4.1 MMGP Parameter approximation

MMGP has eight free parameters, THPU , TAPU , CAPU , Ooffload, g, TCSW , OCOL and αm. We

estimate four of the parameters using micro-benchmarks.

αm captures contention between processes or threads running on the PPE. This contention

depends on the scheduling algorithm on the PPE. We estimate αm under an event-driven schedul-

ing model which oversubscribes the PPE with more processes than the number of hardware

threads supported for simultaneous execution on the PPE, and switches between processes upon

each off-loading event on the PPE [19].

To estimate αm, we use a parallel micro-benchmark that computes the product of twoM×M

square matrices consisting of double-precision floating point elements. Matrix-matrix multi-

plication involves O(n3) computation and O(n2) data transfers, thus stressing the impact of

sharing execution resources and the L1 and L2 caches between processes on the PPE. We used

several different matrix sizes, ranging from 100× 100 to 500× 500, to exercise different levels

of pressure on the thread-shared caches of the PPE. In the MMGP model, we use the mean of

αm obtained from these experiments, which is 1.28.

PPE-SPE communication is optimally implemented through DMAs on Cell. We devised

a ping-pong micro-benchmark using DMAs to send a single word from the PPE to one SPE

and backwards. We measured PPE→SPE→PPE round-trip communication overhead (Ooffload)

to 70 ns. To measure the overhead caused by various collective communications we used

73

mpptest [55] on the PPE. Using a micro-benchmark that repeatedly executes the sched yield()

system call, we estimate the overhead caused by the context switching (TCSW ) on the PPE to be

2 µs. This is a conservative upper bound for context switching overhead, since it includes some

user-level library overhead.

THPU , TAPU , CAPU and the gap g between consecutive DMAs on the PPE are application-

dependent and cannot be approximated easily with a micro-benchmark. To estimate these pa-

rameters, we use a profile of a sequential run of the code, with tasks off-loaded on one SPE. We

use the timing instructions inserted into the applications at specific locations. To estimate THPU

we measure the time that applications spend on the HPU. To estimate TAPU and CAPU we mea-

sure the time that applications spend on the accelerators, in large computational loops which

can be parallelized (TAPU ), and in the sequential accelerator code outside of the large loops

(CAPU ). To estimate g, we measure the time intervals between the consecutive task off-loads

and task completions.

6.4.2 Case Study I: Using MMGP to parallelize PBPI

PBPI with One Dimension of Parallelism

We compare the PBPI execution times predicted by MMGP to the actual execution times ob-

tained on real hardware, using various degrees of PPE and SPE parallelism, i.e. the equivalents

of HPU and APU parallelism on Cell. These experiments illustrate the accuracy of MMGP, in a

sample of the feasible program configurations. The sample includes one-dimensional decompo-

sitions of the program between PPE threads, with simultaneous off-loading of code to one SPE

from each PPE thread, one-dimensional decompositions of the program between SPE threads,

where the execution of tasks on the PPE is sequential and each task off-loads code which is

data-parallel across SPEs, and two-dimensional decompositions of the program, where multi-

ple tasks run on the PPE threads concurrently and each task off-loads code which is data-parallel

across SPEs. In all cases, the SPE code is SIMDized in the innermost loops, to exploit the vec-

74

(a) (b)

Figure 6.6: MMGP predictions and actual execution times of PBPI, when the code uses onedimension of PPE (HPU) parallelism.

tor units of the SPEs. We believe that this sample of program configurations is representative of

what a user would reasonably experiment with while trying to optimize the codes on the Cell.

For these experiments, we used the arch107 L10000 input data set. This data set consists

of 107 sequences, each with 10000 characters. We run PBPI with one Markov chain for 20000

generations. Using the time base register on the PPE and the decrementer register on one SPE,

we obtained the following model parameters for PBPI: THPU = 1.3s, TAPU = 370s, g = 0.8s

and O = 1.72s.

Figure 6.6 compares MMGP and actual execution times for PBPI, when PBPI only ex-

ploits one-dimensional PPE (HPU) parallelism in which each PPE thread uses one SPE for

off-loading. We execute the code with up to 16 MPI processes, which off-load code to up to

16 SPEs on two Cell BEs. Referring to Equation 6.11, we set p = 1 and vary the value of

m between 1 to 8. The X-axis shows the number of processes running on the PPE (i.e. HPU

parallelism), and the Y-axis shows the predicted and measured execution times. The maximum

prediction error of MMGP is 5%. The arithmetic mean of the error is 2.3% and the standard

deviation is 1.4.

Figure 6.7 illustrates predicted and actual execution times when PBPI uses one dimension

75

(c) (d)

Figure 6.7: MMGP predictions and actual execution times of PBPI, when the code uses onedimension of SPE (APU) parallelism, with a data-parallel implementation of the maximumlikelihood calculation.

of SPE (APU) parallelism. Referring to Equation 6.11, we set p = 1 and vary m from 1 to

8. MMGP remains accurate, the mean prediction error is 4.1% and the standard deviation is

3.2. The maximum prediction error in this case is higher (approaching 10%) when the APU

parallelism increases and the code uses SPEs on both Cell processors. A closer inspection of

this result reveals that the data-parallel implementation of tasks in PBPI stops scaling beyond

the 8 SPEs confined in one Cell processor, because of DMA bottlenecks and non-uniformity in

the latency of memory accesses by the two Cell processors on the blade. Capturing the DMA

bottlenecks requires the introduction of a model of DMA contention in MMGP, while captur-

ing the NUMA bottleneck would require an accurate memory hierarchy model integrated with

MMGP. The NUMA bottleneck can be resolved by a better page placement policy implemented

in the operating system. We intend to examine these issues in our future work. For the purposes

of this paper, it suffices to observe that MMGP is accurate enough despite its generality. As we

show later, MMGP predicts accurately the optimal mapping of the program to the Cell multi-

processor, regardless of inaccuracies in execution time prediction in certain edge cases.

76

PBPI with Two Dimensions of Parallelism

Multi-grain parallelization aims at exploiting simultaneously task-level and data-level paral-

lelism in PBPI. We only consider multi-grain parallelization schemes in which degHPU·degAPU ≤

16, i.e. the total number of SPEs (APUs) on the dual-processor Cell Blade we used in this study.

deg() denotes the degree of a layer of parallelism, which corresponds to the number of SPE or

PPE threads used to run the code. Figure 6.8 shows the predicted and actual execution times

of PBPI for all feasible combinations of multi-grain parallelism under the aforementioned con-

straint. MMGP’s mean prediction error is 3.2%, the standard deviation of the error is 2.6 and

the maximum prediction error is 10%. The important observation in these results is that MMGP

agrees with the experimental outcome in terms of the mix of PPE and SPE parallelism to use

in PBPI for maximum performance. In a real program development scenario, MMGP would

point the programmer to the direction of using both task-level and data-level parallelism with a

balanced allocation of PPE contexts and SPEs between the two layers.

6.4.3 Case Study II: Using MMGP to Parallelize RAxML

RAxML with a Single Layer of Parallelism

The units of work (bootstraps) in RAxML are distributed evenly between MPI processes, there-

fore the degree of PPE (HPU) concurrency is bound by the number of MPI processes. As

discussed in Section 6.3.3, the degree of HPU concurrency may exceed the number of HPUs, so

that on an architecture with more APUs than HPUs, the program can expose more concurrency

to APUs. The degree of SPE (APU) concurrency may vary per MPI process. In practice, the

degree of PPE concurrency can not meaningfully exceed the total number of SPEs available

on the system, since as many MPI processes can utilize all available SPEs via simultaneous

off-loading. Similarly to PBPI, each MPI process in RAxML can exploit multiple SPEs via

data-level parallel execution of off-loaded tasks across SPEs. To enable maximal PPE and SPE

concurrency in RAxML, we use a version of the code scheduled by a Cell BE event-driven

77

0

50

100

150

200

250

300

350

400

(1,1)

(1,2)

(2,1)

(1,3)

(3,1)

(1,4)

(2,2)

(4,1)

(1,5)

(5,1)

(1,6)

(2,3)

(3,2)

(6,1)

(1,7)

(7,1)

(1,8)

(2,4)

(4,2)

(8,1)

(1,9)

(3,3)

(9,1)

(1,10)

(2,5)

(5,2)

(10,1)

(1,11)

(11,1)

(1,12)

(2,6)

(3,4)

(4,3)

(6,2)

(12,1)

(1,13)

(13,1)

(1,14)

(2,7)

(7,2)

(14,1)

(1,15)

(3,5)

(5,3)

(15,1)

(1,16)

(2,8)

(4,4)

(8,2)

(16,1)

Executed Configuration (#PPE,#SPE)

Exec

utio

n tim

e in

seo

nds MMGP model

Measured time

Figure 6.8: MMGP predictions and actual execution times of PBPI, when the code uses twodimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees of parallelismwhich optimizes performance is 4-way PPE parallelism combined with 4-way SPE parallelism.The chart illustrates the results when both SPE parallelism and PPE parallelism are scaled totwo Cell processors.

scheduler [19], in which context switches on the PPE are forced upon task off-loading and PPE

processes are served with a fair-share scheduler, so as to have even chances for off-loading on

SPEs.

We evaluate the performance of RAxML when each process performs the same amount of

work, i.e. the number of distributed bootstraps is divisible by the number of processes. The case

of unbalanced distribution of bootstraps between MPI processes can be handled with a minor

modification to Equation 6.11, to scale the MMGP parameters by a factor of (d BMe·M)

B, where B

is the number of bootstraps (tasks) and M is the number of MPI processes used to execute the

code.

We compare the execution time of RAxML to the time predicted by MMGP, using two

input data sets. The first data set contains 10 organisms, each represented by a DNA sequence

of 20,000 nucleotides. We refer to this data set as DS1. The second data set (DS2) contains 10

78

020406080

100120140160180200

1 2 4 8 16Number of PPEs

Exec

utio

n tim

e (s

) MMGP ModelMeasured time

(a)

0

50

100

150

200

250

300

350

1 2 4 8 16Number of PPEs

Exec

utio

n tim

e (s


(b)

Figure 6.9: MMGP predictions and actual execution times of RAxML, when the code uses onedimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2.

organisms, each represented by a DNA sequence of 50,000 nucleotides. For both data sets, we

set RAxML to perform a total of 16 bootstraps using different parallel configurations.

The MMGP parameters for RAxML, obtained from profiling a sequential run of the code are

THPU = 3.3s, TAPU = 63s, CAPU = 104s for DS1, and THPU = 8.8s, TAPU = 118s, CAPU =

157s for DS2. The values of other MMGP parameters are negligible compared to TAPU , THPU ,

and CAPU , therefore we disregard them for RAxML. Note that the off-loaded code that cannot

be parallelized (CAPU ) takes 57-62% of the execution time of a task on the SPE. Figure 6.9

illustrates the estimated and actual execution times of RAxML with up to 16 bootstraps, using

one dimension of PPE (HPU) parallelism. In this case, each MPI process offloads tasks to one

SPE and SPEs are utilized by oversubscribing the PPE with more processes than the number of

hardware threads available on the PPEs. For DS1, the mean MMGP prediction error is 7.1%,

the standard deviation is 6.4. and the maximum error is 18%. For DS2, the mean MMGP

prediction error is 3.4%, the standard deviation is 1.9 and the maximum error is 5%.

79

020406080

100120140160180200

1 2 3 4 5 6 7 8Number of SPEs

Exec

utio

n tim

e (s


(a)

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8Number of SPEs

Exec

utio

n tim

e (s


(b)

Figure 6.10: MMGP predictions and actual execution times of RAxML, when the code usesone dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2.

Figure 6.10 illustrates estimated and actual execution times of RAxML, when the code uses

one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maxi-

mum likelihood calculation functions across SPEs. We should point out that although both

RAxML and PBPI perform maximum likelihood calculations in their computational cores,

RAxML’s loops have loop-carried dependencies that prevent scalability and parallelization in

many cases [20], whereas PBPI’s core computation loops are fully parallel and coarse enough to

achieve scalability. The limited scalability of data-level parallelization of RAxML is the reason

why we confine the executions with data-level parallelism on at most 8 SPEs. As shown in Fig-

ure 6.10, the data-level parallel implementation of RAxML does not scale substantially beyond

4 SPEs. When only APU parallelism is extracted from RAxML, for DS1 the mean MMGP

prediction error is 0.9%, the standard deviation is 0.8. and the maximum error is 2%. For DS2,

the mean MMGP prediction error is 2%, the standard deviation is 1.3 and the maximum error

is 4%.

80

RAxML with Two Dimensions of Parallelism

Figure 6.11 shows the actual and predicted execution times in RAxML, when the code exposes

two dimensions of parallelism to the system. Once again, regardless of execution time predic-

tion accuracy, MMGP is able to pin-point the optimal parallelization model, which in the case

of RAxML is task-level parallelization with no further data-parallel decompositions of tasks

between SPEs, as the opportunity for scalable data-level parallelization in the code is limited.

Innermost loops in tasks are still SIMDized within each SPE. MMGP remains accurate, with

mean execution time prediction error of 4.3%, standard deviation of 4, and maximum prediction

error of 18% for DS1, and mean execution time prediction error of 2.8%, standard deviation of

1.9, and maximum prediction error of 7% for DS2. It is worth noting that although the two

codes tested are fundamentally similar in their computational core, their optimal parallelization

model is radically different. MMGP accurately reflects this disparity, using a small number of

parameters and rapid prediction of execution times across a large number of feasible program

configurations.

6.4.4 MMGP Usability Study

We demonstrate a practical use of MMGP through a simple usability study. We modified PBPI

to execute an MMGP sampling phase at the beginning of the execution. During the sampling

phase, the application is profiled and all MMGP parameters are determined. After finishing

the sampling phase, MMGP estimates the optimal configuration and the application is executed

with the MMGP-recommended configuration. The profiling, sampling and MMGP actuation

phases are performed automatically without any user intervention. We set PBPI to execute 106

generations, since this is a number of generations typically required by biologists. We set the

sampling phase to be 10,000 generations. Even with the overhead introduced by the sampling

phase included in the measurements, the configuration provided by MMGP outperforms all

other configurations by margins ranging from 1.1% (compared to the next best configuration

81

0

20

40

60

80

100

120

140

160

180

200

(1,1

)

(1,2

)

(2,1

)

(1,3

)

(1,4

)

(2,2

)

(4,1

)

(1,5

)

(1,6

)

(2,3

)

(1,7

)

(1,8

)

(2,4

)

(4,2

)

(8,1

)

(2,5

)

(2,6

)

(4,3

)

(2,7

)

(8,2

)

(2,8

)

(16,

1)

Executed configuration (#PPE/#SPE)

Exec

utio

n tim

e in

sec

onds MMGP Model

Measured time

(a)

0

50

100

150

200

250

300

350

(1,1

)

(1,2

)

(2,1

)

(1,3

)

(1,4

)

(2,2

)

(4,1

)

(1,5

)

(1,6

)

(2,3

)

(1,7

)

(1,8

)

(2,4

)

(4,2

)

(8,1

)

(2,5

)

(2,6

)

(4,3

)

(2,7

)

(8,2

)

(2,8

)

(16,

1)

Executed configuration (#PPE/#SPE)

Exec

utio

n tim

e in

sec

onds MMGP model

Measured time

(b)

Figure 6.11: MMGP predictions and actual execution times of RAxML, when the code usestwo dimensions of SPE (APU) and PPE (HPU) parallelism. Performance is optimized by over-subscribing the PPE and maximizing task-level parallelism.

82

0

200

400

600

800

1000

1200

1400

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Sequence Size

To

tal

Tim

e (

s)

Sampling PhasePBPI Execution

Figure 6.12: Overhead of the sampling phase when MMGP scheduler is used with the PBPIapplication. PBPI is executed multiple times with 107 input species. The sequence size of theinput file is varied from 1,000 to 10,000. In the worst case, the overhead of the sampling phaseis 2.2% (sequence size 7,000).

identified via an exhaustive search), and up to 4 times (compared to the worst configuration

identified via an exhaustive search). The sampling phase takes in the worst case 2.2% of the

total execution time, but completely eliminates the exhaustive search that would otherwise be

necessary to find the best mapping of the application to the Cell architecture. Figure 6.12

illustrates the overhead of the sampling phase with PBPI application.

6.5 Chapter Summary

The introduction of accelerator-based parallel architectures complicates the problem of map-

ping algorithms to systems, since parallelism can no longer be considered as a one-dimensional

abstraction of processors and memory. We presented a new model of multi-dimensional parallel

computation, MMGP, which we introduced to relieve users from the arduous task of mapping

parallelism to accelerator-based architectures. We have demonstrated that the model is fairly

accurate, albeit simple, and that it is extensible and easy to specialize for a given architecture.

We envision three uses of MMGP: i) As a rapid prototyping tool for porting algorithms to

83

accelerator-based architectures. More specifically, MMGP can help users derive not only a de-

composition strategy, but also an actual mix of programming models to use in the application in

order to best utilize the architecture, while using architecture-independent programming tech-

niques. ii) As a compiler tool for assisting compilers in deriving efficient mappings of programs

to accelerator-based architectures automatically. iii) As a runtime tool for dynamic control of

parallelism in applications, whereby the runtime system searches for optimal program configu-

rations in the neighborhood of optimal configurations derived by MMGP, using execution time

sampling or prediction-based techniques.

84

Chapter 7

Scheduling Asymmetric Parallelism on a

PS3 Cluster

7.1 Introduction

Cluster computing is already feeling the impact of multi-core processors [30]. Several highly

ranked entries of the latest Top-500 list include clusters of commodity dual-core processors1.

The availability of abundant chip-level and board-level parallelism changes fundamental as-

sumptions that developers currently make while writing software for HPC clusters. While recent

work has improved our understanding of the implications of small-scale symmetric multi-core

processors on cluster computing [7], emerging asymmetric multi-core processors such as the

Cell/BE and boards with conventional processors and hardware accelerators –such as GPUs–,

are rapidly making their way into HPC clusters [94]. There are strong incentives that support

this trend, not the least of which is higher performance with higher energy-efficiency made

possible through asymmetric, rather than symmetric multi-core processor organizations [57].

Understanding the implications of asymmetric multi-core processors on cluster computing

and providing models and software support to ease the migration of parallel programs to these

1http://www/top500.org

85

platforms is a challenging and relevant problem. This study makes four contributions:

i) We conduct a performance analysis of a Linux cluster of Sony PlayStation3 (PS3) nodes.

To the best of our knowledge, this is the first study to evaluate this cost-effective and unconven-

tional HPC platform with microbenchmarks and realistic applications from the area of bioin-

formatics. The cluster we used has 22 PS3 nodes connected with a GigE switch and was built

at Virginia Tech for less than $15,000. We first evaluate the performance of MPI collective

and point-to-point communication on the PS3 cluster, and explore the scalability of MPI com-

munication operations under contention for bandwidth within and across PS3 nodes. We then

evaluate performance and scalability of the PS3 cluster with bioinformatics applications. Our

analysis reveals the sensitivity of computation and communication to the mapping of asym-

metric parallelism to the cluster and the importance of coordinated scheduling across multiple

layers of parallelism. Optimal scheduling of MPI codes on the PS3 cluster requires coordi-

nated scheduling and mapping of at least three layers of parallelism (two layers within each

Cell processor and an additional layer across Cell processors), whereas the optimal mapping

and schedule changes with the application, the input data set, and the number of nodes used for

execution.

ii) We adapt and validate MMGP on the PS3 cluster. We model a generic heterogeneous clus-

ter built from compute nodes with front-end host cores and back-end accelerator cores. The

extended model combines analytical components with empirical measurements, to navigate the

optimization space for mapping MPI programs with nested parallelism on the PS3 cluster. Our

evaluation of the extended MMGP model shows that it estimates execution time with an average

error rate of 5.2% on a cluster composed of PlayStation3 nodes. The model captures the effects

of application characteristics, input data sets, and cluster scale on performance. Furthermore,

The model pin-points optimal mappings of MPI applications to the PS3 cluster with remarkable

accuracy.

iii) Using the cluster of Playstation3 nodes, we analyze earlier proposed user-level scheduling

heuristics for co-scheduling threads (Chapter 5). We show that co-scheduling algorithms yield

86

significant performance improvements (1.7–2.7×) over the native OS scheduler in MPI appli-

cations. We also explore the trade-off between different co-scheduling policies that selectively

spin or yield the host cores, based on runtime prediction of task execution lengths on the accel-

erator cores.

iv) We present a comparison between our PS3 cluster and an IBM QS20 blade cluster (based on

Cell/BE), illustrating that despite important limitations in computational ability and the com-

munication substrate, the PS3 cluster is a viable platform for HPC research and development.

The rest of this Chapter is organized as follows: Section 7.2 presents our experimental

platform. Section 7.3 presents our performance analysis of the PS3 cluster. Section 7.4 presents

the extended model of hybrid parallelism and its validation. Section 7.5 presents co-scheduling

policies for clusters of asymmetric multi-core processors and evaluates these policies. Section

7.6 compares the PS3 cluster against an IBM QS20 Cell-blade cluster. Section 7.7 concludes

the chapter.

7.2 Experimental Platform

Our experimental platform for this thesis is a cluster of 22 PS3 nodes. 8 of these nodes were

available to us in dedicated mode for the purposes of this thesis. The PS3 nodes are connected

to a 1000BASE-T Gigabit Etherent switch, which supports 96 Gbps switching capacity. Each

PS3 runs Linux FC5 with kernel version 2.6.16, compiled for the 64-bit PowerPC architecture

with platform-specific kernel patches for managing the heterogeneous cores of the Cell/BE.

The nodes communicate with LAM/MPI 7.1.1. We used the IBM Cell SDK 2.1 for intra-

Cell/BE parallelization of the MPI codes. The Linux kernel on the PS3 is running on top of a

proprietary hypervisor. Though some devices are directly accessed, the built-in Gigabit Ethernet

controller in the PS3 is accessed via hypervisor calls, therefore communication performance is

not optimized.

87

7.3 PS3 Cluster Scalability Study

7.3.1 MPI Communication Performance

As we extend our MMPG model for the cluster of PlayStation3 machines, the cost of MPI calls

becomes a more significant parameter of the prediction model than it would be for a single

machine. To study how the MMGP model scales for the new parallel computing environment,

we experimented with the real-world parallel computing applications, PBPI and RAxML.

To measure communication performance on the PS3 cluster, we use mpptest [56]. We

present mpptest results only for two MPI communication primitives, which dominate com-

munication time in our application benchmarks. Figure 7.1 shows the overhead of MPI Allreduce(),

with various message sizes. Each data point represents a number and a distribution of MPI

processes between PS3 nodes. For any given number of PS3 nodes, we use 1 to 6 MPI pro-

cesses, using shared memory for communication within the PS3. Our evaluation methodology

stresses the impact of contention for communication bandwidth both within and across PS3

nodes. There is benefit in exploiting shared memory for communication between MPI pro-

cesses on each PS3. For example, collective operations between 8 processes running on 2 PS3

nodes are up to 30% faster than collective operations between 8 processes running across 8 PS3

nodes. However, there is also a noticeable penalty for oversubscribing each PPE with more

than two processes, due to OS overhead, despite our use of blocking shared memory commu-

nication within each PS3. Similar observations can be made for point-to-point communication

(Figure 7.2), albeit the effect of using shared-memory within each PS3 is less pronounced and

the effect of oversuscribing the PPE is more pronounced.

7.3.2 Application Benchmarks

We evaluate the performance of two state-of-the-art phylogenetic tree construction codes, RAxML

and PBPI, on our PS3 cluster, described in Chapter 3. The applications have been painstakingly

88

0

100

200

300

400

500

600

700

800

900

1000

87654321

Tim

e (u

sec)

Number of Nodes

(a) MPI Allreduce() latency, one double.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

87654321

Tim

e (u

sec)

Number of nodes

8192409620481024512256

1

(b) MPI Allreduce() latency, arrays of doubles.

Figure 7.1: MPI Allreduce() performance on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memory for communi-cation within the node.

89

0

20000

40000

60000

80000

100000

120000

140000

87654321

Tim

e (u

sec)

Number of nodes

24871117619921

Figure 7.2: MPI Send/Recv() latency on the PS3 cluster. Processes are distributed evenlybetween nodes. Each node runs up to 6 processes, using shared memory for communicationwithin the node.

optimized for the Cell BE, using vectorization, loop unrolling and tiling, branch optimizations,

double buffering, and optimized numerical implementations of kernels utilizing fast single-

precision arithmetic to implement double-precision operations. The optimization process is

described in Chapter 4. Both RAxML and PBPI are capabale of exploiting multiple levels of

PPE and SPE parallelism. We used a task off-loading execution model in the codes. The execu-

tion commences on the PPE, and SPEs are used for accelerating computation-intensive loops.

The off-loaded loops are parallelized across SPEs and vectorized within SPEs. The number of

PPE processes and the number of SPEs per PPE process are user-specified.

When the PPE is oversubscribed with more than two processes, the processes are scheduled

using an event-driven task-level parallelism (EDTLP) scheduler, described in Chapter 5. Each

process executes until the point it off-loads an SPE task and then releases the PPE, while waiting

for the off-loaded task to complete. The same process resumes execution on the PPE only after

all other processes off-load SPE tasks at least once. The RAxML and PBPI ports on the PS3

90

cluster are adaptations of the original MPI codes and are capable for execution in a distributed

environment. No algorithmic modifications have been applied to the applications.

For each application we use three data sets, briefly termed small, medium, and large. The

large data set occupies the entire memory of a PS3, minus memory used by the operating system

and the hypervisor. The medium and small data sets occupy 40% and 15% of the free memory

of the PS3 respectively. In PBPI the small, medium, and large data sets represent 218 species

with 1250, 3000, and 5000 nucleotides respectively. In RAxML, the small, medium, and large

datasets represent 42, 50, and 107 species respectively. We execute PBPI using weak scal-

ing, i.e. we scale the data set as we add more PS3 nodes, which is the recommended execution

mode. For RAxML we use strong scaling, since the application uses a master-worker paradigm,

where each worker performs independent, parameterized phylogenetic tree bootstrapping and

processes the entire tree independently. Workers are distributed between nodes to maximize

throughput. We perform 192 bootstraps, which is a realistic workload for real-world phyloge-

netic analysis.

Figure 7.3 illustrates the measured execution times of RAxML and PBPI on the PS3 clus-

ter. The predicted execution times on the same charts are derived from the extended MMGP

model, which is discussed in Section 7.4. We make three observations regarding measured

performance:

i) The PS3 cluster scales well under strong scaling (RAxML) and relatively well under weak

scaling (PBPI), for the problem sizes considered. PBPI is more communication-bound than

RAxML, as it involves several collective operations between executions of its Markov-chain

Monte Carlo kernel. We note that due to the hypervisor of the PS3 and the lack of Cell/BE-

specific optimization of the MPI library we used, the performance measurements on the PS3

cluster are conservative.

ii) The optimal layered decomposition of the applications is at the opposite ends of the opti-

mization space. RAxML executes optimally if the PPE on each PS3 is oversubscribed by 6

MPI processes, each off-loading simultaneously on 1 SPE. PBPI generally executes optimally

91

50

100

150

200

250

300

350

400

450

500

Exe

cutio

n T

ime

(sec

)

Configuration (Nnode

, Nprocess

, NSPE

)

(1,

1,6)

(1,

2,3)

(1,

3,2)

(1,

6,1)

(2,

1,6)

(2,

2,3)

(2,

3,2)

(2,

6,1)

(3,

1,6)

(3,

2,3)

(3,

3,2)

(3,

6,1)

(4,

1,6)

(4,

2,3)

(4,

3,2)

(4,

6,1)

(5,

1,6)

(5,

2,3)

(5,

3,2)

(5,

6,1)

(6,

1,6)

(6,

2,3)

(6,

3,2)

(6,

6,1)

(7,

1,6)

(7,

2,3)

(7,

3,2)

(7,

6,1)

(8,

1,6)

(8,

2,3)

(8,

3,2)

(8,

6,1)

PBPIMeasured−LPredicted−LMeasured−MPredicted−MMeasured−SPredicted−S

0

0.5

1

1.5

2

2.5x 10

4

Exe

cutio

n T

ime

(sec

)


, Nprocess

, NSPE

)

(1,

1,6)

(1,

2,3)

(1,

3,2)

(1,

6,1)

(2,

1,6)

(2,

2,3)

(2,

3,2)

(2,

6,1)

(3,

1,6)

(3,

2,3)

(3,

3,2)

(3,

6,1)

(4,

1,6)

(4,

2,3)

(4,

3,2)

(4,

6,1)

(5,

1,6)

(5,

2,3)

(5,

3,2)

(5,

6,1)

(6,

1,6)

(6,

2,3)

(6,

3,2)

(6,

6,1)

(7,

1,6)

(7,

2,3)

(7,

3,2)

(7,

6,1)

(8,

1,6)

(8,

2,3)

(8,

3,2)

(8,

6,1)

RAxMLMeasured−LPredicted−LMeasured−MPredicted−MMeasured−SPredicted−S

Figure 7.3: Measured and predicted performance of applications on the PS3 cluster. PBPI isexecuted with weak scaling. RAxML is executed with strong scaling. x-axis notation: Nnode -number of nodes, Nprocess - number of processes per node, NSPE - number of SPEs per process.

92

with 1 MPI process per PS3 using all 6 SPEs for data-parallel computation, as this configuration

reduces inter-PS3 communication volume and avoids PPE contention.

iii) Although the optimal layered decomposition does not change with the problem size for the

three data sets used in this experiment2, it changes with the scale of the cluster. When PBPI is

executed with 8 SPEs, the optimal operating point of the code shifts from 1 to 2 MPI processes

per node, each off-loading simultaneously on 3 SPEs. We have verified with an out-of-band

experiment that this shift is permanent beyond 8 SPEs. This shift happens because of the large

drop in the per process overhead of MPI Allreduce() (Figure 7.1), when 2 MPI processes

are packed per node, on 3 or more PS3 nodes. This drop is large enough to outweigh the over-

head due to contention between MPI processes on the PPE. The difficulty in experimentally

discovering the hardware and software implications on the optimal mapping of applications to

asymmetric multi-core clusters motivates the introduction of an analytical model presented in

the next section.

7.4 Modeling Hybrid Parallelism

We present an analytical model of layered parallelism on clusters of asymmetric multi-core

nodes, which is a generalization of a model of parallelism on stand-alone asymmetric multi-core

processors (MMGP), presented in Chapter 6. Our generalized model captures computation and

communication across nodes with host cores and acceleration cores. We specialize the model

for the PS3 cluster, to capture the overhead of non-overlapped DMA operations, wait times

during communication operations in the presence of contention for bandwidth both within and

across nodes, and non-overlapped scheduling overhead on the PPEs. In the rest of the section

we present overview of the MMGP model and we discuss the extensions related to the context-

switch overhead, on-chip and inter-node communication.

2The optimal decomposition does not change with the data set, however, as we show in Section 7.5, the optimalscheduling of an application may change with the data set.

93

We model the non-overlapped components of execution time on the Cell/BE’s PPE and SPE,

for single-threaded PPE code which off-loads to one SPE as:

T = (Tppe +Oppe) + (Tspe +Ospe) (7.1)

where Tppe and Tspe represent non-overlapped computation, while Oppe and Ospe represent non-

overlapped overhead on the PPE and SPE respectively. We apply the aforementioned model for

phases of parallel computation individually. Phases are separated by collective communication

operations.

7.4.1 Modeling PPE Execution Time

The overhead on the PPE includes instructions and DMA operations to off-load data and code

to SPEs, and wait time for receiving synchronization signals from SPEs on the PPE.

Assuming that multiple PPE threads can simultaneously off-load computation, we introduce

an additional factor for context switching overhead on the PPE. This factor depends on the

thread scheduling algorithm on the PPE. In the general case, Oppe for code off-loaded from a

single PPE thread to l SPEs is modeled as:

Oppe = l ·Ooff−load + Tcsw(p) (7.2)

We assume that a single PPE thread off-loads to multiple SPEs sequentially and that the context

switching overhead is a function of the number of threads co-executed on the PPE, which is

denoted by p. Ooff−load is application-dependent and includes DMA setup overhead which

we measure with microbenchmarks. Tcsw depends on system software and includes context

switching overhead for p/C context switches, C the number of hardware contexts on the PPE.

The overhead per context switch is also measured with microbenchmarks.

If a hardware thread on the PPE is oversubscribed with multiple application threads, the

94

S1

S2

S3

P3

OS quantum

* *P1P2P1

spin idle spin idleOS quantum OS quantum

P1 P2 P3 P1 P2 P3 P1 P2

S1

S2

S3

P3

S2S2S1

S3

S1

S2

(a) (b)

P1 P2 P3 P1 P2 P3 P1 P2

S1

S2

S3

P3

S1

S2

S3

* * *P1 P2 P3 P1 P2

S1

S2

S3

S1

S2

S3

P3

S1

spin spin spin

P1

spin

(c) (d)

Figure 7.4: Four cases illustrating the importance of co-scheduling PPE threads and SPEthreads. Threads labeled ”P” are PPE threads, while threads labeled ”S” are SPE threads. We as-sume that P-threads and S-threads communicate through shared memory. P-threads poll sharedmemory locations directly to detect if a previously off-loaded S-thread has completed. Stripedintervals indicate yielding of the PPE, dark intervals indicate computation leading to a threadoff-load on an SPE, light intervals indicate computation yielding the PPE without off-loadingon an SPE. Stars mark cases of mis-scheduling.

computation time of each thread may increase due to on-chip resource contention. To accu-

rately model this case, we introduce a scaling parameter, α(p) to the PPE computation compo-

nent, which depends on the number of threads co-executed on the PPE. The PPE component of

the model therefore becomes αp · Tppe + Oppe. The factor αp is estimated using linear regres-

sion with one free parameter, the number of threads sharing a PPE hardware thread, and co-

efficients derived from training samples of Tppe taken during executions of a microbenchmark

that oversubscribes the PPE with 3-6 threads and executes a parameterized ratio of computation

to memory instructions.

95

p

s

SPE execution

T − Sequential communication

T − Communication overheadcT − Parallel computation

Trec Tsen

Figure 7.5: SPE execution

The formulation of Tppe derived thus far ignores additional wait time of threads on the PPE

due to lack of co-scheduling between a PPE thread and an SPE thread off-loaded from it. This

scenario arises when the PPE hardware threads are time-shared between application threads, as

shown in Figure 7.4(a). Ideal co-scheduling requires accurate knowledge of the execution time

of tasks on SPEs by both the operating system and the runtime system. This knowledge is not

generally available. Our model assumes an idealized co-scheduling scenario. SPE tasks for a

given phase of computation are assumed to be of the same execution length and are off-loaded in

bundles with as many tasks per bundle as the number of SPEs on a Cell/BE. We also assume that

the SPE execution time of the first task is long enough to allow for idealized co-scheduling, i.e.

each PPE thread that off-loads a task is rescheduled on the PPE timely, to receive immediately

the signal from the corresponding finishing SPE task. We explore this scheduling problem in

Section 7.5 under more realistic assumptions and propose solutions.

7.4.2 Modeling the off-loaded Computation

Execution on SPEs is divided into stages, as shown in Figure 7.5. Tspe is modeled as:

Tspe = Tp + Ts (7.3)

Tp denotes the computation executed in parallel by more than one SPE. An example is a parallel

loop distributed across SPEs. Ts denotes the part of the off-loaded computation that is inherently

96

sequential and cannot be parallelized across SPEs.

When l SPEs are used for parallelization of off-loaded code, the Tspe term becomes:

Tspe =Tp

l+ Ts (7.4)

The accelerated execution on SPEs includes three more stages, shown in Figure 7.5. Trec

and Tsen account for PPE-SPE communication latency, while Tc captures the SPE overhead that

occurs when an SPE sends to or receives a message from the PPE. The per-byte latencies for

Trec, Tsen and Tc are application-independent and are obtained from microbenchmarks designed

to stress the PPE-SPE communication. Tp and Ts are application-dependent and are obtained

from a profile of a sequential run of the application, annotated with directives that delimit the

code off-loaded to SPEs.

7.4.3 DMA Modeling

Each SPE on the Cell/BE is capable of moving data between main memory and local storage,

while at the same time executing computation. To overlap computation and communication,

applications use loop tiling and double buffering, which are illustrated in pseudocode in Fig-

ure 7.6. When double-buffering is used, the off-loaded loop can be either computation or com-

1: DMA(Fetch Iteration 1, TAG1);2: DMA_Wait(TAG1);

3: for( ... ){4: DMA(Fetch Iteration i+1, TAG1);5: compute(Iteration i);6: DMA_Wait(TAG1);7: DMA(Commit Iteration i, TAG2);

}

8: DMA_Wait(TAG2);

Figure 7.6: Double buffering template for tiled parallel loops.

97

munication bound: if the amount of computation in a single iteration of the loop is sufficient

to completely mask the latency of fetching the data necessary for the next iteration, the loop is

computation bound. Otherwise the loop is communication bound.

Note that a parallel off-loaded loop can be described using Equation 7.4, independently of

whether the parallel part of the loop is computation or communication bound. In both cases, the

loop iterations are assumed to be distributed evenly across SPEs and blocking DMA accesses

can be interspersed with computation in the loop. With double buffering, the DMA request used

to fetch data for the first iteration, as well as the DMA request necessary to commit data to main

memory after the last iteration, can be neither overlapped with computation, nor distributed

(lines 2 and 8 in Figure 7.6). We capture the effect of blocking and non-overlapped DMA in the

model as:

Ospe = Trec + Tsen + Tc + TDMA (7.5)

The last term in equation 7.5 is itemized to the blocking DMAs performed within loop

iterations and the non-overlapped DMAs exposed when the loop is unrolled, tiled, and executed

with double buffering in the code. We use static analysis of the code to captures the DMA sizes.

7.4.4 Cluster Execution Modeling

We generalize our model of a single asymmetric multi-core processor to a cluster by introducing

an inter-processor communication component as:

T = (Tppe +Oppe) + (Tspe +Ospe) + C (7.6)

We further decompose the communication term C in communication latency due to each dis-

tinct type of communication pattern in the program, including point-to-point and all-to-all com-

98

munication. Assuming MPI as the programming model used to communicate across nodes or

between address spaces within nodes, we use mpptest to estimate the MPI communication

overhead for variable message sizes and communication primitives. The message sizes are

captured by static analysis of the application code.

7.4.5 Verification

We verify our model by exhaustively executing PBPI and RAxML on all feasible layered de-

compositions that use 1 to 6 PPE threads, 1 to 6 SPEs per PPE and up to 8 PS3 nodes. Fig-

ure 7.3(a),(b) illustrates that the model is accurate both in terms of predicting execution time and

in terms of discovering optimal application decompositions and mappings for different cluster

scales and data sets. The optimal decomposition may vary across multiple dimensions, includ-

ing application characteristics, such as granularity of off-loaded tasks, and frequency and size

of communication and DMA operations, size and structure of the data set used in the applica-

tion, and number of nodes available to the application for execution. Accurate modeling of the

application under each scenario is valuable to tame the complexity of discovering the optimal

decomposition and mapping experimentally. In our test cases, the model achieves error rates

consistently under 15%. The mean error rate is 5.2%. The errors tend to be higher when the

PPE is oversubscribed with large number of processes, due to error in estimating the thread

interference factor. With respect to prediction accuracy for any given application, data set, and

number of PS3 nodes, the model predicts accurately the optimal configuration and mapping in

all 48 test cases.

7.5 Co-Scheduling on Asymmetric Clusters

Although our model projects optimal mappings of MPI applications on the PS3 cluster with

high accuracy, it is oblivious to the implications of user-level and kernel-level scheduling on

99

oversubscribed cores. More specifically, the model ignores cases in which PPE threads and

SPE threads are not co-scheduled when they need to synchronize through shared memory. We

explore user-level co-scheduling solutions to this problem.

The main objective of co-scheduling is to minimize slack time on SPEs, since SPEs bear

the brunt of the computation in practical cases. This slack is minimized whenever a thread off-

loaded to an SPE needs to communicate or synchronize with its originating thread at the PPE

and the originating thread is running on a PPE hardware context.

As illustrated in Figure 7.4, different scheduling policies can have a significant impact on co-

scheduling, slack, SPE utilization, and eventually performance. In Figure 7.4(a), PPE threads

are spinning while waiting for the corresponding off-loaded threads to return results from SPEs.

The time quantum allocated to each PPE thread by the OS can cause continuous mis-scheduling

of PPE threads with respect to SPE threads.

In Figure 7.4(b), the user-level scheduler uses a yield-if-not-ready policy, which forces each

PPE thread to yield the processor, whenever a corresponding off-loaded SPE thread is pend-

ing completion. This policy can be implemented at user-level by having PPE threads poll

shared-memory flags that matching SPE threads set upon completion. Figure 7.7 illustrates

the performance of this policy in PBPI and RAxML on a PS3 cluster, when the PPE on each

node is oversubscribed with 6 MPI processes, each off-loading on 1 SPE (recall that the PPE

is a two-way SMT processor). The results show that compared to a scheduling policy which

is oblivious to PPE-SPE co-scheduling (Linux scheduling policy), yield-if-not-ready achieves

a performance improvement of 1.7–2.7×, on a cluster composed of PS3 nodes. Yield-if-not-

ready bounds the slack by the time needed to context switch across p− 1 PPE threads, p is the

total number of active PPE threads, but can still cause temporary mis-scheduling and slack, as

shown in Figure 7.4(c). Figure 7.4(d) illustrates an adaptive spinning policy, in which a thread

either spins or yields the processor, based on which thread is anticipated to offload the soonest

on an SPE. This policy uses a prediction which can be derived with various algorithms, the

simplest of which is using the execution length of the most recently off-loaded task from any

100

(1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1)0

100

200

300

400

500

600

700

800

Exe

cutio

n T

ime

(sec

)


, Nprocess

, NSPE

)

PBPI Linuxyield−if−not−ready

(1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1)0

2000

4000

6000

8000

10000

12000

Exe

cutio

n T

ime

(sec

)


, Nprocess

, NSPE

)

RAxML Linuxyield−if−not−ready

Figure 7.7: Performance of yield-if-not-ready policy and the native Linux scheduler in PBPIand RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number of processes pernode, NSPE - number of SPEs per process.

given thread as a predictor of the earliest time that the same thread will be ready to off-load in

the future. The thread spins if it anticipates that it will be the first to off-load, otherwise it yields

the processor.

Although the aforementioned adaptive policy can reduce accelerator slack compared to the

yield-if-not-ready policy, it is still suboptimal, as it may mis-schedule threads due to variations

in the execution lengths of consecutive tasks off-loaded by the same thread, or variations in

the run lengths between any two consecutive off-loads on a PPE thread. We should also note

101

that better policies –with tighter bounds on the maximum slack–, can be obtained if the user-

level scheduler is not oblivious to the kernel-level scheduler and vice versa. Devising and

implementing such policies is described in Chapter 8.

Figure 7.8 illustrates results when RAxML and PBPI are executed with various co-scheduling

policies. Both applications are executed with variable sequence length (x-axis), hence variable

SPE task sizes. In RAxML, Figure 7.8(a), adaptive spinning performs better for small data sets,

while yield-if-not-ready performs better for large data sets. In PBPI, Figure 7.8(b), adaptive

spinning outperforms yield-if-not-ready in all cases. In RAxML, there is variance in length

of the offloaded tasks which increases with the size of the input sequence. This causes more

mis-scheduling when the adaptive policy is used. In PBPI, the task length is not varying, which

enables nearly optimal co-scheduling by the adaptive spinning policy. In general, the best co-

scheduling algorithm can improve performance by more than 10%. We emphasize that the opti-

mal co-scheduling policy changes with the dataset, therefore support for flexible co-scheduling

algorithms in system software is essential on the PS3 cluster.

7.6 PS3 versus IBM QS20 Blades

We compare the performance of the PS3 cluster to a cluster of IBM QS20 dual-Cell/BE blades

located at Georgia Tech. The Cell/BE processors on the QS20 have 8 active SPEs and possibly

other undisclosed microarchitectural differences. Furthermore, although both the QS20 cluster

and the PS3 cluster use GigE, communication latencies tend to be markedly lower on the QS20

cluster, first due to the absence of a hypervisor, which is a communication bottleneck on the PS3

cluster, and second due to exploitation of shared-memory communication between two Cell/BE

processors on each QS20, instead of one Cell/BE processor on each PS3.

We present selected experimental data points where the two platforms use the same number

of Cell processors. On the QS20 cluster, we use both Cell processors per node. Figure 7.9

illustrates execution times of PBPI and RAxML on the two platforms. We report the execution

102

200 1000 3000 5000 8000 100000

10

20

30

40

50

60

Sequence

Exe

cutio

n T

ime

(sec

)

PBPI

adaptiveyield−if−not−ready

300 400 500 600 1500 2000 3000 40000

50

100

150

200

Sequence

Exe

cutio

n T

ime

(sec

)

RAxML

adaptiveyield−if−not−ready

Figure 7.8: Performance of different scheduling strategies in PBPI and RAxML.

time of the most efficient pair of application configuration and co-scheduling policy, on any

given number of Cell processors.

We observe that the performance of the PS3 cluster is reasonably close (within 14% to 27%

for PBPI and 11% to 13% for RAxML) to the performance of the QS20 cluster. The difference

is attributed to the reduced number of active SPEs per processor on the PS3 cluster (6 versus 8

for the QS20 cluster), and faster communication on the QS20 cluster. The difference between

the two platforms in RAxML is less than PBPI, as RAxML is not as communication-intensive.

Interestingly, if we compare datapoints with the same total number of SPEs (48 SPEs on 8

PS3’s versus 48 SPEs on 6 QS20’s), in RAxML the PS3 cluster outperforms the QS20 blade.

103

2 4 6 80

50

100

150

200

250

300

Number of Cells

Exe

cutio

n T

ime

(sec

)

PBPIPS3 ClusterBlade Cluster

2 4 6 80

1000

2000

3000

4000

5000

6000

7000

Number of Cells

Exe

cutio

n T

ime

(sec

)

RAxMLPS3 ClusterBlade Cluster

Figure 7.9: Comparison between the PS3 cluster and an IBM QS20 cluster.

This result does not indicate superiority of the PS3 hardware or system software, as we apply

experimentally defined optimal decompositions and scheduling policies on both platforms. It

rather indicates the implications of layered parallelization. Oversubscribing the QS20 with 8

MPI processes (versus 6 on the PS3) introduces significantly higher scheduling overhead and

brings performance below that of the PS3. This result stresses our earlier observations on the

necessity of models and better schedulers for asymmetric multi-core clusters.

7.7 Chapter Summary

We evaluated a very low-cost HPC cluster based on PS3 consoles and proposed a model of

asymmetric parallelism and software support for orchestrating asymmetric parallelism extracted

from MPI programs on the PS3 cluster. While the Sony PlayStation has several limitations as

104

an HPC platform, including limited storage and limited support for advanced networking, it

has enough computational power compared to vastly more expensive multi-processor blades

and forms a solid experimental testbed for research on programming and runtime support for

asymmetric multi-core clusters, before migrating software and applications to production-level

asymmetric machines, such as the LANL RoadRunner. The model presented in this paper

accurately captures heterogeneity in computation and communication substrates and helps the

user or the runtime environment map layered parallelism effectively to the target architecture.

The co-scheduling heuristics presented in this thesis increase parallelism and minimize slack

on computational accelerators.

105

Chapter 8

Kernel-Level Scheduling

8.1 Introduction

The ideal scheduling policy, which minimizes the context-switching overhead, assumes that

whenever an SPE communicates to the PPE, the corresponding PPE thread is scheduled and

running on the PPE. In Chapter 7 we discussed the possibility of predicting the next-to-run

thread on the PPE. We have implemented a prototype of the scheduling strategy capable of

predicting what process will be the next to run, and the produced results imply that predicting

the next thread to run might be difficult, especially if the the off-loaded tasks exhibit high

variance in execution time.

As another approach which targets minimizing the context switching overhead on the PPE,

we investigate a user-level scheduler which is capable of influencing the kernel scheduling

decisions. We explore the scheduling strategies where the scheduler can not only decide when

a process should release the PPE, but also what will be the next process to run on the PPE. By

reducing the response time related to scheduling on the PPE, our new approach also reduces the

idle time that occurs on the SPE side, while waiting for the new off-loaded task. We call our

new scheduling strategy Slack-minimizer Event-Driven Scheduler (SLEDS).

Besides improving the overall performance, the new scheduling strategy enables more accu-

107

SPE1SPE2SPE3SPE4SPE5SPE6SPE7SPE8

1 23 45 67 8

processes

Shared MemoryRegion

PPE schedules processwhich is ready to run

PPE with multiple

Figure 8.1: Upon completing the assigned tasks, the SPEs send signal to the PPE processesthrough the ready-to-run list. The PPE process which decides to yield passes the data from theready-to-run list to the kernel, which in return can schedule the appropriate process on the PPE.

rate performance modeling. Although the MMGP model projects the most efficient mappings

of MPI applications on the Cell processor with high accuracy, it is oblivious to the implications

of user-level and kernel-level scheduling on oversubscribed cores. More specifically, the model

ignores cases in which PPE threads and SPE threads are not appropriately co-scheduled. The

scheduling policy where the PPE threads are not arbitrary scheduled by the OS scheduler, in-

troduces more regularity in the application execution and consequently improves the MMGP

predictions.

8.2 SLED Scheduler Overview

The SLED scheduler is invoked through a user-level library calls which can easily be integrated

into the existing Cell application. The overview of the SLED scheduler is illustrated in Fig-

ure 8.1. Each SPE thread upon completing the assigned task sends its own pid to the shared

ready to run list, from where this information is further passed to the kernel. Using the knowl-

edge of the SPE threads that have finished processing the assigned tasks, the kernel can decide

what will be the next process to run on the PPE.

Although it is invoked through a user-level library, part of the scheduler resides in the ker-

nel. Therefore, the implementation of the SLED scheduler can be vertically divided into two

108

distinguishable parts:

1. The user-level library, and

2. The kernel code that enables accepting and processing the user-level information, which

is further used in making kernel-level scheduling decisions.

User−Level

Kernel−LevelKernel SchedulingDecisions

ready−to−run List

Information is passed to the Kernel

Figure 8.2: Vertical overview of the SLED scheduler. The user level part contains the ready-to-run list, shared among the processes, while the kernel part contains the system call throughwhich the information from the ready-to-run list is passed to the kernel.

Passing the information from the ready to run list to the kernel can be achieved in two ways:

• The information from the list can be read by the processes running on the PPE, and the

information can be passed to the kernel through a system call, or

• The ready to run list can be visible to the kernel and the kernel can directly read the

information from the list.

In the current study we followed the first approach, where the information to the kernel is

passed through a system call, see Figure 8.2. Placing the ready to run list inside the kernel will

be the subject of our future research. In the current implementation of the SLED scheduler, the

size of the list is constant and is equal to the total number of SPEs available on the system. Each

SPE is assigned an entry in the list. We further described the ready to run list organization in

the following section.

109

8.3 ready to run List

In the current implementation of the IBM SDK on the Cell BE, the local storage of every SPE

is memory-mapped to the address space of the process which has spawned the SPE thread.

Using DMA requests, the running SPE thread is capable of accessing the global memory of the

system. However, these accesses are restricted to the areas of main memory that belong to the

address space of the corresponding PPE thread. Therefore, if the SPE threads do not belong

to the same process, the only possibility of sharing a data structure among them is through the

globally shared memory segments which can reside in the main memory.

The ready to run list needs to be a shared structure accessible by all SPE threads (even if

they belong to different processes). Therefore, it is implemented as a part of a global shared

memory region. The shared memory region is attached to each process at the beginning of

execution.

8.3.1 ready to run List Organization

Initial observation suggests that the ready to run list should be organized in the FIFO order,

i.e. the process corresponding to the SPE thread which was the first to finish processing a

task, should also be the first to run on the PPE. Nevertheless, the strict FIFO organization of

the scheduler might cause certain problems. Consider the situation where the PPE process

A has off-loaded, but the granularity of the off-loaded task is relatively small, and the SPE

execution finishes before process A had a chance to yield on the PPE side. If process B is

in the ready to run list waiting to be scheduled, in the case when the FIFO order is strictly

followed, process A will yield the PPE and process B will be scheduled to run on the PPE. In

the described scenario, the strict FIFO scheduling causes an extra context switch to occur (there

is no necessity for process A to yield the PPE to process B).

Therefore the SLED scheduler is not designed as a strictly FIFO scheduler. Instead, after off-

loading and before yielding, the process verifies that the off-loaded task is still executing. If the

110

SPE task has finished executing, instead of yielding, the PPE process will continue executing.

Following the described soft FIFO policy, it is possible that a process (we will call itA) upon

off-loading will not yield the PPE, but at the same time the off-loaded task upon completion will

write the pid of process A to the ready to run list. Because the pid is written to the list, at some

point process A will be scheduled by the SLED scheduler. However, when scheduled by the

SLED scheduler, process A might not have anything useful to process, since it has not yield

upon off-loading. To avoid the described situation, the process which decides not to yield upon

off-loading also needs to clear the field that has been filled in the ready to run list with its own

pid. Since multiple processes require simultaneous read/write access to the list, maintaining the

list in the consistent state requires usage of locks, which can bring significant overhead.

Instead of allowing processes to access any field in the ready to run list, we found more

efficient solution to be if each process is assigned an exclusive entry in the list. By not allowing

processes to share the entries in the ready to run list, we avoid any type of locking, which

significantly reduces the overhead of maintaining the list in the consistent state.

8.3.2 Splitting ready to run List

The ready to tun list serves as a buffer through which the necessary off-loading related infor-

mation are passed to the kernel. Initially, the SLED scheduler was designed to use a single

ready to run list. However, in certain cases the single-list design forced the SLED scheduler to

perform process migration across the PPE execution contexts.

Consider the situation described in Figure 8.3. The off-loaded task which belongs to the

process P1 has finished processing on the SPE side (the pid of the process P1 has been written

to the ready to run list). Process P1 is bound to CPU1, but the process P2 which is running on

CPU2 off-loads, and initiates the context switch by passing the pid of process P1 to the kernel.

Since the context switch occurred on CPU2 and P1 is bound to run on CPU1, the kernel needs

to migrate process P1 to CPU2. Initially, we implemented a system call which performs the

111

process migration, and the designed of this system call is outlined in Figure 8.4. The essential

step in this system call is outlined in Line 9, where the sched migrate task() function is invoked.

This is a kernel function, which accepts two parameters: task to be migrated, and the destination

cpu where the task should migrate.

��

��

��

��

P2 yieldsCPU2

CPU1

Next to run

P1 is bound to CPU1


P2 reads the pid of the

next process

P2 sends inform

ation to

the kernel

Kernel migrates P1 to CPU2

P3 is running

. . .

. . .

. . .

. . .

. . .

. . .

. . .

P1

SPE1

SPE2

SPE3

SPE4

SPE5

SPE6

SPE7

SPE8

Figure 8.3: ProcessP1, which is bound to CPU1, needs to be scheduled to run by the schedulerthat was invoked on CPU2. Consequently, the kernel needs to perform migration of the processP1, from CPU1 to CPU2

The described process migration might create some drawbacks. Specifically, the sched migrate task()

function might be expensive due to the required locking of the running queues, and also this

function can create uneven distribution of processes across available cpus. To avoid the draw-

backs caused by the sched migrate task() function, we redesigned the ready to run list. Instead

of having a single ready to run list shared among all processes in the system, we assign one

ready to run list to each execution context on the PPE. In this way, only processes sharing the

execution context are accessing the same ready to run list. This mechanism is presented in

Figure 8.5. With the separate ready to run lists there is no more necessity for expensive task

migration, and also we avoid possible load imbalance on the PPE processor.

112

1: void migrate(pid_t next){

2: int next, i, j=0;

3: p = find_process_by_pid(next);4: if (p){5: rq_p = task_rq(p);6: this_cpu = smp_processor_id();7: p_cpu = task_cpu(p);8: if (p_cpu != this_cpu && p!= rq_p->curr){9: sched_migrate_task(p, this_cpu);10: }

11: SLEDS_yield(next) ...

12: }

Figure 8.4: System call for migrating the processes across the execution contexts. Functionsched migrate task() performs the actual migration. SLEDS yield() function schedules the pro-cess to be the next to run on the CPU.

8.4 SLED Scheduler - Kernel Level

The standard scheduler used in the Linux kernel, starting from version 2.6.23, is the Completely

Fair Scheduler (CFS). CFS implements a simple algorithm based on the idea that at any given

moment in time, the CPU should be evenly divided across all active processes in the system.

While this is a desirable theoretical goal, in practice it cannot be achieved since at any moment

in time the CPU can serve only one process. For each process in the system, the CFS records

the amount of time that the process has been waiting to be scheduled on the CPU. Based on the

amount of time spent in waiting to be scheduled and the number of processes in the system, as

well as the static priority of the process, each process is assigned dynamic priority. The dynamic

priority of a process is used to determine the time when and for how long the process will be

scheduled to run.

The structure used by the CFS for storing the active processes is a red-black tree. The

processes are stored in the nodes of the tree, and the process with the highest dynamic priority

113

��

��

��

��

CPU1

CPU2

SPEs



SPE1

SPE2

SPE3

SPE4

SPE5

SPE6

SPE7

SPE8

Figure 8.5: The ready to run list is split in two parts. Each of the two sublists contain processesthat are sharing the execution context (CPU1 or CPU2). This approach avoids any possibilityof expensive process migration across the execution contexts.

(which will be the first to run on the CPU) is stored in the left most node in the tree.

The SLED scheduler passes the information from the ready-to-run list to the kernel through

the SLEDS yield() system call. SLEDS yield() extends the standard kernel scheduler yield()

system call by accepting an integer parameter pid which represents the process that should be the

next to run. A high level overview of the SLEDS yield() function is described in Figure 8.6(a)-

(c) (assumption is that the passed parameter pid is different than zero). First, the process which

should be the next to run is pulled out from the running tree, and its static priority is increased

to the maximum value. The process is then returned to the running tree, where it will be stored

in the left most node (since it has the highest priority). After being returned to the tree, the

static priority of the process is decreased to the normal value. Besides increasing the static

priority of the process, we also increase the time that the process is supposed to run on the

CPU. Increasing the CPU time is important, since if a process is artificially scheduled to run

many times, it might exhaust all the CPU time that it was assigned by the Linux scheduler.

114

. . .P1 P2 P3 Pn

System Call − Kernel Level

Find the process using

Increase the priority

Return the process

Running List (Tree)

pid

. . .P1 P2 Pn




Return the process

P3

Running List (Tree)

pid

(a) (b)

. . .P3 P1 Pn




Return the process

P2

Running List (Tree)

pid

(c)

Figure 8.6: Execution flow of the SLEDS yield() function: (a) The appropriate process is foundin the running list (tree), (b) The process is pulled out from the list, and its priority is increased,(c) The process is returned to the list, and since its priority is increased it will be stored at theleft most position.

In that case, although we are capable of scheduling the process to run on the CPU using the

SLEDS yield() function, the process will almost immediately be switched out by the kernel.

Before it exits, the SLEDS yield() function calls the kernel-level schedule() function which

initiates context switching.

We measured the overhead in the SLEDS yield() system call caused by the operations per-

formed on the running tree. We found that the SLEDS yield() incurs overhead of approximately

8% compared to the standard sched yield() system call.

115

8.5 SLED Scheduler - User Level

Figure 8.7 outlines the part of the SLED scheduler which resides in the user space. Upon off-

loading, the process is required to call the SLEDS Offload() function (Figure 8.7, Line 13). This

function is polling over the the member of the structure signal in order to check if the SPE

has finished processing the off-loaded task. Structure signal resides in the local storage of an

SPE, and the process executing on the PPE knows the address of this structure and uses it for

accessing the members of the sructure. While the SPE task is running, the stop field of the

structure signal is equal to zero, and upon completion of the off-loaded task, the value of this

field is set to one.

If the SPE has not finished processing the off-loaded task, the SLEDS Offload() function

calls the SLEDS yield() function, Figure 8.7, Line 15. The SLEDS yield() function scans the

ready to run list searching for the SPE that has finished processing the assigned task, Figure 8.7,

Line 3 – 10. Two interesting things can be noticed in the function SLEDS yield(). First, the

function scans only three entries in the ready to run list. The reason for this is that the list

is divided among the execution contexts on the PPE, as described in Section 8.3.2. Since the

presented version of the scheduler is adapted to the PlayStation3 (contains a Cell processor

with only 6 SPEs), each ready to run list contains only 3 entries. Second, the list is scanned

at most N times (see Figure 8.7, Line 3), after which the process is forced to yield. If the N

parameter is relatively large, repeatedly scanning of the ready to run list becomes harmful for

the process executing on the adjacent PPE execution context. However, if the parameter N is

not large enough, the process might yield before having a chance to find the next-to-run process.

Although the results presented in Figure 8.8 show that execution time of RAxML depends on

N , the theoretical model capable of describing this dependence will be the subject of our future

work. Currently, for RAxML we chose N to be 300, as this is the value which achieves the

most efficient execution in our test cases (Figure 8.8). For PBPI we did not see any variance in

execution times for values of N smaller than 1000. When N is larger than 1000, performance of

116

1: void _yield(){

2: int next, i, j=0;

3: while(next == 0 && j < N){4: i=0;5: j++;6: while(next == 0 && i < 3){7: next = ready_to_run[i];8: i++;9: }10: }

11: SLEDS_yield(next);

12: }

13: void SLEDS_Offload(){

14: while (((struct signal *)signal)->stop == 0){15: _yield();16: }17: }

Figure 8.7: Outline of the SLEDS scheduler: Upon off-loading a process is required to call theSLEDS Offload() function. SLEDS Offload() checks if the off-loaded task has finished (Line14), and if not, calls the yield() function. yield() scans the ready to run list, and yields to thenext process by executing SLEDS yield() system call.

PBPI decreases due to contention caused by scanning the ready to run list.

8.6 Experimental Setup

To test the SLED scheduler we used the Cell processor inbuilt in the PlayStation3 console. As

an operating system, we used a variant of the 2.6.23 kernel version, specially adapted to run

on the PlayStation3. We also changed the kernel by introducing system calls necessary for

executing the SLED scheduler. We used SDK2.1 to execute our applications on Cell.

117

12.5

12.6

12.7

12.8

12.9

13

13.1

13.2

13.3

13.4

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

900

950

1000

ready-to-run list scanning

Execu

tio

n T

ime (

s)

Figure 8.8: Execution times of RAxML when the ready to run list is scanned between 50 and1000 times. x-axis represents the number of scans of the ready to run list. y-axis representsthe execution time. Note that the lowest value for the y-axis is 12.5, and the difference betweenthe lowest and the highest execution time is 4.2%. The input file contains 10 species, eachrepresented by 1800 nucleotides.

8.6.1 Benchmarks

In this section we describe the benchmarks used to test the performance of the SLED scheduler.

We compared the SLED scheduler to the EDTLP scheduler using microbenchmarks and real-

world bioinformatics applications, RAxML and PBPI.

The microbenchmarks we used are designed to imitate the behavior of real applications

utilizing the off-loading execution model. Using the microbenchmarks we aimed to determine

the dependence of the context switch overhead to the size of the off-loaded tasks.

8.6.2 Microbenchmarks

The microbenchmarks we designed are composed of multiple MPI processes, and each process

uses an SPE for task off-loading. The tasks in each process are repeatedly off-loaded inside a

loop which iterates 1,000,000 times. The part of the process executed on the PPE only initiates

task off-loading and waits for the off-loaded task to complete. The off-loaded task executes

a loop which may vary in length. In our experiments we oversubscribe the PPE with 6 MPI

118

0

10

20

30

40

50

60

70

80

90

100

6

11

15

20

24

28

33

37

41

46

50

55

59

63

68

72

76

81

85

Task Length (us)

Execu

tio

n T

ime (

s)

SLEDSEDTLP

Figure 8.9: Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased.

processes.

We compare the performance of the microbenchmarks using the SLED and EDTLP sched-

ulers. Figure 8.9 represents the total execution time of the microbenchmarks that are executed

with different lengths of the off-loaded tasks. For large task sizes the SLED scheduler outper-

forms EDTLP by up to 17%. However, when the size of the off-loaded task is relatively small,

the EDTLP scheme outperforms the SLED scheme by up to 29%, as represented in Figure 8.10.

We will use the example presented in Figure 8.11 to explain the behavior of the EDTLP and

SLED schemes for small task sizes. Assume that 3 processes, P1, P2, and P3 are oversubscribed

on the PPE. In the EDTLP scheme Figure 8.11 (EDTLP), upon offloading, P1 yields and the

operating system decides which should be the next process to run on the PPE. Since the process

P1 was the first to off-load and yield, it is not likely that the same process will be scheduled

until all other processes have off-loaded and been switched out from the PPE. If the size of

the off-loaded task is relatively small, by the time the process P1 gets scheduled to run again

on the PPE, the off-loaded task will already be completed and the process P1 can immediately

continue running on the PPE.

119

0

5

10

15

20

25

30

6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21

Task Length (us)

Execu

tio

n T

ime (

s)

SLEDSEDTLP

Figure 8.10: Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased – task size islimited to 2.1us.

Consider now the situation represented in Figure 8.11 (SLED), when the SLED scheduler is

used for scheduling the processes with small off-loaded tasks. Due to complexity introduced by

the SLED scheduler, the time necessary for the context switch to complete is increased. Conse-

quently, the time interval for process P1, between the off-loading and the next opportunity to run

on the PPE, increases. Based on the performed analysis, we can conclude that for scheduling

the processes with relatively fine-grain off-loaded tasks (the execution time of a task is shorter

than 15µs), it is more efficient to use the EDTLP scheme than the SLED scheme.

For coarser task sizes (the execution time of a task is longer than 15µs), the SLED scheme

almost always outperforms the EDTLP scheme. The exceptions are certain tasks sizes which

are equal to the exact multiple of scheduling intervals, as can be seen in Figure 8.9. Scheduling

interval represents the time interval after which a process will be scheduled to run on the PPE,

upon offloading. For specific task sizes, the processes will be ready to run at exact moment

when they get scheduled on the PPE using only the EDTLP scheme. We need to point out that

these situations are rare, and in real applications described in Section 8.6.3 and Section 8.6.4

we did not observe this behavior.

120

. . .

. . .P1 P3 P1P2

Task off−loaded from P1 has completed

immediately continue runningP1 is scheduled on the PPE, and can

Context switch overhead

EDTLP

P1

Task off−loaded from P1 has completed

P2 . . . P1

. . . P1 is scheduled on the PPE

Time

P3

SLED

Figure 8.11: EDTLP outperforms SLED for small task sizes due to higher complexity of theSLED scheme.

0

10

20

30

40

50

60

70

80

90

100

6

10

13

17

20

24

27

31

34

38

41

45

48

52

55

59

62

66

69

73

76

80

83

Task Length (us)

Execu

tio

n T

ime (

s)

SLEDS+EDTLPEDTLPSLEDS

Figure 8.12: Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs.

121

0

5

10

15

20

25

30

6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21

Task Length (us)

Execu

tio

n T

ime (

s)

SLEDS+EDTLPEDTLPSLEDS

Figure 8.13: Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs – tasksize is limited to 2.µs.

To address the issues related to the small task sizes (when EDTLP outperforms SLED), we

combined the two schemes into a single scheduling policy. The EDTLP scheme is used when

the size of the off-loaded tasks are smaller than 15µs. The results of the combined scheme are

presented in Figure 8.12 and Figure 8.13.

8.6.3 PBPI

We also compared the performance of the two schemes, EDTLP and SLED, using the PBPI

application. As an input file for PBPI we used a set that contains 107 species and we varied

the length of the DNA sequence that represents the species. In the PBPI application, the length

of the input DNA sequence is directly related to the size of the off-loaded tasks. We varied the

length of the DNA sequence from 200 to 5,000. Figure 8.14 represents the the execution time of

PBPI when the EDTLP and SLED scheduling schemes are used. In all experiments the configu-

ration for PBPI was 6 MPI processes and each process was assigned an SPE for off-loading the

expensive computation. As in the previous example, the EDTLP outperforms the SLED scheme

122

0

5

10

15

20

25

30

35

40

200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000

Sequence Size

Execu

tio

n T

ime (

s)

SLEDSEDTLP

Figure 8.14: Comparison of EDTLP and SLED schemes using the PBPI application. Theapplication is executed multiple times with varying length of the input sequence (representedon the x-axis).

for small task sizes. Again we combined the two schemes, EDTLP for task sizes smaller than

15µs and the SLED for the larger tasks sizes and we present the obtained performance in Fig-

ure 8.15. The combined scheme constantly outperforms the EDTLP scheduler, and the highest

difference we recorded is 13%.

8.6.4 RAxML

We executed RAxML with the input file that contained 10 species in order to compare the

EDTLP and SLED schedulers. As in PBPI case, we varied the length of the input DNA se-

quence, as the size of the input sequence is directly related to the size of the off-loaded tasks.

The length of the sequence in our experiments was between 100 and 5,000 nucleotides. In

case of RAxML, SLED outperforms EDTLP by up to 7%. As in previous experiments, for

relatively small task sizes the EDTLP scheme outperforms the SLED scheme, as represented

in Figure 8.16 and Figure 8.17. For larger tasks sizes the SLED scheme outperforms EDTLP.

Again, by combining the two schemes we can achieve the best performance.

123

0

5

10

15

20

25

30

35

40

200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000

Sequence Size

Execu

tio

n T

ime (

s)

SLEDS+EDTLPEDTLP

Figure 8.15: Comparison of EDTLP and the combination of SLED and EDTLP schemes usingthe PBPI application. The application is executed multiples time with varying length of theinput sequence (represented on the x-axis).

0

5

10

15

20

25

30

35

40

45

50

1003005007009001100150017001900230026002800300032003400370039004200450047004900

Sequence Size

Execu

tio

n T

ime (

s)

SLEDS

EDTLP

Figure 8.16: Comparison of EDTLP and SLED schemes using the RAxML application. Theapplication is executed multiple times with varying length of the input sequence (representedon the x-axis).

124

0

5

10

15

20

25

100 200 300 400 500 600

Sequence Size

Execu

tio

n T

ime (

s)

SLEDS

EDTLP

Figure 8.17: Comparison of EDTLP and the combination of SLED and RAxML schemes usingthe RAxML application. The application is executed multiple times with varying length of theinput sequence (represented on the x-axis).

8.7 Chapter Summary

In this chapter we investigated strategies targeting to reduce the scheduling overhead which oc-

curs on the PPE side of the Cell BE. We designed and tested the SLED scheduler, which uses the

user-level off-loaded-related information in order to influence kernel-level scheduling decisions.

On a PlayStation3, which contains 6 SPEs, we conducted a set of experiments comparing the

SLED and EDTLP scheduling schemes. For comparison we used the real scientific applications

RAxML and PBPI, as well as a set of microbenchmarks developed to simulate the behavior of

larger applications. Using the microbenchmarks, we found that the SLED scheme is capable

outperforming the EDTLP scheme by up to 17%. SLEDS performs better by up to 13% with

PBPI, and up to 7% with RAxML. Note that higher advantage of the SLED scheme is likely

to occur on the Cell BE with all 8 SPEs available (the Cell BE used in PS3 has only 6 SPEs

available) due to higher PPE contention and consequently higher context-switch overhead.

125

Chapter 9

Future Work

This chapter discusses directions of future work. The proposed extensions are summarized as

follows:

• We plan to extend the presented kernel-level scheduling policies by implementing the

ready to run list into the kernel and considering more scheduling parameters such as load

balancing and job priorities when making scheduling decisions.

• We plan to increase utilization of the host and accelerator cores by sharing the accelerators

among multiple tasks and by extending the loop-level parallelism to also include the host

core besides already considered accelerator cores.

• We plan to port more applications to Cell. Specifically, we will focus on streaming,

memory intensive applications, and evaluate the capability of Cell to execute these ap-

plications. By using memory intensive applications, we hope to get better insights in

scheduling strategies which would enable efficient execution of communication bound

applications on asymmetric processors. We consider this to be an important problem

since the memory and bus contention will grow rapidly as the number of cores in multi-

core asymmetric architectures increases.

• Most of the presented techniques in this thesis are not specifically designed for Cell and

127

heterogeneous accelerator-based architectures, and in our future work we plan to extend

the techniques presented in this thesis to homogeneous parallel architectures.

• Finally, we plan to extend the MMGP model by capturing the overhead caused by the

Element Interconnect Bus congestion that can significantly limit the abilities of Cell to

overlap computation and communication.

We expand on our plans for future work in the following sections.

9.1 Integrating ready-to-run list in the Kernel

As described in Chapter 8, the SLED scheduler spans across both, the kernel and the user space.

The ready-to-run list resides in the user space, and it is shared among all active processes. The

information from the ready-to-run list is passed to the kernel-level part of the SLED scheduler

through a system call. Based on the received information, the kernel part of the SLED scheduler

biases kernel scheduling decisions. Further in this section we explain the possible drawbacks

of having the ready-to-run residing in the user space.

The timeline diagram of the SLED scheduler is presented in Figure 9.1 (upper figure). Each

process upon off-loading issues a call to the SLED scheduler. The scheduler iterates through the

ready-to-run list, in order to determine the pid of the next process. As presented in Figure 9.1

(upper figure), it is possible that all processes have been switched to SPE execution and the

scheduler will iterate through the list until one of the SPEs sends the signal to the ready-to-run

list. Therefore, it is likely that some idle time will occur (when no useful work is performed)

upon off-loading and before finding the next-to-run process,. Once it finds the next to run

process, the scheduler will switch to the kernel mode, and influence the kernel scheduler to run

the appropriate process.

The possible drawback of this scheme is that upon determining which process should be the

next to run, the system still needs to perform two context switches, between the user process

128

User−Level Space

Kernel−Level Space

Process running

Process off−loading,and switching to SLED

ready_to_run listSLED iterates through

next process to runKernel schedules

Context−sw

itch

SLE

D−>K

ernel

Con

text

sw

itch

Ker

nel−

>Pro

cess

(next) Process contiues

SLED has found next process

Time

Process running

Process off−loading,and switching to SLED

(next) Process contiues

Con

text

sw

itch

Ker

nel−

>Pro

cess

Kernel schedulesnext process to run

SLE

D−>K

ernel

Context−sw

itch

User−Level Space

Kernel−Level Space

ready_to_run listSLED iterates through

SLED has found process

Figure 9.1: Upon completing the assigned tasks, SPEs send signals to PPE processes throughthe ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run queue to the kernel, which in return can schedule the appropriate process on the PPE.

129

and the kernel, and between the kernel and the user process. In our future work we plan to allow

kernel to directly access the list, which would eliminate one context switch. In other words, by

allowing the kernel to see the ready-to-run list, we overlap the first context switch with the idle

time which occurs before some of the active processes is ready to be rescheduled on the PPE.

When the next-ro-run process is determined the scheduler would already be in the kernel space

and there would be only one context switch left to return the execution to a specific process in

the user space, see Figure 9.1 (bottom figure).

9.2 Load Balancing and Task Priorities

So far we have considered applying the MGPS and SLED schedulers only to a single applica-

tion. In our future work we plan to investigate the described scheduling strategies in a context of

multi-program workloads. Since the schedulers are already designed to work in a distributed en-

vironment, using the schedulers with entirely separated applications should be relatively simple.

However, we envision several challenges with multi-program workload that could potentially

influence the system performance.

First, using the SLED scheduler with multi-program workload can cause load imbalance.

The SLED scheduler contains two ready-to-run lists, and each list is shared among the processes

running on a single cpu. Therefore, the scheduler needs to be capable of deciding how to

group the active processes across cpus in order to minimize load imbalance. The grouping of

processes will depend on parameters such as granularities of the PPE and SPE tasks, PPE-SPE

communication, inter-node and on-chip communication. Furthermore, the scheduler needs to

be able to recognize when the load of the system has changed (for example when one of the

processes has finished executing), and appropriately reschedule the remaining tasks across the

available cpus.

Besides being able to handle the load balancing issues, our future work will focus on in-

cluding support for real-time tasks in our scheduling policies. So far in our experiments all

130

processes were assumed to have the same priority. This is not the case in all situations, and one

example would be streaming video applications. While trying to increase the system throughput

with different process grouping and load-balancing policies, we might actually hurt the perfor-

mance of the real-time jobs in the system. A simple example would be if a real-time task is

grouped with processes that require a lot of resources. Although this might be the best grouping

decision for overall system performance, that particular real-time task might suffer performance

degradation. To address the mentioned issues, we plan to include multiple applications in our

experiments and focus more on load-balancing problems as well as real-time task priorities.

9.3 Increasing Processor Utilization

Our initial scheduling scheme, Event-Driven Task-Level Parallelization (EDTLP), reduces the

idle time on the PPE by forcing each process to yield upon off-loading, and assigning the PPE

to a process which is ready to do work on the PPE side. For further reducing the idle time on

both, the PPE and SPEs, we developed the Slack-minimizer Event Driven Scheduler (SLEDS).

In our future work, as another approach for increasing utilization of SPEs, we plan to intro-

duce sharing of SPEs among multiple PPE threads. The processes in an MPI application are

almost identical, and the off-loaded parts of each process are exactly the same. Therefore, a sin-

gle SPE thread could potentially execute the off-loaded computation from multiple processes.

However, different processes cannot share the SPE threads, since the SPE threads exclusively

belong to the process which created them. Therefore, we plan to investigate another level of

parallelism on the Cell processor, namely thread-level parallelism. Inside of a single node,

instead of having multiple MPI processes, a parallel application would operate with multiple

threads which could share SPEs among themselves. Different processes would be used across

nodes. To further increase utilization of the PPE, we will consider extended loop-level schedul-

ing policies which would also involve the PPE in computation, besides already used accelerator

cores.

131

9.4 Novel Applications and Programming Models

In our thesis we used a limited number of applications that were able to benefit from the off-

loading execution approach. While it is obvious that many scientific (computationally expen-

sive) applications will benefit from the proposed execution models and scheduling strategies, in

our future work we plan to focus on applications with high-bandwidth requirements. Specifi-

cally, we plan to investigate the capability of accelerator-based architectures to execute applica-

tions such as database servers and network package processing.

The mentioned applications are computationally intensive, but also these applications usu-

ally require high memory bandwidth because they stream large amounts of data. Besides being

extremely computationally powerful, Cell has a high-bandwidth bus which connects the on-

chip cores among themselves and with the main memory. While the high-bandwidth bus will

be capable of improving the performance of streaming applications, in a near future it might

become a bottleneck as the number of on-chip cores increases. Therefore, in our future work

we will focus on runtime systems which improve the execution of data-intensive applications

on asymmetric processors.

9.5 Conventional Architectures

The main focus of our thesis were heterogeneous, accelerator-based architectures. However,

parallel architectures comprising homogeneous cores represent the majority of processors that

are in use nowadays. When working with conventional, highly parallel architectures, it is likely

that similar problems will occur as those that we were facing on heterogeneous architectures.

As with asymmetric architectures, applications designed for homogeneous parallel archi-

tectures need to be parallelized at multiple levels in order to achieve efficient execution. Ap-

plications with multiple levels of parallelism are likely to experience load imbalance, which

might result in poor utilization of chip resources. Therefore, we need to have techniques that

132

are capable of detecting and correcting these anomalies.

Most of the techniques we presented in this thesis are not bound to heterogeneous archi-

tectures. In our future work we plan to extend and test our scheduling and modeling work

to homogeneous parallel architectures. While scheduling approaches such as MGPS and S-

MGPS might be relatively simple to apply on any kind of architecture, the MMGP modeling

approach will require more detailed communication modeling. On Cell architecture, because of

the specifics of the SPE design, we were able to assume significant computation-communication

overlap. This obviously will not be the case on architectures with conventional caches, therefore

we will focus more on modeling communication patterns.

9.6 MMGP extensions

Another direction of our future work regarding the MMGP model is more accurate modeling

of the off-loaded tasks, specifically DMA communication in the off-loaded tasks. Each SPE on

the Cell/BE is capable of moving data between main memory and local storage, while at the

same time executing computation. To overlap computation and communication, applications

use loop tiling and double buffering.

In this thesis in our MMGP model we have included all blocking DMA requests that can-

not be computation overlapped. However, unrolling and increased DMA communication can

influence the performance on a completely different architectural level. Although the Element

Interconnect Bus (structure that connects cores on Cell) can achieve bandwidth of over 200

GB/s, the processor-memory bandwidth is limited to 25 GB/s. When many SPEs work simul-

taneously the available bandwidth might not be sufficient. Consider a case where each SPE ex-

ecutes exactly the same loop – realistic scenario when an off-loaded loop is parallelized across

multiple accelerators. If the off-loaded execution is synchronized, all SPEs will issue a DMA

request at the same time. Although the total bandwidth requirements might be less than 25

GB/s, when all SPEs simultaneously and synchronously perform memory communication the

133

requirements might exceed the available bandwidth. The described scenario is likely to occur

when significant loop unrolling is performed, due to heavily increased DMA communication

necessary for bringing data for the enlarged loop bodies. In our future work we plan to extend

the MMGP model by capturing all on-chip contention caused by high bandwidth requirements

of the off-loaded code.

134

Chapter 10

Overview of Related Research

10.1 Cell – Related Research

Cell has recently attracted considerable attention as a high-end computing platform. Recent

work on Cell covers modeling, performance analysis, programming and compilation environ-

ments, and application studies.

Kistler et. al [63] analyze the performance of Cell’s on-chip interconnection network and

provide insights into its communication and synchronization protocols. They present experi-

ments that estimate the DMA latencies and bandwidth of Cell, using microbenchmarks. They

also investigate the system behavior under different patterns of communication between local

storage and main memory. Based on the presented results, the Cell communications network

provides the speed and bandwidth that applications need to exploit the processor’s computa-

tional power. Williams et. al [98] present an analytical framework to predict performance on

Cell. In order to test their model, they use several computational kernels, including dense ma-

trix multiplication, sparse matrix vector multiplication, stencil computations, and 1D/2D FFTs.

In addition, they propose micro-architectural modifications that can increase the performance

of Cell when operating on double-precision floating point elements. Chen et. al [33] investi-

gate communication (DMA) performance on the SPEs. They found strong relation between

135

the size of the prefetching buffers allocated in local storage, and application performance. To

determine the optimal buffer size, they present a detailed analytical model of DMA accesses on

the Cell and use the model to optimize the buffer size for DMAs. To evaluate performance of

their model, they use a set of micro-kernels. Our work differs in that it considers the overall

performance implications of multigrain parallelization strategies on Cell.

Balart et. al [16] present a runtime library for asynchronous communication in the Cell BE

processor. The library is organized as a Software Cache and provides opportunities for over-

lapping communication and computation. They found that the full-associative scheme offers

better chances for communication-computation overlap. To evaluate their system they used

benchmarks from the HPCC suite. While their concern was design and implementation of the

off-loaded code, in our work we assume that the application is already Cell-optimized, and we

focus on scheduling of already applicatoin-exposed parallelism.

Eichenberger et. al [39] present several compiler techniques targeting automatic generation

of highly optimized code for Cell. These techniques attempt to exploit two levels of parallelism,

thread-level and SIMD-level, on the SPEs. The techniques include compiler assisted memory

alignment, branch prediction, SIMD parallelization, OpenMP thread-level parallelization, and

compiler-controlled software caching. The study of Eichenberger et. al does not present details

on how multiple dimensions of parallelism are exploited and scheduled simultaneously by the

compiler. Our contribution addresses this issue. The compiler techniques presented in [39] are

complementary to the work presented in this paper. They focus primarily on extracting high

performance out of each individual SPE, whereas our work focuses on scheduling and orches-

trating computation across SPEs. Zhao and Kennedy [102] present a dependence-driven com-

pilation framework for simultaneous automatic loop-level parallelization and SIMDization on

Cell. They also implement strategies to boost performance by managing DMA data movement,

improving data alignment, and exploiting memory reuse in the innermost loop. To evaluate per-

formance of their techniques, Zhao and Kennedy use microbenchmarks. Similar to the results

presented in our study, they do not see linear speedup when parallelizing tasks across multiple

136

SPEs. The framework of Zhao and Kennedy does not consider task-level functional parallelism

and its coordinated scheduling with data parallelism, two central issues explored in this thesis.

Although Cell has been a focal point in numerous articles in popular press, published re-

search using Cell for real-world applications beyond games was scarce until recently. Hjelte [58]

presents an implementation of a smooth particle hydrodynamics simulation on Cell. This simu-

lation requires good interactive performance, since it lies on the critical path of real-time appli-

cations such as interactive simulation of human organ tissue, body fluids, and vehicular traffic.

Benthin et. al [18] present an implementation of ray-tracing algorithms on Cell, also targeting

high interactive performance. They have shown how to efficiently map the ray tracing algorithm

to Cell, with the performance improvements of nearly an order of magnitude over the conven-

tional processors. However, they found that for certain algorithms Cell does not perform well

due to frequent memory accesses. Petrini et. al [73] recently reported experiences from porting

and optimizing Sweep3D on Cell, in which they consider multi-level data parallelization on the

SPEs. They heavily optimized Sweep3D for Cell and achieved impressive performance of 9.3

Gflops/s for double precision and 50 Gflops/s for single precision floating point computation.

Contrary to their conclusion that the memory performance and the data communication patterns

play central role in Sweep3D, we were able to achieve complete communication-computation

overlap in the bioinformatics code we ported to Cell. The same author presented a study of

graph explorations algorithms on Cell [75]. They investigated suitability of the breath-search

first (BFS) algorithm on the Cell BE. The achieved performance is an order of magnitude better

compared to conventional architectures. Bader et. al [13] examine the implementation of list

ranking algorithms on Cell. List ranking is a challenging algorithm for Cell due to its highly

irregular patterns. When utilizing the entire Cell chip, they reported an overall speedup of 8.34

over a PPE-only implementation of the same algorithm. Recently several Cell studies have ben

conducted as a result of 2007 IBM Cell Challenge. Moorkanikara-Nageswaran et. al [1] de-

veloped Brain Circuit simulation on a PS3 node. As part of the same contest, De Kruijf ported

MapReduce [38] algorithm to Cell. The main goal of our work is both, to develop and opti-

137

mize applications for Cell and develop system software tools and methodologies for improving

performance on the Cell architecture across application domains. We use a case study from

bioinformatics to understand the implications of static and dynamic multi-grain parallelization

on Cell.

10.2 Process Scheduling - Related Research

Dynamic and off-line process scheduling which improves the performance and overall through-

put of the system has been a very active research area. With the introduction of multi-core

systems many scheduling related studies have been conducted targeting performance improve-

ment on these novel systems. We list several contributions in this area.

Anderson et. al [9] argue that the performance of kernel-level threads is inherently worse

than the performance of user-level threads. While user-level threads are essential for high per-

formance computation, kernel-level threads, which support user-level threads, are a poor kernel-

level abstraction due to inherently bad performance. The authors propose new kernel interface

and user-level thread package that together provide the same functionality as kernel threads

while at the same time the performance of their thread library is comparable to that of any

user-level thread library.

Siddha et. al [82] conducted a thorough study on possible scheduling strategies on emerg-

ing multi-core architectures. They consider different multi-core topologies and the associ-

ated power management technologies, and try to point to possible tradeoffs when performing

scheduling on these novel architectures. They focus on symmetric processors and do not con-

sider any of the asymmetric architectures. Somewhat similar to results obtained from our study

(using asymmetric cores), they conclude that the most efficient performance can be achieved by

making the process scheduler aware of the multi-core topologies and the task characteristics.

Fedorova et. al [42] designed a kernel-level scheduling algorithm targeting to improve the

performance of multi-core architectures with shared levels of cache. The motivation for their

138

work comes from the fact that application on multi-core systems are performance dependent

on the behavior of their co-runners. This performance dependency occurs as a consequence of

shared on-chip resources, such as the cache. Their algorithm ensures that the processes always

run as quickly as they would if the cache was fairly shared among all co-running processes. To

achieve this behavior they adjust the CPU timeslices assigned to the running processes by the

kernel scheduler.

Calandrino et. al [26] developed an approach for scheduling soft real-time periodic tasks in

Linux, on Asymmetric Multi-Core Processors. Their approach performs dynamic scheduling of

real-time tasks, while at the same time attempts to provide good performance for non-real-time

processes. To evaluate their approach they used a Linux scheduler simulator, as well as the real

Linux operating system running on a dual-core Intel Xeon processor.

Settle et. al [80] proposed the memory monitoring framework, an architectural support that

provides cache resource information to the operating system. The authors introduce the concept

of an activity vector which represents a collection of event counters for a set of contiguous cache

blocks. Using runtime information the operating system can improve the process scheduling.

Their scheme schedules threads based on run-time cache use and miss pattern for each active

hardware thread. Their techniques improve system performance by 5%. The performance

improvements are caused by increased cache hit rate.

Thekkath and Eggers [93] tested a hypothesis that scheduling threads that share data on the

same processor will decrease compulsory and invalidation misses. They evaluated a variety

of thread placement algorithms. Their workload was composed of fourteen parallel programs

that are representative of real-world scientific applications. They found that placing threads

that share data on the same processor does not have any impact on performance. Instead, the

performance was mostly affected by thread load balancing.

Rajagopalan et. al [76] introduce a scheduling framework for multi-core processors tar-

geting to achieve a balance between control over the system and the level of abstraction. Their

framework uses high-level information supplied by the user to guide thread scheduling and also,

139

where necessary, gives the programmer fine control over thread placement.

Snavely and Tullsen [83] designed the SOS (Sample, Optimize, Symbios) scheduler - an

OS level scheduler that dynamically choses the best scheduling strategy in order to increase

the throughput of the system. The SOS scheduler samples the space of possible process com-

binations and collects values of the hardware counters for different scheduling combinations.

The scheduler applies heuristics to the collected counters in order to determine the most effi-

cient scheduling strategy. The scheduler is designed for the SMT architecture and is capable

of improving system performance by up to 17%. The same authors extend their initial work

by introducing job priorities [84]. While different jobs might have various priorities from the

user’s perspective, the SOS scheduler might be unaware of that. In this way while trying to

improve the system throughput, the SOS scheduler might increase the response time of high

priority jobs.

Sudarsan et. al [90] developed ReSHAPE, a runtime scheduler for dynamic resizing of paral-

lel application executed in a distributed environment. MPI-based applications using ReSHAPE

framework can expand or shrink depending on the availability of underlying hardware. Using

ReSHAPE they demonstrated improvement in job turn-around time and overall system through-

put. McCan et. al [67] propose a dynamic processor-scheduling policy for multiprogrammed

shared-memory multiprocessors. Their scheduling policy also assumes multiple independent

process and it is capable of allocating processors from one parallel job to another based on the

requirement of the parallel jobs. The authors show that is it possible to run beneficially low

priority jobs on the same cpu with high priority jobs, without hurting the high priority jobs.

Their new scheduling scheme can improve system performance by up to 40%.

Curtis-Maury et. al [36] present a prediction model for identifying energy-efficient operating

points of concurrency in multithreaded scientific applications. Their runtime system optimizes

applications at runtime, by using live analysis of hardware event rates. Zhang et. al [100] de-

veloped an OMP-based loop scheduler that selects the number of threads to use per processor

based on sample executions of each possibility. The authors extend that work to incorporate de-

140

cision tree based prediction of the optimal number of threads to use [101]. Springer et. al [85]

developed a scheduler that conforms two conditions: the scheduling strategy satisfies an exter-

nal upper limit for energy consumption and minimizes the execution time. The execution model

chosen by their scheduler is usually less tha 2% of optimal.

10.3 Modeling – Related Research

In this section we review related work in programming environments and models for paral-

lel computation on conventional homogeneous parallel systems and programming support for

nested parallelization. The list of related work in models for parallel computation is by no

means complete, but we believe it provides adequate context for the model presented in this

thesis.

10.3.1 PRAM Model

Fortune and Wyllie presented a model based on random access machine operating in paral-

lel and sharing a common memory [46]. They are modeling execution of a finite program on

PRAM (parallel random access machine) that consists of unbounded set of processors connected

through unbounded global shared memory. The model is rather simple but not realistic for mod-

ern multicore processors, since it assumes that all processors work synchronously and that in-

terprocessor communication is free. PRAM also does not consider network congestion. There

are several variants of the PRAM models: (i) EREW - exclusive read exclusive write PRAM

model does not allow simultaneous execution of read or write operations, (ii) CEREW - concur-

rent read exclusive write, allows simultaneous reading but prevents simultaneous writing, (iii)

CRCW allows both, simultaneous read and simultaneous write operations. Cameron et. al [28]

describe two different implementations of CRCW PRAM: priority and arbitrary.

Several extensions of the PRAM model have been developed in order to make it more prac-

141

tical, but at the same time to preserve its simplicity [5, 6, 51, 62, 68, 72]. Aggarwal et. al [5] add

communication latency to the PRAM model, while the same author includes reduced commu-

nication costs when blocks of data are transfered [6].

The original PRAM model assumes no asyncronous execution. Asynchronous PRAM (APRAM)

model includes the synchronization costs [51, 68]. APRAM contains four different types of

instructions: global reads, global writes, local operations and synchronization steps. Synchro-

nization step represents a global synchronization among processors.

10.3.2 BSP model

Valiant introduced the bulk-synchronous parallel model (BSP) [95], which is a bridging model

between parallel software and hardware. The BSP model is intended neither as a hardware nor

as a programming model, but something in between. The model is defined as a combination

of three attributes: 1) A number of components each performing processing and/or memory

functions; 2) A router that delivers point-to-point messages between the components; and 3)

Facilities for synchronizing all or a subset of components at regular intervals. The computa-

tion is performed in supersteps - each component is allocated a task, and all components are

synchronized at the end of computation/superstep. BSP as well as the other models mentioned,

does not captures the overhead of context switching, which is a significant part of accelerator-

based execution and the MMGP model. BSP allows processors to work asynchronously and

models latency and limited bandwidth.

Baumker et. al [25] introduce extensions of the BSP model, where they introduce blockwise

communication in parallel algorithms. A good parallel algorithm should communicate using

smaller number of large messages rather than using a large number of small messages. There-

fore they introduce a new parameter B which represents a minimum size of the message in

order to fully exploit the bandwidth of the router. Fantozzi et. al [41] introduce D-BSP, a model

where a machine can be divided into submachines capable of exploiting locality. Furthermore,

142

each submachine can execute different algorithms independently. Juurlink et. al [60] extend the

BSP model by providing a way to deal with unbalanced communication patterns, and by adding

a notion of general locality to the BSP model, where the delay of a remote memory access

depends on the relative location of the processors in the interconnection network.

10.3.3 LogP model

LogP [35] is another widely used machine-independent model for parallel computation. The

LogP model captures the performance of parallel applications using four parameters: the latency

(L), overhead (o), bandwidth (g) of communication, and the number of processors (P ).

The drawback of LogP is that it can accurately predict performance only when short mes-

sages are used for communication. Alexandrov et. al [8] propose the LogGP model, which is

an extension of LogP that supports large communication messages and high bandwidth. They

introduce an extra parameter G which captures the bandwidth obtained for large messages.

Ino et. al [59] introduce an extension of LogGP, named LogGPS. LogGPS improves the ac-

curacy of the LogGP model, by capturing the synchronization needed before sending a long

message by high-level communication libraries. They introduce a new parameter S, defined as

the threshold for message length, above which synchronous messages are sent. Frank et. al [47]

extend the LogP model by capturing the impact of contention for message processing resources.

Cameron et. al [27] extend the LogP model by modeling the point-to-point memory latencies

of inter-node communication in a shared memory cluster.

Traditional parallel programming models, such as BSP [95], LogP [35], PRAM [51] and de-

rived models [8,27,59,70] developed to respond to changes in the relative impact of architectural

components on the performance of parallel systems, are based on a minimal set of parameters to

capture the impact of communication overhead on computation running across a homogeneous

collection of interconnected processors. MMGP borrows elements from LogP and its deriva-

tives, to estimate performance of parallel computations on heterogeneous parallel systems with

143

multiple dimensions of parallelism implemented in hardware. A variation of LogP, HLogP [24],

considers heterogeneous clusters with variability in the computational power and interconnec-

tion network latencies and bandwidths between the nodes. Although HLogP is applicable to

heterogeneous multi-core architectures, it does not consider nested parallelism. It should be

noted that although MMGP has been evaluated on architectures with heterogeneous processors,

it can readily support architectures with heterogeneity in their communication substrates as well

(e.g. architectures providing both shared-memory and message-passing communication).

10.3.4 Models Describing Nested Parallelism

Several parallel programming models have been developed to support nested parallelism, in-

cluding nested parallel languages such as NESL [21], task-level parallelism extensions to data-

parallel languages such as HPF [89], extensions of common parallel programming libraries such

as MPI and OpenMP to support nested parallel constructs [29, 64], and techniques for combin-

ing constructs from parallel programming libraries, typically MPI and OpenMP, to better ex-

ploit nested parallelism [11,50,77]. Prior work on languages and libraries for nested parallelism

based on MPI and OpenMP is largely based on empirical observations on the relative speed of

data communication via cache-coherent shared memory, versus communication with message

passing through switching networks. Our work attempts to formalize these observations into

a model which seeks optimal work allocation between layers of parallelism in the application

and optimal mapping of these layers to heterogeneous parallel execution hardware. NESL [21]

and Cilk [22] are languages based on formal algorithmic models of performance that guaran-

tee tight bounds on estimating performance of multithreaded computations and enable nested

parallelization. Both NESL and Cilk assume homogeneous machines.

Subhlok and Vondran [88] present a model for estimating the optimal number of homoge-

neous processors to assign to each parallel task in a chain of tasks that form a pipeline. MMGP

has a similar goal of assigning co-processors to simultaneously active tasks originating from

144

the host processors, however it also searches for the optimal number of tasks to activate in host

processors, in order to achieve a balance between supply from host processors and demand

from co-processors. Sharapov et. al [81] use a combination of queuing theory and cycle-

accurate simulation of processors and interconnection networks, to predict the performance of

hybrid parallel codes written in MPI/OpenMP on ccNUMA architectures. MMGP uses a sim-

pler model, designed to estimate scalability along more than one dimensions of parallelism on

heterogeneous parallel architectures.

Research on optimizing compilers for novel microprocessors, such as tiled and streaming

processors, has contributed methods for multi-grain parallelization of scientific and media com-

putations. Gordon et. al [53] present a compilation framework for exploiting three layers of

parallelism (data, task and pipelined) on streaming microprocessors running DSP applications.

The framework uses a combination of fusion and fission transformations on data-parallel com-

putations, to ”right-size” the degree of task and data parallelism in a program running on a

homogeneous multi-core microprocessor. MMGP is a complementary tool which can assist

both compile-time and runtime optimization on heterogeneous multi-core platforms. The de-

velopment of MMGP coincides with several related efforts on measuring, modeling and opti-

mizing performance on the Cell Broadband Engine [32, 75]. An analytical model of the Cell

presented by Williams et. al [97], considers execution of floating point code and DMA accesses

on the Cell SPE for scientific kernels parallelized at one level across SPEs and vectorized fur-

ther within SPEs. MMGP models the use of both the PPE and SPEs and has been demonstrated

to work effectively with complete application codes. In particular, MMGP factors the effects of

PPE thread scheduling, PPE-SPE communication and SPE-SPE communication into the Cell

performance model.

145

This page intentionally left blank.

146

Bibliography

[1] http://www-304.ibm.com/jct09002c/university/students/contests/cell/index.html.

[2] http://www.rapportincorporated.com.

[3] The Cell project at IBM Research; http://www.research. ibm.com/cell .

[4] www.gpgpu.org.

[5] A. Aggarwal, A. K. Chandra, and M. Snir. On communication latency in pram com-putations. In SPAA ’89: Proceedings of the first annual ACM symposium on Parallel

algorithms and architectures, pages 11–21, New York, NY, USA, 1989. ACM.

[6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity ofprams. Theor. Comput. Sci., 71(1):3–28, 1990.

[7] S. Alam, R. Barrett, J. Kuehn, P. Roth, and J. Vetter. Characterization of scientific work-loads on systems with multi-core processors. In Proc. of IEEE International Symposium

on Workload Characterization (IISWC), 2006.

[8] A. Alexandrov, M. Ionescu, C. Schauser, and C. Scheiman. LogGP: Incorporating LongMessages into the LogP Model: One Step Closer towards a Realistic Model for ParallelComputation. In Proc. of the 7th Annual ACM Symposium on Parallel Algorithms and

Architectures, pages 95–105, Santa Barbara, CA, June 1995.

[9] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy.Scheduler activations: effective kernel support for the user-level management of paral-lelism. ACM Trans. Comput. Syst., 10(1):53–79, 1992.

[10] K. Asanovic, R. Bodik, C. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson,W. Plishker, J. Shalf, S. Williams, and K. Yelick. The Landscape of Parallel Comput-ing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California–Berkeley, December 2006.

147

[11] E. Ayguade, X. Martorell, J. Labarta, M. Gonzalez, and N. Navarro. Exploiting MultipleLevels of Parallelism in OpenMP: A Case Study. In Proc. of the 1999 International Con-

ference on Parallel Processing (ICPP’99), pages 172–180, Aizu, Japan, August 1999.

[12] A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, M. Alvarez, and A. Ramirez. Analysisof video filtering on the cell processor. In Proceeding in Prorisc Conference, pages 116–121, November 2007.

[13] D. Bader, V. Agarwal, and K. Madduri. On the Design and Analysis of Irregular Al-gorithms on the Cell Processor: A Case Study on List Ranking. In Proc. of the 21st

International Parallel and Distributed Processing Symposium, Long Beach, CA, March2007.

[14] D.A. Bader, B.M.E. Moret, and L. Vawter. Industrial applications of high-performancecomputing for phylogeny reconstruction. In Proc. of SPIE ITCom, volume 4528, pages159–168, 2001.

[15] David A. Bader, Virat Agarwal, Kamesh Madduri, and Seunghwa Kang. High perfor-mance combinatorial algorithm design on the cell broadband engine processor. Parallel

Comput., 33(10-11):720–740, 2007.

[16] Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen,Tao Zhang, Kevin O’brien, and Kathryn O’Brien. A novel asynchronous software cacheimplementation for the cell/be processor. In The 20th International Workshop on Lan-

guages and Compilers for Parallel Computing, 2007.

[17] P. Bellens, J. Perez, R. Badia, and J. Labarta. CellSs: A Programming Model for the CellBE Architecture. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.

[18] Carsten Benthin, Ingo Wald, Michael Scherbaum, and Heiko Friedrich. Ray Tracingon the CELL Processor. Technical Report, inTrace Realtime Ray Tracing GmbH, No

inTrace-2006-001 (submitted for publication), 2006.

[19] F. Blagojevic, D. Nikolopoulos, A. Stamatakis, and C. Antonopoulos. Dynamic Multi-grain Parallelization on the Cell Broadband Engine. In Proc. of the 12th ACM SIGPLAN

Symposium on Principles and Practice of Parallel Programming, pages 90–100, March2007.

148

[20] F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopoulos. RAxML-CELL:Parallel Phylogenetic Tree Construction on the Cell Broadband Engine. In Proc. of the

21st International Parallel and Distributed Processing Symposium, March 2007.

[21] G. Blelloch, S. Chatterjee, J. Harwick, J. Sipelstein, and M. Zagha. Implementation of aPortable Nested Data Parallel Language. In Proc. of the 4th ACM SIGPLAN Symposium

on Principles and Practice of Parallel Programming (PPoPP’93), pages 102–112, SanDiego, CA, June 1993.

[22] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: anEfficient Multithreaded Runtime System. In Proc. of the 5th ACM Symposium on Princi-

ples and Practices of Parallel Programming (PPoPP’95), pages 207–216, Santa Barbara,California, August 1995.

[23] Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schrooder. Sparse matrix solvers on thegpu: conjugate gradients and multigrid. ACM Trans. Graph., 22(3):917–924, 2003.

[24] J. Bosque and L. Pastor. A Parallel Computational Model for Heterogeneous Clusters.IEEE Transactions on Parallel and Distributed Systems, 17(12):1390–1400, December2006.

[25] Armin BŁumker and Wolfgang Dittrich. Fully dynamic search trees for an extension ofthe bsp model.

[26] John M. Calandrino, Dan Baumberger, Tong Li, Scott Hahn, and James H. Anderson.Soft real-time scheduling on performance asymmetric multicore platforms. In RTAS ’07:

Proceedings of the 13th IEEE Real Time and Embedded Technology and Applications

Symposium, pages 101–112, Washington, DC, USA, 2007. IEEE Computer Society.

[27] K. Cameron and X. Sun. Quantifying Locality Effect in Data Access Delay: MemoryLogP. In Proc. of the 17th International Parallel and Distributed Processing Symposium,Nice, France, April 2003.

[28] Kirk W. Cameron and Rong Ge. Predicting and evaluating distributed communicationperformance. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercom-

puting, page 43, Washington, DC, USA, 2004. IEEE Computer Society.

[29] F. Cappello and D. Etiemble. MPI vs. MPI+OpenMP on the IBM SP for the NAS Bench-marks. In Proc. of the IEEE/ACM Supercomputing’2000: High Performance Networking

and Computing Conference (SC’2000), Dallas, Texas, November 2000.

149

[30] L. Chai, Q. Gao, and D. K. Panda. Understanding the Impact of Multi-Core Architec-ture in Cluster Computing: A Case Study with Intel Dual-Core System. In Proc. of

CCGrid2007, May 2007.

[31] Maria Charalambous, Pedro Trancoso, and Alexandros Stamatakis. Initial experiencesporting a bioinformatics application to a graphics processor. In Panhellenic Conference

on Informatics, pages 415–425, 2005.

[32] T. Chen, Z. Sura, K. O’Brien, and K. O’Brien. Optimizing the Use of Static Buffers forDMA on a Cell Chip. In Proc. of the 19th International Workshop on Languages and

Compilers for Parallel Computing, New Orleans, LA, November 2006.

[33] Thomas Chen, Ram Raghavan, Jason Dale, and Eiji Iwata. Cell broadband engine archi-tecture and its first implementation. IBM developerWorks, Nov 2005.

[34] Benny Chor and Tamir Tuller. Maximum likelihood of evolutionary trees: hardness andapproximation. Bioinformatics, 21(1):97–106, 2005.

[35] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Scauser, E. Santos, R. Subramonian, andT. Von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proc. of

the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

(PPoPP’93), pages 1–12, San Diego, California, May 1993.

[36] Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S.Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scien-tific codes. IEEE Transaction on Parallel and Distributed Systems.

[37] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn,Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, andUjval J. Kapasi. Merrimac: Supercomputing with streams. In SC ’03: Proceedings of the

2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003.IEEE Computer Society.

[38] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on largeclusters. pages 137–150.

[39] A. Eichenberger, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. Gschwind, K. O’Brien,K. O’Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, and B. So. OptimizingCompiler for the CELL Processor. In Proc. of the 14th International Conference on

150

Parallel Architectures and Compilation Techniques, pages 161–172, Saint Louis, MO,September 2005.

[40] B. Flachs et al. The Microarchitecture of the Streaming Processor for a CELL Processor.Proceedings of the IEEE International Solid-State Circuits Symposium, pages 184–185,February 2005.

[41] Carlo Fantozzi, Andrea Pietracaprina, and Geppino Pucci. Translating submachine lo-cality into locality of reference. J. Parallel Distrib. Comput., 66(5):633–646, 2006.

[42] Alexandra Fedorova, Margo Seltzer, and Michael D. Smith. Improving performanceisolation on chip multiprocessors via an operating system scheduler. In PACT ’07: Pro-

ceedings of the 16th International Conference on Parallel Architecture and Compilation

Techniques, pages 25–38, Washington, DC, USA, 2007. IEEE Computer Society.

[43] J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood ap-proach. Journal of Molecular Evolution, 17:368–376, 1981.

[44] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood ap-proach. J. Mol. Evol., 17:368–376, 1981.

[45] X. Feng, K. Cameron, B. Smith, and C. Sosa. Building the Tree of Life on Terascale Sys-tems. In Proc. of the 21st International Parallel and Distributed Processing Symposium,Long Beach, CA, March 2007.

[46] Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC

’78: Proceedings of the tenth annual ACM symposium on Theory of computing, pages114–118, New York, NY, USA, 1978. ACM.

[47] Matthew Frank, Anant Agarwal, and Mary K. Vernon. Lopc: Modeling contention inparallel algorithms. In Principles Practice of Parallel Programming, pages 276–287,1997.

[48] Bugra Gedik, Rajesh Bordawekar, and Philip S. Yu. Cellsort: High performance sortingon the cell processor. In Proc. of the 33rd Very Large Databases Conference, pages1286–1207, 2007.

[49] Bugra Gedik, Philip S. Yu, and Rajesh R. Bordawekar. Executing stream joins on thecell processor. In VLDB ’07: Proceedings of the 33rd international conference on Very

large data bases, pages 363–374. VLDB Endowment, 2007.

151

[50] A. Gerndt, S. Sarholz, M. Wolter, D. An Mey, C. Bischof, and T. Kuhlen. Particlesand Continuum – Nested OpenMP for Efficient Computation of 3D Critical Points inMultiblock Data Sets. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.

[51] P. Gibbons. A More Practical PRAM Model. In Proc. of the First Annual ACM Sym-

posium on Parallel Algorithms and Architectures, pages 158–168, Santa Fe, NM, June1989.

[52] M. Girkar and C. Polychronopoulos. The Hierarchical Task Graph as a Universal Inter-mediate Representation. International Journal of Parallel Programming, 22(5):519–551,October 1994.

[53] M. Gordon, W. Thies, and S. Amarasinghe. Exploiting Coarse-Grained Task, Data andPipelined Parallelism in Stream Programs. In Proc. of the 12th International Conference

on Architectural Support for Programming Languages and Operating Systems, pages151–162, San Jose, CA, October 2006.

[54] Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. Fastcomputation of database operations using graphics processors. In SIGMOD ’04: Pro-

ceedings of the 2004 ACM SIGMOD international conference on Management of data,pages 215–226, New York, NY, USA, 2004. ACM.

[55] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics.In Proc. of the 6th European PVM/MPI User’s Group Meeting, pages 11–18, Barcelona,Spain, September 1999.

[56] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics.In Proc. of the 6th European PVM/MPI Users Group Meeting, pages 11–18, September1999.

[57] M. Hill and M. Marty. Amdahls Law in the Multi-core Era. Technical Report 1593,Department of Computer Sciences, University of Wisconsin-Madison, March 2007.

[58] Nils Hjelte. Smoothed particle hydrodynamics on the cell broadband engine. Master’sthesis, Umea University, Department of Computer Science, Jun 2006.

[59] F. Ino, N. Fujimoto, and K. Hagihara. LogGPS: A Parallel Computational Model forSynchronization Analysis. In Proc. of the 8th ACM SIGPLAN Symposium on Principles

and Practice of Parallel Programming, pages 133–142, Snowbird, UT, June 2001.

152

[60] Ben H. H. Juurlink and Harry A. G. Wijshoff. The e-BSP model: Incorporating generallocality and unbalanced communication into the BSP model. In Euro-Par, Vol. II, pages339–347, 1996.

[61] W. Kahan. Lecture notes on the status of ieee standard 754 for binary floating-pointarithmetic. 1997.

[62] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide. Efficient pramsimulation on a distributed memory machine. In STOC ’92: Proceedings of the twenty-

fourth annual ACM symposium on Theory of computing, pages 318–326, New York, NY,USA, 1992. ACM.

[63] Mike Kistler, Michael Perrone, and Fabrizio Petrini. Cell Multiprocessor InterconnectionNetwork: Built for Speed. IEEE Micro, 26(3), May-June 2006. Available from http://hpc.pnl.gov / people / fabrizio / papers / ieeemicro-cell.pdf.

[64] G. Krawezik. Performance Comparison of MPI and three OpenMP Programming Styleson Shared Memory Multiprocessors. In Proc. of the 15th Annual ACM Symposium on

Parallel Algorithms and Architectures, pages 118–127, San Diego, CA, June 2003.

[65] E. Scott Larsen and David McAllister. Fast matrix multiplies using graphics hardware.In Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercom-

puting (CDROM), pages 55–55, New York, NY, USA, 2001. ACM.

[66] L-K. Liu, Q. Li, A. Natsev, K.A. Ross, J.R. Smith, and A.L. Varbanescu. Digital me-dia indexing on the cell processor. In ICME 2007, pages 1866 – 1869. IEEE SignalProcessing Society, July 2007.

[67] Cathy McCann, Raj Vaswani, and John Zahorjan. A dynamic processor allocation pol-icy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst.,11(2):146–178, 1993.

[68] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simulations of prams byparallel machines with restricted granularity of parallel memories. Acta Inf., 21(4):339–374, 1984.

[69] Barry Minor, Gordon Fossum, and Van To. Terrain rendering engine (tre), http://www.research.ibm.com / cell / whitepapers / tre.pdf. May 2005.

153

[70] C. Moritz and M. Frank. LoGPC: Modeling Network Contention in Message PassingPrograms. In Proc. of the 1998 ACM SIGMETRICS Conference on Measurement and

Modeling of Computer Systems, pages 254–263, Madison, WI, June 1998.

[71] PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Pro-gramming Environments Manual. http:// www-306.ibm.com / chips / techlib.

[72] Christos Papadimitriou and Mihalis Yannakakis. Towards an architecture-independentanalysis of parallel algorithms. In STOC ’88: Proceedings of the twentieth annual ACM

symposium on Theory of computing, pages 510–513, New York, NY, USA, 1988. ACM.

[73] F. Petrini, G. Fossum, A. Varbanescu, M. Perrone, M. Kistler, and J. Fernandez Periador.Multi-core Surprises: Lessons Learned from Optimized Sweep3D on the Cell BroadbandEngine. In Proc. of the 21st International Parallel and Distributed Processing Sympo-

sium, Long Beach, CA, March 2007.

[74] Fabrizio Petrini, Gordon Fossum, Mike Kistler, and Michael Perrone. MulticoreSuprises: Lesson Learned from Optimizing Sweep3D on the Cell Broadbend Engine.

[75] Fabrizio Petrini, Daniel Scarpazza, Oreste Villa, and Juan Fernandez. Challenges inMapping Graph Exploration Algorithms on Advanced Multi-core Processors. In Proc.

of the 21st International Parallel and Distributed Processing Symposium, Long Beach,CA, March 2007.

[76] Mohan Rajagopalan, Brian T. Lewis, and Todd A. Anderson. Thread scheduling formulti-core platforms. In HotOS 2007: Proceedings of the Eleventh Workshop on Hot

Topics in Operating Systems, 2007.

[77] T. Rauber and G. Ruenger. Library Support for Hierarchical Multiprocessor Tasks. InProc. of Supercomputing’2002, Baltimore, MD, November 2002.

[78] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Peak-performance dfa-basedstring matching on the cell processor. In IPDPS, pages 1–8. IEEE, 2007.

[79] Harald Servat, Cecilia Gonzalez, Xavier Aguilar, Daniel Cabrera, and Daniel Jimenez.Drug design on the cell broadband engine. In PACT ’07: Proceedings of the 16th In-

ternational Conference on Parallel Architecture and Compilation Techniques, page 425,Washington, DC, USA, 2007. IEEE Computer Society.

154

[80] Alex Settle, Joshua Kihm, Andrew Janiszewski, and Dan Connors. Architectural supportfor enhanced smt job scheduling. In PACT ’04: Proceedings of the 13th International

Conference on Parallel Architectures and Compilation Techniques, pages 63–73, Wash-ington, DC, USA, 2004. IEEE Computer Society.

[81] I. Sharapov, R. Kroeger, G. Delamater, R. Cheveresan, and M. Ramsay. A Case Study inTop-Down Performance Estimation for a Large-Scale Parallel Application. In Proc. of the

11th ACM SIGPLAN Symposium on Pronciples and Practice of Parallel Programming,pages 81–89, New York, NY, March 2006.

[82] Suresh Siddha, Venkatesh Pallipadi, and Asit Mallick. Process Scheduling Challenges inthe Era of Multi-core Processors. Intel Technology Journal, 11:btl446, 2007.

[83] Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous mul-tithreaded processor. In ASPLOS-IX: Proceedings of the ninth international conference

on Architectural support for programming languages and operating systems, pages 234–244, New York, NY, USA, 2000. ACM.

[84] Allan Snavely, Dean M. Tullsen, and Geoff Voelker. Symbiotic jobscheduling with pri-orities for a simultaneous multithreading processor. In SIGMETRICS ’02: Proceedings

of the 2002 ACM SIGMETRICS international conference on Measurement and modeling

of computer systems, pages 66–76, New York, NY, USA, 2002. ACM.

[85] Robert Springer, David K. Lowenthal, Barry Rountree, and Vincent W. Freeh. Minimiz-ing execution time in mpi programs on an energy-constrained, power-scalable cluster. InPPoPP ’06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and

practice of parallel programming, pages 230–238, New York, NY, USA, 2006. ACM.

[86] A. Stamatakis. Phylogenetic models of rate heterogeneity: A high performance com-puting perspective. In Proceedings of 20th IEEE/ACM International Parallel and Dis-

tributed Processing Symposium (IPDPS2006), High Performance Computational Biol-ogy Workshop, Proceedings on CD, Rhodos, Greece, April 2006.

[87] Alexandros Stamatakis. RAxML-VI-HPC: maximum likelihood-based phylogeneticanalyses with thousands of taxa and mixed models. Bioinformatics, page btl446, 2006.

[88] J. Subhlok and G. Vondran. Optimal Use of Mixed Task and Data Parallelism forPipelined Computations. Journal of Parallel and Distributed Computing, 60(3):297–319, March 2000.

155

[89] J. Subhlok and B. Yang. A New Model for Integrated Nested Task and Data Parallelism.In Proc. of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel

Programming, pages 1–12, Las Vegas, NV, June 1997.

[90] Rajesh Sudarsan and Calvin J. Ribbens. Reshape: A framework for dynamic resizingand scheduling of homogeneous applications in a parallel environment, 2007.

[91] Alias Systems. Alias cloth technology demonstration for the cell processor, http://www.research.ibm.com / cell / whitepapers / alias cloth.pdf. 2005.

[92] Cell broadband engine programming tutorial version 1.0; http:// www-106. ibm.com /developerworks / eserver / library / es-archguide-v2.html.

[93] R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreadedarchitectures. SIGARCH Comput. Archit. News, 22(2):176–186, 1994.

[94] John A. Turner. Roadrunner: Heterogeneous Petascale Com-puting for Predictive Simu-lation. Technical Report LANL-UR-07-1037, Los Alamos National Lab, Las Vegas, NV,February 2007. ASC Principal Investigator Meeting.

[95] L. Valiant. A bridging model for parallel computation. Communications of the ACM,22(8):103–111, August 1990.

[96] Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian,Milind Girkar, Nick Y. Yang, Guei-Yuan Lueh, and Hong Wang. Exochi: architectureand programming environment for a heterogeneous multi-core multithreaded system. InPLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming lan-

guage design and implementation, pages 156–166, New York, NY, USA, 2007. ACM.

[97] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The Potential of theCell Processor for Scientific Computing. In Proc. of the 3rd Conference on Computing

Frontiers, pages 9–20, Ischia, Italy, June 2006.

[98] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather-ine Yelick. The Potentinal of the Cell Processor for Scientific Computing. ACM Interna-

tional Conference on Computing Frontiers, May 3-6 2006.

[99] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather-ine Yelick. Scientific computing kernels on the cell processor. Int. J. Parallel Program.,35(3):263–298, 2007.

156

[100] Yun Zhang, Mihai Burcea, Victor Cheng, Ron Ho, and Michael Voss. An adaptiveopenmp loop scheduler for hyperthreaded smps. In David A. Bader and Ashfaq A.Khokhar, editors, ISCA PDCS, pages 256–263. ISCA, 2004.

[101] Yun Zhang and Michael Voss. Runtime empirical selection of loop schedulers on hyper-threaded smps. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and

Distributed Processing Symposium (IPDPS’05) - Papers, page 44.2, Washington, DC,USA, 2005. IEEE Computer Society.

[102] Y. Zhao and K. Kennedy. Dependence-based Code Generation for a Cell Processor.In Proc. of the 19th International Workshop on Languages and Compilers for Parallel

Computing, New Orleans, LA, November 2006.

157

Scheduling on Asymmetric Parallel Architectures

Documents

Transcript of Scheduling on Asymmetric Parallel Architectures