Task Scheduling in Speculative Parallelization...Task Scheduling in Speculative Parallelization...

Task Scheduling in Speculative Parallelization

David Burgo Baptista

Dissertacao para Obtencao do Grau de Mestre emEngenharia Informatica e de Computadores

Juri

Presidente: Doutor Joao Antonio Madeiras Pereira

Orientador: Doutor Joao Manuel Pinheiro Cachopo

Vogal: Doutor Joao Manuel Santos Lourenco

Outubro de 2011

Agradecimentos

Levar a termo um trabalho de investigacao nao e uma proeza individual, e se ha valor no tra-

balho que entrego, trata-se apenas do reflexo inevitavel da qualidade dos investigadores com quem

trabalhei ao longo da sua realizacao. Desta forma, deixo em primeiro lugar nestas palavras os

meus sinceros agradecimentos ao Prof. Joao Cachopo, mentor do presente trabalho e principal

destinatario destes Agradecimentos, a quem devo todo o trabalho arduo de guiar e acompanhar

a eclosao, exploracao e maturacao de uma ideia cientıfica, assim como do seu mestrando. Presto

tambem os meus agradecimentos a todos os restantes membros do grupo ESW, cujo convivıo diario

deixou sem sombra de duvida a sua assinatura indistinguıvel neste trabalho.

Com igual importancia, deixo ainda os meus agradecimentos a todas as pessoas que me acom-

panharam ao longo desta jornada, que nem sempre enveredou pelos caminhos mais faceis nem pelos

mais curtos. Vou omitir a habitual lista de nomes e tıtulos que usualmente acompanham estas

seccoes; pois mais que um agradecimento conjunto, devo a cada um o meu agradecimento pessoal.

Voces sabem quem sao: Obrigado pelo vosso apoio.

Este trabalho foi patrocinado por uma bolsa de investigacao da FCT no ambito do projecto

RuLAM: Running Legacy Applications on Multicores, PTDC/EIA-EIA/108240/2008.

Lisboa, Agosto de 2011

David Baptista

The student searches the world for meaning. The

master finds worlds of meaning in the search.

Ao meu irmao Duarte, e a Diana.

Resumo

Programacao paralela usando um modelo de memoria transaccional e uma alternativa a bem estab-

elecida pratica baseada em locks. Para alem de se apresentar como uma ferramenta atractiva para

paralelizacao manual, permite tambem novas abordagens em sistemas de paralelizacao automatica,

e nomeadamente sistemas de paralelizacao especulativa. Actualmente encontramo-nos no ponto

em que se encontram disponıveis na comunidade cientıfica sistemas de memoria transaccional efi-

cientes e flexıveis, pelo que a construcao de sistemas de paralelizacao especulativa baseados em

memoria transaccional comeca a ser possıvel. Consequentemente, apresenta-se um novo conjunto

de desafios, nomeadamente relativamente a decomposicao de programas sequenciais em tarefas

paralelas, o escalonamento destas tarefas, e resolucao de conflitos, entre outros.

Neste trabalho, exploro em detalhe o problema da paralelizacao automatica, e particularmente

da paralelizacao especulativa. No amago desta exploracao, derivo um modelo teorico simples a

partir do qual posso identificar debilidades e potencialidades comuns aos sistemas de paralelizacao

automatica que encontramos na literatura. Deste ponto de partida, identifico a necessidade de ser

adoptado um modelo de especulacao mais generico, como aquele que conseguimos construir usando

um suporte de memoria transaccional.

Dentro deste ambito, abordo os gestores de contencao, assim como o porque de estes serem fun-

damentalmente limitados no seu raio de accao em sistemas de paralelizacao especulativos baseados

em memoria transaccional. Discuto tambem o papel do escalonamento de tarefas nesses sistemas,

e combinando estes dois temas, proponho o conceito mais generico de escalonadores informados,

que coleccionam e usam informacao sobre conflitos ocorrentes durante a execucao do programa

entre tarefas especulativas para gerar escalonamentos mais eficientes.

Avancando com este conceito, implemento um escalonador informado, e avalio a sua eficacia

num sistema de paralelizacao especulativa, assim como numa benchmark generica para sistemas

de memoria transaccional. Os resultados que obtenho nesta avaliacao vao de neutrais a positivos,

indicando a necessidade de mais investigacao nesta area.

Abstract

Concurrent programming using transactional memory models is an alternative approach to the

ubiquitously supported lock-based programming. Transactional memory suits itself well not only

to manual parallelization, but also to automatic parallelization, particularly speculative paralleliza-

tion. With the advent of efficient, general purpose software transactional memory frameworks, this

kind of automatic parallelization starts to become viable. In consequence, there is a new set of

challenges that need to be addressed, concerning decomposition of a sequential program into tasks,

scheduling of said tasks, and conflict resolution, among others.

In this work, I explore the problem of automatic parallelization, and particularly of speculative

parallelization, in detail, and derive a basic model upon which I can identify common strengths

and weaknesses in state-of-the-art systems. I delineate why there is a need towards a more generic

speculation model, such as the one we can build on top of a transactional memory runtime.

Within this scope, I discuss the role of contention managers and how they are poorly suited for

speculative parallelization systems using transactional memory runtimes as speculation support,

along with the role of scheduling in those systems, and therewith I propose a new, more generic

concept, that of conflict-aware schedulers – task schedulers that collect and use information about

conflicts occuring between speculative tasks.

Building on this foundation, I implement a conflict-aware scheduler, and evaluate it on both

a speculative parallelization system and a general purpose software transactional memory bench-

mark, obtaining a range of neutral to positive results, opening the way for further research in this

area.

Palavras-chaveMemoria Transaccional

Execucao Especulativa

Paralelizacao Automatica

Ambientes Multiprocessador

Gestao de Contencao

Escalonamento

KeywordsTransactional Memory

Speculative Execution

Automatic Parallelization

Multicore Environments

Contention Management

Scheduling

Contents

1 Introduction 1

1.1 Computing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Work Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

2.1 Software Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Transaction Nesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Contention Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Speculative Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Automatic Parallelization Systems 21

3.1 The Need for Parallelization Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 A Basic Model of Automatic Parallelization Systems . . . . . . . . . . . . . . . . . 23

3.3 Speculative Parallelization Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Speculation Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Contention Management within Speculative Parallelization Systems 33

4.1 False Positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Reduced Freedom of Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Acceptable Serialization Sequences . . . . . . . . . . . . . . . . . . . . . . 36

i

4.2.2 Influence on Contention Managers . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Conflict-Aware Task Scheduling 41

5.1 Towards Intelligent Schedulers in Speculative Parallelization Systems . . . . . . . 41

5.2 Collecting Conflict Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Extending Jaspex with Conflict-Aware Task Scheduling . . . . . . . . . . . . . . 47

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Experimental Results 51

6.1 Results on Jaspex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Results on STMBench7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7 Conclusion and Future Work 59

ii

List of Figures

2.1 Protein folding scenario with a naive contention management policy. W denotes a

write transaction, R a read-only transaction. . . . . . . . . . . . . . . . . . . . . . 14

2.2 Protein folding scenario with a smart contention management policy. W denotes a

write transaction, R a read-only transaction. . . . . . . . . . . . . . . . . . . . . . 14

3.1 Generic component view of a speculative parallelization system. Arrows illustrate

information flow between the three components: Task Identification, Task Spawning

and Speculation Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Fork-join parallelism in the speculative parallelization system of Tian et al. This

type of parallelization is recurrent in the literature. Three different versions of the

same code are pictured: The static version, mirroring the structure of the loop in the

source code; the sequential execution, unfolding the code for an actual execution;

and the parallel execution, showing the body of loops being executed in parallel, and

being serialized at their epilogue (initialization of the loop) and prologue (finalization

of the loop). Adapted from [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 An example of an inefficient schedule. . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 An example of an efficient schedule. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Component view of a speculative parallelization system with conflict-aware task

scheduling. Arrows show the flow of information between the five components: Taks

Identification, Task Spawning, Task Scheduling, Data Collector and Speculation

Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1 STMBench7 runs for three types of workload in Phobos, with and without schedul-

ing. In the xx-axis we have the number of worker threads, and in the yy-axis we

have the average number of operations per second. . . . . . . . . . . . . . . . . . . 56

6.2 STMBench7 runs for three types of workload in Azul, with and without scheduling.

In the xx-axis we have the number of worker threads, and in the yy-axis we have

the average number of operations per second. . . . . . . . . . . . . . . . . . . . . 57

iii

List of Tables

6.1 STMBench7 data for the three types of workload in Phobos, for the regular bench-

mark and with added scheduling (+S). Average throughput is in operations/second. 54

6.2 STMBench7 data for the three types of workload in Azul, for the regular benchmark

and with added scheduling (+S). Average throughput is in operations/second. . . 54

v

List of Listings

2.1 Concurrency control using fine-grained locks. . . . . . . . . . . . . . . . . . . . . . 7

2.2 Concurrency control using coarse-grained locks. . . . . . . . . . . . . . . . . . . . 8

2.3 Concurrency control using transactional memory. . . . . . . . . . . . . . . . . . . 9

2.4 A sample parallel protein folder class. . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 A sample OperationsCenter Java class, with (potentially) optimistically paralleliz-

able methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 A sample method of the type that might be seen in an enterprise reporting applica-

tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Two sample methods manipulating the same collection of Customers. . . . . . . . 24

3.3 An example of cold code (also known informally as should-never-happen code). . . 24

3.4 Nested transactions on an operation with different execution profiles. . . . . . . . 30

4.1 A DataSource class containing different operations manipulating the same variable. 34

4.2 A long method doing various manipulations with a data source. . . . . . . . . . . 35

4.3 A dangerous method for speculative parallelization. . . . . . . . . . . . . . . . . . 37

4.4 A simple code sample with two interdependent sequential operations. . . . . . . . 38

5.1 An example of a class whose methods might be inefficiently executed by a black-box

parallel execution environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vii

Chapter 1

Introduction

Legacy applications are often written in a sequential fashion, garnering therefore no benefit from

running on multicore architectures. On the other hand, even new applications are considerably

more costly to develop as parallel applications. Therefore, the development and study of tools

and frameworks to support development of parallel applications and parallelization of existing

applications is a critical step towards maximizing the potential of multicores.

The goal of automatic parallelization is precisely to remove from the programmer the burden

of having to deal explicitly with parallelism. This can be achieved in a myriad of ways, which

fall into two categories: static and dynamic schemes. The first category consists of schemes that

rely on offline program analysis to determine data and functional parallelism in the program, and

modify it to run in parallel. On the other hand, dynamic schemes make decisions at runtime, and

can take into account runtime information and properties. Systems in the latter category tipically

use a speculative approach to parallelization, assuming that a set of tasks can be run in parallel

and performing some sort of rollback or cancellation upon stumbling on conflicts.

Software transactional memory is a maturing technology that has now reached the point where

it is performant enough to be able to support both fine-grained and coarse-grained parallelism;

two examples of state-of-the art performant software transactional memory runtimes are [8, 9]

and [26]. Therefore, I believe the opportunity has surfaced to use software transactional memory

as support for automatic parallelization systems, and some preliminary results on this possibility

have been presented in [2, 1]. This approach to automatic parallelization seems promising, but

there are also plenty of challenges to be addressed. One of these challenges is the scheduling of

speculative tasks; the existing literature acknowledges, sometimes implicitly and often explicitly,

the need of speculation-aware scheduling algorithms, rather than general-purpose schedulers, to

take full advantage of speculative parallelization. In this work, I present a theoretical analysis of

this point, uncovering the limitations of speculation-blind scheduling, and present a simple – yet

effective – scheduling algorithm based on the principle of conflict-aware scheduling.

1

1.1 Computing Model

Throughout this document, I assume a computing model concerning three types of components:

tasks, threads, and processors. Tasks are the basic unit of computation; threads are logical executors

that execute tasks assigned to them in a given order; and processors are physical executing units,

to which threads are scheduled by the operating system. At any given time point, a processor is

executing at most one thread. A processor which is not executing any thread is said to be idle,

and busy if it is executing one. In a similar way, a thread which has no tasks assigned is said to be

idle, and busy if it has at least one task assigned. I assume the existence of a limited thread pool,

of the same order of magnitude as the number of processors available. I call these threads worker

threads. This computing model is akin to that described by Herlihy and Shavit in [24].

1.2 Work Objectives

One of the approaches that allows parallelization without intervention of the programmer is auto-

matic parallelization. Although there are several research venues within this realm, I concentrate

my efforts on speculative parallelization. The rationale behind this orientation is that speculative

parallelization allows an optimistic approach to parallelism, and therefore evades the problem of

having to prove the absence of dependencies in the portions of the code being executed concur-

rently. As a consequence, it works in a much wider range of contexts, as on one hand there are

often portions of code that make it possible to have inter-dependencies, but where in reality those

dependencies never materialize (a great example of this phenomenon is cold code1); and on the

other hand, even actual dependencies may present no barriers to parallelization in highly likely sce-

narios. It is also an approach that suits itself to parallelize applications that have potentially very

different runtime behaviors, depending on the inputs and starting conditions of the application.

Within the research community, several software transactional memory systems have been

developed [22, 9, 29], providing a potential basis on which speculative parallelization can be built.

These systems provide the usual guarantees of atomicity, consistency, and isolation, without the

limitations of hardware transactional memory systems [23]. I believe that software transactional

memory runtimes currently best address the needs of speculative parallelization systems. However,

in speculative parallelization research, they have not been used very often. Instead, authors either

develop ad-hoc, proof-of-concept speculation schemes, which are not explored in depth [38, 27], or

opt to run their systems in simulated transactional hardware [10]. In general, the hard problem

of developing a comprehensive speculation framework has thus been avoided. In consequence,

speculative parallelization systems are usually able to execute speculatively only regions of code

that meet certain conditions (for instance, loops). It also becomes difficult to evaluate the efficacy of

speculation mechanisms and other components of automatic parallelization systems by themselves,

as they are usually fundamentally intertwined. I believe that using a general model of transactional

memory for speculation support allows us a much greater flexibility in exploiting the available range

of parallelism in an application, from fine-grained computations to coarse-grained operations. It

also opens the door for better and more efficient research in automatic parallelization systems, by

separating the concern of determining what to parallelize from how to actually run it in parallel.

1I undergo a detailed explanation of this phenomenon and how it can impair automatic parallelizationschemes in Chapter 3.

2

In the context of automatic parallelization using speculative execution, there are two major

challenges that need to be addressed. The first is that a sequential program has to be divided into

tasks, which are then scheduled to be executed in parallel. Although simple decompositions can

be already shown to yield favorable speedups [10], as far as I know, finding good decomposition

techniques that minimize dependencies among tasks and task size (to be able to exploit fine-

grained parallelism) remains an open problem, which I do not intend to address here. Anjo [2, 1]

has developed some preliminary work in this field, and his work shows that there are a lot of subtle

points that need to be dealt with by such techniques. The second point, which I address in this

work, is the task scheduling problem. Task scheduling in parallel architectures is a very well studied

problem [30, 6, 16]; yet, task scheduling in the context of speculative parallelization has further

challenges, being that the independence between tasks is an optimistic assumption only, rather than

a guarantee. In this work, I develop a scheduler, custom fitted to the requirements of speculative

parallelization systems. I also enhance the scheduler with capabilities and responsabilites usually

attributed to contention managers, namely conflict management. I show how this holistic approach

has potential not only within the scope of speculative parallelization systems, but also for general

purpose software transactional memory runtimes.

My work builds on the work of Anjo, who used the Java Versioned Software Transactional

Memory (JVSTM) [9], developed by Cachopo and Rito-Silva, as the basis for his automatic par-

allelization scheme. He developed a system that transforms a sequential program into a parallel

program that uses speculative execution. The scope of his work was mostly the decomposition

of a sequential program into transactional tasks, and I build on it by exploring and validating

scheduling techniques for these tasks. I also base my work on the JVSTM, and I obtain results on

the integration of my techniques with Anjo’s system as part of the evaluation of my work, as well

as results on a software transactional benchmark running on the JVSTM.

1.3 Structure of the Document

I have divided this work into 7 chapters, the current one being the first. In Chapter 2, I present a

detailed review of the state-of-the-art in three main fields that my work intersects, namely Software

Transactional Memory, Speculative Parallelization, and Task Scheduling. In Chapter 3, I describe

automatic parallelization systems in detail, developing a basic model to characterize them. In

particular, I go into detail into the components of speculative parallelization systems, and current

shortcomings.

Chapter 4 contains a discussion on contention management and its usefulness for speculative

parallelization systems, which I conclude is limited. I follow by generalizing contention management

in Chapter 5, giving birth to the concept of conflict-aware scheduling. I describe the theoretical

framework of this concept, and then show how I implemented a simple version of it on top of

Anjo’s Jaspex system. In Chapter 6, I go in depth into experimental results, both on Anjo’s

Jaspex system, and on a well-established software transactional memory benchmark, STMBench7

[17].

I finalize by giving an overview of my work’s conclusions in Chapter 7, along with the topics I

consider the most important for future research in this area.

3

Chapter 2

Related Work

In this chapter, I describe and analyze the state of the art in the three main fields that my work

intersects: software transactional memory, speculative parallelization, and task scheduling.

In Section 2.1, I start by reviewing the origins of transactional memory, a parallel programming

paradigm offering an alternative to lock-based synchronization. I show how the development of

early hardware-based transactional memory systems and specifications eventually led to the pro-

posal of software-based transactional memory systems, and their differences. I follow by character-

izing several distinct software transactional memory systems highlighting important milestones in

software transactional memory research. I finish by presenting current lines of research in software

transactional memory, and how research in the field relates to my work.

In Section 2.2, I review a set of systems designed to enable sequential programs to be automat-

ically executed in multicore architectures, in a parallel fashion. Generically, these systems divide a

program into sections of code that are eventually run in parallel, and provide a runtime that allows

sequential execution semantics to be preserved in that setting. As the sections of code chosen to

be run in parallel at a certain time might be not parallelizable at all, these systems are said to

speculate on the parallelism of these sections, and are therefore called speculative parallelization

systems. Part of my work is built on top of Anjo’s speculative parallelization system [2, 1], and

therefore I place greater emphasis on the properties and capabilities of his system.

In the third and final section of this chapter, I describe the task scheduling problem, both in its

generic formulation and within the scope of my work. This field has been extensively researched,

and, in consequence, there are multiple well-documented approaches available; I will summarily

present these approaches and analyze them through the perspective of my work.

2.1 Software Transactional Memory

Transactional memory is a concurrency control mechanism that has emerged as an alternative to

traditional lock-based mechanisms. Transactional memory allows the programmer to specify simply

what operations should execute atomically, and ensures the usual transactional properties for these

operations (atomicity, consistency, and isolation). I will explain, by means of a pratical example,

5

in what ways transactional memory presents itself as an emerging solution to concurrency control.

As an example of a traditional lock-based solution, consider Listing 2.1. There is a SaleMediator

class, whose purpose is to mediate the sale between two Agents, transfering the funds from the

buyer to the seller and the Item being sold from the seller to the buyer, presumably in a parallel

and fast evolving marketplace. The synchronization is implemented using fine-grained locking, so

the code is performant and does not unnecessarily lock up the buyer or the seller for too long,

which could cause them to miss buying or selling opportunities. Writing this kind of code is highly

error-prone, however. The probability that a programmer is going to forget releasing the lock

before returning, or unlock them in the wrong order, is high. And notice that even the code in

this Listing is not free from other problems, such as deadlocks. In fact, only an analysis of every

region of code where these locks are acquired could give us such a guarantee (another alternative

is the use of tryLock mechanisms, solving the deadlock problem at the cost of this code being

even more complex). Therefore, it is much easier for programmers to opt for a coarse-grained

locking alternative, such as the one in Listing 2.2. Notice that this approach, in addition to being

easier to write, also allows a more logical organization of the code, oriented to business actions

such as ‘transfer item’ and ‘transfer balance’. Unfortunately, it also sacrifices a lot of parallelism.

With transactional memory support, the code is as easy, if not easier, to write as using coarse-

grained locks: consider Listing 2.3, that shows how the code would look like by using transactional

memory support from the JVSTM [8, 9], a software transactional memory system, described later

in this section. Ideally, transactional memory would allow the programmer to unlock the kind of

parallelism present in Listing 2.1, while using the kind of high-level constructs present in Listing

2.3.

Transactional memory can be supported in hardware, software, or both; in practice, there are

several ways transactional memory systems can be implemented, and therefore the properties of

programmer-defined transactions, memory and time overheads, and the amount of modifications

required to non-transactional code vary strongly. In generic terms, a transactional memory sys-

tem works by adding some level(s) of indirection to memory locations, allowing the transactional

runtime to mediate both read and write access to these locations and imbue these accesses with

transactional properties (consistency, isolation, atomicity). In Listing 2.3, the level of indirection

added by the transactional memory system is visible and specified by the programmer; in other

systems, as in DeuceSTM [26], the additional level of indirection is transparent to the programmer.

Concluding, the main driving points in transactional memory research are, on one hand, the

simplification of parallel programming; and, on the other hand, the potential of increased per-

formance. As exemplified in the preceding paragraphs, due to the inherent difficulty of correctly

placing fine-grained locks, in practice coarse-grained locks are often used to guarantee correct exe-

cution. This is a pessimistic approach to parallelization: potential parallelism is sacrificed for the

sake of correctness. On systems where there is a very high level of actual contention, as opposed

to theoretical contention – that is, where shared objects are frequently the subject of data races

between competing threads – the pessimistic approach is not very wasteful. On the other hand,

given a low probability of data races between parallel operations, an optimistic approach to con-

currency control allows the programmer to explore parallelism that might be lost by the pessimistic

approach. As I demonstrate in Chapter 3, such an approach does make sense for at least a number

of very common scenarios. Software transactional memory runtimes, by design, provide exactly

such an approach, at least in the optic of the programmer. Transactional memory also has some

6

class SaleMediator {

[...]

void order( Agent buyer, Agent seller, Item item, Money price ) {

buyer.getLock().lock();

if( buyer.hasFunds(price) ){

seller.getLock().lock();

if( seller.hasItemStocked( item ) ){// Seller actionsseller.getStockedItems().remove( item );seller.getBalance().add( price );

seller.getLock().unlock();

// Buyer actionsbuyer.getOrders().add( item );buyer.getBalance().subtract( price );

buyer.getLock().unlock();

return;}else{

seller.getLock().unlock();buyer.getLock().unlock();return;

}}else{

buyer.getLock().unlock();return;

}

}

}

Listing 2.1: Concurrency control using fine-grained locks.

7


[...]

void order( Agent buyer, Agent seller, Item item, Money price ) {

try {// Acquire locksbuyer.getLock().lock();seller.getLock().lock();

if( buyer.hasFunds(price) ){

if( seller.hasItemStocked( item ) ){

// Transfer itembuyer.getOrders().add( item );seller.getStockedItems().remove( item );

// Transfer balancebuyer.getBalance().subtract( price );seller.getBalance().add( price );

return;}else {

return;}

}else {

return;}

}finally{

// Release locksseller.getLock().unlock();buyer.getLock().unlock();

}

}

}

Listing 2.2: Concurrency control using coarse-grained locks.

8


[...]

@atomicvoid order( VBox<Agent> buyer, VBox<Agent> seller,

Item item, Money price ) {

if( buyer.get().hasFunds(price) ){

if( seller.get().hasItemStocked( item ) ){

// Transfer itembuyer.get().getOrders().add( item );seller.get().getStockedItems().remove( item );

// Transfer balancebuyer.get().getBalance().subtract( price );seller.get().getBalance().add( price );

return;}else {

return;}

}else {

return;}

}

}

Listing 2.3: Concurrency control using transactional memory.

9

exclusive properties that distinguish it from locks, such as failure atomicity: transactional memory

systems are able to undo the changes made by a transaction that, for some reason, fails to commit.

Therefore, it suits itself to a whole range of applications for which locks are simply unsuitable. In

my work and others, transactional memory has been applied to support speculative approaches to

parallelization, as I discuss in Section 2.2.

Transactional memory was first introduced by Herlihy et al. in [23]. This first proposal was

hardware-based, and basically rested on a number of extensions to the processor, enabling an

extended Load-Linked/Store-Conditional instruction that operated on multiple memory words si-

multaneously. This specification had the important property of being lock-free, that is, a running

program, as a whole, always makes progress in a finite amount of time even in the presence of

contention between threads, thus excluding deadlocks and livelocks, although not starvation and

similar phenomena. However, this first proposal also had some shortcomings, the most glaring of

all being the need of special purpose hardware that was not (and still has not become at the time of

writing) widely available. Another shortcoming was that the data pertaining to a transaction had

to be known in advance to the programmer, thereby excluding the potential of using transactional

memory on programs manipulating dynamic data structures, such as graphs and trees, among

others.

Software transactional memory was introduced in 1995 by Shavit and Touitou in [33], as an

alternative to Herlihy’s hardware transactional memory. Their software-based system for trans-

actional memory allowed the general programmer to have access to a non-blocking alternative to

synchronization. In their work, Shavit and Touitou argue that the development of parallel pro-

grams is made much easier by increased flexibility in the choice of synchronization operations, and

they specify how to implement software transactional memory on top of the standard (and widely

available on commercial hardware) single location Load-Linked/Store-Conditional instruction. Al-

though the system they describe is free from typical hardware transactional memory limitations,

such as reduced number of non-composable atomic operations and limited amount of memory, it

is still static, in the sense that the programmer has to know beforehand the data set that pertains

to the transaction. Therefore, it remains an unviable alternative for programs relying on dynamic

data structures.

The first unbounded, dynamic software transactional memory system was proposed in July

2003 by Herlihy et al., in [22]. The system was developed with two main goals, the first being

the possibility of using dynamic data structures within transactions (with the usual guarantees of

atomicity), and the second being faster performance than previous transactional memory systems.

In their work, Herlihy et al. relax some properties of Shavit and Touitou’s specification, to have

a simpler model and faster runtimes. Shavit and Touitou’s software transactional memory model

is wait-free [20], that is, any running transaction is guaranteed to make progress even in the

presence of contention. On the other hand, Herlihy’s dynamic software transactional memory

ensures obstruction-freedom, i.e., any running transaction is only guaranteed to make progress in

the absence of contention. Therefore, their system does not exclude the possibility of livelock or

starvation. To address these issues, they forge the concept of contention managers – entities that are

able to monitor the status of running transactions, and are responsible for using that information

to ensure execution progress. The introduction of contention managers separates conflict resolution

from the remaining aspects of transactional memory, allowing therefore for a greater flexibility in

the development and use of contention management policies.

10

Since Herlihy et al. first presented their software transactional memory system in 2003, several

other software transactional memory systems have been developed ([19] and [29] provide a detailed

comparison of the strengths and shortcomings of some designs). In my work, I used the Java

Versioned Software Transactional Memory (JVSTM) as the basis for my research. The JVSTM is

a software transactional memory library for Java introduced in 2005 by Cachopo and Rito-Silva

[8, 9]. It incorporates the concept of versioned boxes, generalizations of memory locations that hold

multiple versions of a value instead of just the most recent one. By using this model, it becomes

possible to have read-only transactions that always commit successfully: each transaction reads

only the version of the object that is consistent with its own timestamp, and write-transactions

cache the written values in their own local memory before commiting changes to a new version.

JVSTM uses this property to its advantage by differentiating read-only transactions from write

transactions, with the first ones having significantly lower overhead. It also speculatively executes

a transaction as read-only if there is a good probability that it will not write any value (naturally, if

the assumption fails, the transaction must be restarted with proper status). These design decisions

allow JVSTM to be highly performant for applications with high read/write ratios.

In the next sections, I describe two concepts within transactional memory research that are

important for my work: transaction nesting and contention management. First, I explain the

concept of transaction nesting within transactional memory and how the development of powerful

nesting models is a prerequesite for performance improvement in transactional memory-fueled par-

allelism. Afterwards, I present current research in the field of contention management, introduced

by Herlihy in 2003 [22], and which has become an integral part of performance improvement in

current software transactional memory systems.

2.1.1 Transaction Nesting.

By design, all software transactional memory systems provide support for concurrent top-level

transactions – transactions that commit directly to main memory after finishing execution. Ad-

ditionally, a transactional memory system can provide a nesting model, i.e., support the creation

of transactions within transactions, such that there is a parent-child relationship between them.

This allows the programmer to arbitrarily compose transactional operations, and the runtime to

explore further parallelism. Without such a model, transactional memory systems become strongly

restrictive. Opportunities for exploiting parallelism at different levels of granularity are restricted,

as small regions of code that may behave parallelly within a certain scope (for instance, a method)

may not behave parallelly at a larger scope. On the other hand, transactional operations cannot

invoke other transactional operations, which impairs common programming practices such as ab-

straction and code reuse. To deal with these issues, and nested transactional operations in general,

several nesting models have been studied and proposed.

Moss and Hosking, in [31], analyze three well-known nesting models, and their respective ar-

chitecture sketches. A closed nesting model is any model that only allows transactions to commit

to their parents; that is, upon completion, they update the values of the parent transaction, if

successful. A particular case of a closed nested model is a linear nesting model, where only one

child of a given transaction is executing at any given time. The JVSTM currently uses this model,

due to its simplicity, and it allows a given set of optimizations, as Moss and Hosking acknowledge.

Development of a more general model is still in progress. In contrast, an open nesting model

11

allows any transaction to commit to the shared memory state. Although no serializability theo-

rem has yet been formulated for open nesting, there are special cases where open nesting allows

further parallelism, without violating the sequential semantics of a program. For example, in [27],

an open-nested unordered set iterator is implemented, taking advantage of the commutativity of

certain operations, coupled with the possibility of designing ‘inverse methods’ that undo aborted

work.

2.1.2 Contention Management.

Herlihy et al., in [23], were mainly concerned with developing an unbounded software transactional

memory system to allow use of dynamic data structures within transactions. However, they also

introduced the pratical concept of contention managers. These entities implement policies that

decide how to handle conflicts between transactions, such as when should a transaction be allowed

to abort another transaction, or how much should a transaction wait before trying to continue

executing, for example. The need for contention managers arises from the possibility of livelock and

starvation in software transactional memory systems that do not provide the wait-freedom property.

Nonetheless, these managers may have a major influence on performance; I will illustrate with an

example the critical role of contention managers in performance, and later in this section I review

a set of studies that support this same conclusion. Listing 2.4 lists the code for a sample parallel

protein folder. The Protein objects are folded using either the method foldProteinSimple,

which performs a reasonably fast computation, but most often fails to fold the protein, or the

method foldProteinComplex, which performs a rather heavy computation and often succeeds.

When either one fails at folding, it stores the Protein in a container that stores all previous

failures. Before beginning the folding itself, either method checks the existing list of failures for

similar Protein objects that have failed before, and immediately declares failure if it does find any

similarity. Now consider that the method foldProteinSimple is invoked four times in a row by

a thread, having for its arguments four similar (in the sense defined above) Protein objects, and

assume that the folding is going to fail (as happens very frequently). At the same time, another

thread invokes foldProteinComplex, and assume the method is going to succeed as usually. In

the absence of a contention manager, or if our contention manager takes no starvation prevention

measures, a possible scenario for this execution is shown in Figure 2.1. This scenario is as bad,

performance-wise, as running the program sequentially. A smarter management policy might

achieve a much better scenario, as show in Figure 2.2. Here, the contention manager correctly

aborts the transaction running foldProteinComplex, based on some information about the past

behavior of similar transactions, or maybe just transaction length.

Although contention managers are not explicitly recognized as schedulers, in practice they

allow the programmer to program local scheduling decisions, to be made in the face of conflicts.

Herlihy et al. acknowledge that contention policies are likely to be highly dependent on application

behavior, and therefore introduce the concept of pluggable contention managers in their framework.

Scherer and Scott [32] built upon the work of Herlihy et al., by exploring some of the design

space of contention management. Although they start with the assumption of Herlihy et al.

that contention managers probably should be application specific to maximize performance, they

analyze and develop several general-purpose contention management policies. Their Polka policy,

which combines a randomized exponential backoff with a priority accumulation mechanism to avoid

12

class ProteinFolder {

@transactionalProteinList failures;

[...]

@atomicvoid foldProteinComplex( Protein prot ) {

bool successful = true;

// If we have tried to fold similar proteins before and failed,// we give up immediatelyif( !failures.containsSimilarProteins( prot ) ){

// Very heavy computation that may or may not be successful[...]

}else return;

if( !successful )failures.add( prot );

}

@atomicvoid foldProteinSimple( Protein prot ) {

bool successful = true;

// If we have tried to fold similar proteins before and failed,// we give up immediatelyif( !failures.containsSimilarProteins( prot ) ){

// Fast computation that may or may not be successful[...]

}else return;

if( !successful )failures.add( prot );

}

}

Listing 2.4: A sample parallel protein folder class.

13

Figure 2.1: Protein folding scenario with a naive contention management policy. W denotes awrite transaction, R a read-only transaction.

Figure 2.2: Protein folding scenario with a smart contention management policy. W denotes awrite transaction, R a read-only transaction.

14

starvation, yields superior performance in their results, consistently outperforming other contention

management policies over a range of applications with distinct behaviors. In their work, Scherer

and Scott develop policies to address the problem of prioritized contention management, which

differs from the general contention management problem by the quantification of the right of each

transaction to progress. Prioritized contention management policies should therefore ensure greater

progress for higher-priority transactions than for lower-priority transactions. Scherer and Scott

[32] argue that prioritized contention management is a natural fit for Herlihy’s obstruction-free

software transactional memory. However, in their analysis, they identify some shortcomings of their

contention management policies; one of them is that a flow of short-lived, low-priority transactions

can still starve a higher-priority one. This problem highlights one of the major points of my work:

I claim that contention management is a form of local scheduling, and I show that performance

can be improved by actually incorporating contention management into global scheduling. In this

case, a local contention manager cannot do anything to prevent the scheduler from scheduling

lower-priority transactions for execution, and the starvation problem arises from that inevitability.

In the following section, I present a brief review of the state of the art in speculative paral-

lelization systems, which so far have relied on custom developed mechanisms to support speculative

execution. I argue that software transactional memory is a natural support mechanism for specu-

lative execution, as preliminary results by Anjo [2, 1] and Couto [11] have shown.

2.2 Speculative Parallelization

Thread-level speculation is a technique where sections of code are executed speculatively, in an

optimistic fashion, in an attempt to explore latent parallelism in sequential programs. For ex-

ample, consider Listing 2.5. I may try to speed up the execution of a call to operation1 by

speculatively executing validateOperation1 and setupOperation1 in parallel. In a scenario

where validateOperation1 only makes a few expensive checks, and usually allows operation1

to proceed, I might get quite a significant speedup by doing so. This is a kind of parallelism that

would be hard to exploit using locks, or to uncover by static analysis. In fact, in [37], Steffan

and Mowry show that thread-level speculation successfully exploits parallelism that cannot be ex-

ploited by parallelizing compilers that rely on static analysis to rule out dependency violations,

and that is of coarser grain than instruction-level parallelism. Since then, thread-level speculation

has been applied with success to loops [38, 13, 18], methods [10], and even coarser-grained sections

of code [27]. As I describe in the following paragraphs, several speculation mechanisms have been

developed, from hardware supported speculation to simple copy-and-update schemes.

Tian et al. [38] use thread-level speculation applied to loops, using a copy-and-update scheme

for speculative execution; speculative threads, which are spawned by the main thread, obtain

memory copies of the data that they need to execute, and upon successful speculation (speculation

that does not incur in any conflicts), the written data gets copied to the main thread. To make

the most out of speculation, they use static offline profiling to determine the most profitable loops

for speculation. Although their system may have significant memory overheads, it obtains almost

linear speedups in loop executions. Their speculation scheme uses source code transformations,

and is therefore limited to scenarios where the source code is available. Implicit in their approach is

the assumption that speculative parallelization works at maximum profit when it is profile-guided.

15

class OperationsCenter {

void operation1() {validateOperation1();setupOperation1();performOperation1();

}

void operation2() {validateOperation2();setupOperation2();performOperation2();

}

}

Listing 2.5: A sample OperationsCenter Java class, with (potentially) optimistically paralleliz-able methods.

Chen and Olukotun [10] apply thread-level speculation to methods. They develop their work in

the context of how different levels of parallelism map to different types of processors. They show

how chip multiprocessors can be used to speculate on method executions in Java, to obtain speedups

that could not be obtained neither by relying on superscalar processors and instruction-level par-

allelism alone, nor by relying on bus-based multiprocessors, which incur in high communication

overheads. They are able to successfully parallelize interesting programs, namely some Java library

classes (StringBuffer and Hashtable), using their approach. These programs are interesting in the

sense that, by obtaining speedups on their execution, it is possible to confer parallelism to appli-

cations that use these libraries without further modifications. The results presented in their paper

were obtained by running the modified applications on simulated speculative hardware; nonethe-

less, Chen and Olukotun succeed in showing the ability of coarser-grained thread-level speculation

to unlock higher levels of parallelism.

Kulkarni et al. [27] go even further, stating that the identification of high-level abstractions

is fundamental in increasing performance of speculative parallelization. They invent the Galois

approach, which is essentially based on developing and using optimistic parallel library classes

in irregular programs – programs that have a complex memory access behavior, and therefore

are difficult to parallelize using static data dependency analysis techniques. They implement two

optimistic parallel collections and are able to speed up two irregular applications by replacing

the ‘work-list ’-type constructs in these applications by these collections. To augment optimistic

parallelism, they use the concept of semantic commutativity between operations: by specifying

which operations in the collection are commutative (operations such as add(x) and remove(x)),

they are able to reduce the number of conflicts which cause unnecessary aborts; they categorize

this technique as a controlled type of open nesting. In consequence, their speculation mechanism is

based on inverse methods that undo operations in case of conflicts, an approach similar to that of

Herlihy and Koskinen in [21]. A major contribution of their work is also that they show the influence

of different scheduling policies in the resulting performance. A good scheduling policy increased

the performance by up to a factor of two in their experiments; the results that were obtained also

led Kulkarni et al. to conclude that an effective scheduling policy is highly application dependent.

In [25], Jiang and Shen propose a dynamic speculation scheme that profiles execution of spec-

16

ulative regions at runtime, and uses information gathered in both past production runs of the

program and its current run to decide if it is profitable to speculate a certain region. They use

machine learning techniques, in particular classification trees, to relate program input with prof-

itability of speculation. By doing so, they are able to reduce misspeculation, obtaining greater

speedups and also greater computing efficiency. Even though they consider only loop-level and

functional parallelism in their work, they obtain the important conclusion that the decision of

whether to speculate can be improved by taking into account runtime properties of the program.

To the extent of my knowledge, speculative parallelization supported by software transactional

memory has not yet succeeded in producing strong results that could allow it to compete realisti-

cally with established parallelization paradigms. Anjo [2, 1] has pioneered a speculative paralleliza-

tion system built using the JVSTM [8, 9]. His system parallelizes a sequential program by first

performing byte code transformations on the application classes, allowing them to be controlled by

the transactional runtime of the JVSTM. He then identifies points of speculation (namely method

calls) and rewrites the bytecode to insert runtime decision points there. The decision on whether

or not to speculate is taken at runtime, but once a section of code (task) has been marked for

speculative execution, the runtime loses control of the task’s execution. It is scheduled for execu-

tion by the Java thread scheduler, and there is no contention management policy. In my work, I

have extended Anjo’s work by creating a scheduling module that is also conflict-aware, allowing

scheduling decisions to take into account not only each task’s properties and the current execution

context, but also past conflict history.

In the next section and as my work draws upon task scheduling research, I will analyze standard

approaches to generic task scheduling problems and characterize them in the scope of my work.

2.3 Task Scheduling

The problem of scheduling, in the context of my computing model, refers to the problem of as-

signing tasks to threads in the thread pool, in such a way that maximizes throughput. I assume

that a thread in that pool is allocated to some processor by some mechanism as soon as the thread

becomes busy and an idle processor is available. To maximize throughput, I need therefore to keep

the maximum number of threads busy with useful computation. Different scheduling paradigms

have been in use for general multithreaded computations, and in my work I create a new scheduling

paradigm for scheduling tasks in speculative parallelization, by taking into account conflict infor-

mation. Although it has been acknowledged, in the context of speculative parallelization, that

scheduling policies have a large influence on performance [27, 2], little research has been conducted

in developing schedulers for various forms of thread-level speculation. My results show that a pro-

filing, conflict-aware scheduling policy obtains faster performance than general schedulers in this

context. In the following paragraphs I present the two major paradigms for general scheduling of

multithreaded computations, work sharing and work stealing.

In the work sharing model, a special thread, the scheduler (or eventually a hierarchy of sched-

ulers) is responsible for managing tasks and assigning them to the threads available for compu-

tation. A recent, successful example of the application of this paradigm to massively parallel

computations is Google’s MapReduce [14]. Contrasting with work sharing, work stealing is a

17

symmetrical scheduling paradigm, where threads are responsible for obtaining the tasks that they

execute. When a thread has completed its work, it steals tasks owned by busier threads and

executes them.

There is an extensive amount of literature on the scheduling problem modeled as a general

weighted directed acyclic graph (DAG) [30, 16, 6], where the nodes represent the tasks, their

weights the respective computation costs, the directed edges the dependencies between tasks, and

the costs of the edges the communication costs between tasks. In this model, a scheduler is simply

an algorithm that is able to minimize the schedule length, which is the maximum finish-time of

all nodes. The problem of finding the optimal solution to the general form of this problem is

NP-complete [16]. Therefore, the general approach to the DAG-scheduling problem has been to

develop heuristics to help find a solution in tractable time on one hand, and, on the other hand,

to make assumptions about the structure and costs of the graph to allow for relaxed problems.

McCreary et al. provide a review and evaluation of a set of heuristic approaches to this problem in

[30]. In my case, the scheduling-DAG is not known a priori ; moreover, I have special, transactional-

memory related task states. For instance, the information of whether a given task has incurred

into conflicts, and the nature and recurrence of these conflicts are relevant to scheduling and do

not map easily into this model.

The work stealing approach does not need to know the scheduling-DAG at any time; in [6], Blu-

mofe and Leiserson introduce a randomized work stealing algorithm, where processors need only

to take into account local information about their computation Their results are valid for fully-

strict multithreaded computations, that is, computations where each thread has dependencies only

towards its children. Although this model excludes a good amount of general multithreaded compu-

tations, it actually promises a solid approach for scheduling speculatively parallelized computations,

which follow this structure in general. Lea, in [28], implemented a variant of the randomized work

stealing algorithm by Blumofe and Leiserson to provide an efficient implementation of fork-join

parallelism in Java, obtainining improvements over standard Java thread scheduling.

2.4 Discussion

Several approaches to thread-level speculation have been developed, showing that it can be used

to exploit the whole spectrum of parallelism latent in a sequential application, from fine-grained

parallelism to very coarse-grained parallelism. However, these studies have placed most of their

emphasis on speculation techniques and not on the mechanisms of speculation themselves. Sim-

ple copy-and-update schemes [38], which easily incur in too much memory overhead for threads

manipulating lots of data, simulated transactional hardware [10], and difficult to design (and even

more difficult to generate automatically) inverse methods [27], among others, were used to provide

speculation mechanisms for these studies. In my work, I have extended Anjo’s Jaspex, which uses

JVSTM as the backend for its speculation mechanism, as I believe that software transactional

memory allows me to explore the whole range of parallelism in an application in a uniform fashion.

Unfortunately, there are still limitations to software transactional memory systems that restrict

the amount of parallelism that can be exploited. Practical, yet powerful, nesting models such as

closed nesting are still being incorporated into state of the art software transactional memory sys-

tems, and these models have a direct influence on how tasks can be run in parallel, and therefore

18

on the performance of speculative execution systems. For example, consider once more Listing 2.5

in Section 2.2. Now, consider a program that invokes operation1 and operation2 in sequence,

and a transactional memory system that provides no nesting model. If I execute these methods

speculatively, I may then exploit existing parallelism in their execution, but not in the execution

of validateOperation1 and setupOperation1, for example. On the other hand, if I exploit this

finer-grained parallelism by executing validateOperation1 and setupOperation1 speculatively,

then I lose the ability to exploit the coarser-grained parallelism I was exploiting before. Note

that I do not get around this problem by executing validateOperation1, setupOperation1,

validateOperation2 and setupOperation2 speculatively at the same time, as the latter transac-

tions will have to wait for program order to be able to commit; therefore, this option corresponds

roughly to running operation1 and operation2 speculatively and does not unlock further par-

allelism. The linear nesting model currently supported by the JVSTM allows the programmer to

reuse transactional code, to arbitrarily nest transactions, and, in general, to reap the conceptual

benefits from the more powerful closed nesting model, but limits the amount of parallelism that

will be able to be exploited in practice.

As far as I am aware, no other study has been conducted to research and develop scheduling

algorithms that are specifically suited to transactional tasks created in the scope of speculatively

parallelized applications. General-purpose scheduling algorithms can often get in the way of spec-

ulative parallelization, producing schedules that cause great numbers of aborted speculative tasks.

Researchers in both the speculative parallelization and the contention management fields have

supported the belief that better suited scheduling algorithms are needed: in [27], Kulkarni et al.

state that intelligent scheduling of speculative tasks had a large influence on performance; in [32],

Scherer and Scott found that contention management alone could not solve some problems caused

by the obliviousness of the scheduler to the contention management policies, as I discussed in Sec-

tion 2.1.2. In Chapter 5, I describe how I incorporated contention management information and

responsabilites in the scheduling algorithms, thereby increasing the performance of Anjo’s Jaspex

system for speculative parallelization.

19

Chapter 3

Automatic Parallelization Systems

In the context of my work, I use parallelization system to refer to a special runtime environment

with a set of particular capabilities. These capabilities, once combined, allow sequentially written

programs to be executed in a parallel fashion on a multicore machine, potentially resulting in

faster performance. In this chapter, I will derive a basic model for these systems, describing their

properties, and I will go into detail on the particular subset of speculative parallelization systems.

First of all, however, before describing the properties of these systems, it will be useful to define

a set of concepts, beginning by what I consider to be sequentially written programs, and why the

quest to achieve multicore execution of these programs is a legitimate one.

3.1 The Need for Parallelization Systems

Modern software engineering presents an interesting scenario, in that although the elemental build-

ing blocks of parallel programs (threads, locks, and even atomic test-and-set operations) have been

around for a few decades [12], most programs are developed in a sequentially-minded fashion. By

which I mean that programmers usually develop a program with the implicit assumption that it

is going to be executed by a single processor, with totally ordered statements (i.e., if statement

a precedes statement b, and statement b precedes statement c, then these three statements will

always be carried out in the order a b c) and exclusive memory, whose content at a given point in

time is consistent with the effects brought upon by the (totally ordered) statements executed up

to that point. I will refer to programs developed in this way as sequential programs. To provide a

simple example that I will continually explore throughout this chapter, consider Listing 3.1.

In this listing we have a sample method (of an unspecified class), resembling one that might

be seen in an enterprise application. The specific semantics behind the code is not very relevant,

although I opted for a semantically rich example to illustrate on one hand how typical sequential

code might be, and on the other hand how sequential code easily maps into our cognitive models of

how processes unfold. As seen on the sample code, the Report class is some kind of data container

into which we can load data, add statistics, and then compute statistics on the loaded data. In

addition, there are two different objects (Systems.OnlineStore and Systems.WholesalePortal)

that provide us some kind of data that is loaded into the report.

21

public Report generateOrderReport() {

Report report = new Report();

OrderData storeData = Systems.OnlineStore.getOrderData();report.loadData( storeData );

OrderData wholesaleData = Systems.WholesalePortal.getOrderData();report.loadData( wholesaleData );

report.addStatistics( Statistics.DefaultStatistics );report.computeStatistics();

return report;}

Listing 3.1: A sample method of the type that might be seen in an enterprise reporting applica-tion.

Given the sample code, we would not expect the computeStatisticsMethod to run before

either invocation to loadData has returned. In the same way, although with different implications,

we would also not expect the WholeSalePortal data to be loaded before the OnlineStore data,

even if this might be a semantically sound execution. Concerning memory, we would definitely not

expect the Report to have any data before we got to the topmost invocation of loadData in the

listing. Programmers tipically interpret and design programs the same way they read and write

them, which is from top to bottom.

Modern hardware, on the other hand, has developed to the point where even consumer market

computers have multiple processors, tipically sharing the same memory (in a myriad of different

setups, whose discussion is out of the scope of this work). Theoretically, there is the potential

for increased performance as more raw computing power is available, but in practice there is little

to no performance gain for most applications, as they are sequential applications that by design

cannot allocate processing work to more than a single processor.

Unfortunately, overcoming this hurdle is not easy at present. Parallel programming involves

extra costs, as the programmers have to identify and specify sections of code which can be run

in parallel (and/or which cannot), and develop synchronization mechanisms for sections of code

that might read and write to the same memory regions. Presently, this is still a skill-intensive

task that has a high time and quality cost. The end result is that, excluding performance critical

data-parallel applications (like building the index for a search engine [14]), most applications being

written today, as well as most legacy applications, are still sequentially written programs, and

gather no benefit from being run in a multicore machine rather than in a single-core one.

Fortunately, as I outlined in Chapter 2, researchers have come up with a number of solutions

to address this situation, by developing systems capable of finding parallelism within sequentially

written programs. In my work, I concentrate on a specific family of systems, which I shalll refer

to as automatic parallelization systems. These systems, with a well defined set of capabilities,

are described in detail in the next section. They exploit parallelism from sequential programs by

a process that vaguely resembles the manual process that I described above, and thereby allow

sequential programs to potentially run faster in a multicore architecture.

22

3.2 A Basic Model of Automatic Parallelization Systems

I will now proceed to describe in detail the capabilities that define the family of systems that I am

labeling automatic parallelization systems. I start by laying out a fundamental assumption about

the type of input that these systems accept, followed by a set of fundamental capabilities that will

form the basis of further discussion in this work.

The fundamental assumption is that these systems will have as their main input a compiled

program, not necessarily accompanied by the program’s source code. This first assumption imme-

diately rules out any type of source code analysis techniques, where some results have been already

obtained in the literature (see, for example, the parallelizing compiler Polaris [5]). It is a necessary

one, however, for in practical scenarios it is unlikely that one will have access to the source code,

either due to commercial (non-released source code or code under copyright) or availability reasons

(source code lost or replaced by a new version, unavailability of legacy compilers, etc). It might

also be the scenario where there is theoretical availability of the source code, but in practice the

time and/or monetary costs of obtaining it are too high, or there is a distrust of the released code

base. All in all, the goal of source code independence for automatic parallelization systems is a

very necessary one in their viability as useful tools. Note that this assumption does not rule out

the possibility that the system in question does use source code analysis (when available) to obtain

better results.

Automatic parallelization systems have the goal of interpreting, transforming, and/or running

the mentioned sequential programs in a parallel fashion, allowing faster performance on multicore

architectures. By doing so, they must still preserve the sequential semantics of the original program:

any parallel run of the program must still produce a result that is consistent with a sequential run

of the program. These systems depend on two fundamental capabilities to produce their results:

• Being able to identify and to delimit, either offline, at load time, or at runtime, sections of

the program that will be able to be executed concurrently with other sections of the program.

• Orchestrating the run of the program, by creating tasks and assigning them to threads as

they become idle.

An example of automatic parallelization systems that fit into this model and have exclusively

these two capabilities are static analysis parallelizers: these systems make an offline analysis of

the program, identify parallel sections of code (tipically loops), and modify the program so that

these sections of code are executed in parallel. These systems are still fairly limited in their

power to parallelize most programs; this stems from the fact that static analysis, to preserve

sequential semantics, must uncover parallelism following the principle that two sections of code

are not parallelizable until proven parallelizable. To understand why this is a major limitation,

consider the code samples in Listing 3.2 and Listing 3.3.

In the first listing, we have the declaration of two methods that manipulate a class instance

variable, a collection of Customers. The first method, penalizeRates, searches for customers with

bad credit, and penalizes their rates accordingly. The second method, applyGlobalPromotion,

searches for customers to whom a given promotion applies, and updates their rates accordingly.

Parting from the common sense that promotions usually apply to good customers, we might be

23

public void updateCreditRates(){

List<Customer> customers = getCustomers();

for( Customer c: customers ){

if( c.Credit >= CREDIT_TRESHOLD )c.penalizeRates();

}}

public void applyGlobalPromotion( Promotion promotion ){

List<Customer> customers = getCustomers();

for( Customer c: customers ){

if( promotion.appliesTo( c ) )c.updateRates( promotion.Rates );

}}

Listing 3.2: Two sample methods manipulating the same collection of Customers.

public class Account{

[...]

public void transferFunds( Money amount, Account targetAccount ){

if( this.hasNecessaryFunds( amount ) ){

this.withdraw( amount );targetAccount.deposit( amount );

}

// Should never happenif( !this._complexValidator.validateAccountState( this ) ){

[...]}

}

}

Listing 3.3: An example of cold code (also known informally as should-never-happen code).

24

tempted to say that the set of customers updated by the first method will not usually intersect

the set of customers updated by the second. Beyond our common sense, there might be even

some business rule that actually guarantees that they in fact never intersect. Therefore it would

be perfectly safe to execute these two methods concurrently, were they to be called in the code.

Unfortunately, for a static-analysis–based parallelization system, this is imperceptible. It detects

a possible overlap in the accesses to the collection of Customers, and therefore will never identify

parallelism between these two methods. Even if the system was particularly sophisticated, it might

happen that there is the potential for parallelism due to particular patterns in the data used by

our application (customers, promotions, etc) and not by the application structure itself.

The second listing showcases another phenomenom that is common and also impairs static-

analysis–based parallelization systems. This phenomenom, known as cold code (or sometimes

informally as should-never-happen code) is code present in an application to test for, detect, or

recover from some highly unlikely condition, or, in systems with complex business rules and low

tolerance to errors, to ensure that certain complex conditions are enforced at certain points. The

example shown in the listing is one of the latter cases. We have an Account class, with a method

transferFunds that allows us to transfer some amount to another Account. After the execution

of the main logic of the method, there is a sanity test validating what might be a very complex

set of conditions regarding the state of the Account object after the transfer. This sanity test may

access a great amount of data sources to ensure that a set of complex business rules, which should

never be violated by any operation, still hold. In practice, this code may never be relevant (i.e.,

the if could as well be a no-op if the program was well built and preserved these business rules

at all times). Static analysis will still be stumped by this type of code though, specifically by the

possibility that these complex validations access data modified by other sections of code. It can

therefore fail to parallelize what in practice are perfectly parallelizable sections of code.

To deal with these limitations, we have to extend the basic model I presented in this section

with further capabilities. In the next section, I describe a possible solution, leading us to the family

of systems I call speculative parallelization systems.

3.3 Speculative Parallelization Systems

Automatic parallelization systems, as derived from the basic model I described in the previous

section, have some major limitations if we rely exclusively on the two fundamental capabilities. As

a matter of fact, there is not a great potential for increased performance in general sequential pro-

grams if, among the tasks identified in the program, few or none are eligible for parallel execution.

Fortunately, there are ways around this limitation; one possible solution is the addition of a third

capability as follows:

• Being able to identify and revoke the execution of tasks that have (potentially) violated the

sequential semantics of the original program.

By adding this third capability, we greatly expand the pool of possible candidate sections of code

for parallel execution. Returning to Listing 3.2, systems with this added capability are now able

to choose to execute penalizeRates and applyGlobalPromotion in parallel, as in the event of

25

the violation of sequential semantics (by the concurrent modification to the same Customer, for

instance) it is possible to revoke either or both tasks, and re-execute them in a way that does

not violate sequential semantics. As there is no longer the need for a formal proof to put sections

of code executing in parallel, these systems can make the decision of putting any two sections of

code in parallel execution, in the hope that sequential semantics will not be violated and increased

performance will result. This is congruent with the thread speculation systems I described in

Chapter 2, and I shall therefore refer to these systems as speculative parallelization systems.

Generically, the speculative parallelization system is composed of three components, as illus-

trated in Figure 3.1. The first, responsible for the first fundamental capability as defined in the

previous section, is the task identification component, which takes the sequential program as an

input and identifies the sections of code (tasks) that are able to be autonomously executed by a

single thread. As speculative parallelization systems are able to revoke the execution of tasks, the

criterion for task delimitation is not, or at least does not have to be necessarily, the feasibility of

executing that section of code in parallel with other sections of code without interference and pre-

serving sequential semantics. Rather, these systems have a much greater freedom for delimitating

tasks, and the criteria for delimitation will tend more towards the facility of packaging the section

of code into an autonomous task (process that will usually involve some amount of memory copy-

ing, insertion of fork and join points, etc.) and practical viability of the task (tipically sections of

code that are too small or execute too fast are not viable tasks, as the overhead of creating the task

itself will overshadow any performance gain). For concrete examples of speculative parallelization

systems and the criteria applied for task delimitation, see [2, 1], [38] and [27]. As referred before,

this component may take as optional inputs additional kinds of information that help with task

identification, such as the program’s source code.

The second component, responsible for the second fundamental capability, controls task spawn-

ing. This component accompanies the execution of the program, at least logically; in practice some

solutions use injected code to implement this responsability, and there is not a central authority

for task spawning, but rather a sequence of local decisions injected at fork and join points. Be-

sides accompanying execution, this component is also responsible for spawning speculative tasks.

It makes the decision of whether it is or is not profitable to speculate at a certain point of the

execution of the program, and it is also responsible for return type prediction for the speculative

tasks (if present), and for collecting the necessary data and context to package autonomous spec-

ulative tasks for execution. Once again, this component may take as input more than simply the

output of the Task Identification component; a chief example is found in [38], where profiling data

of previous runs was fed into this component, allowing possibly better assessments to be made on

speculation benefits of individual tasks.

The third and final component of a generic speculative parallelization system is the specu-

lation support, which provides support for task speculation (by providing a mechanism for task

revocation). As reviewed in Section 2.2 of Chapter 2, several solutions with differing degrees of

complexity can be implemented to provide this support; the reviewed range of solutions went from

a simple copy-and-update scheme, to a full blown transactional memory runtime. Speculative

tasks run under the control of this component. Although pictured in Figure 3.1, the thread pool

where speculative tasks are executed is generally not under control of the speculative parallelization

system itself, meaning that the system has no control over the way tasks are scheduled for execu-

tion, once they have been handed over to the thread pool. As I show later in this work, this can

26

Figure 3.1: Generic component view of a speculative parallelization system. Arrows illustrate in-formation flow between the three components: Task Identification, Task Spawning and SpeculationSupport.

27

be a serious shortcoming, as the system-level scheduling policies treat these tasks as black boxes

and therefore can end up working against speculative parallelization, by systematically scheduling

tasks in such a way that causes them to violate sequential semantics (and in consequence their

revocation), degrading overall throughtput and execution time. Before proceeding though, it will

be useful to have a brief discussion on the the influence of the speculation support component on

the speculative parallelization system.

3.3.1 Speculation Support

Several speculation support mechanisms with different properties have been used by researchers

in different settings. In the work of Tian et al. [38], a copy-and-update scheme is used: the

main thread of execution executes in a non-speculative fashion, and speculative threads (threads

executing speculative tasks) copy all the data necessary for speculative execution to their own

private memory, and, on successful speculation, commit the changes on their private memory to the

main thread’s memory. If a speculative task misspeculates, then all the data previously copied by

the thread is simply discarded. This approach was designed with the intent of making unsuccessful

speculation attempts have the same cost as successful ones – as the state of speculative threads

is simply discarded, no expensive rollback mechanisms are necessary to restore the program’s

memory to a state consistent with the sequential semantics of the program. Naturally, this comes

at the price of introducing a high overhead at each speculation attempt, as a potentially significant

amount of data must be copied back and forth between the speculative threads and the main

thread. Moreover, this speculation support mechanism has a scalability problem: all the speculative

threads need to communicate with the main thread’s memory, and therefore the access to the main

thread’s memory ultimately places a hard limit on the amount of speculative tasks that can be

run in parallel. In the case of the system of Tian et al., the speculative tasks consist solely of loop

step executions, in the fork-join style of parallel programming, as shown in Figure 3.2. As such,

the nature of parallel running tasks is somewhat limited, and the memory copying operations can

be optimized by lazy copying from the main thread’s memory, meaning that any given executing

thread can forego copying certain memory locations on spawning, and instead raise a runtime

exception upon trying to read that location for the first time, forcing it to communicate with the

main thread. After identifying variables that are not likely to be read in most iteration executions,

this simple mechanism can and does reduce wasteful copying, as reported in their experiments.

Unfortunately, it is hard to generalize this type of speculation support to other (non-loop based)

models of parallel speculative execution without losing the benefit of these optimizations, as they

rest on some assumptions about the task identification system; namely on the assumption that

speculative tasks access a relatively stable set of memory locations. This might be true in the

special case of loop-parallelism (to which I might add that even then, Tian et al. still have to rely

on profiling for differentiating profitable loops for speculation); however, in arbitrary code sections,

where this assumption may not hold, both memory copying on task spawning and lazy copying

by raising exceptions can become prohibitively expensive. The conclusion is that the speculation

support component in this work is tightly coupled to the task identification component. The

consequence of that coupling is that it becomes hard to exploit parallelism at different granularities,

even in approaches that require programmer input (as opposed to being fully automatic). Note

that this coupling is not specific of this work, but rather a broad trend in speculative parallelization

28

Figure 3.2: Fork-join parallelism in the speculative parallelization system of Tian et al. This type ofparallelization is recurrent in the literature. Three different versions of the same code are pictured:The static version, mirroring the structure of the loop in the source code; the sequential execution,unfolding the code for an actual execution; and the parallel execution, showing the body of loopsbeing executed in parallel, and being serialized at their epilogue (initialization of the loop) andprologue (finalization of the loop). Adapted from [38].

29

public void oftenComplexOperation( OperationInput oi ){

atomic{

boolean complex = true;

atomic{

// Is this object in a complex state?complex = this.isInComplexState();

}atomic{

if( complex )this.performComplexOperation( oi );

elsethis.performSimpleOperation( oi );

}}

}

Listing 3.4: Nested transactions on an operation with different execution profiles.

literature, excluding the cases where simulated hardware is used (as in [10]).

To defeat this trend and reduce coupling between the task identification and speculation sup-

port component, we have to depend on a speculation support system that allows us total freedom

on the type of tasks that can be effectively speculated, instead of just on constructs like loops,

functions, etc, that may or may not correspond to the most interesting opportunities for paral-

lelization. As far as I know, the only available solution that allows us this type of freedom is a full

transactional runtime. Transactional runtimes allow us to enclosure sections of code into transac-

tions, meaning that they will execute (logically) in complete isolation from other transactions (if

the transactional runtime provides weak atomicity) or the rest of the program (if the transactional

runtime provides strong atomicity [7]). By using a transactional runtime as our speculation sup-

port, we are able to identify tasks within the program with arbitrary size, and starting and ending

at any point in the program. If the transactional support allows nested transactions, we are even

able to define a hierarchy of tasks, giving a lot of freedom for the task spawning component to

orchestrate the parallel execution of the program in different ways. For example, consider Listing

3.4, where we have a method with different execution profiles: depending on the object’s state, the

method either performs a simple or a complex operation. The test that determines the path chosen

(isInComplexState) may itself be a computationally intensive operation. I have used the keyword

atomic to encapsulate transactions as could be defined by the task identification component. This

is done for illustration purposes only; the task identification component itself does not have access

to the source code and will therefore likely never produce this Listing as an artifact.

Now, by allowing nested transactions, the task spawning component may be able to decide, in

a scenario where other tasks are being put into parallel execution, whether to spawn two tasks for

this method, or just a single one (the top transaction). Depending on the information available

to the task spawning component on the nature of this method and other currently running tasks,

it can have the capability of gauging the execution profile of this particular execution and the

30

probability that either of the alternatives is going to degenerate into the violation of the sequential

semantics of the program by some tasks (and thereby their rollback) and decide accordingly.

Currently, as was reviewed in Chapter 2, there are few to none hardware-based transactional

runtimes generally available, so we have to resort to software transactional memory for a trans-

actional runtime. Significant advancements have been made in this field in the last few years;

still, for performant transactional runtimes (being that performance is an indirect requirement for

speculation support systems in the context of automatic parallelization, as the ultimate goal is

performance, and we cannot afford to succeed in logically parallelizing a program only to choke

on the speculation support system), few alternatives are currently available, among them the Java

Versioned Transactional Memory (JVSTM) [9, 8] and DeuceSTM [26]. I have used as the basis

for my research a speculative parallelization system developed by Anjo [2, 1], that itself uses the

JVSTM as its speculation support system, so it remains as future work the repetition of the same

experiments using DeuceSTM, which has different strengths and limitations than the JVSTM.

Regardless of the speculation support system, there always remain some problems regarding

practical viability of speculation that cause us to deviate from the theoretical scenario where

any section of code can be speculated freely. Non-transactional operations like input/output and

system calls with side effects are responsible for tasks that cannot be revoked, and therefore there

must be mechanisms in place to assure that these operations are executed in the correct execution

context (in the absence of transactional input/output systems or other coping mechanisms). These

limitations, along with possible solutions and JVSTM-specific limitations, have been thouroughly

explored by Anjo in [1].

3.4 Conclusion

In this chapter, I have described the inner workings of automatic speculative and non-speculative

parallelization systems, by deriving a generic model on their capabilites. I have also made an

introduction to the challenges presented by the speculation support component within speculative

parallelization systems, and how it has a crucial influence on the performance of these systems. In

the next chapter, I present the notion of contention management, and how it can help speculative

parallelization systems obtain better performance.

31

Chapter 4

Contention Management within

Speculative Parallelization

Systems

As reviewed in Chapter 2, contention managers where introduced by Herlihy in [23] as a mechanism

for software transactional memory runtimes that were not wait-free to help prevent livelocks and

starvation (and thereby increase performance and fairness). They work by ensuring that there is

some intelligence guiding transaction abortion and re-execution. By doing so, scenarios such as

the one presented in Chapter 2 can be avoided, although of course contention managers are by

nature heuristic and therefore cannot guarantee that a given application is going to be livelock

and starvation-free. Both Herlihy [23] and Scherer and Scott [32] also recognize that contention

managers are bound to be application specific and, following that premise, present an extensible

framework where contention managers implementing different policies can be plugged into the

software transactional memory runtime.

Contention managers, as their name suggests, take action in the presence of contention (a

race between two transactions to access and modify the same data in a way that causes one or

both to violate isolation or atomicity). In the case of thread-level speculation systems, as is the

case with speculative parallelization systems, this contention is detected in two different ways,

depending on the system: when there is an access to data that another transaction owns (for

example reading data that transaction has written to), or when a transaction tries to commit and

realizes it has accessed data another transaction owned before. In any of the cases, contention is

inferred from the presence of one or more conflicts. Contention managers have had good results

within several scenarios, as was reviewed in Chapter 2, so it is a pertinent question whether they

are a useful concept for speculative parallelization systems. As we shall see, within speculative

parallelization systems contention managers encounter two main problems not present in general

purpose transactional programs, namely increased false positives and reduced freedom of choice.

In this chapter, I will formulate these two problems precisely and then analyze how these problems

point us to a different approach towards conflict management.

33

public class DataSource{

private string name;private boolean hasChanged;private DataSourceObserverCollection observers;

public void setName( string name ){

this.name = name;this.hasChanged = true;

}

public void discardChanges(){

this.init();this.hasChanged = false;

}

public void notifyObservers(){

if( this.hasChanged )observers.notifyObservers();

}

[...]}

Listing 4.1: A DataSource class containing different operations manipulating the same variable.

4.1 False Positives

Speculative parallelization systems are a very peculiar case in what concerns contention manage-

ment; first of all, a conflict has a very well defined semantic significance: the sequential semantics

of the sequential program may have been violated. Whether it has actually been violated or not is

not easily determined. Even after removing trivial cases of false conflicts that may or may not be

recognized as such by the transactional memory runtime (such as replacing a value type object by

another equivalent), whether the sequential semantics have been violated or not remains always

open, as it is depends on the execution flow of the program.

For an example consider, Listing 4.1. It contains the partial declaration of a DataSource

class, with two methods writing the hasChanged field, and one reading it. Now look at Listing

4.2. The method doStuffWithDataSource is a long method using a DataSource. Assume for

a moment that a speculative parallelization system has partitioned this method into three tasks,

corresponding to section 1, section 2 and section 3 indicated in the code. If these three sections

happened to be put into parallel execution, and section 1 and section 3 finished their computation

first, there would be a reported conflict on section 1 having written the hasChanged field, and

section 3 having read it. However, consider the case where this field had the false value initially.

Although section 1 would set the field to true, and section 3 has used the value false, section 2, had

the program been executed sequentially, could have set it to false again, depending on the value

of someCondition. Therefore, although we have detected the violation of sequential semantics,

34

public void doStuffWithDataSource( DataSource dataSource ){

// Section 1dataSource.setName( name );[...]

// Section 2if( someCondition )

dataSource.discardChanges();[...]

// Section 3dataSource.notifyObservers();[...]

}

Listing 4.2: A long method doing various manipulations with a data source.

it could be a false positive, depending on other sections of the program. Naturally, we cannot

expect a transactional runtime to recognize these false positives (in fact, it is easy to prove that it

is impossible to recognize them in the general case using a halting problem argument1).

4.2 Reduced Freedom of Choice

Another profound difference between contention management for speculative parallelization sys-

tems and contention management for general purpose software transactional memory runtimes is

the serialization criteria. To the program (and programmer), transactions execute atomically. This

means that, for a given set of transactions {T1, T2, . . . , Tn} put into parallel execution on a general

purpose software transactional memory runtime, each of these n transactions will commit, and

therefore appear to execute, between two other transactions. So, in practice, the transactional

memory runtime will produce an output consistent with a possible sequential execution of these n

transactions. Each of these possible sequential executions is a serialization sequence. It immedi-

ately follows that, for n transactions, there are n! possible serialization sequences, corresponding

to all the possible sequences containing all of those transactions. For example, if I execute three

transactions (T1, T2 and T3), I have six possible serialization sequences: T1T2T3, T1T3T2, T2T1T3,

T2T3T1, T3T1T2 and T3T2T1.

Contention managers, by means of making a decision on which transaction(s) to restart upon

detection of conflicts, have a very definitive outcome on the resulting serialization sequence. In

the simplest case, if I execute two transactions, T1 and T2, and they conflict, it might just happen

that if I abort and restart T1, then T2 immediately commits, achieving the sequence T2T1; and, if

I restart T2 instead, then T1 commits, achieving the sequence T1T2. In a general purpose software

transactional memory, every serialization sequence is a possible one, so the contention manager

1General terms of the proof: Consider that section 2 had a call to some function f before changingthe value of the hasChanged field to some other value. Determining whether this section would actuallychange the value of the field would also include being able to determine whether f would ever return, asthe assignment happens immediately after. Thus it would require solving the halting problem for functionf.

35

has complete freedom of choice on what to abort and what to restart.

For a speculative parallelization system, not all of these serialization sequences can be accepted.

As the speculative parallelization system has to preserve the sequential semantics of the original

program, the actual serialization sequences that are allowed are (in general) a much smaller subset

of the set of all possible serialization sequences. It is not trivial to define what serialization

sequences are allowed; I will define them formally in the following paragraphs.

4.2.1 Acceptable Serialization Sequences

First of all, and since the acceptance or refusal of a serialization sequence is intimately related to the

semantics of the original program, it will be useful to define a state equivalence function equiv that,

given a runtime state of the original program Sorig and a runtime state of the parallelized program

Sparallel, outputs true if Sparallel can be interpreted to Sorig and false otherwise. This encapsulates

the notion that the parallelized version of the program is likely to have a different type of program

state (such as adding versioned boxes to objects [2, 1]), which can, in the absence of running

transactions, be interpreted as a state of the original program. Of course, the correspondence does

not have to be one-to-one, and can actually be many-to-many (suppose something akin to the

parallel collections used by [27], where a linked-list implementing a set could end up in different,

but nonetheless semantically equivalent states).

We also need the notion of original timestamp, which will allow us to impose a total order

on spawned transactions. Any transaction corresponds to a section of the original code being

packaged into a task; and therefore, when two or more transactions are started, any transaction

can be ordered in relation to any other transaction, by mirroring the order of the corresponding

sections of code in the original program. Note that this is the execution order and not necessarily

the source code order (which is not even known to the system).

Now let {T1, T2, . . . , Tn} be a set of transactions corresponding to a sequential section of code

from the original sequential program The ur-sequence (borrowing the germanic prefix ur, meaning

original, primitive) is a serialization sequence that, for any given state Sorig1 and Sparallel1 such

that equiv(Sorig1, Sparallel1) = true, verifies the following conditions:

• The sequential execution of the section of code from the original sequential program leads

from state Sorig1 to a valid state Sorig2

• The sequential execution of the transactions corresponding to this serialization sequence leads

from state Sparallel1 to a valid state Sparallel2

• equiv(Sorig2, Sparallel2) = true

Basically, the ur-sequence is a serialization sequence that produces behavior consistent with

the behavior of the section of the original program for any starting conditions. There is usually a

very obvious ur-sequence, namely T1T2 . . . Tn (assuming for the sake of simplicity that the trans-

actions are numbered according to their original timestamp as defined above). This corresponds

to the obvious observation that just wrapping the code into transactions and then executing them

sequentially should result in no violations of the sequential semantics.

36

public void dangerous( double value ){

// Section 1double multiplier = this.getMultiplier();[...]

// Section 2while( value < MAX_VALUE ){

value = value * multiplier;}[...]

}

Listing 4.3: A dangerous method for speculative parallelization.

Unfortunately this obvious sequence is not always a ur-sequence, and in fact there might be

none at all for a given section of code. I will not delve into details, as I shall not address this

issue in the current work; but do consider Listing 4.3. This is dangerous code for speculative

parallelization. If Section 1 and Section 2 get executed in parallel, it is easy to see that Section

2 may enter an infinite loop (if it uses the value 0 for multiplier, for instance), even if they

would get serialized in their original timestamp order. This happens because the second task never

reaches the point where it would commit and detect the conflict, and therefore restart with correct

values. In this case, early conflict detection (i.e. detection of the conflicts at the time the data is

acquired; see [36] for a more detailed overview on conflict detection schemes) would remove this

danger; I have yet to be convinced, however, that it is not subject to the same danger if present

in a more complex form.

On the other hand, there can also be multiple ur-sequences: a simple example are three sections

of code with no interdependencies at all. Three sections of code calculating the value of three

variables from independent data is a common example. Any serialization sequence will correspond

to the same resulting state, which corresponds to the state obtained by the original sequential

code, so all serialization sequences are ur-sequences.

In conclusion, for a given section of code there might be none or multiple ur-sequences, and

usually there is at least one (the obvious one constructed above). With these constructions I can

now state a theorem regarding acceptable serialization sequences, that is, serialization sequences

that can be accepted by the speculative parallelization system as they will cause no violation of

sequential semantics:

• For each section of code and starting state Sparallel1 for that section of code, an accept-

able serialization is a serialization sequence that, from that starting state, produces a state

Sparallel2 equivalent to the state produced by a ur-sequence for that section of code, from

that starting state.

As is to be expected, for most sequential programs, the set of acceptable serialization sequences

will be a fairly small subset of all possible serialization sequences.

37

public void doStuffWithData( Data data ){

// Section 1this.PrepareData( data );

// Section 2this.ProcessData( data );

}

Listing 4.4: A simple code sample with two interdependent sequential operations.

4.2.2 Influence on Contention Managers

The reduced set of acceptable serialization sequences has a direct influence on contention managers,

as it makes some possible decisions useless. For example, consider the simple code in Listing

4.4, where Section 2 uses and modifies the data prepared in Section 1. In the absence of other

transactions, if we have a transaction running for Section 1 and another for Section 2 and they

conflict upon commiting, it is useless to restart the transaction for Section 1 as it will always

conflict. This happens because Section 2 can never commit before Section 1 commits (as that is

not an acceptable serialization sequence). Therefore, the commit for Section 2 will be permanently

waiting and Section 1 will always conflict no matter how many times it is restarted. Basically, the

contention manager will have to restart Section 2 to make progress.

Naturally, this does not mean that there is no lattitude for the contention manager’s decisions

to influence overall performance, as this is dependent on various factors, among them the type of

conflict detection used by the speculation support component (early or late [36]), the determinism

of the original program, the acceptable sequences that are actually accepted by the speculative

parallelization system, and other concurrent running transactions.

4.3 Conclusion

Contention management within speculative parallelization systems is generally less useful than on

general purpose software transactional memory systems, given two factors: the increased number

of false positives, and the reduced freedom of choice. In practice, this means that alhough a paral-

lelized sequential program inccurs probably in many more conflicts than a parallel program (that

has been engineered from the start to run in parallel and therefore have fewer data dependencies),

the influence of contention management is prone to be smaller.

As I will demonstrate in the next chapter, this does not mean that being aware of conflicts

is useless. In fact the very opposite is true: we will want to go beyond contention management

as way of solving conflicts, and use information of past conflict information to prevent future

conflicts (in an approach similar to the approach taken by Dragojevic et al. in [15]). Instead of

being fed to a contention manager, the information about conflicting tasks will be collected by a

special component that will then provide that information to a task scheduling component. This

conflict-aware task scheduling component will then be able to take preventive measures to avoid

38

tasks incurring into conflicts, thereby increasing overall performance.

39

Chapter 5

Conflict-Aware Task Scheduling

In Chapter 3, I presented a general model of speculative parallelization systems, and went into detail

into the role of each of the three elemental components: the task identification component, the task

spawning component, and the speculation support runtime. In Chapter 4, I discussed the concept of

contention management, invented in the context of general purpose software transactional memory

runtimes. I demonstrated how contention managers, when used on speculative parallelization

systems that use a transactional memory runtime for speculation support, face a set of limitations

specific of those systems. With this in mind, I shall now describe how a different solution, still based

on the monitorization of conflicts between running tasks, has the potential to fare much better

than traditional contention management in this scenario. Then, I shall show how this solution can

be applied in practice to a specific speculative parallelization system, Anjo’s Jaspex [2, 1] (which

has been summarily presented in Chapter 2).

5.1 Towards Intelligent Schedulers in Speculative Paral-

lelization Systems

When I reviewed existing speculative parallelization systems in Chapter 2, I made it clear that most

speculative parallelization systems have been developed to work on top of some kind of parallel

execution environment, which is left unspecified and uncontrolled. This means that between the

spawning of a speculative task and the collection of its results (as well as detection of conflicts and

any further action taken by conflict managers, if present) there is a huge black box, the parallel

execution environment, that is responsible for picking up these tasks and executing them from start

to finish. Unfortunately, the fact that the operation of this environment is entirely uncontrolled

presents hazards for the performance of the speculative parallelization system.

Consider that among a given set of speculative tasks, some are bound to conflict and others not.

If the parallel execution environment has a number of tasks awaiting execution that exceeds the

available number of threads, then a decision will have to be made on which tasks to execute first. A

bad decision will cause unnecessary conflicts, thereby degrading the effectiveness of the speculative

parallelization system. These decisions, as of now, are being delegated to the parallel execution

41

public class Example{

private Data a, b, c, d, e, f;

// Does heavy computation using variable avoid messWithA() { [...] }

// Does heavy computation using variable bvoid messWithB() { [...] }

// Does heavy computation using variable cvoid messWithC() { [...] }

// Does heavy computation using variable dvoid messWithD() { [...] }

// Does heavy computation using variable evoid messWithE() { [...] }

// Does heavy computation using variable fvoid messWithF() { [...] }

// Does heavy computation using variables a, b, cvoid messWithABC() { [...] }

// Does heavy computation using variable d, e, fvoid messWithDEF() { [...] }

public void messWithEverything(){

messWithA();messWithB();messWithC();

messWithABC();

messWithD();messWithE();messWithF();

messWithDEF();}

}

Listing 5.1: An example of a class whose methods might be inefficiently executed by a black-boxparallel execution environment.

42

Figure 5.1: An example of an inefficient schedule.

environment (tipically the one natively provided by the operating system). This environment has

no information whatsoever on the nature of the tasks it is being handed, and therefore cannot

consistently succeed in scheduling tasks in a way that achieves maximum effectiveness for the

speculative parallelization system.

For an example of this effect (beyond some short remarks that can be extracted occasionally

from the literature and point shyly to this very same conclusion, although in a less directed way, as

reviewed on Chapter 2), consider Listing 5.1. If a speculative parallelization system were to execute

messWithEverything by executing speculatively and simultaneously each of the eight methods it

invokes, given a black-box parallel execution environment with three threads, we might easily

obtain the run pictured in Figure 5.1, where the tasks have been scheduled in a way that results

in three conflicts. A much better schedule (among others that are equally efficient) would produce

the run pictured in Figure 5.2, with no conflicts.

43

Figure 5.2: An example of an efficient schedule.

44

As we can see, the gains in performance can be significant. And this example in particular is

a simplified one (the tasks all have the same duration); the more complex the scenario, the less

likely it is that a blind parallel execution environment is going to produce the most efficient runs

for the tasks handed to it by the speculative parallelization system.

The solution is to add a fourth component to the speculative parallelization system. This fourth

component is the task scheduler, and it turns the black box scenario into a white box scenario.

Instead of being subject to the scheduling decisions of the parallel runtime environment, we can

now use information about the tasks to produce schedules that enhance performance. Naturally,

to be able to build such a schedule, we are going to need to collect some information on those

tasks.

5.2 Collecting Conflict Information

There is a lot of information that can be used to feed a task scheduling component. Some data can

be collected after the task is executed, such as how long it took to execute or how many different

objects are accessed, among others. These data are known or can be estimated beforehand, such as

original timestamp (in the sense defined in Chapter 4), code size, and others. These data can also

be related with task input (such as variables and parameters, if it is a method) to obtain various

profiles of execution, although I will not explore this topic in this work.

Obviously, one of the most interesting types of data that can be collected is information about

conflicts. This information can be coarse grained (how many times did the task conflict until now)

or fine grained (what tasks has this task conflicted with before, and in what way). By collecting

and feeding this data to the task scheduler, there is a good chance that we can help minimize

conflicts over time, as the task scheduler starts to obtain a reliable profile of application behavior.

This profile can even be stored for later use, so that the scheduler can use information from past

runs of the program.

It is not surprising that collecting information about conflicts and using it to make good sched-

ules will result in better performance (and in Chapter 6 I present some evidence it does). If we

go back and look at contention management after reaching this point, we can see that contention

management is actually a less generic form of conflict-aware task scheduling. A contention man-

ager collects data about conflicts, and in state-of-the-art contention managers these data is not

consolidated and is used only once (for the resolution of the conflict); it also makes scheduling

decisions, the difference being that it only applies differentiated scheduling decisions (such as, task

A has conflicted twice, and therefore is going to be re-executed only a minute from now) to tasks

that have conflicted. The move to a task scheduling component, in addition with a new data

collecting component that collects statistics about the tasks, is a natural move towards a much

more general, and also much more powerful form of contention management – the core difference

being the capability of antecipating and preventing conflicts rather than solving them. It is also a

necessary one in the specific case of speculative parallelization systems, where conflict management

is bound to be much less effective (as discussed in Chapter 4).

To finish, Figure 5.3 shows the component view of a speculative parallelization system, extended

with a task scheduling and a data collector component to enable conflict-aware task scheduling.

45

Figure 5.3: Component view of a speculative parallelization system with conflict-aware taskscheduling. Arrows show the flow of information between the five components: Taks Identification,Task Spawning, Task Scheduling, Data Collector and Speculation Support.

46

5.3 Extending Jaspex with Conflict-Aware Task Scheduling

As introduced in Chapter 2, Jaspex [2, 1] is a speculative parallelization system for the Java plat-

form, using the JVSTM [8, 9], a software transactional memory runtime, as speculation support.

Jaspex enforces sequential semantics by enforcing an ur-sequence, namely the trivial ur-sequence,

which means that transactions must commit in their original timestamp order. This makes it a

good case study, as the value of contention management is very limited for this system (as argued

in Chapter 4 ).

To implement conflict-aware task scheduling, I first incorporated the data collector component.

Jaspex, as of now, speculates on method executions only; therefore, the data collector creates a

map of conflicting methods, where, upon each conflict, a pair of conflicting methods is registered.

The JVSTM also had to be slightly changed, so that it would report for each conflict at least one

pair of conflicting transactions (in the standard implementation of JVSTM this information was

not made available externally, although the information is available internally). This data collector

component was implemented in a lock-free way, so that it can run in its own thread and be highly

scalable, presenting no contention point even for hundreds of threads. This was achieved by the

use of lock-free (test-and-set) hash tables on one hand, and on the other hand by allowing a small

margin of error. As the scheduler uses these data to make heuristic decisions, the fact that it

may read data within a small margin of error results in at most a bad scheduling decision; this

is definitely a small tradeoff to pay. The margin of error we have to account for results from the

simple fact that the data may evolve while the scheduler reads it (as it does not lock the data),

and thus may obtain a reading that does not correspond to a consistent view of the data at any

given time point. Fortunately, the resulting deviations are obviously small, as the scheduler reads

the data at about the same rate as it is modified, which is directly related to task throughput.

Afterwards, the task scheduler component was developed, to mediate the access to the native

thread pool. Having present the information on how many processors are available, a thread

pool is constructed that has one thread per processor minus one, being that the task spawning

component is always running in its own thread already, along with other JVM–native threads, such

as the garbage collector. After that, the task scheduler ensures that there are never more threads

computing than the number of available processors. It does this by keeping track of which tasks

are doing work and which are not - each task, upon starting or resuming a computation, checks in

as a working task, and upon stopping, checks out. The check-in-check-out mechanism ensures that

the processors are always busy, even when some tasks are not computing. There are two details

in Jaspex that introduce the requirement for this type of control; the first is the enforcement of a

particular serialization sequence. In consequence of this enforcement, once a task reaches the end

of computation, if it does not have the earliest timestamp, it cannot commit and must wait for

earlier tasks to finish. While waiting, these tasks will check out at the scheduler, opening the doors

for other tasks to be allowed to start computation. The second detail is the way non-transactional

actions are dealt with: as of now, non-transactional actions cause a task to stop computation in a

similar way to the previous situation, waiting for all tasks with earlier timestamps to finish. Once

they are finished and commited, if there are no conflicts, then the non-transactional (and therefore

irreversible) action executes. This mechanism ensures preservation of sequential semantics; and,

for the task scheduler, works in pretty much the same way as the previous situation, causing a

check out of the pool, and then a check in when the task is ready to resume computing.

47

Due to these particularities, the task scheduler is a scheduler of the work-sharing type. The

work stealing approach, lacking a central point of control, does not adapt well to this scenario, with

tasks stopping and resuming computation, and having relative interdependence (given that they

have to wait for each other to commit in the correct order, or to execute non-transactional actions).

Naturally, the scheduler in the work sharing approach can become a bottleneck in the speculative

parallelization system. The exploration of the subtleties required for a correct implementation of

a work stealing approach in this scenario remains as future work.

Finally, the scheduling intelligence is plugged into the scheduler, by means of a scheduling

policy. This scheduling policy has access to the data collector, and is pluggable and unpluggable

from the scheduler itself. If no scheduling policies are available, then by default the scheduler

works as a first-in-first-out queue: the earliest tasks get executed first. On the other hand, if a

scheduling policy is available, the scheduler queries the policy on what task to execute next, passing

to the policy two collections of tasks: the collection of tasks awaiting execution (pending), and the

collection of tasks currently executing. Given this context and all available information collected

by the data collector, the scheduling policy must then either return a task from among the tasks

in the collection of pending tasks for it to be executed, or return nothing. In the latter case, the

scheduler will not put anything into execution, and will query the scheduling policy the next time a

task checks out or finishes computing, as usual. Although dangerous (a defective scheduling policy

may cause a complete stop in the execution of the program), the decision of not putting anything

into execution is also a valid and important one. After all, if there is one particular task that

conflicts with any other task (for example), then there is no interest in putting any other tasks

into execution while that particular task is executing.

As the scheduling policies are pluggable, experimentation with different kinds of scheduling

policies is easy, in the same way as pluggable contention managers in DSTM [32]. It is also

possible to develop application specific scheduling policies, or reuse policies that are trained on

application runs. In the context of this work I developed a simple scheduling policy, that was used

for all the experimental results reported in the next chapter. This policy is the No-Conflicters

policy, and uses a simple and draconian rule: while a candidate task has an enemy in the thread

pool – that is, a task with which it has conflicted in the past – it is not put into execution. In this

way, tasks that have conflicted in the past are never again put into simultaneous execution.

5.4 Conclusion

In this chapter, I introduced the concept of conflict-aware task scheduling, the intelligent scheduling

of tasks based on conflict information collected and estimated during runtime, and how it is related

to contention management. Contention management takes local scheduling decisions based on

short-lived information about a conflict between transactions, and therefore presents itself as a

specialization of conflict-aware task scheduling. Where contention management could only resolve

conflicts, the addition on conflict-aware task scheduling enables both conflict resolving and conflict

prevention.

To add this capability to a speculative parallelization system, the addition of at least two new

components is necessary: a data collector component, responsible for collecting and preserving

48

data about tasks and incurred conflicts, and a task scheduling component, that takes over the

responsability of managing how spawned tasks get executed in the thread pool.

In the actual implementation of conflict-aware task scheduling for the Jaspex system, I divided

the responsibility between three components rather than two, to separate two types of concerns:

how to manage a limited thread pool (rather than the logically unlimited thread pool presented

by parallel execution enviroments), given that tasks do not execute linearly from start to finish;

and deciding what tasks to execute given the information collected so far. By separating these

concerns, I obtained a pluggable framework, where diverse scheduling policies, application-specific

or otherwise, can be plugged in for research and/or performance.

49

Chapter 6

Experimental Results

In this chapter, I present experimental results on conflict-aware task scheduling. In the first section,

I use the extended Jaspex system, as described in the previous chapter, on a number of bench-

marks, as well as on a simple, fabricated benchmark used as a proof of concept. Unfortunately,

due to current limitations on the capabilities of Jaspex for speculative parallelization, and also

due to a lack of benchmarks specially developed for the evaluation of speculative parallelization

systems, results on Jaspex did not go very far beyond proof of concept. Therefore, in the second

section of this chapter, I extend a well known software transactional benchmark, STMBench7 [17]

with conflict-aware task scheduling, and measure the resulting performance gain on two different

machines.

6.1 Results on Jaspex

All the experiments with Jaspex were run on the Phobos machine, a machine with two quad-core

Intel Xeon CPUs E5520 with hyperthreading. I ran the tests with eight threads, one per core.

Each run consisted of executing a program from start to end in eight iterations, allowing three

iterations for warmup and then taking the average execution time of the remaining five iterations.

Each program was run in this way twice, once under the control (and thus parallelized by) the

regular Jaspex system, and once under the control of the extended Jaspex system.

As the departing point for these experiments, I implemented a simple, fabricated benchmark in

the style of Listing 5.1 in Chapter 5. I populated every method with time consuming computations,

using the data variables in a way that would cause them to conflict in the same way as shown in

that example (messWithA, messWithB and messWithC conflicting with messWithABC, and, in the

same way, messWithD, messWithE and messWithF conflicting with messWithDEF, and finally one

extra method that manipulates all of the variables and therefore would conflict if simultaneously

executed with any other method). I created a program that would continuously execute this

example in a loop, allowing for the data collector component to collect significant data to feed the

task sheduling component.

The results of using the extended Jaspex versus plain Jaspex on this artificial benchmark were,

51

predictably, visible. In the regular Jaspex the average time spent per iteration of the loop was

of 27.2 seconds; in the extended Jaspex this same measurement amounted to 24.5 seconds, a

speedup of about 11%. The scheduling policy used was the No-Conflicters policy described in

the previous chapter; a qualitative assessment of the runs showed an evolution from the type of

schedule shown on Figure 5.1 to one of the type 5.2, with no conflicts, as the policy quickly picked

up the conflicting profile of each method (i.e., with what methods it was conflicting with), and

prevented further conflicts.

Although the speedup is low for an artificial benchmark, these results point out that there are

at least some consistent gains to be obtained by adding conflict-aware task scheduling to Jaspex.

To determine exactly how big of a margin there was for performance gains, I did a second set of

runs, plugging this time an ideal schedule, hand-crafted for this specific example. By using this

scheduling policy, the average time per iteration of the loop was reduced to 15.7 seconds, a much

more significant speedup of about 73% relative to the run on regular Jaspex. Note that this is

not necessarily the highest speedup attainable for this benchmark: I integrated task scheduling

and scheduling policies in Jaspex using a greedy approach, where the scheduling policy enforces

a good schedule by picking one task at a time, given a limited context (the tasks in execution

and those awaiting execution). Therefore, the scheduling policies implemented in this way do not

necessarily have a global view of the program. Greater performance gains may be possible with

more sophisticated schemes.

Nonetheless, these results show, on a proof-of-concept basis, that there is the potential for sev-

eral performance gains by using conflict-aware task scheduling. They also show that the scheduling

policy is a key ingredient of this success.

After these preliminary results, I tried obtaining results on several benchmarks, among them

non-parallel benchmarks of the DaCapo suite of benchmarks [3, 4], the Java Grande benchmarks

[34], and the JatMark benchmark [35]. I was unable to obtain speedup on any of these benchmarks,

although I did not obtain any slowdown either. Given that the data collector component is lock-

free, and given that the scheduler was using a very simple and computably efficient scheduling

policy, all while running in its own thread, it is not surprising that I am able to at least mantain

normal Jaspex-level performance. It is also not very surprising that I did not obtain any speedup,

mainly due to three main problems, in order of importance:

• Lack of speculation: Jaspex is a system still being heavily improved, but as of the writing of

this work, was still very limited in the amount of speculation it was able to perform. Anjo

outlines these limitations in detail in [1]. The most important of them is that Jaspex is not

able to speculate on a method when it cannot determine its entry parameters. Jaspex also

does not instrument the Java runtime itself, so it cannot speculate on any methods of objects

of the Java runtime. And, in the absence of enough tasks being speculated, there is very

little that can be improved by adding a scheduling policy (think of the extreme case where

only one task is delivered to the execution runtime at a time). This is also the main reason

why there was no slowdown either, after adding the scheduler.

• Non-transactional actions: Besides the obvious non-transactional actions such as input/out-

put, Jaspex considers as non-transactional actions any computation involving arrays, as well

as any call to a native method. These non-transactional actions cause a huge degradation of

52

performance, as they force the runtime to wait for all execution prior to the non-transactional

action to be finished. In the case of arrays, any benchmark using arrays is severely hampered.

• Some of the benchmarks, like JatMark and JavaGrande, were developed for the improvement

of Java perfomance, and are not necessarily suited for validation of speculative parallelization

systems. The ideal benchmarks for speculative parallelization systems are ones that present

opportunities for various levels of parallelism, as I believe a lot of general-purpose applications

do. Benchmarks like the previous one had in mind the performance of low-level operations,

and present applications that deal with few levels of abstraction, and are also unrepresentative

of real world applications.

In addition to the context of automatic parallelization, I also obtained results on the addition

of conflict-aware scheduling in the more general scope of general purpose software transactional

memory systems, which I present in the next section.

6.2 Results on STMBench7

The STMBench7 benchmark [17] is a software transactional memory benchmark that is already

parallel; it emulates some properties of parallel executing object oriented systems by having a graph

of objects that is continually read and modified by different operations. Some of these operations

operate on small portions of the graph, others on large portions; some only read data, and others

write data or alter the graph structure itself. The benchmark provides therefore a good oportunity

for evaluating performance of software transactional memory implementations. It also happens to

provide a good opportunity for evaluating the effectiveness of conflict-aware task scheduling. For

my experiments, I used the JVSTM as the software transactional memory runtime upon which the

STMBench7 benchmark ran.

To construct my experiment, I had first to modify the STMBench7, so that it would present

a slightly different behavior. In the original benchmark, each thread continually generates and

executes random operations, but there is never an excess of tasks available, as each thread only

generates a task after finishing its current one. I modified this behavior, so that instead of one task

being generated for each thread that goes idle, a batch of N tasks is generated at the beginning of

the benchmark, and then another batch is generated everytime the current batch runs out. For my

experiments I used the N value of 2000, although I also briefly experimented with other numbers,

such as 100, 500 and 1000, and came to the conclusion that this number had pratically no influence

in the results (as long as it is well above the number of available threads)

After this modification, I added a data collector component, collecting conflict data in the same

way as in Jaspex, by typifying the operations available in the Benchmark and registering a map of

conflicts. I then added the same scheduler component I had developed for Jaspex, and the same

No-Conflicters policy.

I ran the benchmark on two different machines: Phobos, a machine with two quad-core Intel

Xeon CPUs E5520 with hyperthreading (the same one used for the experiments with Jaspex),

and Azul, an Azul Systems’ Java Appliance with 208 cores. On each of these machines, I ran

the benchmark with three differing workloads - a workload consisting majorly of read-only opera-

53

STMBench7 runs Throughput

Number of threads Read Read+S Read-Write Read-Write+S Write Write+S

1 3676 4009 2710 2728 2351 24632 7402 7436 4983 5422 3180 43024 12822 13480 6918 9138 3736 40348 20827 18081 7518 7302 3269 301716 18478 12055 4575 5807 1874 2900

Table 6.1: STMBench7 data for the three types of workload in Phobos, for the regular benchmarkand with added scheduling (+S). Average throughput is in operations/second.

STMBench7 runs Throughput

Number of threads Read Read+S Read-Write Read-Write+S Write Write+S

1 217 222 182 187 148 1462 466 443 298 364 177 2134 870 740 414 492 190 2278 1500 1033 475 474 185 22716 1618 1725 447 511 167 27432 1594 1324 407 528 167 26764 1651 1761 380 578 164 270128 1156 1049 302 601 132 257256 625 1181 190 558 74 274

Table 6.2: STMBench7 data for the three types of workload in Azul, for the regular benchmarkand with added scheduling (+S). Average throughput is in operations/second.

tions (read-dominated workload), another consisting majorly of write operations (write-dominated

workload), and a third where both kinds of operations are roughly at an equilibrium (read-write

workload). Each workload was run for two minutes. The output of these runs is an average

throughput, in operations per second.

Figure 6.1 and Table 6.1 show the results obtained when running the benchmark on Phobos.

Each workload is run with an increasing number of threads, in powers of two. In the read-dominated

workload, there are hardly any conflicts (in fact, in the JVSTM, read-only transactions never

conflict). Therefore, it is natural that the read-dominated workload exhibits degraded performance

with the added scheduling, given the extra computing work that ends up being useless in the

absence of conflicts. We can also see that after the number of threads exceeds the number of

physical processors available, there is a loss of performance in both scenarios, due to wasteful

context switching forced on the operating system.

For the read-write workload, there is a similar shape, with the same loss in performance after

the number of threads exceeds the number of processors. However, here the conflict-aware task

scheduling achieves greater throughput, with a gain of 32% at its peak (at 4 threads). For any

number of threads, the benchmark with added scheduling is able to prevent conflicts between

conflict-prone tasks, and thereby stays always at or above the level of throughtput of the regular

benchmark

The write-dominated workload is even more interesting, with a peak gain of 35% at two threads

54

(similar to the read-write workload). However, and due to the high number of conflicts for this

workload, at a high number of threads few tasks are selected for simultaneous exection (fewer than

the number of available threads, as the scheduling policy has the option of not putting anything

into execution), and therefore the throughput decreases much less than in the regular benchmark.

The results obtained when running the benchmark on the Azul machine are detailed in Figure

6.2 and Table 6.2, and they show improved results. In the read-dominated workload, we have a

scenario akin to the scenario on the Phobos machine, with the regular benchmark slightly out-

performing the scheduling benchmark. In the read-write workload and in the write-dominated

workload, however, there is a consistent advantage for the scheduling benchmark. Not only are

the performance gains much more significant than the results on Phobos, with a peak performance

gain of 195% for 256 threads on the read-write workload, and a peak performance gain of 288%

for the write-dominated workload, but adding conflict-aware task scheduling succeeds into turning

a trend of descending performance (as the number of threads increase) into a trend of ascending

performance.

These results show that conflict-aware task scheduling can have a great influence on perfor-

mance, even with a naive scheduling policy such as the No-Conflicters policy, used throughout

these experiments. And the greater the probability of conflicts, the greater the performance gains

obtained with their prevention.

55

Figure 6.1: STMBench7 runs for three types of workload in Phobos, with and without scheduling.In the xx-axis we have the number of worker threads, and in the yy-axis we have the averagenumber of operations per second.

56

Figure 6.2: STMBench7 runs for three types of workload in Azul, with and without scheduling. Inthe xx-axis we have the number of worker threads, and in the yy-axis we have the average numberof operations per second.

57

Chapter 7

Conclusion and Future Work

In the course of this work, I derived a three-component model of speculative parallelization systems,

and reviewed a set of speculative parallelization systems under the light of this model, allowing

me to synthetise a theoretical approach towards them. On this basis, I analyzed the concept of

contention management, and its usefulness to these systems. By analyzing the limitations of con-

tention management within these systems, I demonstrated how contention management has few

opportunities to influence the performance of a speculative parallelization system, contrasting with

the good results reported by the literature in general purpose software transactional memory sys-

tems. The limited usefulness of contention management within speculative parallelization systems

lead us directly to the generalization of the concept, by characterizing contention management as

an instance of the much broader concept of conflict-aware task scheduling. Although conceived

within this scope, the concept has direct applicability to general purpose software transactional

memory systems as well.

I extended the three-component model of the speculative parallelization system to a five-

component with added conflict-aware scheduling. I then implemented this approach on both a

state-of-the-art speculative parallelization system (Anjo’s Jaspex), and on a traditional software

transactional memory benchmark, STMBench7. I also developed a simple scheduling policy for the

task scheduling component, the No-Conflicters policy, a simple and draconian policy that prevents

simultaneous execution of any tasks that have conflicted in the past.

In the experimental phase, I showed how the conflict-aware task scheduling approach can func-

tion in principle by the use of a small artificial benchmark. Unfortunately, due to the severe

limitations still present in state-of-the-art speculative parallelization systems, and due to the ab-

sence of appropriate benchmarks, I was unable to obtain further validation. In the traditional

software transactional memory benchmark, STMBench7, the results I obtained were highly pos-

itive and validated the thesis that the approach is a viable one for increased performance, and

certainly not exclusively for speculative parallelization systems, but also for general purpose soft-

ware transactional systems.

Given the rather exploratory nature of this work, there are several topics left open, among them

improvements to speculative parallelization systems. Instead of an exhaustive list, I have chosen

to present only those that I consider the most important topics for the progress of a conflict-aware

59

task scheduling approach, some of which I have hinted at, at various points in the text:

• The validation of the approach with other software transactional memory runtimes (namely

with Deuce, another high performant runtime). As different systems are bound to have dif-

ferent strengths and weaknesses, it would be a strong step towards confirming the soundness

of the approach.

• Increase the amount of information collected and estimated about the tasks, such as task

duration, range of data accessed, abort-to-commit ratio, among others, and incorporate this

information into the scheduling policies, so that better decisions can be made earlier in the

program’s execution.

• Develop and test other scheduling policies, and make a comparative study of their effective-

ness, and relative effectiveness between them and perfect schedules.

• Develop a realistic benchmark for speculative parallelization, focused on object-oriented com-

putation at various levels of abstraction rather than performance of low-level operations.

60

Bibliography

[1] Ivo Filipe Silva Daniel Anjo. Jaspex: Speculative parallelization on the java platform. Master’sthesis, Instituto Superior Tecnico/Universidade Tecnica de Lisboa, October 2009.

[2] Ivo Filipe Silva Daniel Anjo and Joao Cachopo. Jaspex: Speculative parallel execution of javaapplications. In 1st INFORUM. Faculdade de Ciencias da Universidade de Lisboa, September2009.

[3] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan,D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B.Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. von Dincklage, and B. Wiedermann.The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA ’06:Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programing,Systems, Languages, and Applications, pages 169–190, New York, NY, USA, October 2006.ACM Press.

[4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan,D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B.Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. von Dincklage, and B. Wiedermann.The DaCapo Benchmarks: Java benchmarking development and analysis (extended version).Technical Report TR-CS-06-01, 2006. http://www.dacapobench.org.

[5] Bill Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, David Padua, PaulPetersen, Bill Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weatherford. Polaris:The next generation in parallelizing compilers. In Proceedings of the Workshop on Languagesand Compilers for Parallel Computing, pages 10–1. Springer-Verlag, Berlin/Heidelberg, 1994.

[6] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by workstealing. Journal of the ACM, 46(5):720–748, 1999.

[7] Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. Deconstructing transactions:The subtleties of atomicity. In Fourth Annual Workshop on Duplicating, Deconstructing, andDebunking. June 2005.

[8] Joao Cachopo and Antonio Rito-Silva. Versioned boxes as the basis for memory transactions.In Workshop on Synchronization and Concurrency in Object-Oriented Languages (SCOOL05),October 2005.

[9] Joao Cachopo and Antonio Rito-Silva. Versioned boxes as the basis for memory transactions.Science of Computer Programming, 63(2):172–185, 2006.

[10] M. K. Chen and K. Olukotun. Exploiting method-level parallelism in single-threaded javaprograms. In PACT ’98: Proceedings of the 1998 International Conference on Parallel Archi-tectures and Compilation Techniques, page 176, Washington, DC, USA, 1998. IEEE ComputerSociety.

[11] Emanuel Amaral Couto. Speculative execution by using software transactional memory. Mas-ter’s thesis, Universidade Nova de Lisboa, 2009.

61

[12] David Culler, J.P. Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware/-Software Approach. Morgan Kaufmann, 1st edition, 1998. The Morgan Kaufmann Series inComputer Architecture and Design.

[13] Francis Dang, Hao Yu, and Lawrence Rauchwerger. The r-lrpd test: Speculative parallelizationof partially parallel loops. Parallel and Distributed Processing Symposium, International,1:0020, 2002.

[14] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters.Communications of the ACM, 51(1):107–113, 2008.

[15] Aleksandar Dragojevic, Rachid Guerraoui, Anmol V. Singh, and Vasu Singh. Preventingversus curing: avoiding conflicts in transactional memories. In Proceedings of the 28th ACMsymposium on Principles of distributed computing, PODC ’09, pages 7–16, New York, NY,USA, 2009. ACM.

[16] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theoryof NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.

[17] Rachid Guerraoui, Micha l Kapa lka, and Jan Vitek. STMBench7: A benchmark for soft-ware transactional memory. In Proceedings of the Second European Systems Conference Eu-roSys 2007, pages 315–324. ACM, March 2007.

[18] Manish Gupta and Rahul Nim. Techniques for speculative run-time parallelization of loops.In Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing(CDROM), pages 1–12, Washington, DC, USA, 1998. IEEE Computer Society.

[19] Tim Harris, James R. Larus, and Ravi Rajwar. Transactional Memory, 2nd Edition. Morgan& Claypool, 2010.

[20] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languagesand Systems, 13(1):124–149, 1991.

[21] Maurice Herlihy and Eric Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In PPoPP ’08: Proceedings of the 13th ACM SIGPLANSymposium on Principles and practice of parallel programming, pages 207–216, New York,NY, USA, 2008. ACM.

[22] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software trans-actional memory for dynamic-sized data structures. In PODC ’03: Proceedings of the twenty-second annual symposium on Principles of distributed computing, pages 92–101, New York,NY, USA, 2003. ACM.

[23] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA ’93: Proceedings of the 20th annual international symposiumon Computer architecture, pages 289–300, New York, NY, USA, 1993. ACM.

[24] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann,March 2008.

[25] Yunlian Jiang and Xipeng Shen. Speculation with little wasting: Saving cost in softwarespeculation through transparent learning. Technical report, College of William & Mary, July2009.

[26] Guy Korland, Nir Savit, and Pascal Felber. Noninvasive java concurrency with deuce stm(poster). In SYSTOR 09: The Israeli Experimental Systems Conference, 2009.

[27] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, andL. Paul Chew. Optimistic parallelism requires abstractions. In PLDI ’07: Proceedings of the2007 ACM SIGPLAN conference on Programming language design and implementation, pages211–222, New York, NY, USA, 2007. ACM.

62

[28] Doug Lea. A java fork/join framework. In JAVA ’00: Proceedings of the ACM 2000 conferenceon Java Grande, pages 36–43, New York, NY, USA, 2000. ACM.

[29] Virendra J. Marathe and Michael L. Scott. A qualitative survey of modern software transac-tional memory systems. Technical report, University of Rochester, 2004.

[30] C.L. McCreary, A.A. Khan, J.J. Thompson, and M.E. McArdle. A comparison of heuristicsfor scheduling dags on multiprocessors. In Parallel Processing Symposium, 1994. Proceedings.,Eighth International, pages 446–451, Apr 1994.

[31] J. Eliot B. Moss and Antony L. Hosking. Nested transactional memory: model and architec-ture sketches. Science of Computer Programming, 63(2):186–201, 2006.

[32] William N. Scherer, III and Michael L. Scott. Advanced contention management for dynamicsoftware transactional memory. In PODC ’05: Proceedings of the twenty-fourth annual ACMsymposium on Principles of distributed computing, pages 240–248, New York, NY, USA, 2005.ACM.

[33] Nir Shavit and Dan Touitou. Software transactional memory. In PODC ’95: Proceedings ofthe fourteenth annual ACM symposium on Principles of distributed computing, pages 204–213,New York, NY, USA, 1995. ACM.

[34] L. A. Smith and J. M. Bull. A parallel java grande benchmark suite. In In Supercomputing 01:Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM, page 8. ACMPress, 2001.

[35] Miloslaw Smyk. Jatmark java performance benchmark. http://wfmh.org.pl/thorgal/

jatmark/, 2008 (accessed October 13, 2011).

[36] Michael F. Spear, Virendra J. Marathe, William N. Scherer Iii, and Michael L. Scott. Conflictdetection and validation strategies for software transactional memory. In Proceedings of the20th International Symposium on Distributed Computing, 2006.

[37] J. Steffan and T Mowry. The potential for using thread-level data speculation to facilitateautomatic parallelization. International Symposium on High-Performance Computer Archi-tecture, 0:2, 1998.

[38] Chen Tian, Min Feng, Vijay Nagarajan, and Rajiv Gupta. Copy or discard execution modelfor speculative parallelization on multicores. In MICRO ’08: Proceedings of the 2008 41stIEEE/ACM International Symposium on Microarchitecture, pages 330–341, Washington, DC,USA, 2008. IEEE Computer Society.

63

Task Scheduling in Speculative Parallelization...Task Scheduling in Speculative Parallelization...

Documents

Transcript of Task Scheduling in Speculative Parallelization...Task Scheduling in Speculative Parallelization...