Thesis Proposal: Scheduling Parallel Functional Programsspoons/thesis/proposal.pdf · Thesis...

Thesis Proposal:

Scheduling Parallel Functional Programs

Daniel Spoonhower

Committee: Guy E. Blelloch (co-chair)Robert Harper (co-chair)Phillip B. GibbonsSimon L. Peyton Jones (Microsoft Research)

(Revised June 13, 2007)

Abstract

Parallelism abounds! To continue to improve performance, programmers must use parallel algorithms

to take advantage of multi-core and other parallel architectures. Existing declarative languages allow

programmers to express these parallel algorithms concisely. With a deterministic semantics, a declarative

language also allows programmers to reason about the correctness of programs independently of the

language implementation. Despite this, the performance of these programs still relies heavily on the

language implementation and especially on the choice of scheduling policy.

In this thesis, I propose to use a cost semantics to allow programmers to reason about the perfor-

mance of parallel programs and in particular about their use of space. This cost semantics also provides a

specification for the language implementation. In my previous work, I have formalized several implemen-

tations, including different scheduling policies, as small-step transition semantics. Incorporating these

policies into the language semantics establishes a tight link between programs, scheduling policies, and

performance. Using these semantics, I have shown that in some cases, the choice of scheduling policy

has an asymptotic effect on memory use.

In my continuing work, I will consider extensions to my language and develop a full-scale implementa-

tion. With these, I hope to demonstrate that a declarative language is a practical way to program parallel

algorithms and that my cost semantics offers an effective means to reason about their performance.

1 Introduction

The goal of this thesis is to show that declarative programming languages are an effective means to expressparallel algorithms. Declarative languages relegate the low-level details of a parallel implementation to thelanguage implementation, freeing programmers to seek out opportunities for parallel execution and to focuson ensuring program correctness. By abstracting away from the concrete aspects of the implementation andarchitecture, declarative languages facilitate the development of parallel programs. Declarative programscan also be realized on a variety of parallel architectures.

Understanding the behavior of programs in light of these different implementations, however, requires aclear description of the language semantics. In this thesis, I advocate for a semantics based on a deterministicmodel of parallel execution. Under such a semantics, a parallel program will always yield the same resultregardless of the underlying architecture and language implementation. Thus parallelism is merely a meansto achieve good performance and can be ignored for the purposes of reasoning about program correctness.

1

Functional programming provides a good foundation for expressing parallel algorithms because it offers anatural means to achieve a deterministic semantics. The lack of side effects ensures that the behavior of eachparallel task can be determined independently of other tasks. The language implementation can interleaveand evaluate parallel tasks in an arbitrary order without affecting the results.

Despite this, the implementation of a functional programming language cannot be ignored entirely. Whileone can analyze the running time and space use of functional programs, subtle aspects of the language im-plementation can render these analyses meaningless. For example, it is well known that naıve implementa-tions of sequential functional languages can asymptotically increase the space complexity of some programs(e.g., Shao and Appel [1994], Gustavsson and Sands [1999]).

The performance of parallel functional programs often hinges upon the scheduling policy, the mappingof parallel tasks to physical processors. In this thesis, I focus on how the scheduling policy affects the spaceuse of parallel programs. While there is a wealth of research on different scheduling policies and their effectson memory usage (e.g., Blumofe and Leiserson [1998], Blelloch et al. [1999], Narlikar and Blelloch [1999]),I plan to study scheduling from the point of view of the language implementation. This requires a carefulanalysis of the compilation of functional programs as well as a close integration of the scheduling policy withthe parallel semantics.

There has also been significant work on semantics for high-level, parallel languages (e.g., Hudak andAnderson [1987], Roe [1991], Blelloch and Greiner [1995], Aditya et al. [1995], Greiner and Blelloch [1999]),but none of it has attempted to capture implementation details such as scheduling policy. In only oneinstance [Blelloch and Greiner, 1996] has this work attempted to reason about the space use of programs.In this thesis, I propose to fill this gap between scheduling policy and language semantics by designing aset of semantics that each describe the behavior of a particular policy. Furthermore, by building schedulinginto the language definition, I will provide a means to reason about the performance of functional parallelprograms.

1.1 Overview

Before giving my thesis statement, I briefly review several key topics relevant to this thesis.

Functional Parallel Languages In the bulk of this proposal, I consider a pure functional language, i.e.a language without side effects. This language allows a form of fork-join parallelism where the elements ofpairs and arrays may be computed in parallel. Due to the lack of effects, each element may be evaluatedindependently of any of the others. Thus every parallel execution of a program will yield the same result:choosing among different parallel implementations is a matter of performance and will not affect programcorrectness.

Parallel Scheduling In a parallel languages such as this, there are often many more opportunities forparallel execution than can accommodated by the underlying hardware platform. The way in which paralleltasks are prioritized can have a dramatic, and even asymptotic, effect on performance. A scheduling policydetermines the priorities of parallel tasks and assigns tasks to physical resources. This assignment may bedetermined on-line by part of the language runtime implementation, or it may be determined off-line as partof a compiler transformation.

Cost Semantics To provide a means of comparing the space use of these different implementations, Iuse a cost semantics. A cost semantics is a refinement of the usual notion of a dynamic semantics thatdistinguishes programs based on their intensional behavior. It yields not only the result of evaluation, butalso a measure of how this result is computed. My cost semantics assigns an abstract cost to each programin the source language. This cost is parameterized in such a way that it can be used to analyze the use ofspace under different scheduling policies.

2

Provable Implementations While this cost semantics allows the programmer to compare behavior ofdifferent source programs, these comparisons are only meaningful if the language implementation reflectsthese costs. Thus a cost semantics also acts as a specification for the implementation. Any implementationthat meets this specification is called a provable implementation. In this thesis proposal, I describe imple-mentations of several scheduling policies. In each case, I prove a correspondence between the analysis of theabstract cost and the behavior of the implementation.

The aim of this thesis is to show that together, these techniques can be effective tools for expressing andimplementing parallel algorithms.

Thesis Statement A cost semantics provides a powerful and elegant means toreason about the use of space in parallel functional programs and to guide provablyspace-efficient implementations, including parallel scheduling policies.

To substantiate this thesis, I will build upon my previous work on cost semantics and scheduling. Iwill show how parallel algorithms can be expressed in a high-level language and how this language can beimplemented efficiently. My continuing work will focus on three areas.

1. Language Extensions – While I have formalized costs and implementation for a purely functionalparallel language, many algorithms require effectful language features such as mutable state. I willconsider how these features can be integrated into my framework.

2. Implementation – While I have given several implementations of my language, these implementationsare still relatively abstract. It remains to build a concrete implementation that executes programs onparallel hardware.

3. Applications – To witness the potential of my work, I will implement several parallel algorithms inmy functional language and analyze the performance of these implementations using both my costsemantics and empirical methods.

In the remainder of this proposal, I give a more detailed motivation for this thesis, a survey of relatedwork, and an account of my previous work. This work consists of two parts: first, a cost semantics thatallows a precise analysis of space use, and second, a set of language implementations that each embodya different scheduling policy. Throughout this exposition, I use a series of examples to demonstrate thisanalysis and illustrate the effects of scheduling on the use of space. Finally, I conclude with specific plansfor the completion of this thesis.

1.2 Motivation

Parallel computing has become ubiquitous. While parallel architectures were once used predominatelyin scientific simulation and other high-performance applications, parallelism is now considered to be theprimary means to improve the performance of commodity microprocessors. Today’s laptops, workstations,and gaming consoles boast at least two and as many as eight cores per processor. Intel predicts that withinfive years they will deliver chips with 80 cores [Intel].

While it is conceivable that a handful of these cores might be occupied by different applications, theamount of available parallelism will quickly outstrip the number of applications typically run by users. Tocontinue to scale with advances in hardware performance, every desktop application must become a parallelprogram.

The number of applications will not continue to increase at the same rate as the number of cores andprocessors, but the amount of data manipulated by users certainly will: personal computers are becomingmassive repositories for video, sound, photographs, communication, and other text. To keep pace with thenumber of processors, applications must process these data in parallel.

Three-dimensional rendering is an example of an application where parallelism has been used to achievebetter performance. Graphics processors are designed precisely to take advantage of the natural parallelism

3

of 3D rendering, and programs use this parallelism to render more pixels and more realistic images. Mod-ern graphics cards now offer a tremendous amount of parallel computing power, reaching into the 100s ofGFLOPS [Owens et al., 2007]. While graphics processors are somewhat limited in what kinds of calculationsthey can perform, there is increasing interest in finding additional uses for these computational resources.

To take advantage of this wide range of parallel platforms, including multi-core processors, multi-processorsystems, and graphics processors, programs must be written in a platform-independent language. I claimthat any such parallel language must be declarative and give only a high-level description of where parallelevaluation may occur. Once details of the input data and platform have been established at runtime, thelanguage implementation can determine the exact mapping of parallel tasks to processors or processingelements.

While elements of the input data may be processed concurrently, it is still sensible to ask, what would theresult be if these data were processed sequentially? This question is critical as it defines the means by whichprogrammers can reason about the behavior of parallel programs. It is also a significant constraint on anyparallel implementation. It stipulates that evaluation is deterministic, that every parallel implementationmust yield the same result.

While determinism is sufficient to reason about the correctness of a parallel application, it does not tell usanything about program performance. Nor does it allow us to compare the performance of a program withrespect to different language implementations. A cost semantics provides a vehicle by which we can makethese comparisons. Using my cost semantics, I have shown several examples where the choice of schedulingpolicy has a dramatic, and even asymptotic, effect on the space usage of parallel programs.

In the remainder of this section, I present a small example and give an informal analysis of its space use.In subsequent sections, I will develop the formalisms needs to carry out this analysis more rigorously.

Example: Quicksort Consider the following implementation of quicksort where the input list is parti-tioned and each half is sorted in parallel. In this work, the components of a pair { e1, e2 } may be evaluatedin parallel.

fun qsort xs =case xs of nil ⇒ nil

| [x] ⇒ [x]| x:: ⇒ append {qsort (filter (le x) xs),

qsort (filter (gt x) xs)}

Note that this is a persistent version of quicksort: the original list is preserved and each intermediate list isfreshly allocated. While this code does not necessarily represent the most efficient persistent implementation,it certainly is one reasonable possibility. As such, we would like to understand its performance across a setof language implementations.

Figure 1 shows an upper bound on the use of space as a function of the input size. Each line representsa variation in the number of processors or in the scheduling policy used to prioritize parallel tasks. I willdescribe these policies (“depth-first” and “breadth-first”) in greater detail below. Note here only that thefirst two configurations consume space as a polynomial function of the input size, while the latter two requireonly linear space. Also, note that these plots all represent the execution of the same program, each with adifferent implementation of the language.

I briefly draw a connection with another important component of a language implementation, the garbagecollector. Programmers who use garbage-collected languages must consider the effect of the collector on ap-plication performance. Switching between collector algorithms or changing the configuration of a givenalgorithm can have a significant impact on end-to-end performance. No single collector algorithm is appro-priate for all applications.

In much the same way, I claim that programmers who desire to understand the performance of parallelapplications must consider the effect of the scheduling policy. In the sequel, I will give examples thatdemonstrate that no single policy is best for all applications.

4

Figure 1 Space Use as a Function of Input Size. This figure shows an upper bound on the space required tosort a list as a function of input size. Each configuration differs in the number of processors or the schedulingpolicy used to prioritize parallel tasks. These policies will be described in Section 4.

0

50

100

150

200

250

300

350

400

450

0 5 10 15 20

spac

e hi

gh-w

ater

mar

k

input size

1 PE, depth-first2 PE, depth-first3 PE, depth-first

1-3 PE, breadth-first

2 Related Work

Parallel Languages Interest in side effect-free parallel programming began decades ago with languagessuch as Val [McGraw, 1982], Id [Arvind et al., 1989], and Sisal [Feo et al., 1990]. Like the current proposal,these researchers argue that a deterministic semantics is essential in writing correct parallel programs andthat a language without side effects is a natural means to achieve such a semantics.

Several researches have considered speculative evaluation as a means to achieve better performance infunctional programming (e.g., Hudak and Anderson [1987], Aditya et al. [1995], Blelloch and Greiner [1995],Greiner and Blelloch [1999]). Roe [1991] advocates for a form of explicit parallelism in a functional languageand also gives an abstract performance analysis of functional parallel programs. While most of his workfocuses on a call-by-need language, his analyzes the performance of programs in a language with a strictform of parallelism. Efficient implementations of strict, nested parallelism were considered by Blelloch andhis collaborators [Blelloch, 1990, Blelloch et al., 1994]. This work was later extended to a higher-orderlanguage with a rich set of types [Chakravarty and Keller, 2000, Lechtchinsky et al., 2006].

Baker-Finch et al. [2000] formalized the semantics for the core of a lazy parallel language, GpH. Thissemantics is stratified into a sequential and parallel components. While alternative evaluation strategiesare considered within this framework (e.g., fully speculative), only a single, non-deterministic scheduler isdescribed. In GpH, evaluation strategies [Trinder et al., 1998] allow a functional programmer to explicitlycontrol the structure parallel evaluation (e.g. divide-and-conquer, collection-oriented parallelism).

Cost Semantics Non-standard semantics were used by both Rosendahl [1989] and Sands [1990] to au-tomatically derive bounds on the time required to execute programs. In both cases, the semantics yieldsa cost that approximates the intensional behavior of programs. Cost semantics are also used to describeprovably efficient implementations of speculative parallelism [Greiner and Blelloch, 1999] and explicit data-parallelism [Blelloch and Greiner, 1996]. These cost semantics yield directed, acyclic graphs that describethe parallel dependencies, just as the computation graphs in this proposal. Execution time on a boundedparallel machine is given in terms of the (parallel) depth and (sequential) work of these graphs. In onecase [Blelloch and Greiner, 1996], an upper bound on space use is also given in terms of depth and work.This work was later extended to a language with recursive product and sum types [Lechtchinsky et al., 2002].

Ennals [2004] uses a cost semantics to compare the work performed by a range of sequential evaluationstrategies, ranging from lazy to eager. Like the current proposal, he also uses cost graphs with distinguishedtypes of edges, though his edges serve different purposes. He does not formalize the use of space by thesedifferent strategies. Gustavsson and Sands [1999] also use a cost semantics to compare the performance

5

of sequential, call-by-need programs. They give a semantic definition of what it means for a programtransformation to be “safe for space” [Shao and Appel, 1994] and provide several laws to help prove that agiven transformation does not asymptotically increase the space use of programs. To prove the soundness ofthese laws, they use a semantics that yields the maximum heap and stack space required for execution. Inthe context of a call-by-value language, Minamide [1999] showed that a CPS transformation is space efficientusing a cost semantics.

Jay et al. [1997] describe a static framework for reasoning about the costs parallel execution using amonadic language. Static cost models have also been used to automatically choose a parallel implementationduring compilation based on hardware performance parameters [Hammond et al., 2003] and to inform thegranularity of scheduling [Loidl and Hammond, 1996, Portillo et al., 2002]. Unlike this proposal, the latterwork focuses on how the size of program data structures affect parallel execution (e.g. through communicationcosts), rather than how different execution models affect the use of space at a given point in time.

Scheduling Within the algorithms community there has been significant research on the effect of schedul-ing policy on memory usage [Blumofe and Leiserson, 1998, Blelloch et al., 1999, Narlikar and Blelloch, 1999].In that work, it has been shown that different schedules can have an exponential difference in memory usagewith only two processors.

3 Cost Semantics

In this section, I will describe a cost semantics for this language. Recall that a cost semantics allows us todistinguish programs based not only on their final results, but also based on how those results are computed.Like previous work, the cost semantics in this proposal is a dynamic semantics. Thus, it yields results onlyfor closed expressions, i.e. for a given program over a particular input. Just an in ordinary performanceprofiling, we must run a program over a series of inputs before we can generalize its behavior.

The dynamic nature of the cost semantics has several implications. It means that extensions to thesource language are quite straightforward. For example, adding recursive functions requires only minimalchanges to the theorems and proofs in this proposal. Unlike a static analysis, which must compute a fixedpoint of the behavior for each recursive function, the cost semantics must only account for the dynamicbehavior of these functions. This amounts to tracking an additional binding, and ensuring that the spacecosts associated with this binding are accurately accounted for. The fact that the binding is recursive isimmaterial to my analysis.

While this framework can easily accommodate possibly non-terminating programs, it gives no informationabout individual program instances that diverge. While this is certainly a limitation of my framework, itis also a limitation of current practice using heap profilers [Runciman and Wakeling, 1993]. Furthermore,this is not a significant issue with respect to the implementation of parallel algorithms (as opposed to webservers and other interactive applications) since we are only interested in instances of program executionthat actually yield results.

Consider a call-by-value functional language extended with parallel pairs. The language may easily beextended with other primitive types, recursive functions, and arrays, but I elide these for the sake of clarity.The syntax of source expressions is shown below.

(expressions) e ::= x | λx.e | e1 e2 | {e1, e2} | πi e

Remark. The static semantics of this language is completely standard. I will omit a definition of typing,and even the definitions of terms such as type safety, and I trust the reader to assume their conventionalmeanings.

The cost semantics is given in terms of the following semantic objects.

(values) v ::= 〈η; x.e〉` | 〈v1, v2〉`

(locations) ` ∈ L(environments) η ::= · | η, x 7→ v

6

Figure 2 Profiling Cost Semantics. In addition to a result, this semantics also yields two graphs that canbe used to reconstruct the cost of obtaining that result. Computation graphs g record dependencies in timewhile heap graphs h record dependencies among values. The substitution [η]e of the values bound in η forthe variables appearing in e is defined in Appendix A.2.

η . e ⇓ v; g; h

(` fresh)

η . λx.e ⇓ 〈η; x.e〉`; [`]; {(`, `′)}`′∈locs([η](x.e))

(E-Fn)(x 7→ v) ∈ η (n fresh)

η . x ⇓ v; [n]; (n, loc(v))(E-Var)

η1 . e1 ⇓ 〈η2; x.e3〉`1 ; g1; h1 η1 . e2 ⇓ v2; g2; h2 η2, x 7→ v2 . e3 ⇓ v3; g3; h3 (n fresh)

η1 . e1 e2 ⇓ v3; g1 ⊕ g2 ⊕ [n] ⊕ g3; h1 ∪ h2 ∪ h3 ∪ {(n, `1), (n, loc(v2))}(E-App)

η . e1 ⇓ v1; g1; h1 η . e2 ⇓ v2; g2; h2 (` fresh)

η . {e1, e2} ⇓ 〈v1, v2〉`; g1 ⊗ g2 ⊕ [`]; h1 ∪ h2 ∪ {(`, loc(v1)), (`, loc(v2))}

(E-Pair)

η . e ⇓ 〈v1, v2〉`; g; h (n fresh)

η . πi e ⇓ vi; g ⊕ [n]; h ∪ {(n, `)}(E-Proji)

In this semantics, I maintain a distinction between expressions and values. Values are also annotated withlocations so that sharing can be made explicit without resorting to an explicit heap: the syntax distinguishesbetween a pair whose components occupy the same space and one whose components do not. This will allowus to draw a tight connection between this semantics and the behavior of an implementation, in particular,with respect to its use of space.

The cost semantics (Figure 2) is an evaluation semantics that computes both the result of the computationand an abstract cost reflecting how the result was obtained. While the semantics is sequential, the cost willallow us to reconstruct different parallel schedules and reason about the space use of programs executingwith these schedules. The judgment

η . e ⇓ v; g; h

is read, in environment η, expression e evaluates to value v with computation graph g and heap graph h.The extensional portions of this judgment are completely standard in the way they relate expressions tovalues. As discussed below, edges in a computation graph represent control dependencies in the executionof a program, while edges in a heap graph represent dependencies on and between values.

Computation Graphs The first part of the cost associated with each program, the computation graph, isa directed, acyclic graph. Each node in the computation graph represents the evaluation of a sub-expression,and edges represent dependencies between sub-expressions. Edges in the computation graph point forwardin time: an edge from node n1 to node n2 indicates that n1 must be executed before n2.

Each computation graph has one source node (with in-degree zero) and one sink node (with out-degreezero), i.e. computation graphs are directed series-parallel graphs. Each such graph consists of a single-node,or of the sequential or parallel composition of smaller graphs. Nodes are denoted ` and n (and variants).Graph are written as tuples such as (ns; ne; E) where ns is the source or start node, ne is the sink or endnode, and E is a list of edges. The remaining nodes of the graph are implicitly defined by the edge list.Single-node graphs and graph operations are defined below. In these diagrams, nodes are represented as

7

circles while arbitrary graphs are represented as diamonds. Time flows downward.

Single Node Serial Composition Parallel Composition

[n] (ns; ne; E) ⊕ (n′

s; n′

e; E′) (ns; ne; E) ⊗ (n′

s; n′

e; E′)

(n; n; ε) (ns; n′

e; E, E′, (ne, n′

s)) (n; n′; E, E′, (n, ns), (n, n′

s),

(ne, n′), (n′

e, n′))

n, n′ fresh

Heap Graphs The second part of the cost associated with each program, the heap graph, is also a directed,acyclic graph. Unlike computation graphs, heap graphs do not have distinguished start or end nodes. Eachnode again represents the evaluation of a sub-expression. While edges in the computation graph point forwardin time, edges in the heap graph point backward in time. Edges represent a dependency on a value: if thereis an edge from n to ` then n depends on the value at location `. It follows that any space associated with` cannot be reclaimed until after n has executed and any space associated with n has also been reclaimed.

Each heap graph shares nodes with the computation graph arising from the same execution. In a sense,computation and heap graphs may be considered as two sets of edges on a shared set of nodes. As above,the nodes of heap graphs are left implicit.

Edges in the heap graph record both the dependences among values as well as dependencies on values byother parts of the program state. As an example of the first case, in the evaluation rule for pairs, two edgesare added to the heap graph to represent the dependencies of the pair on each of its components. Thus, ifthe pair is reachable, so is each component. In the evaluation of a function application, however, two edgesare added to express the use of values. The first such edge marks a use of the function. The second edgedenotes a possible last use of the argument. For strict functions, this second edge is redundant: there willbe another edge leading to the argument when it is used. However, for non-strict functions, this is the firstpoint at which the garbage collector might reclaim the space associated with the argument.

Consider the rule describing the evaluation of pairs, E-Pair. The cost graphs for this rule are displayedhere.

Arrows in the computation graph point downward. This graph consists of two subgraphs (one for eachcomponents) and three additional nodes. The first node, at the top, represents the cost of forking a newparallel computation, and the second node, in the middle, represents the cost of joining these parallel threads.The final node represents the cost of the allocation of the pair. There are two heap edges (pointing upwardand in bold) shown in the graph, representing the dependency of the pair on each of its components. Notethat these components need not have been allocated as the final step in either sub-graph.

8

3.1 Schedules

Together, the computation and heap graphs allow a programmer to analyze the behavior of her programunder a variety of hardware and scheduling configurations. A key component of this analysis is the notionof a schedule. Each schedule describes one possible parallel execution of the program and records whichparallel tasks are executed at each time step. Every schedule must obey the constraints described by thecomputation graph g.

Definition (Schedule). A schedule of a graph g = (ns; ne; E) is a sequence of sets of nodes N0, . . . , Nk

such that ns 6∈ N0, ne ∈ Nk, and for all i ∈ [0, k),

• Ni ⊆ Ni+1, and

• for all n ∈ Ni+1, predg(n) ⊆ Ni.

Where predg(n) is the set of nodes n′ such that there is an edge in g of the form (n′, n).It also will be convenient to distinguish those nodes which are executed in a given time step from those

that have executed in previous steps.

Definition (Executed Nodes). Given a schedule N0, . . . , Nk, the nodes executed at each step E1, . . . , Ek

are defined as Ei = Ni\Ni−1 for i ∈ [1, k].

For a schedule of a graph g, the sequence of sets of executed nodes corresponds to a pebbling [Hopcroftet al., 1977] of g.

3.2 Roots

To understand the use of space, the programmer must also account for the structure of the heap graph h.Given a schedule N0, . . . , Nk for an associated graph g, consider the moment of time represented by someNi. Because Ni contains all previously executed nodes and because edges in h point backward in time, eachedge (n, `) in h will fall into one of the following three categories.

• Both n, ` 6∈ Ni. In this case, since the value associated with ` has not yet been allocated, the edge(n, `) does not contribute to the use of space at time i.

• Both n, ` ∈ Ni. While the value associated with ` has been allocated, the use of this value representedby this edge is also in the past. Again, the edge (n, `) does not contribute to the use of space at time i.

• ` ∈ Ni, but n 6∈ Ni. In this case, the value associated with ` has already been allocated, and nrepresents at one possible use in the future. Here, the edge (n, `) does contribute to the use of spaceat time i.

This leads us to the following definition.

Definition (Roots). The roots of a heap graph h with respect to a location ` after evaluation of the nodesin N , written roots`,h(N), is the set of nodes `′ in N where `′ = ` or h contains an edge leading from outsideN to `′. Symbolically,

roots`,h(N) = {`′ ∈ N | `′ = ` ∨ (∃n.(n, `′) ∈ h ∧ n 6∈ N)}

I use the term roots to evoke a related concept from work in garbage collection. For the reader thatis most comfortable thinking in terms of an implementation, the roots might correspond to those memorylocations that are reachable directly from the processor registers or the call stack. This is just one possibleimplementation, however: the computation and heap graphs stipulate only the behavior of an implementation.This connection will be made precise in the next section.

9

4 Provable Implementations

While the evaluation semantics given in the previous section allows a programmer to draw conclusionsabout the performance of her program, these conclusions would be meaningless if the implementation ofthe language did not reflect the costs given by that semantics. In this section, I define several provableimplementations [Blelloch and Greiner, 1996] of this language, each as a transition (small-step) semantics.The first is a non-deterministic semantics that defines all possible parallel executions. Each subsequentsemantics will define the behavior of a particular scheduling algorithm. The following table gives a briefoverview of all the semantics used in this proposal.

Semantics (Figure) Style Judgment(s) Notes

Cost (2) big-step η . e ⇓ v; g; h sequential, profiling semantics

Primitive (3) small-step e −→ e′axioms shared among parallel imple-mentations

Non-deterministic (4) small-stepe

nd

7−−→ e′

dnd

7−−→ d′defines all possible parallel executions

Depth-first (5) small-stepe

df

7−→ e′

ddf

7−→ d′algorithmic implementation favoringleft-most subexpressions

As part of the implementation of this language, I extend the syntax to include a parallel let construct. Thisconstruct is used to denote expressions whose parallel evaluation has begun but not yet finished. Declarationswithin a let par may step in parallel, depending on the constraints enforced by one of the transition semanticsbelow. Declarations and let par expressions reify a stack of expression contexts such as those that appear inmany abstract machines (e.g. [Landin, 1964]). Unlike a stack, which has exactly one topmost element, thereare many “leaves” in our syntax that may evaluate in parallel. We also include values within the syntax ofexpressions so that substitution will be well-defined. These extensions are shown below.

(expressions) e ::= . . . | let par d in e | v(declarations) d ::= x = e | d1 and d2

(value declarations) δ ::= x = v | δ1 and δ2

As this semantics is defined using substitution, closures will always be trivial: the environment of everyclosure in this semantics will be empty. I use 〈x.e〉` as an abbreviation for 〈·; x.e〉`.

To facilitate the definition of several different parallel semantics, I first factor out those parts of thesemantics that are common to each variation. These primitive sequential transitions are defined by thefollowing judgment.

e −→ e′

This judgment represents the step taken by a single processor in one unit of time (e.g., allocating a pair,applying a function). Primitive transitions are defined by the axioms in Figure 3. These axioms limit whereparallel evaluation may occur by defining the intermediate forms for the evaluation of pairs and functionapplication. When exactly parallel evaluation occurs is defined by the scheduling semantics, as given in theremainder of this section.

4.1 Non-Deterministic Scheduling

The first implementation in this proposal is a non-deterministic nd transition semantics that defines allpossible parallel executions. Though this semantics itself does not serve as a model for a realistic imple-mentation, it is a useful tool in reasoning about other, more realistic, semantics. The non-deterministicsemantics is defined by a pair of judgments

end

7−−→ e′ dnd

7−−→ d′

10

Figure 3 Primitive Transitions. These rules encode transitions where no parallel execution is possible. Theywill be used in each of the different scheduling semantics that follow in this section. The substitution of avalue declaration into an expression [δ]e is defined in Appendix A.2.

e −→ e′

(` fresh)

λx.e −→ 〈x.e〉`(P-Fn)

(x1, x2 fresh and e1, e2 not values)

e1 e2 −→ let par x1 = e1 inlet par x2 = e2 in x1 x2

(P-App)

〈x.e〉` v2 −→ [v2/x]e(P-AppBeta)

(` fresh)

{v1, v2} −→ 〈v1, v2〉`

(P-Pair)(x fresh and e not a value)

πi e −→ let par x = e in πi x(P-Proji)

πi 〈v1, v2〉` −→ vi

(P-ProjiBeta)

(x1, x2 fresh and e1, e2 not values)

{e1, e2} −→ let par x1 = e1

and x2 = e2 in {x1, x2}

(P-Fork)

let par δ in e −→ [δ]e(P-Join)

that state, expression e takes a single parallel step to e′ and, similarly, declaration d takes a single parallelstep to d′. This semantics allows unbounded parallelism: it models execution on a parallel machine with anunbounded number of processors. It is defined by the rules in Figure 4.

Most of the nd rules are straightforward. The only non-determinism lies in the application of the rulend-Idle. In a sense, this rule is complemented by nd-Branch: The latter says that all branches may beexecuted in parallel, but the former allows any sub-expression to sit idle during a given parallel step.

4.1.1 Extensional Behavior

Though this implementation is non-deterministic in how it schedules parallel evaluation, the result of ndevaluation will always be the same, no matter which expressions evaluate in parallel. This statement isformalized in the following theorem. (In this and other results below, we always consider equality up to therenaming of locations.)

Theorem 1 (Confluence). If end7−→∗ e′ and e

nd7−→∗ e′′ then there exists an expression e′′′ such that e′

nd7−→∗ e′′′

and e′′nd7−→∗ e′′′. Similarly, d

nd7−→∗ d′ and d

nd7−→∗ d′′ then there exists a declaration d′′′ such that d′

nd7−→∗ d′′′

and d′′nd7−→∗ d′′′.

Proof. As usual, this follows from the “diamond” property as shown in Lemma 1 below.

Lemma 1. If end7−→ e′ and e

nd7−→ e′′, then there exists an expression e′′′ such that e′

nd7−→ e′′′ and e′′

nd7−→ e′′′.

Similarly, If dnd7−→ d′ and d

nd7−→ d′′, then there exists a declaration d′′′ such that d′

nd7−→ d′′′ and d′′

nd7−→ d′′′.

Proof. By induction on the derivations of end

7−−→ e′ and dnd

7−−→ d′.

Case nd-Idle: In this case e′ = e. Assume that end

7−−→ e′′ was derived using rule R. Let e′′′ = e′′. Then

we have end

7−−→ e′′ (by applying R) and e′′nd

7−−→ e′′ (by nd-Idle), as required.As all of the non-determinism in this semantics is focused in the use of the nd-Idle rule, the remaining

cases follow from the immediate application of the induction hypothesis.

11

Figure 4 Non-Deterministic Parallel Transition Semantics. This semantics defines all possible paralleltransitions of an expression, including those that take an arbitrary number of primitive steps in parallel.Parallelism is isolated within transition expressions of the form let par. Declarations step in parallel usingnd-Branch. Note that expressions (or portions thereof) may remain unchanged using the rule nd-Idle.

end

7−−→ e′

dnd

7−−→ d′

let par d in end

7−−→ let par d′ in e(nd-Let)

end

7−−→ e(nd-Idle) e −→ e′

end

7−−→ e′(nd-Prim)

dnd

7−−→ d′

end

7−−→ e′

x = end

7−−→ x = e′(nd-Leaf)

d1nd

7−−→ d′1 d2nd

7−−→ d′2

d1 and d2nd

7−−→ d′1 and d′2(nd-Branch)

Before considering the intensional behavior of the parallel semantics, I prove several properties relating itbehavior to that of the cost semantics. As such, I will temporarily ignore the cost graphs and write η . e ⇓ vif η . e ⇓ v; g; h for some g and h.

The first such property (Completeness) states that any result obtained using the cost semantics canalso be obtained using the nd implementation. To relate these results, I define an embedding of valuesfrom the cost semantics (including closures with non-empty environments) into values in the implementationsemantics. This embedding substitutes away any variables bound in a closure.

x〈η; x.e〉`y = 〈x.[xηy]e〉`

x〈v1, v2〉`y = 〈xv1y, xv2y〉

`

Environments η are embedded piecewise by applying the embedding to each component value.

Theorem 2 (nd Completeness). If η . e ⇓ v then [xηy]end7−→∗

xvy.

The proof is carried out by induction on the derivation of η . e ⇓ v and is shown in Appendix B.

The following theorem (Soundness) ensures that any result obtained by the implementation semanticscan also be derived using the cost semantics. As my extensions to the source language given in this sectionrepresent runtime intermediate forms, I define an embedding of these new expression forms into the originalsyntax. Parallel let expression are embedded by substitution, and declarations are mapped to environments.

plet par d in eq = [pdq]peqpeq = e (e 6= let par d in e)

px = eq = x 7→ peqpd1 and d2q = pd1q, pd2q

I also define a vectorized form of the evaluation relation that evaluates several expressions simultaneously.It evaluates a list of variable-expressions pairs to a list of variable-value pairs,

η . · ⇓ ·

η1 . η2 ⇓ η′

2 η1 . e ⇓ v

η1 . η2, x 7→ e ⇓ η′

2, x 7→ v

Finally, I extend the evaluation relation to relate values to themselves. (This extension requires only trivialchanges to the previous theorem, since every value steps to itself in zero steps.)

η . v ⇓ v(E-Val)

12

Theorem 3 (nd Soundness). If [η]end7−→∗ v, then η.peq ⇓ xvy. Similarly, if [η]d

nd7−→∗ δ, then η.pdq ⇓ xδy.

Proof. By induction on the number of steps n in the sequence of transitions.Case 0: In this case, e = v and therefore, peq = v. Since every value is related to itself under the evaluationrelation, the case is proved. The same applies to d and δ.

Case n > 0: Thus [η]end

7−−→ [η′]e′ and [η′]e′nd

7−−→∗ v. Inductively, we have η′ . pe′q ⇓ xvy. The remainder ofthis case is given by the following lemma.

Lemma 2. If [η]end

7−→ [η′]e′, [η′]e′nd

7−→∗ v, and η′ .pe′q ⇓ xvy, then η .peq ⇓ xvy. Similarly, if [η]dnd

7−→ [η′]d′,

[η′]d′nd7−→∗ δ, and η′ . pd′q ⇓ xδy, then η . pdq ⇓ xδy.

The proof is carried out by induction on the derivations of end


7−−→ d′ and is shown inAppendix B.

4.1.2 Intensional Behavior

Having considered the extensional behavior of this implementation, I now turn to its intensional behavior.As we take the semantics to define all possible parallel executions, it should be the case the any schedule wederive from the cost semantics is implemented by a sequence of parallel steps, as defined by the transitionrelation. This statement is made precise in the following theorem.

Theorem 4 (Cost Completeness). If e ⇓ v; g; h and N0, . . . , Nk is a schedule of g then there exists a

sequence of expressions e0, . . . , ek such that e0 = e and ek = xvy and for all i ∈ [0, k), eind7−→ ei+1 and

locs(ei) ⊆ rootsv,h(Ni).

Here I write rootsv,h(N) as an abbreviation for the roots of the location of that value, or rootsloc(v),h(N).The locations of an expression locs(e) are simply the locations of any values embedded in that expression.The location of a value is the outermost location of that value. Locations of expressions and values aredefined in Appendix A.1.

The final condition of the theorem states that the use of space in the parallel semantics, as determinedby locs(), is approximated by the measure of space in the cost graphs, as given by roots().

Theorem 5 (Cost Soundness). If e ⇓ v; g; h and e0, . . . , ek is a sequence of expressions such that e0 = e

and ek = xvy and for all i ∈ [0, k), eind

7−→ ei+1 then there exists a schedule of g given by N0, . . . , Nk withlocs(ei) ⊆ rootsv,h(Ni).

Both these theorems must be generalized to account for evaluation in a non-empty environment, but inboth cases, are pr-oven much like their extensional counterparts above.

4.2 Depth-First Scheduling

I now define an alternative transition semantics that is deterministic and implements a depth-first schedule.Depth-first (df) schedules, defined below, prioritize the leftmost sub-expressions of a program and alwayscomplete the evaluation of these leftmost sub-expressions before proceeding to sub-expressions on the right.The semantics in this section implements a p-depth-first (p-df) scheduler, a scheduler that uses at most pprocessors. As a trivial example, a left-to-right sequential evaluation is equivalent to a one processor or 1-dfschedule.

Just as we defined the non-deterministic implementation as a transition relation, we can do the samefor the depth-first implementation. The p-df transition semantics is defined on configurations p; e andp; d. These configurations describe an expression or declaration together with an integer p that indicatesthe number of processors that have not yet been assigned a task in this parallel step. At the root of thederivation of each parallel step, p will be equal to the total number of processors. Within a derivation, pmay be smaller but not less than zero. The semantics is given by the following pair of judgments.

p; edf

7−→ p′; e′ p; ddf

7−→ p′; d′

13

Figure 5 p-Depth-First Parallel Transition Semantics. This deterministic semantics defines a single parallelstep for left-to-right depth-first schedule using at most p processors. Configurations p; e and p; d describeexpressions and declarations with p unused processors remaining in this time step.

p; edf

7−→ p′; e′

p; ddf

7−→ p′; d′

p; let par d in edf

7−→ p′; let par d′ in e(DF-Let)

p; vdf

7−→ p; v(DF-Val)

0; edf

7−→ 0; e(DF-None) e −→ e′

p + 1; edf

7−→ p; e′(DF-Prim)

p; ddf

7−→ p′; d′

p; edf

7−→ p′; e′

p; x = edf

7−→ p′; x = e′(DF-Leaf)

p; d1df

7−→ p′; d′1 p′; d2df

7−→ p′′; d′2

p; d1 and d2df

7−→ p′′; d′1 and d′2(DF-Branch)

These judgments define a single parallel step of an expression or declaration. The first is read, given pavailable processors, expression e steps to expression e′ with p′ processors remaining unused. The second hasan analogous meaning for declarations.

The p-df transition semantics is defined by the rules given in Figure 5. Most notable is the DF-Branchrule. It states that a parallel declaration may take a parallel step if any available processors are used firston the left sub-declaration and then any remaining available processors are used on the right. Like thenon-deterministic semantics above, the p-df transition semantics relies on the primitive transitions given inFigure 3. In rule DF-Prim, one processor is consumed when a primitive transition is applied.

For the df semantics, we must reset the number of available processors after each parallel step. To do so,we define a “top-level” transition judgment for df evaluation with p processors. This judgment is defined byexactly one rule, shown below. Note that the number of processors remaining idle p′ remains unconstrained.

p; edf

7−→ p′; e′

ep-df

7−−−→ e′

The complete evaluation of an expression, as for the non-deterministic semantics, is given by the reflexive,

transitive closure of the transition relationp-df

7−−−→∗ .We now consider several properties of the df semantics. First, unlike the non-deterministic implementa-

tion, this semantics defines a particular evaluation strategy.

Theorem 6 (Determinacy of df Evaluation). If p; edf

7−→ p′; e′ and p; edf

7−→ p′′; e′′ then p′ = p′′ and

e′ = e′′. Similarly, if p; ddf

7−→ p′; d′ and p; ddf

7−→ p′′; d′′ then p′ = p′′ and d′ = d′′.

The proof is carried out by induction on the first derivation and hinges on the following two facts: first,that DF-Val and DF-Val yield the same results, and second, that in no instance can both DF-Let andDF-Prim be applied.

We can easily show the df semantics is correct with respect to the cost semantics, simply by showingthat its behavior is contained within that of the non-deterministic semantics.

Theorem 7 (df Soundness). If p; edf

7−→ p′; e′ then end7−→ e′. Similarly, If p; d

df

7−→ p′; d′ then dnd7−→ d′.

Proof. By induction on the derivation of p; edf

7−→ p′; e′. Cases for derivations ending with rules DF-Let,DF-Leaf, and DF-Branch follow immediately from appeals to the induction hypothesis and analogous

14

rules in the non-deterministic semantics. DF-Prim also follows from its analogue. Rules DF-None andDF-Val are both restrictions of nd-Idle.

It follows immediately that if ep-df

7−−−→ e′ then end

7−−→ e′. This result shows the benefit of defining andreasoning about a non-deterministic semantics: once we have shown the soundness of an implementationwith respect to the non-deterministic semantics, we get soundness with respect to the cost semantics forfree. Thus, we know there is some schedule that accurately models behavior of the df implementation. Itonly remains to pick out precisely which schedule does so.

To allow programmers to understand the behavior of this semantics, I define a more restricted form ofschedule. For each computation graph g there is a unique p-df schedule. As shown below, these schedulesprecisely capture the behavior of the df implementation.

Definition (p-Depth-First Schedule). A p-depth-first schedule of a graph g is a schedule N0, . . . , Nk suchthat all i ∈ [0, k),

• |Ei+1| ≤ p and

• for all n ∈ Ni+1, n′ ∈ Ni, n′ � n,

and for any other schedule of g given by N ′

0, . . . , N′

j that meets these constraints, n ∈ N ′

i ⇒ n ∈ Ni.

Here n′ � n if n′ appears first in the list of edges in g. The final condition ensures that the depth-firstschedule is the most aggressive schedule that respects this ordering of nodes: if it is possible to execute noden at time i (as evidenced by its membership in N ′

i) then any depth-first schedule must also do so.

Theorem 8 (df Completeness). If e ⇓ v; g; h and N0, . . . , Nk is a p-df schedule of g then there exists

a sequence of expressions e0, . . . , ek such that e0 = e and ek = v and for all i ∈ [0, k), eip-df

7−−→ ei+1 andlocs(ei) ⊆ rootsv,h(Ni).

This theorem must be generalized not only over arbitrary environments, but also over df schedules whichmay use a different number of processors at each time step. This allows a p-df schedule to be split into twodf schedules that, when run in parallel, never use more than p processors in a given step. The proof hingeson the fact that any df schedule can be split in this fashion, and moreover, that it can be split so that theleft-hand side is allocated all the processors (up to p) that it could possibly make use of.

4.3 Breadth-First Scheduling

Just as the semantics in the previous section captured the behavior corresponding to a depth-first pebbling ofthe computation graph, we can also give an implementation corresponding to a breadth-first (bf) pebbling.This is the most “fair” schedule in the sense that it distributes computational resources evenly across paralleltasks. For example, given four parallel tasks, a 1-bf scheduler alternates round-robin between the four. A2-bf scheduler takes one step for each of the first two tasks in parallel, followed by one step for the secondtwo, followed again by the first two, and so on.

I omit the presentation of this semantics and only state that a theorem making a precise correspondencebetween breadth-first schedules and this implementation, similar to that shown above for the depth-firstcase, can also be proved.

4.3.1 Example: Quicksort

We now return to the parallel implementation of quicksort described in Section 1.2. The plot shown inFigure 1 is derived from a direct implementation of the cost semantics in Standard ML. The computationand heap graphs are automatically analyzed to determine an upper bound on the total space required foreach scheduling policy. This space is determined by the number of nodes in the heap graph reachable fromthe roots().

15

Recall that the four configurations described in the Figure 1 are (from top to bottom) a 1-df (i.e.sequential) schedule, a 2-df schedule, a 3-df schedule, and p-bf schedules for p ≤ 3. While the secondrequires less space than the first, both require space that is polynomial in the input size. The space used inthe third and fourth configurations is a linear function of the input size. In all cases, we are considering theworst-case behavior for this example: sorting a list whose elements are in reverse order.

The p-df schedules for p ≤ 2 require more space because they begin to partition many lists (one for eachrecursive call) before the first such partitions are completed. In the worst case, each of these in-progresspartitions requires O(N) space, and there are N such partitions.

The p-bf schedules avoid this asymptotic blowup in space use by completing all partitions at a givenrecursive depth before advancing to the next level of recursion. This implies that no matter how manypartitions are in-progress, only O(N) elements will appear in these lists. The p-df schedules for p > 2achieve the same performance by exhausting all available parallelism and, in effect, simulating the breadth-first schedules.

Figure 6(a) shows an example of cost graphs for the quicksort example. Both the computation andheap graphs are “distilled” to reveal their essential characteristics. For example, long sequences of sequentialcomputation are represented as a single node, a node whose size is proportional to the length of that sequence.Heap edges are between two composite nodes are weighted by the size of all transitively reachable objects.

4.3.2 Example: Numerical Integration

Our second example concerns the numerical integration of following degree five univariate polynomial.

f(x) = (x + 3) × (x − 10) × (2x − 20)× (3x − 24) × (4x + 15)

We approximate the integral between ±10 using the adaptive rectangle rule. This algorithm computes thearea under f as a series of rectangles, using narrower rectangles in regions where f seems to be undergoingthe most change. Whenever the algorithm splits an interval, the approximation of each half is computed inparallel. Note that the parallel structure of this example is determined not by the size or shape of a datastructure, but instead by the behavior of the function f . Increasing the amount of available parallelismallows us to calculate better approximations in a given period of time. The code for this algorithm is shownbelow.

fun integrate f (a, b) =let val mid = (a + b) / 2.0

val xdif = b − aval ydif = (f b) − (f a)

in

if withinTolerance (xdif, ydif)then (∗ approximate ∗)

(f mid) ∗ xdifelse (∗ divide and recur ∗)

let val (l, r) = { integrate f (a, mid),integrate f (mid, b) }

in

l + rend

end

Where withinTolerance is a predicate that determines if a rectangular approximation is sufficiently accurate.Figure 7 shows an upper bound on the space required by this program. Each data point represents an

execution with the same input but with a different number of processors.Unlike the quicksort example, the depth-first schedule requires far less space for a small number of

processors. Even as we increase the number of processors, the depth-first scheduler only gradually increases

16

Figure 6 Cost Graphs for Quicksort. Summarized computation and heap graphs for the examples in (a)Section 1.2 and (b) Section 4.3.3. Both represent evaluations of qsort [4,3,2,1]. Graph are “distilled” toreveal their essential characteristics. The graph on the right shows parallel evaluation is more restricted inthe second version of quicksort (though this particular input offers relatively few opportunities for parallelevaluation to begin with).

(a) (b)

17

Figure 7 Space Use as a Function of Number of Processors. This figure shows an upper bound on the spacerequired to numerically integrate a function using the code given in the text. The breadth-first schedulerrequires a nearly constant of amount of space, regardless of the number of processors. The depth-firstscheduler gradually increases space requirements as the number of processors increases.

50

100

150

200

250

300

5 10 15 20 25

spac

e hi

gh-w

ater

mar

k

# processing elements

breadth-firstdepth-first

its use of space. In contrast, the breadth-first scheduler uses a large amount of space, independently of thenumber of PE. Given enough PE, the depth-first scheduler will eventually emulate the behavior (and theperformance) of the breadth-first scheduler.

In this example, the depth-first scheduler capitalizes on the fact that the result returned by this function isonly an integer value. Thus by focusing resources on a small number of intervals and immediately aggregatingthe results, it makes efficient use of space. The breadth-first scheduler, on the other hand, expands manyintervals before computing any sums.

4.3.3 Example: Quicksort Revisited

Recall that the poor performance of the depth-first scheduler in the implementation of quicksort above isdue to the input argument xs is appears in both branches of the parallel evaluation. In light of this, consideran alternative implementation where the recursive case is structured as follows.

...| x:: ⇒

let val (ls, gs) = {filter (le x) xs,filter (gt x) xs}

in

append {qsort ls, qsort gs}end

In this version, we partition the list in parallel, but then synchronize before recursively sorting each sub-list.This version makes a better use of space under a depth-first schedule, but it does so at the cost of introducingmore constraints on parallel execution. In particular, by synchronizing before the recursive call, it ensuresthat the depth-first scheduler will use only O(N) space. Figure 6(b) shows summarized cost graphs thisversion of quicksort.

While this example shows that even in a declarative language such as ours, programmers have a measure ofcontrol over performance, it also points to potential problem: otherwise innocuous program transformationsmay adversely affect space usage. In this case, program variables ls and gs are each used exactly once andare prime candidates for inlining. Inlining the definitions of ls and gs, however, produces exactly the version

18

of the code given in Section 1.2. I may explore, as part of my continuing work, a characterization of programtransformations (such as inlining) that may be safely performed in the context of a parallel language.

5 Discussion

This section gives a brief look into the design of the cost semantics and the implications of my design choices.One might ask, are there other (useful) cost semantics for this language? Does the cost semantics reflect aparticular implementation? Are there other implementations that also adhere to the specifications set bythe cost semantics?

Consider as an example the following alternative rule for the evaluation of function application. Thepremises remain the same, and the only difference from the conclusion in Figure 2 is highlighted with arectangle.

η1 . e1 ⇓ 〈η2; x.e3〉`1 ; g1; h1 η1 . e2 ⇓ v2; g2; h2 · · ·

η1 . e1 e2 ⇓ v3; g1 ⊕ g2 ⊕ g3 ⊕ [n] ;

h1 ∪ h2 ∪ h3 ∪ {(n, `1), (n, loc(v2))}

This rule yields the same result as the version given in Figure 2. However, it admits more implementations.Recall that the heap edges (n, `1) and (n, loc(v2)) represent possible last uses of the function and its argument,respectively. This variation of the rule moves those dependencies until after the evaluation of the functionbody. With this rule, one could prove the preservation of costs for implementations that preserved thesevalues, even in the case where they are not used during the function application.

In contrast, the original version of this rule requires that these values be considered for reclamation bya garbage collector as soon as the function is applied. Suppose that an implementation stores the values ofvariables on a stack of activation records. Before the application, references to 〈η2; x.e3〉

`1 and v2 will appearin the current record. If the implementation conforms to the original application rule, then it must eitherclear references to these values in the current record before the function is applied or inform the collectorthat these values are no longer live. An implementation that converts code into a continuation-passingstyle (CPS) [Appel, 1992] and heap-allocates activation records is also constrained by this rule: it must notallocate space for these values unless the corresponding variables appear somewhere in the closure.

As this example suggests, there are many different implementations of the cost semantics. There is alsoleeway in the choice of the semantics itself. The goal was to find a semantics that describes a set of commonimplementation techniques. When they don’t align, my experience suggests that either the semantics or theimplementation can be adjusted to allow them to fit together.

6 Conclusion

Before giving a timetable for my remaining work, I briefly discuss several topics that will form the remainderof this thesis.

6.1 Language Extensions

The language I have considered thus far is a pure functional language, i.e. a language without side effects.One avenue of further work is to consider effectful extensions to the language, such as mutable references orarrays. This extension presents two challenges.

First, my analysis of heap graphs depends on the fact that, at each step in the schedule, the set of executedheap nodes is an approximation of the heap state at that point in time. That is, my analysis assumes thatheap graphs are persistent structures. In a programming language with mutable state, the semantics of aheap update must preserve enough information to reconstruct both the original heap structure and the newstructure resulting from the update.

A second and perhaps more significant challenge is determining the parallel semantics of a language witheffects. The semantics above relies on the fact that no thread of execution can influence the behavior of any

19

other thread. Thus every schedule will yield the same (extensional) result. With certain kinds of effects, theinterleaving of threads may affect this result.

There are several possible solutions to this second problem. We can restrict parallel evaluation to a purefragment of the language, perhaps using a monad to stratify the language. This would maintain consistencywith a sequential implementation. Alternatively, we can give different semantics, for example, by prioritizingupdates not based on the order in which they occur but by their position in the program. While this wouldnot be consistent with a sequential semantics it would still give a well-defined behavior that could be usedby programmers to reason about their parallel programs independently of the scheduling policy.

6.2 Implementation

The implementations discussed above are still relatively abstract in that they rely on operations, such assubstitution, that are not readily implemented at the level of modern hardware. Part of the remainder ofmy thesis will consist of a compiler and runtime system for a parallel functional language.

One possibility for a more concrete implementation is to adapt an existing sequential compiler to supporta parallel language. MLton is an open-source compiler for Standard ML that uses whole-program analysisto achieve good performance [MLton]. This performance and MLton’s simple design make it an attractiveoption for a parallel implementation. This implementation would use one of MLton’s two existing backends,which generate either C or native x86 code, to target multi-core and other shared-memory multi-processorsystems based on the x86 architecture.

A parallel version of MLton can be broken down roughly into two parts: compiler changes and runtimesupport. Changes to the compiler should be relatively simple. They include adding new primitive operationsto allow synchronization among threads and access to the scheduler. I must also ensure that these primitivesare handled properly by MLton’s many optimizations.

Changes to the runtime will be more significant. Here, I must not only implement the additional primitivesrequired to schedule parallel tasks, but also ensure that existing runtime operations are thread-safe. Forexample, allocation and collection routines must properly synchronize simultaneous requests from multiplethreads. To provide reasonable performance, operations such as allocation must be thread-local in thecommon case.

One open issue concerns the granularity of parallel tasks. It may be more efficient to occasionally executea set of parallel tasks sequentially rather than incur the overhead of communicating with the scheduler. Inthis case, one implementation strategy is to produce two versions of each function: one that adds paralleltasks (if any) to the queue and a second that executes them sequentially. Tasks would be added to the queueeither with some fixed frequency or until parallel execution reaches a given depth.

Alternatives In the case that a MLton implementation proves to be too time-consuming to fit within thescope of my thesis, several alternative implementations are possible. Both Nepal [Chakravarty and Keller,2000] and NESL [Blelloch and Greiner, 1996] provide implementations of a flattened semantics that wouldallow me to test predictions made by the cost semantics. There is also an opportunity to explore alternativeflattening techniques within these implementations, including implementations based on depth-first traversalsof the cost graphs.

Graphics processors present another opportunity for implementation. As in the case of Nepal or NESL,this would again limit me to a data parallel implementation. However, from the point of view of thecentral processor, program execution is sequential. This would simplify many aspects of the implementationdiscussed above.

6.3 Applications

In addition to the small examples discussed in the text above (e.g., quicksort, sieve of Eratosthenes), I planto implement one or more larger parallel programs taken from areas such as graph theory or mesh generation.While I have listed applications as the third part of my ongoing work, I plan to identify several potentialapplications before beginning significant work on other parts of my thesis. These applications will serve not

20

only to demonstrate my techniques, but also to drive the extensions to my language and the implementationdescribed in this section.

6.4 Plan of Work

I conclude with a detailed plan for the remainder of this thesis.

Area Task Duration

Language ExtensionsImpure Semantics. Determine parallel semantics for an impurelanguage

5%

Mutable State. Integrate mutable state into cost semantics 10%

Implementation

Compiler Support. Extend syntax, elaborator, and internal lan-guages; verify new primitives are treated correctly by optimizations

10%

Runtime Support. Add synchronization and scheduling primitives;implement thread-safe versions of existing runtime operations (e.g.,allocation, collection)

20%

Testing and Instrumentation. Verify correctness and performanceof implementation; add appropriate instrumentation to validatepredictions

10%

ApplicationsImplementation. Implement several small as well as one or morelarger examples

15%

Evaluation. Analyze behavior of programs both using cost seman-tics and empirically

10%

Dissertation Writing. Complete the thesis document 20%

References

Shail Aditya, Arvind, Jan-Willem Maessen, and Lennart Augustsson. Semantics of ph: A parallel dialect ofhaskell. Technical Report Computation Structures Group Memo 377-1, MIT, June 1995.

Andrew W. Appel. Compiling with continuations. Cambridge University Press, New York, NY, USA, 1992.ISBN 0-521-41695-7.

Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data structures for parallel computing.ACM Trans. Program. Lang. Syst., 11(4):598–632, 1989. ISSN 0164-0925.

21

Clem Baker-Finch, David King, and Phil Trinder. An operational semantics for parallel lazy evaluation. InACM-SIGPLAN International Conference on Functional Programming (ICFP’00), pages 162–173. ACM,2000.

G. E. Blelloch. Vector Models for Data-Parallel Computing. MIT Press, Cambridge, MA, 1990.

G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portablenested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4–14, April 1994.

Guy Blelloch, Phil Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grainedparallelism. Journal of the Association for Computing Machinery, 46(2):281–321, 1999.

Guy E. Blelloch and John Greiner. Parallelism in sequential functional languages. In Functional ProgrammingLanguages and Computer Architecture, pages 226–237, 1995.

Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of nesl. In ACMSIGPLAN International Conference on Functional Programming, pages 213–225, May 1996.

R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journalof Computing, 27(1):202–229, 1998.

Manuel M. T. Chakravarty and Gabriele Keller. More types for nested data parallel programming. In ICFP’00: Proceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages94–105, New York, NY, USA, 2000. ACM Press. ISBN 1-58113-202-6.

Robert Ennals. Adaptive Evaluation of Non-Strict Programs. PhD thesis, University of Cambridge, 2004.

John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A report on the sisal language project. J. ParallelDistrib. Comput., 10(4):349–366, 1990. ISSN 0743-7315.

John Greiner and Guy E. Blelloch. A provably time-efficient parallel implementation of full speculation.ACM Transactions on Programming Languages and Systems, 21(2):240–285, 1999.

Jorgen Gustavsson and David Sands. A foundation for space-safe transformations of call-by-need programs.In Proceedings of Workshop on Higher Order Operational Techniques in Semantics, number volume 26 ofElectronic Notes in Theoretical Computer Science, September 1999.

Kevin Hammond, Jost Berthold, and Rita Loogen. Automatic skeletons in template haskell. ParallelProcessing Letters, 13(3):413–424, September 2003.

John Hopcroft, Wolfgang Paul, and Leslie Valiant. On time versus space. J. ACM, 24(2):332–337, 1977.ISSN 0004-5411.

Paul Hudak and Steve Anderson. Pomset interpretations of parallel function programs. In Proc. of aconference on Functional programming languages and computer architecture, pages 234–256, London, UK,1987. Springer-Verlag. ISBN 0-387-18317-5.

Intel. 80-core programmable processor first to deliver teraflops performance. URL http://www.intel.com/.

C. Barry Jay, Murray Cole, M. Sekanina, and Paul Steckler. A monadic calculus for parallel costing of a func-tional language of arrays. In Euro-Par ’97: Proceedings of the Third International Euro-Par Conferenceon Parallel Processing, pages 650–661, London, UK, 1997. Springer-Verlag. ISBN 3-540-63440-1.

Peter J. Landin. The mechanical evaluation of expressions. Computer Journal, 6, Jan 1964.

R. Lechtchinsky, M.M.T. Chakravarty, and G. Keller. Higher Order Flattening. In V. Alexandrov, D. vanAlbada, P. Sloot, and J. Dongarra, editors, International Conference on Computational Science (ICCS2006), LNCS. Springer, 2006.

22

Roman Lechtchinsky, Manuel M. T. Chakravarty, and Gabriele Keller. Costing nested array codes. ParallelProcessing Letters, 12(2):249–266, 2002.

Hans-Wolfgang Loidl and Kevin Hammond. A sized time system for a parallel functional language. InProceedings of the Glasgow Workshop on Functional Programming, Ullapool, Scotland, July 1996.

James R. McGraw. The val language: Description and analysis. ACM Trans. Program. Lang. Syst., 4(1):44–82, 1982. ISSN 0164-0925.

Yasuhiko Minamide. Space-profiling semantics of the call-by-value lambda calculus and the cps transforma-tion. In Andrew D. Gordon and Andrew M. Pitts, editors, The Third International Workshop on HigherOrder Operational Techniques in Semantics, volume 26 of Electronic Notes in Theoretical Computer Sci-ence. Elsevier, 1999.

MLton. An open-source, whole-program, optimizing standard ml compiler. URL http://www.mlton.org/.

G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM Trans. on Program-ming Languages and Systems, 21(1):138–173, 1999.

John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. Lefohn, and Tim-othy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer GraphicsForum, 26:80–113, March 2007.

Alvaro J. Rebon Portillo, Kevin Hammond, Hans-Wolfgang Loidl, and Pedro B. Vasconcelos. Cost analysisusing automatic size and time inference. In Ricardo Pena and Thomas Arts, editors, Implementation ofFunctional Languages, 14th International Workshop, IFL 2002, Madrid, Spain, September 16-18, 2002,Revised Selected Papers, volume 2670 of Lecture Notes in Computer Science, pages 232–248. Springer,2002. ISBN 978-3-540-40190-2.

P. Roe. Parallel Programming Using Functional Languages. PhD thesis, Department of Computing Science,University of Glasgow, 1991.

Mads Rosendahl. Automatic complexity analysis. In FPCA ’89: Proceedings of the fourth internationalconference on Functional programming languages and computer architecture, pages 144–156, New York,NY, USA, 1989. ACM Press. ISBN 0-89791-328-0.

Colin Runciman and David Wakeling. Heap profiling of lazy functional programs. J. Funct. Program., 3(2):217–245, 1993.

D. Sands. Calculi for Time Analysis of Functional Programs. PhD thesis, Department of Computing,Imperial College, University of London, September 1990.

Zhong Shao and Andrew W. Appel. Space-efficient closure representations. In LFP ’94: Proceedings of the1994 ACM conference on LISP and functional programming, pages 150–161, New York, NY, USA, 1994.ACM Press. ISBN 0-89791-643-3.

Philip W. Trinder, Kevin Hammond, Hans-Wolfgang Loidl, and Simon L. Peyton Jones. Algorithm +Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, January 1998.

23

A Definitions

For completeness, I give several definitions. Most are straightforward, and any interesting cases were dis-cussed explicitly in the text above.

A.1 Locations

The location of a value is the outermost location of that value. It serves to uniquely identify that value.The locations of an expression are the locations of any values that appear in that expression. Similarly fordeclarations.

loc(〈x.e〉`) = `loc(〈v1, v2〉

`) = `

locs(λx.e) = locs(e)locs(e1 e2) = locs(e1) ∪ locs(e2)locs({e1, e2}) = locs(e1) ∪ locs(e2)locs(πi e) = locs(e)locs(let par d in e) = locs(d) ∪ locs(e)

locs(x = e) = locs(e)locs(d1 and d2) = locs(d1) ∪ locs(d2)

A.2 Substitution

Substitution, as used in Sections 3 and 4, is a standard capture-avoiding substitution.

[v/x]x = v[v/x]y = y (if x 6= y)[v/x]c = c[v/x](λy.e) = λy.([v/x]e) (if x 6= y)[v/x](e1 e2) = ([v/x]e1) ([v/x]e2)[v/x]{e1, e2} = {[v/x]e1, [v/x]e2}[v/x](πi e) = πi ([v/x]e)[v/x](let par d in e) = let par [v/x]d in [v/x]e

[v/x](x = e) = x = [v/x]e[v/x](d1 and d2) = [v/x]d1 and [v/x]d2

[x = v]e = [v/x]e[δ1 and δ2]e = [δ1][δ2]e

[·]e = e[η, x 7→ v]e = [η][v/x]e

B Proofs

Theorem 2 (nd Completeness). If η . e ⇓ v then [xηy]end7−→∗

xvy.

Proof. By induction on the derivation of η . e ⇓ v.Case E-Fn: We apply ND-Prim along with P-Fn to achieve the desired result.

Case E-Var: Since (x 7→ v) ∈ η, it follows that [xηy]e = xvy, and xvynd

7−−→0xvy.

Case E-App: First, [xηy](e1 e2) = ([xηy]e1) ([xηy]e2). Applying ND-Prim along with P-App, we have

let par x1 = [xηy]e1 in let par x2 = [xηy]e2 in x1 x2. Inductively, [xηy]e1nd

7−−→∗x〈η′; x.e〉`y and [xηy]e2

nd

7−−→

24

∗xv2y. We apply rules ND-Let, ND-Leaf, and ND-Prim to the let par expressions at each step. We obtain

the final result by application of P-Join and P-AppBeta (along with ND-Prim in both cases).Case E-Pair: Here, [xηy]{e1, e2} = {[xηy]e1, [xηy]e2}. Applying ND-Prim along with P-Fork, we have

let par x1 = [xηy]e1 and x2 = [xηy]e2 in {x1, x2}. Inductively, [xηy]e1nd

7−−→∗xv1y and [xηy]e2

nd

7−−→∗xv2y. We

again apply rules ND-Let, ND-Branch, ND-Leaf, and ND-Prim to the let par expression at each step(also using ND-Idle in the case where the two sub-computations are of different lengths). We obtain thefinal result by application of P-Join and P-Pair (along with ND-Prim in both cases).

Case E-Proji: [xηy](πi e) = πi [xηy]e and πi [xηy]end

7−−→ let par x = [xηy]e in πi x by rule P-Proji with

ND-Prim. Inductively, we have [xηy]end

7−−→∗x〈v1, v2〉

`y. Applying rules ND-Let, ND-Leaf, and ND-Prim

at each step, we have let par x = [xηy]e in πi xnd

7−−→∗ let par x = x〈v1, v2〉`y in πi x. We apply rules P-Join

and P-ProjiBeta (along with ND-Prim in both cases) to yield the final result.

Lemma 2. If [η]end7−→ [η′]e′, [η′]e′

nd7−→∗ v, and η′ .pe′q ⇓ xvy, then η .peq ⇓ xvy. Similarly, if [η]d

nd7−→ [η′]d′,

[η′]d′nd7−→∗ δ, and η′ . pd′q ⇓ xδy, then η . pdq ⇓ xδy.

Proof. By induction on the derivations of end


7−−→ d′.

Case ND-Let: We have e = let par d in e1 and e′ = let par d′ in e1, as well as a derivation of dnd

7−−→ d′.Note that peq = [pdq]pe1q and pe′q = [pd′q]pe1q. It is easily shown that η′ . [pd′q]pe1q ⇓ xvy if and only ifη′ . pd′q ⇓ xδy and η′ . [xδy]pe1q ⇓ xvy. Inductively, we have η . pdq ⇓ xδy. Applying ND-Let yields thedesired result.Case ND-Idle: In this case, e′ = e, and therefore, pe′q = peq. The required result is one our assumptions.Case ND-Prim: We must consider each primitive transition. P-App, P-Proji, and P-Fork followimmediately since pe′q = e. In the case of P-Join, peq = e′. P-Fn follows by application of E-Fn.For the cases of P-Pair, P-ProjiBeta, and P-AppBeta, we apply rules E-Pair, E-Proji, and E-App(respectively), using the fact that every value is related to itself in the cost semantics.

Case ND-Leaf: We have [η]end

7−−→ [η′]e′ as a sub-derivation. Inductively, we have η . peq ⇓ xvy and theresult follows immediately.

Case ND-Branch: Here, we have [η]pd1qnd

7−−→ [η′]d′1 and [η]pd2qnd

7−−→ [η′]d′2 as sub-derivations. Inductively,we have η . pd1q ⇓ xδ1y and η . pd2q ⇓ xδ2y. The result follows immediately.

25

Thesis Proposal: Scheduling Parallel Functional Programsspoons/thesis/proposal.pdf · Thesis...

Documents

Transcript of Thesis Proposal: Scheduling Parallel Functional Programsspoons/thesis/proposal.pdf · Thesis...