Exploiting Vector and Multicore Parallelism for Recursive ...milind/docs/ppopp17b.pdf · mix data-...

Exploiting Vector and Multicore Parallelism forRecursive, Data- and Task-Parallel Programs

Bin RenCollege of William and Mary

[email protected]

Sriram KrishnamoorthyPacific Northwest National

[email protected]

Kunal AgrawalWashington University at St. Louis

[email protected]

Milind KulkarniPurdue University

[email protected]

AbstractModern hardware contains parallel execution resources thatare well-suited for data-parallelism—vector units—and taskparallelism—multicores. However, most work on parallelscheduling focuses on one type of hardware or the other. Inthis work, we present a scheduling framework that allowsfor a unified treatment of task- and data-parallelism. Our keyinsight is an abstraction, task blocks, that uniformly handlesdata-parallel iterations and task-parallel tasks, allowing themto be scheduled on vector units or executed independentlyas multicores. Our framework allows us to define sched-ulers that can dynamically select between executing task-blocks on vector units or multicores. We show that theseschedulers are asymptotically optimal, and deliver the max-imum amount of parallelism available in computation trees.To evaluate our schedulers, we develop program transfor-mations that can convert mixed data- and task-parallel pro-grams into task block–based programs. Using a prototypeinstantiation of our scheduling framework, we show that, onan 8-core system, we can simultaneously exploit vector andmulticore parallelism to achieve 14×–108× speedup oversequential baselines.

Keywords Task Parallelism; Data Parallelism; GeneralScheduler

Copyright c©2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Governmentretains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Governmentpurposes only.

PPoPP ’17 Feb. 4–8, 2017, Austin, Texas, USA.Copyright c© 2017 ACM 978-1-4503-4493-7/17/02. . . $15.00DOI: http://dx.doi.org/10.1145/3018743.3018763

1. Introduction

The two most common types of parallelism are data paral-lelism, arising from simultaneously executing the same op-eration(s) over different pieces of data, and task parallelism,arising from executing multiple independent tasks simulta-neously. Data parallelism is usually expressed through par-allel loops, while task parallelism is often expressed usingfork-join parallelism constructs.

These types of parallelism are each well-matched todifferent types of parallel hardware: data parallelism iswell-suited for vector hardware such as SIMD units. Theparadigm of executing the same computation across multi-ple pieces of data is exactly what SIMD units were designedfor. While multicore hardware is more flexible (and manyprogramming models allow mapping data-parallel executionto multiple cores), the independent execution context pro-vided by multicores are well-matched to the task parallelismin languages such as Cilk [4].

Increasingly, however, both kinds of parallel hardwarecoexist. Modern machines consist of multicore chips, witheach core containing dedicated vector hardware. As a result,it is important to be able to write programs that target bothtypes of hardware in a unified manner.

Unfortunately, when programming data- or task-parallelism,the two paradigms are often treated separately. While pro-gramming models such as OpenMP [11] support both data-and task-parallelism, the coexistence is uneasy, requiringvery different constructs, used in very different contexts. In-deed, Cilk “supports” data parallelism through constructssuch as cilk for, which, under the hood, translate dataparallelism to task parallelism [6]. In recent work, Ren etal. developed transformations that work in the opposite di-rection [15]: given a task parallel program, they extract dataparallelism for execution on vector units. None of these ap-

117

proaches cleanly support both multicore and vector execu-tion simultaneously, especially for programs that mix data-and task-parallelism.

In this paper, we develop a system that unifies the treat-ment of both types of parallelism and both types of parallelhardware: we develop general schedulers that let programsmix data- and task-parallelism and run efficiently on hard-ware with both multicores and vector units. Importantly, ourschedulers allow exploiting data- and task-parallelism at alllevels of granularity: task parallel work can be decomposedinto data-parallel work for execution on vector units, anddata-parallel work can be forked into separate tasks for loadbalancing purposes through work-stealing. This grain-freetreatment of parallelism allows our system to effectively anddynamically map complex programs to complex hardware,rather than “baking in” a particular execution strategy.

The key insight behind our work is a unifying abstractionfor parallel work, the task block. Task blocks contain multi-ple independent tasks that can arise from both data parallelloops and from task parallel fork-join constructs. We developa general scheduling framework that allows task blocks tobe used as a unit of scheduling for both data parallelism(essentially, by executing a data-parallel for loop over thetask block) or task parallelism (by allowing task blocks tobe stolen and executed independently). Importantly, each ofthese execution modes produce new task blocks that makeindependent scheduling decisions, allowing for the free mix-ing of data- and task-parallel execution.

To evaluate our schedulers, we present a simple languagethat allows programmers to express task parallelism nestedwithin data parallelism. We then extend Ren et al.’s transfor-mations [15] to generate task block–based programs fromalgorithms written in our language. Programs written inour specification language (or in general-purpose languagesrestricted to use analogous constructs) and run using ourscheduling framework can simultaneously exploit data- andtask-parallelism (rather than one or the other) on both vectorand multicore hardware.

In this paper, we make several contributions:

1. We develop a general scheduling framework based ontask blocks for SIMD and multicore execution of compu-tation trees. Using this scheduling framework, we makethree advances: (i) we show that Ren et al.’s approachto executing task-parallel programs on vector hardwareis a special case of our framework; (ii) we use ourframework to develop a new class of scheduling poli-cies, restart that expose more data parallelism; (iii) weprovide mechanisms to support Cilk-style work-stealingbetween cores, unifying execution on data-parallel andtask-parallel hardware.

2. We provide theoretical bounds on the parallelism ex-posed by our various scheduling policies. We show thatour new scheduling policies can expose more parallelismthan Ren et al.’s original policies, and are asymptotically

optimal (though in practice, scheduling overheads maydominate). This represents the first theoretical analysisof vector execution of mixed task- and data-parallel pro-grams.

3. We evaluate on eleven benchmarks, including all thebenchmarks from Ren et al., as well as new benchmarksthat exhibit more general structures (e.g., nesting task-and data-parallelism).

We implement a proof-of-concept instantiation of ourscheduling framework. Overall, on an 8-core machine,our system is able to deliver combined SIMD/multicorespeedups of 14×–108× over sequential baselines.

2. BackgroundWe now provide some background on previous work onvectorizing task parallel programs and how task parallelprograms are scheduled using work stealing.

2.1 Vectorizing Task Parallel Programs

Vectorization of programs has typically focused on vector-izing simple, dense data parallel loops [10]. It is not imme-diately apparent that SIMD, which is naturally data paral-lel, can be applied to recursive task-parallel programs. Suchprograms can be thought of as a computation tree, where atask’s children are the tasks that it spawned and the leavesare the base-case tasks.1 Any effective vectorization of theseprograms must satisfy two constraints:1. All of the tasks being executed in a vectorized manner

must be (substantially) identical. This criterion is natu-rally satisfied because the tasks arise from the same re-cursive code.

2. Data accesses in the tasks must be well structured. Vectorload and store operations access contiguous regions ofmemory. If a SIMD instruction operates on data that isnot contiguous in memory, it must be implemented withslow scatter and gather operations.To understand why this second restriction is problematic,

consider trying to vectorize the standard Cilk-style executionmodel, with each SIMD lane representing a different worker.Because the workers execute independent computation sub-trees, their program stacks grow and shrink at different rates,leading to non-contiguous accesses to local variables, andconsequently inefficient vectorization.

SIMD-friendly execution: Ren et al. developed an execu-tion strategy that satisfies both constraints for efficient vec-torization [15]. The key insight of their strategy was to per-form a more disciplined execution of the computation, withthe restriction that all tasks executed in a SIMD operation beat the same recursion depth—this allows the stacks of thesetasks to be aligned and interleaved so that local variables canbe accessed using vector loads and stores.

1 This formulation does not work once we have computations with syncs.However, those programs can also be represented using a tree; albeit a morecomplex and dynamic one.

118

1 void foo(int x)2 if (isBase(x))3 baseCase()4 else5 l1 = inductiveWork1(x)6 spawn foo(l1)7 l2 = inductiveWork2(x)8 spawn foo(l2)

(a) Simple recursive code. spawn createsnew tasks.

1 void bfs foo(TaskBlock tb)2 TaskBlock next3 foreach (Task t : tb)4 if (isBase(t.x))5 baseCase()6 else7 l1 = inductiveWork1(t.x)8 next.add(new Task(l1))9 l2 = inductiveWork2(t.x)

10 next.add(new Task(l2))11 bfs foo(next)

(b) Blocked Breadth-first version of code in Figure 1(a).

1 void blocked foo(TaskBlock tb)2 TaskBlock left, right3 foreach (Task t : tb)4 if (isBase(t.x))5 baseCase()6 else7 l1 = inductiveWork1(t.x)8 left.add(new Task(l1))9 l2 = inductiveWork2(t.x)

10 right.add(new Task(l2))11 blocked foo(left)12 blocked foo(right)

(c) Blocked depth-first execution.

Figure 1: Pseudocode demonstrating vectorization transformations.

Ren et al.’s approach targets programs written using aCilk-like language (or programs in other languages that ad-here to similar restrictions). The programs they target consistof a single, k-ary recursive method:

f(p1, . . . , pk) if eb then sb else si

This method evaluates a boolean expression to decide whetherto execute a base case (sb, which can perform reductions tocompute the eventual program result) or to execute an induc-tive case (si), which can spawn additional tasks by invokingthe recursive method again. As in Cilk, these spawned meth-ods are assumed to be independent of the remainder of thework in the spawning method, and hence can be executed inparallel. Ren et al.’s transformations target single recursivemethods; Section 5 describes how we extend the language tohandle data-parallel programs with task-parallel iterations.

Figure 1(a) shows a simple recursive program that weuse to explain Ren et al.’s execution strategy2. The steps ofexecution are as follows:1. As shown in Figure 1(b), the computation is initially

executed in a breadth-first, level-by-level manner wheretasks at the same level are grouped into a task block. Cru-cially, by executing block by block, the code structureis a dense loop over the block allowing one to use stan-dard vectorization techniques [10]. If a task spawns a newtask, the newly-spawned task is placed into the next taskblock representing the next level of tasks. Once the taskblock finishes, execution moves on to the next block.

2. While breadth-first execution satisfies the two constraintsoutlined above, it suffers from excessive space usage. ATaskBlock contains all the tasks at a level in the computa-tion tree; therefore, its size can be Ω(n) where where n isthe number of tasks. To control this space explosion, Renet al. switch to depth-first execution once a TaskBlockreaches a sufficient size: each task in the task block sim-ply executes to completion by calling the original recur-sive function. This limits space usage, as each of these

2 Reductions, computations, etc., are hidden in the baseCase and inductive-Work methods. Here, we use Java-like pseudocode, rather than the originalspecification language.

tasks will only use O(d) space to execute (where d is thedepth of the tree).To recover vectorizability, Ren et al. apply point block-ing [7], yielding the code in Figure 1(c). This strategyproceeds by (recursively) first executing each task’s leftsubtask in a blocked manner and then each task’s rightsubtask. Note that the code still operates over a denseblock of tasks and all the tasks in any TaskBlock are atthe same level, facilitating vectorization.Because not all tasks have identical subtrees, as execution

proceeds, some threads will “die out” and the TaskBlockwill become empty. If there are fewer than Q tasks in aTaskBlock (Q is the number of lanes in the vector unit), it isimpossible to effectively use the SIMD execution resources.To tackle this problem, Ren et al. proposed re-expansion:if a TaskBlock has fewer than Q tasks, they can switchback to breadth-first execution, generating more work andmaking the next-level block larger. Re-expansion is the keyto effective vectorization. In Section 3, we show that re-expansion is a specific instance of a class of schedulingpolicies that our general scheduling framework supports.

2.2 Work-Stealing Scheduler Operation

Work-stealing schedulers are known to be efficient boththeoretically and empirically for scheduling task parallelprograms [1, 3]. In a work-stealing scheduler, each workermaintains a work deque, which contains ready tasks. When atask executing on worker p creates two children, the workerplaces its right child on its deque and executes its left child.When a worker p finishes executing its current task, it pops atask from the bottom of its own deque. If p’s deque is empty,then p becomes a thief. It picks a victim uniformly at randomfrom the other workers, and steals work from the top of thevictim’s deque. If the victim’s deque is empty, then p simplyperforms another random steal attempt.

3. General Schedulers for Vector andMulticore Execution

This section presents our general scheduler framework forefficient vectorized execution. We first describe three mech-anisms that can be combined in various ways to produce dif-

119

ferent classes of schedulers. We next show that re-expansionis an instantiation of this framework. We then describe a new,more flexible, scheduler policy, restart. Finally, we explainhow the framework can be extended to support multicore ex-ecution for both re-expansion and restart-based schedulers.Our schedulers are general, and apply to any type of compu-tation tree.

Notation: We assume P cores and Q SIMD lanes percore. We assume that the recursive method has two recursivespawns; the framework readily generalizes to more spawns.

3.1 Scheduler Mechanisms

We first describe mechanisms for a “sequential” scheduler—a scheduler that supports vectorized execution on a singlecore. The unit of scheduling of the scheduler is task blocks(TaskBlock) which contain some number of tasks—nodesin the computation tree—that can be executed in a SIMDmanner: if a TaskBlock has t tasks in it, the scheduler canuse Q lanes to execute the t tasks in dt/Qe SIMD steps.The scheduler has a deque, with multiple levels. Each levelrepresents a particular level of the computation tree.

The deque starts with a single TaskBlock at the top levelthat contains a single task: the task generated by callingthe recursive method for the first time. At any time, whenscheduling a block b for execution, the scheduler can takeone of three actions:

Breadth-first expansion (BFE): Block b is executed in aSIMD manner. Any new tasks that are generated (spawned)are placed in a block b′, which is scheduled next. Note thatb′ can contain up to twice as many tasks as b. If b′ is empty(has no tasks), another block is chosen for scheduling fromthe deque according to the scheduling policy.

Depth-first execution (DFE): Block b is executed in aSIMD manner. For each task that is executed, all new tasksgenerated by the first spawn call are placed in block b′, andall new tasks generated by the second spawn call are placedin block b′′. Note that b′ and b′′ contain at most as manytasks as b. If not empty, block b′′ is pushed onto the deque,and block b′ is scheduled next. If b′ is empty, then anotherblock is chosen for scheduling from the deque according tothe scheduling policy.

Restart: The scheduler pushes block b onto the deque (with-out executing it). If there is already a block on the deque atthe same level (e.g., if b is the sibling of a block that waspushed onto the deque), then b is merged with the block atthat level. Some block is chosen from the deque for schedul-ing next, according to the scheduling policy.

Note the distinction between “scheduling” a block, whichmeans choosing an action to take with the block, and “exe-cuting” a block, which means running the tasks in the blockusing SIMD. For brevity, we say e.g., a block is “executedusing DFE” to mean that the block was scheduled, a DFEaction was chosen, and then the block was executed.

Various schedulers can be created by combining thesemechanisms in various ways, and changing the policies thatdictate which mechanism is used at each scheduling step.

3.2 Re-expansion

Ren et al.’s re-expansion policy is, in fact, in the class ofscheduling policies that can be described using our schedul-ing framework. Re-expansion only uses the BFE and DFEactions. In the context of the pseudocode of Figure 1, breadthfirst expansion is akin to executing the code in Figure 1(b)—a single new TaskBlock is generated—while depth-first exe-cution is akin to executing the code in Figure 1(c)—two newTaskBlocks are generated.

Re-expansion schedulers have two thresholds,tdfe3 andtbfe

4. A re-expansion scheduler executes its first block inBFE mode. As long as the next block has fewer than tdfetasks, the scheduler keeps performing BFE actions. If thenext block has more than tdfe tasks, the scheduler switchesto DFE actions to conserve space. The scheduler then re-mains in DFE mode until the block has fewer than tbfe tasks,at which point it switches back to BFE mode. If the currentblock has no work to do, it grabs work from the bottom ofthe deque. Intuitively, the scheduler uses BFE actions to gen-erate parallel work and DFE actions to limit space usage.However, because BFE actions can only draw from the bot-tom of the deque, they can only find parallelism “below” thecurrent block, limiting the effectiveness of re-expansion, asSection 4 shows.

3.3 Restart

We now define a new class of schedulers that we call restartschedulers. These schedulers take advantage of the restartmechanism to be more aggressive in finding more parallelwork, which can lead to better parallelism with smaller blocksizes, as Section 4 shows.

The basic pattern of a restart scheduler is as follows: thescheduler starts in BFE mode, and switches to DFE modewhen the tdfe threshold is hit, just as in a re-expansion sched-uler. However, once in DFE mode, the scheduler behavesdifferently. If the current block, b, has fewer than trestarttasks on it, the scheduler takes a restart action. b is mergedwith any other blocks at the current level, and the sched-uler looks for more work to be done. Unlike re-expansionschedulers, which would be constrained to begin executingthe bottom block on the deque, a re-start scheduler scans upfrom the bottom of the deque. As the scheduler works its wayup through the deque, it merges (concatenates) all blocks ata given level together (there might be more than one blockat a given level due to multiple restart or DFE actions thatleave blocks on the deque). After merging, if a block has atleast trestart tasks on it, the scheduler begins execution ofthat block in DFE mode. If the restart scheduler reaches the

3 Trigger to switch to depth-first execution4 Trigger to switch to breadth-first expansion

120

top of the deque without finding trestart work, it executesthe top block in BFE mode. Intuitively, if a particular partof the computation tree runs out of work, the restart sched-uler will look higher up in the tree and try to merge workfrom different portions of the tree that is at the same level togenerate a larger block of work.

A restart scheduler maintains two invariants: (i) at eachlevel above the currently executing block, there may be upto two blocks on the deque—one left over from a restart (thatmust have fewer than trestart tasks), and the other left overfrom a DFE (that may have more than trestart tasks, but doesnot have to); (ii) at each level below the currently executingblock, there is at most one block on the deque, and that blockmust have fewer than trestart tasks.

3.4 Creating Parallel Schedulers

Both schedulers described above are sequential; thoughTaskBlocks can be used for vector-parallel execution, theschedulers are for single-core execution; there is just onedeque of TaskBlocks. We now extend the scheduling frame-work to support multicore execution by adding support formultiple workers (cores), each with their own deque, andadding one more action that a scheduler can take:

Steal: The scheduler places block b onto the deque with-out executing it (b might be empty). The thread then steals aTaskBlock from another worker’s deque. The stolen TaskBlockis added to the worker’s deque at the appropriate level and,if necessary, merged with any block already on the worker’sdeque. The choice of which deque to steal from, which blockon the deque to steal, and which block to schedule next, is amatter of scheduler policy.Using this new action, we can extend re-expansion andrestart schedulers to support multicore execution:

Re-expansion: A worker never attempts to steal until itsdeque is empty (i.e.. it is completely out of work). It thensteals the top TaskBlock from random worker’s deque, addsit to its deque, and executes that block as usual; that is, withDFE if it has more than tbfe nodes and with BFE otherwise.Note that only right blocks generated during DFE are placedon deques and are available for stealing.

Restart: Stealing in restart schedulers is more complicated.In the sequential scheduler, a worker that is unable to createa block with at least trestart tasks in its deque executes thetop block in its deque with a BFE action. In the parallelscheduler, that worker would instead perform a steal action.Note that at the time it performs the steal, the worker has aninvariant that all of the blocks in its deque have fewer thantrestart tasks (because it was unable to restart from its owndeque). During the steal, the current worker (thief) choosesa random worker (victim) and looks at the top of that deque(note that the victim could be the thief itself).

The top of the victims deque contains one or two blocks.If one of them has at least trestart tasks, the thief steals it,adds it to its deque, and executes it with a DFE action. If

the block has fewer than trestart tasks, the thief steals it,adds it to its deque, and executes a constant number of BFEactions starting from that block (to generate more work). Ifthis generates a block with at least trestart tasks, the thiefswitches to DFE. Otherwise, it tries a restart action and, ifthat fails, steals again.

3.5 Setting the Thresholds

In producing various schedulers, there are three thresholds ofimportance: tdfe, tbfe, and trestart. The first threshold, tdfeplaces an upper bound on the block size: a TaskBlock hasat most 2tdfe tasks. The latter two thresholds dictate whena scheduler looks for more work: if a TaskBlock is belowone of the thresholds, the scheduler performs some action tofind more work— re-expansion schedulers switch to BFE,restart schedulers perform a restart. These thresholds are lessconstrained and should be set between Q and tdfe.

4. Theoretical AnalysisThis section analyzes the performance of various schedulingstrategies described above, namely: a) The basic strategy thatperforms BFE and then switches to DFE; b) re-expansionstrategy; and c) restart strategy. We prove asymptoticallytight upper and lower bounds on the SIMD running timeTs for computation trees with n unit time tasks. (Utilizationbounds are implied by these bounds.) We first analyze thethree sequential strategies on a core with Q SIMD lanes.These results indicate that restart can provide linear speedupas long as the block size (tdfe) is larger than Q. On theother hand, the basic strategy and re-expansion both requirelarge block sizes to guarantee linear speedup; however, re-expansion requires significantly smaller block sizes. We thenalso analyze the parallel performance of restart where wehave P cores each with Q SIMD lanes and show that restartis asymptotically optimal for these programs. Our theoreticalresults hold for any computation tree where the elements of atask block are vectorizable, and hence our restart schedulerscan guarantee asymptotically optimal speedup regardless ofcomputation tree structure.

Preliminaries: For the theoretical bounds, we model theexecution of the program as a walk of a computation treewith n nodes. Each node in the computation tree is a unittime task and each edge is a dependence between tasks (tasksof longer length can be modeled as a chain). We also assumethat all nodes at the same level in the tree are similar in thatthey can be part of a SIMD block, since, in our computations,they have the same computation and aligned stacks. We alsoassume that each node has at most 2 children.5 Therefore,the tree’s height h is at least lg n and at most n.

In each step, a core executes between 1 and Q nodes.The step is a complete step if the core executes Q nodes;otherwise it is incomplete. Recall that in our strategies, the

5 This assumption is for simple exposition; all the results generalize forlarger (but constant) out-degrees.

121

scheduler operates on TaskBlock of size between 1 andkQ = tdfe. If the TaskBlock is of size larger than Q, thenit may take multiple SIMD steps to execute it. We call theentire execution of a TaskBlock to encompass a superstep.Ts denotes the SIMD execution time of the program on a

single core with Q simd lanes; alternatively, we can think ofTs as the number of steps executed by a scheduling strategy.Some relatively obvious bounds on Ts: (1) Ts < n, sinceeach step executes at least 1 node. (2) Ts ≥ n/Q since eachstep executes at most Q nodes. (3) Ts ≥ h, since each stepexecutes only nodes at a particular level. We will, later, alsoanalyze the parallel execution where we have P cores eachwith Q SIMD lanes. We say Tsp is the execution time of theprogram on these P cores. Again, we have the obvious lowerbounds of Tsp ≥ n/(QP ) and Tsp ≥ h.

We first state some straightforward structural lemmas.

Claim 1. Each superstep has at most one incomplete step.

Claim 2. In the entire computation, the number of completesteps is at most n/Q since each complete step executesexactly Q nodes.

Claim 3. The breadth first expansion of a tree (or subtree)of height h has at most h supersteps, since each superstepcompletes an entire level of the tree.

4.1 Sequential Strategies with and withoutRe-expansion

For this section, we state the theorems without proof; proofswill be provided in an extended technical report. The fol-lowing theorem for the basic strategy, with no re-expansionand no restarts, implies that it guarantees linear speedup iffk = Ω(2ε); either the block size must be very large or thecomputation tree very shallow.

Theorem 1. For any tree with height h = lg n + ε, therunning time is Θ(min2εn/kQ+ n/Q+ lg n+ ε, n).

Now consider the strategy with re-expansion, as de-scribed in Section 3.2. We assume tdfe = kQ for somek, and tbfe = k1Q, where 1 ≤ k1 ≤ k.

Theorem 2. For any tree with height h = lg n + ε, therunning time is Θ(min( (ε−lg k)

k1+ 1)n/Q) + lg n+ ε, n).

Comparing Theorems 1 and 2, we see that dependenceon ε is linear instead of exponential in the strategy withre-expansion. Therefore, re-expansion provides speedup forlarger values of ε and smaller values of k than the strategywithout re-expansion. In addition, we should make k1 ≈ k;that is we switch to BFE as soon as we have enough spacein the block to do so since that decreases the execution time.

4.2 Sequential Execution with Restart

We now prove that the restart strategy, as described in Sec-tion 3.3, is asymptotically optimal. We set trestart = k2Q,where 1 ≤ k2 ≤ k. At this point, we start depth first expan-sion again. We say that a superstep that executes a block of

size smaller than k2Q is a partial superstep and a block ofsize greater than k2Q is a full block.

The following lemma is obvious from Claim 3.

Lemma 1. There are at most h partial supersteps during thecomputation.

Theorem 3. The running time is Θ(n/Q+h) for a computa-tion tree with n nodes and depth h, which is asymptoticallyoptimal for any scheduler.

Proof. There are at most n/(k2Q) full supersteps and atmost h partial supersteps; thus, we have at most n/(k2Q)+hincomplete steps by Claim 1. Adding the bound of n/Q forcomplete steps (by Claim 2), gives us the total bound. Thisis asymptotically optimal due to lower bounds mentioned inthe preliminaries section.

Note that, unlike the previous strategies, the executiontime of the restart strategy does not depend on k or k2;therefore, we can make the block size as small as Q and stillget linear speedup.

4.3 Work-Stealing with Restart

We now bound the running time of the parallel work-stealingstrategy with restart, as described in Section 3.4.

We can first bound the number of partial supersteps byobserving that a partial block is only executed as part ofa breadth first execution, which only occurs either at thebeginning of the computation or after a steal attempt.

Lemma 2. The number of partial supersteps is at most h +O(S).

For the purposes of this proof, we will assume that a stealattempt takes 1 unit of time. The proof can be generalizedso that a steal attempt takes c time for any constant c, butfor expository purposes, we stick to the simple assumption.We can then bound the total completion time in terms of thenumber of steal attempts:

Lemma 3. The completion time of the computation is at mostO(n/QP + S/P ).

Proof. Each step is either an incomplete step, a completestep, or a steal attempt. There are a total of O(n/Q) com-plete steps by Claim 2. Each full superstep has at least k2complete steps. Combining this fact with Lemma 2, we canconclude that there are at most O(n/k2Q + S) supersteps.By Claim 1, we can bound the number of incomplete stepsby the same number. Since there are P total processors, wecan add the number of complete steps, the number of incom-plete steps, and the number of steal attempts and divide byP to get a bound on the completion time.

From the above lemma, we can see that the crux of thisproof will depend on bounding the total number of stealattempts S. We will bound the number of steal attemptsusing an argument very similar to the one used by Aroraet. al [1] (henceforth referred to as ABP). Similar to that

122

paper, we will define a potential function on blocks whichdecreases geometrically with the depth of the block (in ABP,the potential function is on nodes).

For any block b, we define d(b) as the depth of the block,which is equal to the depth of all nodes in the block. We de-fine w(b) = h− d(b). We can now define various potentials:

• The potential of a full ready block is 43w(b), the potentialof a partial ready block is 43w(b)−1. The potential of ablock that is executing is 43w(b)−1.

• The potential of any processor is the total potential of allthe blocks on its deque added to the potential of the blockit is executing (if any).

• The potential of the computation is the sum of the poten-tial of all processors.

Like the ABP proof, we show that the potential never in-creases and that a certain number of steal attempts decreasethe potential by a constant factor, allowing us to bound thenumber of steal attempts. However, the proof also differs dueto the following reasons: 1) Unlike standard work-stealing,where there is one node at each depth on the deque, we canhave up to two blocks at each depth, necessitating a changein the potential function. 2) In standard work-stealing, aworker only steals when its deque is empty; here it may stealeven if its deque contains partial blocks. 3) In work-stealing,all the nodes on a processor’s deque have a larger potentialthan the one that the worker is executing; this may not bethe case for our scheduler. 4) Each node takes unit time toexecute in the ABP model, while our blocks can take up to ktime steps to finish executing; this changes the bound on thenumber of steal attempts.

The following three lemmas prove that the potential neverincreases, and that steal attempts reduce potential by a sig-nificant amount.

Lemma 4. The initial potential is 43h and the final potentialis 0. The potential never increases during the computation.

Proof. Potential changes due to the following reasons:

1. A block starts executing: The potential decreases since anexecuting block has a lower potential than a ready block.

2. An executing block with weight w finishes executing:This block generates at most 2 ready blocks with lowerweight. The potential decreases the least when bothare full blocks. In this case, the change in potential is43w−2 − 2× 43(w−1) ≥ 0.

3. Two blocks with weight w merge: If the result of themerge is a partial block, the potential obviously de-creases. If the result is a full block, it immediately startsexecuting. Therefore, the potential also decreases.

Lemma 5. (Top Heavy Deques Lemma) The heaviest blockon any processor is either at the top of the processor’s dequeor it is executing. In addition, this heaviest block contains atleast 1/3 of the total potential of the processor.

Proof. By construction, the deque is ordered by depth andthe weight decreases as depth increases. Therefore, the heav-iest node has to be on the top of the deque or executing.

Let x be the heaviest block (if there are two, we pick onearbitrarily). The potential of this block is at least Φ(x) =43w−1 (if it is executing or partial); otherwise it is higher. Ifthere is another node at the same depth, its potential is also43w−1 since it must be a partial block. Other than this, theremay be 2 blocks each with all weights lower than w; one fulland one partial. Therefore, the total potential of the processorisΦ(p) ≤ 2 × 43w−1 +

∑w−1y=1 (43(y−1) + 43(y−1)−1) ≤

3 × 43w−1 = 3Φ(x) Thus, the heaviest block has at leasta third of the potential.

Lemma 6. Between time t and time t′, if there are at leastkP steal attempts, then PrΦt − Φt′ ≥ 1/24Φt ≥ 1/4

Proof. We divide the potential of the computation into twoparts. Φ(Dt) is the potential of the workers where the heav-iest block is on the deque and Φ(At) is the potential of theworkers where the heaviest block is executing at time t.

First, let us consider a processor p whose potential isin Φ(At) and let us say that the currently executing blockis x. From Lemma 5, we know that Φ(x) = 43w(x)−1 ≥1/3Φ(p). For kP steal attempts to occur, at least k timesteps must have passed (since each steal attempt takes 1 timestep). Therefore, by time step t′, x has finished executing. Inthe worst case, it enables two full blocks with a cumulativepotential of 2 × 43(w(x)−1) = 1/8Φ(x). Therefore, thepotential of p reduces by at least 7/8× 1/3 = 7/24.

Now consider the processors whose potential is in Φ(Dt),and say p is one of these processors. If p is a victim of asteal attempt this time interval, then the heaviest block xwill be stolen and assigned. From Lemma 5, we know thatΦ(x) ≥ 1/3Φ(p). Once x is assigned, its potential drops bya factor of 1/4. Therefore, the potential that used to be on phas reduced by a factor of 1/12.

We use Lemma 7 (Balls into Bins) from ABP to concludethat with probability 1/4, after P steal attempts, the Φ(Dt)reduces by at least a factor of 1/24.

Lemma 7. The expected number of steal attempts isO(kPh).In addition, the number of steal attempts is O(kPh +P lg(1/ε)) with probability at least (1− ε).

Proof. The initial potential is 4h and the final potential is0. From Lemma 6, we conclude that after O(kP ) steal at-tempts, the potential decreases by a constant factor in ex-pectation. Therefore, the expected number of steal attemptsis O(kPh). The high-probability bound can be derived byapplying Chernoff Bounds.

The final theorem is obvious from Lemmas 3 and 7.

Theorem 4. The running time isO(n/QP +kh) in expecta-tion and O(n/QP + kh+ lg(1/ε)) with probability at least(1− ε).

123

Corollary 1. Since n/QP and h are lower bounds on thecompletion time, restart mechanism provides an asymptoti-cally optimal completion time.

4.4 Space Bounds and Discussion

Since each worker can have at most 2 blocks at each depthlevel in its deque, we get the following bound for space forboth restart and re-expansion.

Lemma 8. The total space used is hkQP .

We can now compare the different strategies in termsof the performance they provide. First, let us consider thestrategies on a single processor. The above theorems holdfor all values of h = lg n + ε. However, different strategiesrequire different block sizes in order to provide speedup.Here we look at the implications of various block size valueson speedup provided by the various strategies. Since therestart strategy always provides optimal speedup, we onlycompare the strategies with and without re-expansion. Inaddition, as we mentioned earlier, for the restart strategy, itis not advantageous to make k1 large; so for the followingdiscussion, we make k1 ≈ k.

1. If k > 2h−lgn, all schemes give an asymptotically opti-mal parallelization.

2. If k ≤ 2h−lgn, but k > 2h−lgn/Q then we see very littlespeedup without re-expansion. With re-expansion, we getoptimal speedup if k ≥ lgQ (a reasonable assumptionsince k is already so large).

3. For smaller k, where k > h − lg n, we get no speedupwithout re-expansion, but continue to get speedup withre-expansion as long as k > lgQ.

4. For smaller k, re-expansion provides some speedup untilk reaches (h−lg n)/Q. After that, it provides no speedup.

Therefore, while the basic strategy requires very large blocksizes to provide speedup, re-expansion can provide speedupwith moderate block sizes, which nevertheless increase withthe height of the tree.

We also implement parallelism using both restart and re-expansion. However, we only prove the bounds for the restartstrategy in this section, since that is the more complicatedbound. We can use a similar technique to prove the followingbound on re-expansion: O(( (h−lgn−lg k)

k + 1)n/PQ) + kh)(assuming we make k1 as large as we can). However, whilewith restart, we can set k = 1 to get asymptotically opti-mal speedup, this is not possible for re-expansion. As wecan see from the bounds, there is a trade-off between mul-ticore parallelism and SIMD parallelism. Larger k reducesthe first term, and increases the second term — thereby in-creasing SIMD parallelism at the cost of reducing multicoreparallelism. Smaller k does the opposite. As for space, for agiven k, both strategies have the same bound. However, sincerestart can provide linear speedup at smaller block sizes, itmay use less space for the same performance.

1 void c f(Node p, Node r, float d)2 x = root.c x − p.c x3 /∗ similarly y and z∗/4 dr = x ∗ x + y ∗ y + z ∗ z5 if dr >= d || (r.isLeaf && p!=r)6 //update p using reduction7 else8 for (Node c: r.children)9 if c != null //task ||

10 spawn c f(p, c, d/4)

12 void comp f all(ps, root)13 foreach (Node p: ps) //data ||14 c f(p, root, d)

Figure 2: Barnes-Hut, with data and task parallelism

5. Extended language specificationTo evaluate our scheduling framework, we can simply tar-get the same type of recursive, task-parallel programs thatRen et al.’s approach targets. However, we also want to val-idate our schedulers for more general computation trees. Todo so, we extend Ren et al.’s specification language to targeta broader class of applications: those that combine data- andtask-parallelism by nesting task-parallel function calls insidedata-parallel loops. We then extend Ren et al.’s transforma-tion so that programs written in this extended specificationlanguage generate the TaskBlocks required by our schedul-ing framework.

5.1 Mixed data- and task-parallel programs

Ren et al.’s model captures classical programs such as thosefor solving n-queens or min-max problems. However, manyinteresting task-parallel programs have a different pattern.In particular, in many domains, the application performstask-parallel work for each item in a set of data; this isoften expressed as a data-parallel loop with each iterationconsisting of task-parallel work.

For instance, consider the Barnes-Hut simulation algo-rithm (Figure 2) [2]. In it, a series of bodies in space (typi-cally, stars in a galaxy) are placed into an octree. Each bodythen traverses the octree to compute the gravitational forcesexerted by each of the other bodies. At its heart, Barnes-Hutconsists of a data-parallel loop: “for each body in a set ofbodies, compute the force.” However, the force computationcan be parallelized using task parallelism by visiting sub-trees of the octree in parallel [16]. Many other benchmarksfollow this general pattern [8, 14].

Prior work on vectorizing such applications, includingBarnes-Hut [8] and decision-tree traversal [14], have fo-cused on the data-parallel portion of the computation and donot exploit the task parallelism. Moreover, these attemptshave relied on special-purpose transformations targetedspecifically at data-parallel traversal applications, rather thangeneral data-parallel applications with embedded task paral-lelism.

124

5.2 Extending the language

Ren et al.’s work, and hence their language specifica-tion, does not admit task-parallelism nested within data-parallelism: it targets single recursive calls. We can extendthe specification language to admit two types of programs:a single recursive method, as before, or a data-parallel loopenclosing a recursive method; this simple extension to thelanguage allows us to express programs such as Barnes-Hut.

foreach (d : data)f(d, p1, . . . , pk) if eb then sb else si

Note that the non-recursive functions from Ren et al.’slanguage specification—such as baseCase and isBasefrom Figure 1—can include data-parallel loops. This meansthat our modified language supports programs that nest data-parallelism inside task-parallelism inside data-parallelism.

5.3 Extending the transformation

In these programs, Ren et al.’s transformation frameworktreats each iteration of the outer data parallel loop as a singletask-parallel program and hence execute the data-parallelouter loop sequentially. Therefore, only task-parallelism isexploited, not data-parallelism (the opposite problem of thatfaced by prior work [8, 14]).

To exploit both data- and task-parallelism, we can readilyextend the transformation described in Section 2.1. Note thatthe transformed program in Figure 1(b) operates on a blockof tasks. The initial call to this code consists of a TaskBlockwith a single task in it, representing the initial call to therecursive method. When transforming data parallel enclos-ing task parallel code, we can iterate over the data parallelwork to construct multiple tasks, one for each iteration ofthe data-parallel loop. These tasks can then be placed into aTaskBlock, which is passed to the initial breadth-first, task-parallel code. If the initial TaskBlock is too big (i.e., if thedata-parallel loop has too many iterations), strip mining canbe used to create smaller TaskBlocks that are then sequen-tially passed to the vectorized, task-parallel code.

If there are data-parallel loops within the non-recursivefunctions of the task-parallel body (i.e., if a single task has anested data-parallel loop), the iterations of the data-parallelloops are placed into a TaskBlock generated from the en-closing task. Because these data-parallel loops do not haveany further nesting, no further transformations are necessaryto handle this additional level of parallelism.

6. ImplementationIn this section, we present our Cilk implementation—usingspawn and sync—of the re-expansion and restart strategiesto schedule recursive task-parallel programs.

Vectorization: We do not discuss the specifics SIMDizationin this paper. At a high level, the data parallel loops overtask blocks can be vectorized using the same techniques asin Ren et al.’s work [15], such as AoS to SoA transformation,together auto-vectorization and SIMD intrinsics when the

auto-vectorizer fails, and the process of adding new tasksto blocks can be vectorized using Streaming Compaction.

Re-expansion: Figure 3(a) shows the parallel, blockedre-expansion version of the example in Figure 1(a). If re-expansion is triggered, the left and right TaskBlock aremerged together. Otherwise, they are executed indepen-dently. This code demonstrates that re-expansion can benaturally expressed as a Cilk program by marking recur-sive function calls (blocked foo reexp) in lines 13, 15,and 16 as concurrent using the spawn keyword. Other as-pects of TaskBlock management, such as the stack of taskblocks, are handled by the default Cilk runtime.

Restart: Figure 3(b) shows the parallel blocked re-start ex-ecution for the same example according to the strategy ex-plained in Section 3.4. This figure illustrates an ideal im-plementation. Each thread maintains a deque of TaskBlockand RestartBlock (tbs and rbs in the figure), with at mostone task block and restart block per level in the computa-tion tree. Until termination, each thread steals a TaskBlockor RestartBlock (lines2–4), places it in a TaskBlock at thecorresponding level in its local deque, and processes it recur-sively until no level has a non-empty TaskBlock. At the endof this working phase, a worker’s local deque only consistsof restart blocks. For conciseness, we do not show the BFEstep on a steal as explained in Section 3.3. While efficient,this strategy does not naturally map to Cilk-like program-ming models. A particular challenge of the strategy in Fig-ure 3(b) is that both the continuation of the TaskBlock beingexecuted and the RestartBlock are pushed to the deque andmade available for stealing. Rather than modifying the im-plementation of the spawn-sync constructs in Cilk, we use asimpler strategy that does not guarantee the same space ortime bounds, but nevertheless provides good performance.

Simplified restart: The implementation of our simplifiedrestart strategy is illustrated in Figure 3(c). Rather than main-tain the restart blocks in the deque, they are maintained in anexplicitly managed RestartBlock stack. The restart stack isimplemented as a RestartBlock linked-list consisting of anarray of tasks at the current level and a pointer pointing toRestartBlock at the next level. The blocked recursive func-tion (blocked foo restart) takes a TaskBlock stackand a RestartBlock stack as input and returns a RestartBlockstack. If the total number of tasks in the TaskBlock andRestartBlock is less than the restart threshold, the tasks inthe TaskBlock are moved into the RestartBlock, which isreturned. Otherwise, we fill up the TaskBlock with tasksfrom the RestartBlock and spawn the TaskBlock for the nextlevel. This spawn only exposes the task block’s continuation—not the restart stack—for stealing by other worker threads.The returned RestartBlock stacks from all spawned in-stances of next level are merged into one RestartBlock stack(merge(rleft, rright)), which is then returned.The merge function is also implemented as a blocked func-

125

1 void blocked foo reexp(TaskBlock tb)2 TaskBlock left, right3 foreach (Task t : tb)4 if (isBase(t.x))5 baseCase()6 else7 l1 = inductiveWork1(t.x)8 left.add(new Task(l1))9 l2 = inductiveWork2(t.x)

10 right.add(new Task(l2))11 if (/∗do re−expansion∗/)12 left.merge(right)13 spawn blocked foo reexp(left)14 else /∗depth−first execution∗/15 spawn blocked foo reexp(left)16 spawn blocked foo reexp(right)

(a) Re-expansion

1 /∗tbs/rbs: task/restart block stack ∗/2 while (not done)3 tb = steal(); lvl = tb.lvl; tbs[lvl] = tb4 blocked foo restart(lvl)5 void blocked foo restart(int lvl)6 if (tbs[lvl].size + rbs[lvl].size < t restart)7 /∗ empty tasks in tbs[lvl] into rbs[lvl]∗/8 return9 /∗fill tbs[lvl] with tasks from rbs[lvl]∗/

10 TaskBlock left, right11 foreach (Task t : tb)12 if (isBase(t.x)) baseCase()13 else left.add(new Task(inductiveWork1(t.x)))14 tbs[lvl+1] = left; blocked foo restart(lvl+1)15 foreach (Task t : tb)16 if (not isBase(t.x))17 right.add(new Task(inductiveWork2(t.x)))18 tbs[lvl+1] = right; blocked foo restart(lvl+1)

(b) Ideal restart

1 RestartBlock blocked foo restart(TaskBlock tb,RestartBlock rb)

2 if (tb.size + rb.size < t restart)3 /∗move tasks from tb into rb∗/4 return rb5 /∗fill tb with tasks from rb∗/6 TaskBlock left, right7 foreach (Thread t : tb)8 /∗Figure 5(a) lines 4−10∗/9 RestartBlock rleft, rright

10 rleft= spawn blocked foo restart(left, rb.next)11 rright = spawn blocked foo restart(right, NIL)12 sync13 rb.next = spawn merge(rleft, rright)14 sync15 return rb

(c) Simplified restart

Figure 3: Pseudocode demonstrating blocked, parallel, vectorized execution with different scheduling strategiestion that recursively merges the input restart stacks, and in-vokes blocked foo restart if the number of tasks at alevel exceeds the restart threshold.

Frequent restart stack merges may be expensive. To min-imize this cost, we introduce an optimization to directlypass the restart stack from one spawn to next if no stealhappened. As shown in Figure 3(c) lines 9–13, in normalcase, each spawn takes as input a distinct input restart stackand returns a distinct restart stack. These restart stacks aremerged together after the sync. To reduce the cost, we testwhether the a steal immediately preceded the given spawnstatement. If not, the previous spawn’s output restart stackis provided as the input restart stack to this spawn, and themerge operation between them is eliminated. In terms of thepseudo-code, the spawn in the blocked foo restartfunction (line 11) is replaced by the following:

1 if NO INTERVENING STEAL:2 rright = spawn blocked foo restart(right, rb.next)3 else:4 rright = spawn blocked foo restart(right, NULL)

Limitations of simplified restart: As shown in Figure 3(c),the simplified restart strategy can be directly translated intoa Cilk program. While simpler to implement than the restartstrategy formulated in Section 4.3, this simplified strategysuffers from significant limitations. These arise from the factthat the RestartBlock stack returned by the function is notavailable for further processing until all work preceding theimmediately enclosing sync scope complete. We briefly dis-cuss these limitations, but do not present concrete proofs dueto space limitations. First, passing restart stacks through thereturn path can lead to an important source of concurrencybeing stashed away, unavailable for execution by idle work-ers. In the worst case, this can lead to a fully serialized exe-cution. Second, consider a execution of a TaskBlock at depthh. At each level 0 through h − 1, a restart stack can be sus-pended, requiring a total space of h2trestart space, whichis worse than the space overhead for the ideal restart strat-egy. Despite these limitations, we observe in Section 7 that

this restart strategy achieves competitive performance for thebenchmarks considered.

7. Experimental EvaluationWe conduct our experiments on an 8-core, 2.6 GHz IntelE5-2670 CPU with 32 KB L1 cache per core, 20 MB last-level cache, and 128-bit SSE 4.2 instruction set. The in-put recursive program, re-expansion (reexp), and restart(restart) variants are parallelized as Cilk programs andcompiled using MIT Cilk [5]6. All program variants arecompiled using the Intel icc-13.3.163 compiler with -O3optimization. We report the mean of 10 runs. We do not re-port the negligible standard deviation observed due to spaceconstraints. We observed little difference in performance be-tween enabling and disabling compiler vectorization for theinput program, despite some loops getting vectorized. Weuse execution times of the sequential runs with compilerauto-vectorization as the baseline.

Table 1 list the basic features of the benchmarks: prob-lem input, number of levels in the computation tree, and to-tal number of tasks. The sequential version of each bench-mark is obtained by removing the Cilk keywords (spawn andsync). The execution time for this sequential execution time(Ts) serves as the baseline for all evaluation in this section.The benchmarks have varying computation tree structures:knapsack is a perfectly balanced tree with all base casetasks at the last level. fib, binomial, parentheses,Point correlation, and knn (k-nearest neighbor) areunbalanced binary trees with varying numbers of base casetasks in the middle levels. All other benchmarks are moreunbalanced with larger fan-outs. uts is a binomial tree andmuch deeper than nqueens, graphcol, minmax, andBarnes-Hut7.

The benchmarks show various mixing of task parallelismand data parallelism: knapsack, fib, parentheses,

6 The non-parallelized version of reexp is, effectively, Ren et al.’s vector-ization strategy [15].7 A more detailed characterization can be found in [15]

126

Benchmark Problem #Levels #Tasks Ts(s) T1(s) T16 Block RB size Ts/T1 1-worker Ts/T16 16-worker

size Ts/T1x Ts/T1r Ts/T16x Ts/T16r

knapsack long 31 2.15B 9 61 3.9 212 any 0.14 1.68 1.69 2.2 24.8 24.6fib 45 45 3.67B 9 99 6.7 214 4096 0.09 1.57 1.58 1.3 22.9 22.7parentheses 19 37 4.85B 10 134 8.8 213 4607 0.08 1.42 1.40 1.2 20.3 19.4nqueens 15 16 168M 233 267 16.7 212 2040 0.89 4.00 4.08 14.2 61.6 60.9graphcol 3(38-64) 39 42.4M 31 37 2.4 210 473 0.85 8.74 8.78 13.0 108 107uts 30 228 19.9M 146 152 9.6 211 2047 0.97 1.58 1.59 15.4 23.0 23.8binomial C(36,13) 36 4.62B 8 126 8.1 213 4096 0.07 1.00 0.97 1.0 14.9 14.1minmax 4× 4 13 2.42B 19 58 3.7 210 32767 0.31 1.65 1.57 4.9 20.6 17.1Barnes-Hut 1M 18 3.00B 81 227 53.8 29 511 0.35 1.34 1.48 1.5 16.0 17.5Point corr. 300K 18 1.77B 132 197 13.1 210 256 0.74 1.89 1.74 11.1 29.9 27.9knn 100K 15 1.36B 71 105 7.6 29 128 0.65 1.24 1.20 8.9 19.3 18.4

Geo. mean 0.31 1.89 1.87 4.2 26.7 26.0

Table 1: Benchmark characteristics and performance. Ts: sequential execution time; T1: single-threaded execution time of theinput Cilk version; T1x, T1r: single-threaded SIMD execution time of re-expansion and restart versions, respectively; T16,T16x, T16r: execution time of input, re-expansion, or restart versions on 16 workers, respectively. All re-expansion and restartimplementations use 16-wide vector operations, except 8-wide for knapsack,and 4-wide vector operations for uts, Barnes-Hut,Point correlation, and knn. Block size: Best block size for re-expansion and restart, except 16-worker nqueens (211), graphcol(29), and minmax (215 for re-expansion and 29 for restart). RB size: Restart block size used by the restart versions.

scalar reexp restart

Block SOA SIMD Block SOA SIMD

1-worker 0.3 0.5 0.6 1.9 0.5 0.6 1.916-worker 4.2 6.4 9.5 26.7 8.2 9.3 26.0

Scalability 13.6 11.7 15.3 14.2 16.2 15.2 13.9

Table 2: Geometric mean of speedup with respect to Tsfor implementation variants on 1 and 16 workers. scalar:input Cilk program; Block: Blocked using re-expansion orrestart strategy; SOA: Blocked version transformed to struct-of-arrays layout to enable vectorization; SIMD: SIMD vec-torized version of the benchmark in SOA form. Scalabilityshows the relative performance of the same implementationrunning on 16 workers as compared to 1 worker.

uts, binomial, and minmax only show task parallelismin the form of recursions; nqueens and graphcol havenested data parallelism within their outer task parallel recur-sions; Barnes-Hut has a for-loop outside recursive func-tion calls, i.e., task parallelism is nested in data parallelism;and Point corr. and knn are more complex, with threelevels of parallelism, data parallel base cases nested withintask parallel recursion, which is, in turn, nested within dataparallel outer loops.

To improve vectorization potential, we use the smallestpossible data type, without affecting generality, for eachbenchmark: knapsack and uts use short and int,respectively. Barnes-Hut, Point correlation, andknn use float. Other benchmarks use char.

7.1 Speedup from Blocked SIMD Execution

Table 1 shows the effectiveness of various strategies toexploit task and data parallelism on both single core (1-worker) and multi-cores (16-worker). Even on one thread,re-expansion and restart strategies, coupled with SIMD vec-torization, improve (1.8× average) over sequential executiontimes. Combining compiler vectorization with Cilk paral-lelization achieves a speedup of 4.2 (average) over sequentialexecution. Re-expansion and restart significantly improveupon this default parallelization, achieving 26× (average)speedup over Ts. This shows the overall efficacy of the re-expansion and restart strategies. Re-expansion and restartstrategies exhibit similar performance across all the bench-marks considered. For restart, we select the optimumtrestart threshold, shown as RB Size in the table.

Our speedup primarily benefits from three sources: first,the reduction of recursive function calls from pure task par-allelism to the combination of both task parallelism and dataparallelism; second, the vectorization optimization, and as-sociated efficient data layout transformations such as AoS toSoA; and third, good scalability from one worker to multipleworkers. The first effect is important for kernels with littlesequential work. In these kernels, the Cilk recursive schedul-ing incurs relatively large overheads. Our blocked execu-tion can efficiently reduce such overhead, because our taskblocks coarsen the granularity of spawned tasks. This is alsowhy we may get more than SIMD-width times speedupfor some of these benchmarks’ 1-worker version. Table 2demonstrates the relative speedup of each transformation onthe input benchmark from the input version to the blocked,layout transformed to struct-of-arrays form so as to enableSIMD vectorization (SOA), and the final vectorized version.

127

0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

reexprestart

(a) nqueens0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

(b) graphcol0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

(c) uts0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

(d) minmax0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

(e) Barnes-Hut0 %

20 %

40 %

60 %

80 %

100 %

20

22

24

26

28

210

212

214

216

(f) Point correlation

Figure 4: SIMD utilization. x-axis: block size; y-axis: %age of tasks that can be vectorized. knn is identical to Point correlation.

For both re-expansion and restart strategies, we see that theSOA transformation to enable vectorization improve perfor-mance on top of blocking. However, we observe even greaterimprovements from vectorizing this layout-transformed ver-sion. In general, comparing to the first effect, the second andthird effects are more important. Therefore, in the followingsections, we evaluate the impact the reexp and restartstrategies have on vectorization efficiency and scalability.

7.2 Impact of Block Size on SIMD Utilization

Increasing the block size can expose more work to be ex-ecuted by the vector unit, increasing SIMD utilization (theratio of complete SIMD steps to total steps during execu-tion). However, larger block sizes consume more space andincrease cache pressure. A good scheduling strategy shouldachieve good SIMD utilization with small block sizes.

Figure 4 shows the SIMD utilization of reexp andrestart at various block sizes (for space reasons, wepresent results only for the larger benchmarks). SIMD uti-lization grows with increasing block size, and, consistentwith the theoretical analysis, at each block size restartmatches or exceeds the SIMD utilization achieved by reexp.For smaller block sizes, restart performs better thanreexp for all benchmarks except for knapsack, Barnes-Hut, Point correlation, and knn, where both strate-gies achieve similar performance improvements. For bench-marks like graphcol and uts, restart can achieve> 90% SIMD utilization at block size of 24, while reexprequires 27 and 25. For benchmarks such as nqueensand minmax, SIMD utilization improves more slowly forboth reexp and restart. However, SIMD utilization forrestart continues to exceed that for reexp for smallerblock sizes. In the case of Barnes-Hut and Pointcorrelation, both restart and reexp achieve iden-tical SIMD utilization for all block sizes considered.

7.3 Scalability of Blocked SIMD Execution

The scalability for the best block size is shown in Table 1.For the best block size, reexp and restart achieve com-parable speedups for most benchmarks, and the performancedifference is within 5%.

Given the difference in SIMD utilization between re-expansion and restart at smaller block sizes, we show scala-bility of of reexp and restart with various numbers ofCilk workers for a small block size (25) in Figure 5. For thissmall block size, we find that restart performs better ifit can improve the SIMD utilization (such as nqueens,

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

scalarreexp

restart

(a) graphcol 0

3

6

9

12

15

18

0 2 4 6 8 10 12 14 16

(b) uts 0

5

10

15

20

0 2 4 6 8 10 12 14 16

(c) minmax

0

3

6

9

12

15

18

21

0 2 4 6 8 10 12 14 16

(d) Barnes-Hut 0

3

6

9

12

15

18

21

0 2 4 6 8 10 12 14 16

(e) Point correlation 0

10

20

30

40

50

0 2 4 6 8 10 12 14 16

(f) knn

Figure 5: Xeon E5 scalability for block size 25. x-axis:number of workers; y-axis: speedup relative to 1-worker Cilkbaseline.

graphcol, uts and minmax). Otherwise, it may re-sult in some slowdown due to the implementation com-plexity and the cost of manipulating the restart stacks. Fornqueens and minmax, restart can provide a reason-able speedup whereas reexp still results in some slowdownrelative to the non-SIMD version. For benchmarks such asBarnes-Hut, Point correlation, and knn, the rel-atively poor scalability from 8-worker to 16-worker is due tothe interference between hyper-threaded workers executedon the same core in the 8-core system.

In summary, both scheduling strategies have their ad-vantages: reexp is easier to implement and has very lowscheduling overhead, while restart has good theoreticalbounds and can potentially improve the SIMD utilization forsmaller block sizes. With either scheduling method, we cantransform our candidate codes to achieve good speedup.

8. Related WorkIn many programming systems, parallelizing recursive pro-grams involves user identification of concurrent functioninvocations coupled with a runtime system that dynam-ically maps concurrent function invocations to the pro-cessing cores. Variants of the spawn-sync construct usedin this paper constitute the most common subset of manysuch programming systems (Cilk [3, 4], Thread BuildingBlocks [13], OpenMP [11], X10 [18], etc.). These program-ming systems also allow programmers to write programsthat mix task- and data-parallelism, and are not necessarilytied to SIMD architectures, so they are more general. How-ever, they focus on mapping that parallelism to multicores,rather than tackling SIMD; thus, they typically decouplevectorization considerations from multicore parallelism and

128

attempt to use SIMD instructions primarily in the base case.These approaches hence exploit SIMD less thoroughly thanours, which vectorizes the recursive steps in addition to basecases, taking advantage of SIMD at all steps of the compu-tation.

Intel’s ispc [12] is another existing effort closely relatedto our work. It targets programs that have task- and data-parallelism, and also attempts to exploit SIMD units in ad-dition to multicores. However, its handling of nesting oftask- and data-parallelism differs substantially from ours. Inparticular, given a data-parallel for loop, different iterationsare mapped to individual lanes of a SIMD unit. If that forloop contains nested task-parallelism, that task parallelismis not exploited: the entire iteration is executed within a sin-gle SIMD lane, underexploiting parallelism, causing diver-gent execution, and incurring unnecessary gather and scattermemory operations among all SIMD lanes.

Optimizations to improve recursion overhead include par-tial recursive call inlining [17] and transforming the programto a loop program [9]. Partial inlining indirectly helps vec-torization by adding more instructions to the inductive workbasic block, but often not enough to keep wide SIMD unitsbusy.

Similar to our approach, Jo et al. [7] optimized tree traver-sal algorithms by vectorizing across multiple root-to-leaftraversals. The nodes of the tree, rather than the call stack,are used to track execution progress. Also, their work did notconsider multi-core parallelization. Ren et al. [15] presentedthe re-expansion strategy to vectorize recursive programs.However, as discussed in the introduction, they provide notheoretical analysis of the benefits of their strategy, do notdeal with multicore execution, and do not consider the restartstrategy.

9. ConclusionsWe presented a unified approach for executing programs thatexhibit both task- and data-parallelism on systems that con-tain both vector units and multicores. Rather than “bakingin” execution strategies based on programming models (e.g.,converting data parallelism to task parallelism in Cilk [6],or vice versa [15], or choosing to only exploit one type ofparallelism in programs that contain both [8, 14]), our uni-fied scheduling framework allows programs to execute in themanner best suited to the current context and hardware: ifthere is enough data parallelism at a particular point in anexecution, vector units can be exploited, while if there is toomuch load imbalance at another point, data parallelism canbe traded off for task parallelism to allow for work stealing.This grain free execution strategy allows for the free mixingand arbitrary exploitation of both types of parallelism andboth types of parallel hardware. By using our framework,we are able to achieve 14×–108× speedup (23×–259× over1-worker Cilk) across ten benchmarks by exploiting bothSIMD units and multicores on an 8-core system.

10. AcknowledgmentsThe authors would like to thank our shepherd, Mary Hall,as well as the anonymous reviewers for their helpful sugges-tions and comments. The authors would also like to thankYoungjoon Jo for contributing to several benchmarks. Thiswork was supported in part by the U.S. Department of En-ergy’s (DOE) Office of Science, Office of Advanced Scien-tific Computing Research, under DOE Early Career awards63823 and DE-SC0010295. This work was also supportedin part by NSF awards CCF-1150013 (CAREER), CCF-1439126, CCF-1218017, and CCF-1439062. Pacific North-west National Laboratory is operated by Battelle for DOEunder Contract DE-AC05-76RL01830.

References[1] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread

Scheduling for Multiprogrammed Multiprocessors. In SPAA,pages 119–129, 1998.

[2] J. Barnes and P. Hut. A hierarchical o(nlogn) force-calculation algorithm. Nature, 324(4):446–449, December1986.

[3] R. D. Blumofe and C. E. Leiserson. Scheduling MultithreadedComputations by Work Stealing. J. ACM, 46(5):720–748,1999.

[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,K. H. Randall, and Y. Zhou. Cilk: An Efficient MultithreadedRuntime System. Journal of Parallel and Distributed Com-puting, 37(1):55–69, August 25, 1996.

[5] Cilk. Cilk 5.4.6. http://supertech.csail.mit.edu/cilk/.

[6] I. Corp. Intel Cilk Plus Language Extension Specification,2011.

[7] Y. Jo and M. Kulkarni. Enhancing Locality for RecursiveTraversals of Recursive Structures. In OOPSLA, pages 463–482, 2011.

[8] Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorizationof Tree Traversals. In PACT, pages 363–374, 2013.

[9] Y. A. Liu and S. D. Stoller. From Recursion to Iteration: Whatare the Optimizations? In PEPM, pages 73–82, 2000.

[10] S. Maleki, Y. Gao, M. J. Garzaran, T. Wong, and D. A. Padua.An Evaluation of Vectorizing Compilers. In PACT, pages372–382, 2011.

[11] OpenMP Architecture Review Board. OpenMP Specificationand Features. http://openmp.org/wp/, May 2008.

[12] M. Pharr and W. R. Mark. ispc: A SPMD Compiler forHigh-performance CPU Programming. In Innovative ParallelComputing (InPar), 2012, pages 1–13. IEEE, 2012.

[13] J. Reinders. Intel Threading Building Blocks: Outfitting C++for Multi-Core Processor Parallelism. O’Reilly, 2007.

[14] B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen,and W. Schulte. SIMD Parallelization of Applications thatTraverse Irregular Data Structures. In 2013 IEEE/ACM Inter-national Symposium on Code Generation and Optimization(CGO), pages 1–10. IEEE, 2013.

129

http://supertech.csail.mit.edu/cilk/

http://supertech.csail.mit.edu/cilk/

http://openmp.org/wp/

[15] B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulka-rni. Efficient Execution of Recursive Programs on CommodityVector Hardware. In PLDI, pages 509–520, 2015.

[16] M. Rinard and P. C. Diniz. Commutativity analysis: a newanalysis technique for parallelizing compilers. ACM Trans.Program. Lang. Syst., 19(6):942–991, 1997. ISSN 0164-0925.

doi: http://doi.acm.org/10.1145/267959.269969.

[17] R. Rugina and M. C. Rinard. Recursion Unrolling for Divideand Conquer Programs. In LCPC, pages 34–48, 2000.

[18] X10. The X10 Programming Language. www.research.ibm.com/x10/, Mar. 2006.

130

www.research.ibm.com/x10/

www.research.ibm.com/x10/

Exploiting Vector and Multicore Parallelism for Recursive ...milind/docs/ppopp17b.pdf · mix data-...

Documents

Transcript of Exploiting Vector and Multicore Parallelism for Recursive ...milind/docs/ppopp17b.pdf · mix data-...