Optimizing dataflow applications on heterogeneous environments

Cluster Comput (2012) 15:125–144DOI 10.1007/s10586-010-0151-6

Optimizing dataflow applications on heterogeneous environments

George Teodoro · Timothy D.R. Hartley ·Umit V. Catalyurek · Renato Ferreira

Received: 20 September 2010 / Accepted: 29 December 2010 / Published online: 24 March 2011© Springer Science+Business Media, LLC 2011

Abstract The increases in multi-core processor parallelismand in the flexibility of many-core accelerator processors,such as GPUs, have turned traditional SMP systems intohierarchical, heterogeneous computing environments. Fullyexploiting these improvements in parallel system design re-mains an open problem. Moreover, most of the current toolsfor the development of parallel applications for hierarchi-cal systems concentrate on the use of only a single proces-sor type (e.g., accelerators) and do not coordinate severalheterogeneous processors. Here, we show that making useof all of the heterogeneous computing resources can sig-nificantly improve application performance. Our approach,which consists of optimizing applications at run-time by ef-ficiently coordinating application task execution on all avail-able processing units is evaluated in the context of repli-cated dataflow applications. The proposed techniques weredeveloped and implemented in an integrated run-time sys-tem targeting both intra- and inter-node parallelism. The ex-perimental results with a real-world complex biomedical ap-plication show that our approach nearly doubles the perfor-

G. Teodoro (�) · R. FerreiraDept. of Computer Science, Universidade Federal de MinasGerais, Belo Horizonte, MG, Brazile-mail: [email protected]

R. Ferreirae-mail: [email protected]

T.D.R. Hartley · U.V. CatalyurekDepts. of Biomedical Informatics, and Electrical & ComputerEngineering, The Ohio State University, Columbus, OH, USA

T.D.R. Hartleye-mail: [email protected]

U.V. Catalyureke-mail: [email protected]

mance of the GPU-only implementation on a distributed het-erogeneous accelerator cluster.

Keywords GPGPU · Run-time optimizations ·Filter-stream

1 Introduction

An important current trend in computer architecture is in-creasing parallelism. This trend is turning traditional distrib-uted computing resources into hierarchical systems, whereeach computing node may have several multi-core proces-sors. At the same time, manufacturers of modern graphicsprocessors (GPUs) have increased GPU programming flexi-bility, garnering intense interest in using GPUs for general-purpose computation; under the right circumstances, GPUsperform significantly better than CPUs. In light of thesetwo trends, developers who want to make full use of high-performance computing (HPC) resources need to designtheir applications to run efficiently on distributed, hierarchi-cal, heterogeneous environments.

In this paper, we propose techniques to efficiently executeapplications on heterogeneous clusters of GPU-equipped,multi-core processing nodes. We evaluate our method inthe context of the filter-stream programming model, a typeof dataflow where applications are decomposed into filtersthat may run on multiple nodes of a distributed system.The application processing occurs in the filters, and filterscommunicate with one another by using unidirectional logi-cal streams. By using such a dataflow programming model,we expose a large number of independent tasks which canbe executed concurrently on multiple devices. In our filter-stream run-time system, filters are multithreaded, and can

mailto:[email protected]




126 Cluster Comput (2012) 15:125–144

include several implementations of their processing func-tion, in order to target different processor types.

The speedup GPUs can achieve as compared to CPUsdepends on the type of computation, amount of work, inputdata size, and application parameters. Additionally, in thedataflow paradigm, the tasks generated to be performed bythe filters to achieve the application’s goals are not necessar-ily known prior to execution. To support these types of appli-cations, the decision about where to run each task has to bemade at run-time, when the tasks are created. Our approachassigns tasks to devices based on the relative performanceof that device, with the aim of optimizing the overall execu-tion time. While in previous work [39], we considered intra-node parallelism, in the original version of this paper [37]and here we consider both intra-node and inter-node paral-lelism. We also present new techniques for on-line perfor-mance estimation and handling data transfers between theCPU and the GPU. Our main contributions are:

– A relative performance estimation module for filter-stream tasks based on their input parameters.

– An algorithm for efficiently coordinating data transfers,enabling asynchronous data transfers between the CPUand the GPU;

– A novel stream communication policy for heterogeneousenvironments that efficiently coordinates the use of CPUsand GPUs in cluster settings;

Although an important active research topic, generatingcode for the GPU is beyond the scope of this paper. Weassume that the necessary code to run the application onboth the CPU and the GPU are provided by the program-mer and we focus on the efficient coordination of the execu-tion on heterogeneous environments, as most of the relatedwork relegates this problem to the programmer and eitherfocuses only on the accelerator performance or assumes thespeedup of the device is constant for all tasks. For GPU pro-gramming we refer to the CUDA [23] toolkit and mentionother GPU programming research based on compiler tech-niques [18, 26], specialized libraries [7, 12, 21], or applica-tions [34]. Also, though we focus on GPUs and CPUs, thetechniques are adaptable for multiple different devices.

The rest of the paper is organized as follows: in Sect. 2we present our use case and motivating application. Sec-tion 3 presents the Anthill, the framework used for evalu-ation of the proposed techniques, and how it has been ex-tended to target multiple devices. Section 4 presents our per-formance estimation approach and Sect. 5 describes the run-time optimizations. An extensive experimental evaluation ispresented in Sect. 6 and the conclusions are summarized inSect. 7.

2 Motivating application

For this work, our motivating application is the Neuroblas-toma Image Analysis System (NBIA), a real-world biomed-ical application developed by Sertel et al. [31]; we also willuse this application to evaluate the performance of our tech-nique. Neuroblastoma is a cancer of the sympathetic ner-vous system affecting mostly children. The prognosis of thedisease is currently determined by expert pathologists basedon visual examination under a microscope of tissue slides.The slides can be classified into different prognostic groupsdetermined by the differentiation grade of the neuroblasts,among other issues.

The process of manual examination by pathologists iserror-prone and very time-consuming. Therefore, the goal ofNBIA is to assist in the determination of the prognosis of thedisease by classifying the digitized tissue samples into dif-ferent subtypes that have prognostic significance. The focusof the application is on the classification of stromal devel-opment as either stroma-rich or stroma-poor, which is oneof the morphological criteria in the disease’s prognosis thatcontributes to the categorization of the histology as favor-able and unfavorable [32].

Since the slide images can be very high resolution (over100K × 100K pixels of 24-bit color), the first step in NBIAis to decompose the image into smaller image tiles that canbe processed independently. Next, the image analysis usesa multi-resolution strategy that constructs a pyramid repre-sentation [28], with multiple copies of each image tile fromthe decomposition step with different resolutions. As an ex-ample, a three-resolution pyramid of an image tile could beconstructed with (32 × 32), (128 × 128), and (512 × 512)images; each higher-resolution image is simply a higher-resolution version of the actual tile from the overall im-age. NBIA analyzes each tile starting at the lowest resolu-tion, and will process the higher-resolution images unlessthe classification satisfies some pre-determined criterion.

The classification of each is based on statistical featuresthat characterize the texture of tissue structure. To that end,NBIA first applies a color space conversion to the La*b*color space, where color and intensity are separated, and en-abling the use of Euclidean distance for feature calculation.The texture information is calculated using co-occurrencestatistics and local binary patterns (LBPs), which help char-acterize the color and intensity variations in the tissue struc-ture. Finally, the classification confidence at a particular res-olution is computed using hypothesis testing; either the clas-sification decision is accepted or the analysis resumes witha higher resolution, if one exists. The result of the imageanalysis is a classification label assigned to each image tileindicating the underlying tissue subtype, e.g., stroma-rich,stroma-poor, or background. Figure 1 shows the flowchartfor the classification of stromal development. More detailson the NBIA can be found in [31].

Cluster Comput (2012) 15:125–144 127

Fig. 1 NBIA flow chart

The NBIA application was originally implemented in thefilter-stream programming model as a set of five compo-nents: (i) Image reader, which reads tiles from the disk;(ii) Color conversion, which converts image tiles from theRGB color space to the La*b* color space; (iii) Statisticalfeatures, which is responsible for computing a feature vectorof co-occurrence statistics and LBPs; (iv) Classifier, whichcalculates the per-tile operations and conducts a hypothe-sis test to decide whether or not the classification is satis-factory; and (v) Start/Output, which controls the processingflow. The computation starts by tiles at the lowest resolu-tion, and only restarts computations for tiles at a higher res-olution if the classification is insufficient, according to theClassifier filter. This loop continues until all tiles have sat-isfactory classifications, or they are computed at the highestresolution. Since the Color conversion and Statistical featurefilters are the most computationally intensive, they were im-plemented for GPUs. For our experiments, we started fromthe original implementations of the neuroblastoma applica-tions for both GPU and CPU except that we utilize both re-sources simultaneously at run-time. There is no significantdifference in the application code, our changes only affectthe run-time support libraries.

NBIA was chosen as our motivating application becauseit is an important real-world and complex classification sys-tem that we believe to have a skeleton similar to a broadclass of applications. Its approach to divide the image intoseveral tiles that can be independently processed through apipeline of operations is found, for instance, in several imageprocessing applications. Furthermore, the multi-resolutionprocessing strategy which concurrently generates tasks withdifferent granularities during the execution time is alsopresent in various research areas as computer vision, imageprocessing, medicine, and signal processing [13, 15, 22, 28].

In our experience with the filter-stream programmingmodel, most applications are bottleneck free, and the num-ber of active internal tasks are higher than the availableprocessors [8, 11, 17, 35, 40, 43]. Thus, the proposed ap-proach to exploit heterogeneous resources consists of allo-cating multiple tasks concurrently to processors where theywill perform the best, as detailed in Sect. 3. Also, the com-putation time of applications’ tasks dominates the overall

application execution time, so the cost of the communica-tion latency is offset by the speedups on the computation.Although these premises are valid for a broad range of ap-plications, there is also interesting work on linear algebracomputation for multicore processors [33], for instance, in-cluding heterogeneous computing with GPUs [2], where theamount of application’s concurrent tasks are smaller thanthe number of processors. The authors improved applicationperformance by using scheduling algorithms that take intoaccount the task dependencies, in order to increase the num-ber of concurrent active tasks, hence allowing them to takeadvantage of multicore systems.

3 Anthill

Anthill is a run-time system based on the dataflow modeland, as such, applications are decomposed into processingstages, called filters, which communicate with each otherusing unidirectional streams. The application is then de-scribed as a multi-graph representing the logical intercon-nection of the filters [4]. At run time, Anthill spawns in-stances of each filter on multiple nodes of the cluster, whichare called transparent copies. Anthill automatically handlesrun-time communication and state partitioning among trans-parent copies [36].

When developing an application using the filter-streammodel tasks and data parallelism are exposed. Task paral-lelism is achieved as the application is broken up in a set offilters, which independently perform their specific transfor-mations in a pipeline fashion. Data parallelism, on the otherhand, can be obtained by creating multiple (transparent)copies of each filter and dividing the data to be processedamong them. In Fig. 2 we show a typical Anthill applica-tion.

The filter programming abstraction provided by Anthillis event-oriented. The programmer provides functions thatmatch input buffers from the multiple streams to a set ofdependencies, creating event queues. Anthill run-time con-trols the non-blocking I/O issues that are necessary. This ap-proach derives heavily from the message-oriented program-ming model [6, 24, 42].

128 Cluster Comput (2012) 15:125–144

Fig. 2 Anthill application

Fig. 3 Filter architecture

The programmer also provides handling functions foreach of the events on a filter which are automatically in-voked by the run-time. These events, in the dataflow model,amount to asynchronous and independent tasks and as thefilters are multithreaded, multiple tasks can be spawned pro-vided there are pending events and compute resources. Thisfeature is essential in exploiting the full capability of currentmulticore architectures, and in heterogeneous platforms it isalso used to spawn tasks on multiple devices. To accomplishthat, Anthill allows the user to provide for the same eventmultiple handlers for each specific device.

Figure 3 illustrates the architecture of a typical filter. Itreceives data from multiple input streams (In1, In2, andIn3), each generating its own event queue, and there arehandler functions associated with each of them. As shown,these functions are implemented targeting different types ofprocessors for each event queue. The Event Scheduler, de-picted in the picture is responsible for consuming events

from the queues invoking appropriate handlers according tothe availability of compute resources. As events are con-sumed, eventually some data is generated on the filter thatneeds to be forwarded to the next filter. This is done by therun-time system though it is not depicted in the figure.

The assignment of events to processors is demand-driven.Thus, when events are queued, they are not immediately as-signed to a processor. Rather, this occurs on-demand as de-vices become idle and new events are queued. In the currentimplementation, the demand-driven, first-come, first-served(DDFCFS) task assignment policy is used as default strategyof the Event Scheduler [38].

The first decision for the DDFCFS policy is to select fromwhich queue to execute events; this decision is made in around-robin fashion provided there is a handling functionfor the available processor. Otherwise, the next queue is se-lected. Whenever a queue is selected, the oldest event onthat queue is then dispatched. This simple approach garan-tees assignment to different devices according to their rela-tive performance in a transparent way.

4 Performance estimator

As highlighted earlier, at the core of the techniques pro-posed in this paper is the fact that the relative performanceof GPUs is data dependent. With that in mind, the decisionabout where to assign each task has to be delayed until run-time and determining its relative performance is central toour whole approach.

Despite the fact that modeling the performance of appli-cations has been an open challenge for decades [9, 16, 41],we believe the use of relative fitness to measure performanceof the same algorithm or workload running on different de-vices is accurate enough, and is far easier to predict than

Cluster Comput (2012) 15:125–144 129

Table 1 Evaluating the performance estimator prediction

Benchmark Speedup avg. error (%) CPU Time avg. error (%) Description App. source

Black-Scholes 2.53 70.50 European option price CUDA SDK [23]

N-body 7.34 11.57 Simulate bodies iterations CUDA SDK [23]

Heart Simulation 13.79 41.97 Simulate electrical heart activity [27]

kNN 8.76 21.18 Find k-nearest neighbors Anthill [38]

Eclat 11.32 102.61 Calculate frequent itemsets Anthill [38]

NBIA-component 7.38 30.35 Neuroblastoma (Sect. 2) [11, 29]

Fig. 4 Anthill relative performance estimator

execution times. However, this task should not be left tothe application programmer, but rather, should be part of thesystem and so we propose the Performance Estimator.

The proposed solution, depicted in Fig. 4, uses a two-phase strategy. In the first phase, when a new application isimplemented, it is benchmarked for a representative work-load and the execution times are stored. The profile gener-ated in this phase consists of the application input parame-ters, targeted devices and execution times, and construes atraining dataset that is used during the actual performanceprediction.

The second phase implements a model learning algo-rithm, and can employ different strategies to estimate thetargeted relative performance. However, it is important tonotice that modeling the behavior of applications based ontheir inputs is beyond any basic regression models. Also, itis beyond the scope of this paper to study this specific prob-lem and propose a final solution. Rather, we propose an al-gorithm which we have validated experimentally and whichwas shown to yield sufficient accuracy for our decision mak-ing.

Our algorithm uses kNN [10] as the model learning algo-rithm. When a new application task is created, the k nearestexecutions in our profile are retrieved based on a distancemetric on the input parameters and their execution timesare averaged and used to compute the relative speedup ofthe task on the different processors. The employed distancemetric for the numeric parameters first normalizes them, di-viding each value by the highest value of each dimension,and then uses an Euclidian distance. The non-numeric at-tributes, on the other hand, gives a distance of 0 to attributeswith complete match, or 1 otherwise.

For the purpose of evaluating the effectiveness of the Per-formance Estimator, we evaluated six representative appli-cations (described in Table 1) with the technique discussedabove. The evaluation has two main purposes: (i) to under-stand if the proposed technique performs an acceptable esti-mation; and, (ii) to discuss our insights that relative perfor-mance (speedup) is easier to predict sufficiently accuratelythan execution times. The results shown were obtained byperforming a first-phase benchmark using a workload of 30jobs, which are executed on both the CPU and the GPU.The estimator errors are calculated using a 10-fold cross-validation, and k = 2 was utilized as it achieved near-bestestimations for all configurations.

The average speedup error for each application is shownin Table 1. First of all, our methodology’s accuracy washigh, since the worst-case error is not higher than 14%,while the average error among all the applications is only8.52%. In our use case application, whose accuracy wasabout average, this error level does not impact the perfor-mance because it is sufficient to maintain the optimal orderof the tasks. We also used the same approach to estimatetask execution times for the CPU, using the same workloadas before. During this second evaluation, we simply com-puted the predicted execution times as the average of the k

nearest samples’ execution times. The CPU execution timeerror is also shown in the same table; those errors are muchhigher than the speedup errors for all applications, althoughthe same prediction methodology is employed.

This empirical evaluation is interesting for different rea-sons: (i) as our task assignment relies on relative perfor-mance estimation, it should perform better for this require-

130 Cluster Comput (2012) 15:125–144

ment than for time-based strategies; (ii) the speedup can alsobe used to predict execution times of an application in dif-ferent run-time environments. For instance, if the executiontime in one device is available, the time in a second proces-sor could be calculated utilizing the relative performance be-tween them. The advantage of this approach is that the es-timated execution time error would be equal to the error ofthe predicted speedup.

We believe that relative performance is easier to predict,because it abstracts effects like conditional statements orloop breaks that highly affect the execution time modeling.Moreover, the relative performance does not try to model theapplication itself, but the differences between devices whenrunning the same program.

5 Performance optimizations

In this section we discuss several run-time techniques forimproving the performance of replicated dataflow computa-tions, such as filter-stream applications, on heterogeneousclusters of CPU- and GPU-equipped machines. First, wepresent our approach to reduce the impact of data transfersbetween the CPU and the GPU by using CUDA’s asynchro-nous copy mechanism. Next, in Sect. 5.2, we present a tech-nique to better coordinate CPU and GPU utilization. Lastly,in Sect. 5.3, we propose a novel stream communication pol-icy that improves the performance of heterogeneous clus-ters, where computing nodes have different processors, bycoordinating the task assignment.

5.1 Improving CPU/GPU data transfers

The limited bandwidth between the CPU and the GPU is acritical barrier for efficient execution of GPU kernels, wherefor many applications the cost of data transfer operationsis comparable to the computation time [34]. Moreover, thislimitation has strongly influenced application design, in-creasing the programming challenges.

An approach to reduce GPU idle time during data trans-fers is to overlap the them with useful computation. Simi-lar solutions have been used in other scenarios, where tech-niques such as double buffering are used to keep proces-sors busy while data is transferred among memory hierar-chies of multicore processors [30]. The approach employedin this work is also based on the overlapping of communi-cation and computation, but on NVidia GPUs double buffer-ing may not be the most appropriate technique because thesedevices allow multiple concurrent transfers among CPU andGPU. Moreover, in all but the most recent GPUs, prior to thenew NVidia Fermi GPUs in its Tesla version, these concur-rent transfers are only possible in one direction, from CPUto GPU or from GPU to CPU; two concurrent transfers in

different directions are not supported. Thus, the problem ofproviding efficient data transfer becomes challenging, as theperformance can only be improved up to a saturation pointby increasing the number of concurrent transfers. Unfortu-nately, the optimal number of concurrent transfers varies ac-cording to the computation/communication rates of the tasksbeen processed and the size of the transferred data. We showthis empirically in Sect. 6.2.

The solution of overlapping communication with com-putation to reduce processor idle time consists of assigningmultiple concurrent processing events to the GPU, overlap-ping the events’ data transfers with computation, and deter-mining the number of events that maximizes the applicationperformance. We assume that the data to be copied from theCPU to the GPU are the data buffers received through thefilter input streams. For the cases where the GPU kernel’sinput data is not self-contained in the received data buffers,the copy/format function can be rewritten according to eachfilter’s requirements.

After copying the GPU kernel’s input data to the GPU,and after running the kernel, it is typically necessary to copythe output results back to the CPU and send them down-stream. During this stage, instead of copying the result itself,the programmer can use the Anthill API to send the outputdata to the next filter, passing the address of the data on theGPU to the run-time environment. With the pointer to thisdata, Anthill can transparently start an asynchronous copyof the data back to the CPU before sending it to the nextfilter. It is also important to highlight that our asynchronouscopy mechanism uses the Stream API of CUDA SDK [23],and that each Anthill event is associated with a single CUDAstream.

Because application performance can be dramatically in-fluenced by GPU idle time, we propose an automated ap-proach to dynamically configure the number of concurrent,asynchronous data copies at run-time, according to the GPUtasks’ performance characteristics. Our solution is based onan approach that changes the number of concurrent eventsassigned to the GPU according to the throughput of the ap-plication. Our algorithm (Algorithm 1) starts with two con-current events and increases the number until the GPU’sthroughput begins to decrease. The previous configurationis then saved, and with the next set of events, the algorithmcontinues searching for a better number of concurrent datacopies by starting from the saved configuration. In order toquickly find the saturation point, the algorithm increases thenumber of concurrent data copies exponentially until theGPU throughput begins to decrease. Thereafter, the algo-rithm makes a binary search for the best configuration be-tween the current and last value of concurrentEvents. Af-ter that phase, it simply makes changes by one concurrentdata copy at a time.

Most of the recent NVidia GPUs only allow concurrenttransfers in one direction, our algorithm has been designed

Cluster Comput (2012) 15:125–144 131

Algorithm 1 Algorithm to control the CPU/GPU data trans-fers

concurrentEvents = 2; streamStepSize = 2;stopExponetialGrowth = 0;while not EndOf Work do

for i := 0, eventId = 0; i < concurrentEvents; i++ doif event ← tryToGetNewEvent() then

asyncCopy(event.data, event.GPUData, ...,event.cuStream)activeEvents.insert(eventId++, event)

end ifend forfor i := 0; i < activeEvents.size; i++ do

proc(event[i])end forfor i := 0; i < activeEvents.size; i++ do

event ← activeEvents.getEvent(i)waitStream(event.cuStream)asyncCopy(event.outGPUData, event.outData, ...,event.cuStream)

end forfor i := 0; i < activeEvents.size; i++ do

event ← activeEvents.getEvent(i)waitStream(event.cuStream)send(event)activeEvents.remove[i]

end forcurThroughput ← calcThroughput()if currentThroughput > lastThroughput then

concurrentEvents += streamStepSize;if stopExponetialGrowth �= 1 then

streamStepSize *= 2;end if

end ifif curThroughput < lastThroughput &concurrentEvents > 2 then

if streamStepSize > 1 thenstreamStepSize /=2 ;

end ifif stopExponetialGrowth == 0 then

stopExponetialGrowth = 1;streamStepSize /=2 ;

end ifconcurrentEvents -= streamStepSize;

end ifend while

with this in mind. To maximize performance given this lim-itation, it schedules multiple concurrent transfers from theCPU to the GPU, executes the event processing, and finallyschedules all transfers of the data back to the CPU. A barrieris used at the end of this processing chain to stop concur-rent transfers in different directions from occurring. If datatransfers in each direction are not grouped, the asynchro-nous, concurrent data copy mechanism is not used, and theGPU driver defaults to the slower synchronous copy version.

The new NVidia Fermi GPUs, on the other hand, haveadded the capability to handle multiple data transfers in

both directions concurrently, from CPU to GPU and GPUto CPU, with dual overlapped memory transfer engines.Thus, when using these newer processors, it possible to ex-ploit the transfer mechanism without grouping data trans-fers in each direction, as is done in our three loop algo-rithm shown before. However, the proposed algorithm todynamically control the number of concurrent data beingtransferred/processed is still useful for Fermi Tesla GPUs,as this value still impacts the performance. Thus, when us-ing these new GPUs, the three loops from the original algo-rithm can be merged into a single loop, since it is possibleto have asynchronous transfers in both directions. Althoughnot shown in the algorithm, we guarantee that the numberof concurrentEvents is never smaller than 1, and its max-imum size is bounded by the available memory.

5.2 Intra-filter task assignment

The problem of assigning tasks in heterogeneous environ-ments has been the target of research for a long time [1–3, 5, 14, 15, 19]. Recently, with the increasing ubiquity ofGPUs in mainstream computers, the scientific communityhas examined the use of nodes with CPUs and GPUs in moredetail. In Mars [12], an implementation of the MapReduceprogramming model for CPU- and GPU-equipped nodes,the authors evaluate the collaborative use of CPUs and GPUwhere the Map and Reduce tasks are divided among them,when using a fixed relative performance between the de-vices. The Qilin system [21] argues that the processingrates of system processors depend on the input data size.By generating a model of the processing rates for each ofthe processors in the system during a training phase, Qilindetermines how best to split the work among the proces-sors for successive executions of the application, consider-ing that application internal tasks have the same relative per-formance. However, for certain classes of applications, suchas our image analysis application, the processing rates of thevarious processors are data-dependent, meaning that such astatic partitioning will not be optimal for these cases.

Indeed, in dataflow applications, there are many internaltasks which can exhibit these data-dependent performancevariations, and we experimentally show that taking thesevariations into account can significantly improve applica-tion performance. Heterogeneous processing has been previ-ously studied [15], but in this work the authors target meth-ods to map and schedule tasks onto heterogeneous, parallelresources where the task execution times do not vary ac-cording to the data. Here, we show that this data-dependentprocessing rate variability can be leveraged to give applica-tions extra performance.

In order to exploit this intra-filter task heterogeneity,we then proposed and implemented a task assignment pol-icy, called demand-driven dynamic weighted round-robin

132 Cluster Comput (2012) 15:125–144

Fig. 5 DDWRR: task ordering per device type, having CPU as base-line processor

(DDWRR) [39] in the Anthill Event Scheduler module, pre-viously shown in Sect. 3. As in DDFCFS, the assignmentof events to devices is demand-driven: the ready-to-executetasks are shared among the processors inside a single nodeand are only assigned when a processor becomes idle, andthe first step of selecting from which stream to processevents is done in round-round fashion.

The main difference between DDWRR and DDFCFS isin the second phase, when an event is chosen from the se-lected stream. In this phase, DDWRR chooses events ac-cording to a per-processor weight that may vary during theexecution time. This value is the computed estimation of theevent’s performance when processed by each device. Forexample, this value could be the event’s likely executiontime speedup for this device when compared to a baselineprocessor (e.g., the slowest processor in the system). Dur-ing the execution, this weight is then used to order ready-to-execute events for each device. The employed ordering sortsevents from highest to lowest weight for each device. As thespeedup of all tasks of the baseline processor is 1, tasks withthe same speedup are ordered additionally using the inversevalue of the best speedup they achieve in any of the availabledevices. The idea is that if a given processor X can chooseamong a set of tasks with the same speedup, the selected taskshould have the worst performance when processed in anyof other available devices of the system should be selected.

Figure 5 presents the DDWRR tasks ordering for 5 tasks:t1, t2, t3, t4 and t5, created in this order in a given filter in-stance. As shown, for the CPU baseline processor, the tasks’relative performance is 1. The GPU queue is built insert-ing tasks in decreasing order according to the speedup overthe baseline CPU processor. Therefore, since t1 and t3 havethe same relative performance, the oldest created task (t1) isinserted first. The CPU queue, meanwhile, uses the inverseof GPU speedup to create its ordering, as all tasks have thesame CPU relative performance because it is the baseline.

Given the task ordering shown in the figure, during execu-tion when a certain processor is available, the first event inthe queue for this type of device is chosen, and the selectedtask is removed from other queues.

Therefore, DDWRR assigns events in an out-of-orderfashion, but instead of using it for speculative execution orto reduce the negative impact of data dependencies [25], it isused to sort the events according to their suitability for eachdevice. It is also important to highlight that DDWRR doesnot require an exact speedup value for each task because it isonly necessary to have a relative ordering of events accord-ing to their performance. The estimator described in Sect. 4has sufficient accuracy for our purposes.

5.3 Inter-filter optimizations: on-demand dynamicselective stream

On distributed systems, performance is heavily dependentupon the load balance as the overall execution time is that ofthe slowest node. Our previous techniques deal only with theevents received in a single instance of a filter. To optimizeglobally, however, when there are multiple instances of afilter, we need to consider which of those instances shouldreceive and process the messages we send.

We present a novel stream communication policy tooptimize filter-stream computations on distributed, hetero-geneous, multi-core, multi-accelerator computing environ-ments. To fully utilize and achieve maximum performanceon these systems, filter-stream applications have to satisfytwo premises that motivate the proposed policy: (i) the num-ber of data buffers at the input of each filter should be highenough to keep all the processors busy, making it possibleto exploit all of the available resources, but not so high as tocreate a load imbalance among filter instances; (ii) the databuffers sent to a filter should maximize the performance ofthe processors allocated to that filter instance.

Based on these premises, we propose an on-demand dy-namic selective stream (ODDS) policy. This stream policyimplements an n × m on-demand directed communicationchannel from n instances of a producer filter Fi to m in-stances of a consumer filter Fj . As such, ODDS implementsa policy where each instance of the receiver filter Fj canconsume data at different rates according to its processingpower.

Because instances of Fj can consume data at differentrates, it is important to determine the number of data buffersneeded by each instance to keep all processors fully utilized.Moreover, as discussed previously, the number of bufferskept in the queue should be as short as possible to avoidload imbalance across computing nodes. These two require-ments are obviously contradictory, which poses an interest-ing challenge. Additionally, the ideal number of data buffersin the filter’s input queue may be different for each filter

Cluster Comput (2012) 15:125–144 133

instance and can change as the application execution pro-gresses. Not only do the data buffers’ characteristics changeover time, but the communication times can vary due to theload of the sender filter instances, for example. ODDS iscomprised of two components: Dynamic Queue AdaptationAlgorithm (DQAA) and Data Buffer Selection Algorithm(DBSA). DQAA is responsible for premise (i), where asDBSA is responsible for premise (ii). In the next two sub-sections we describe these two algorithms in more detail.

5.3.1 Dynamic queue adaptation algorithm (DQAA)

Our solution to control the queue size on the receiver sidederives from concepts developed by Brakmo et al. for TCPVegas [20], a transport protocol which controls flow andcongestion in networks by continuously measuring networkresponse (packet round trip times) and adjusting the trans-mission window (number of packets in transit). For our pur-poses, we continuously measure both the time it takes fora request message to be answered by the upstream filter in-stance and the time it takes for a processor to process eachdata buffer, as detailed in Fig. 6. Based on the ratio of therequest response time to the data buffer processing time,we decide whether the length of the StreamRequestSize (thenumber of data buffers assigned to a filter instance, which in-cludes those data buffers being transferred, already receivedand queued) must be increased, decreased or left unaltered.The alterations are the responsibility of the thread Thread-Worker, that computes its target request size after finishingthe processing of each of its data buffers and updates thecurrent request size if necessary.

In parallel, the ThreadRequester thread observes thechanges in the requestsize and the target stream request sizefor each ThreadWorker. Whenever the requestSize falls be-low the target value, instances of the upstream filter are con-tacted to request more data buffers, which are received andstored in the filter StreamOutQueue. While these requestsoccur for each ThreadWorker, the StreamOutQueue cre-ates a single queue with the received data buffers. Once allof the buffers residing in the shared StreamOutQueue havebeen received, the queue also maintains a queue of databuffer pointers for each processor type, sorted by the databuffers’ speedup for that processor.

5.3.2 Data buffer selection algorithm (DBSA)

Our approach for selecting a data buffer to send downstreamis based on the expected speedup value when a given databuffer is processed by a certain type of processor. This algo-rithm is similar to the one described earlier to select a taskfor a given device, and it also relies on the Performance Es-timator to accomplish that.

Whenever an instance of filter Fj demands more datafrom its input stream, the request includes information about

Algorithm 2 ThreadWorker (proctype, tid)for all proctype, targetrequestsize(tid) = 1, requestsize(tid) = 0do

while not EndOfWork doif |StreamOutQueue(proctype)| > 0 then

d ← GETDATABUFFER(StreamOutQueue(proctype))requestsize(tid) − −timetoprocess ← PROCESSDATABUFFER(d)targetlength ← requestlatency

timetoprocessif targetlength > |targetrequestsize(tid)| then

targetrequestsize(tid) + +end ifif targetlength < |targetrequestsize(tid))| then

targetrequestsize(proctype) − −end if

end ifend while

end for

Algorithm 3 ThreadRequester (proctype,tid)while not EndOfWork do

while |requestsize(tid)| < targetrequestsize(tid) dop ← CHOOSESENDER (proctype)sendtime ← TIMENOW ( )SENDMESSAGE(REQUESTMSG(proctype),p)m ← RECEIVEMESSAGE(p)recvtime ← TIMENOW ( )if m �= ∅ then

d ← m.dataINSERT(StreamOutQueue, d)requestlatency ← recvtime − sendtimerequestsize(tid) + +

end ifend while

end while

Fig. 6 Receiver threads

the processor type which caused the request to be issued (be-cause, according to Fig. 6, the ThreadRequester will gener-ate specific request messages for each event handler thread).Upon receipt of the data request, the upstream filter instancewill select, from among the queued data buffers, the bestsuited for that processor type.

The algorithm we propose, which runs on the sender sideof the stream, maintains a queue of data buffers that is keptsorted by the speedup for each type of processor versus thebaseline processor. When the instance of Fi which receiveda request chooses and sends the data buffer with the high-est speedup to the requesting processor, it removes the samebuffer from all other sorted queues. On the receiver side, asstated above, a shared queue is used to minimize the loadimbalance between the processors running in the system.

In Fig. 7, we show the operation of DBSA. The threadThreadBufferQueuer is activated each time the filter Fi ’sinstance sends a data buffer through the stream. It inserts

134 Cluster Comput (2012) 15:125–144

Algorithm 4 ThreadBufferQueuerwhile not EndOfWork do

if StreamInQueue �= ∅ thend ← GETDATABUFFER(StreamInQueue)INSERTSORTED(d ,SendQueue)

end ifend while

Algorithm 5 ThreadBufferSenderwhile not EndOfWork do

if ∃requestmsg thenproctype ← requestmsg.proctyperequestor ← requestmsg.senderd ← DATABUFFERSELECTIONALGORITHM(SendQueue,

proctype)SENDMESSAGE(DATAMSG(d),requestor)

end ifend while

Fig. 7 Sender threads

the buffer in the SendQueue of that node with the computedspeedups for each type of processor. Whenever a processorrunning an instance of Fj requests a data buffer from that fil-ter instance Fi , the thread ThreadBufferSender processesthe request message, executes DBSA and sends the selecteddata buffer to the requesting filter Fj . Figure 8 shows anoverview of the proposed stream policy, including messagesexchanged when data is requested from filter Fj to Fi , andthe algorithms executed on each side to control data buffersqueue sizes.

6 Experimental results

We have carried out our experiments on a 14-node PCcluster. Each node is equipped with one 2.13 GHz In-tel Core 2 Duo CPU, 2 GB main memory, and a single

NVIDIA GeForce 8800GT GPU. The cluster is intercon-nected using switched Gigabit Ethernet (Intel/8800GT). Wehave also executed some experiments on single computenodes, the first being equipped with dual quad-core AMDOpteron 2.00 GHz processors, 16 GB of main memory, anda NVidia GeForce GTX260 GPU (AMD/GTX260). The sec-ond single node, equipped with a dual NVidia Tesla C2050,has been used to evaluate the dual overlapped memory trans-fer engines available on this Fermi based GPU. All of thesesystems use the Linux operating system. In experimentswhere the GPU is used, one CPU core of the node is as-signed to manage it, and is not available to run other tasks.

We have run each experiment multiple times such thatthe maximum standard deviation is less than 3.2%, and pre-sented the average results here. The speedups shown in thissection are calculated based on the single CPU-core ver-sion of the application. We also fused the GPU NBIA fil-ters to avoid extra overhead due to unnecessary GPU/CPUdata transfers and network communication; thus, our opti-mizations are evaluated using an already optimized versionof the application.

6.1 Effect of tile size on performance

In the first set of experiments, we analyzed the performanceof the CPU and GPU versions of NBIA as a function of theinput image resolution. During this evaluation, we generateddifferent workloads using a fixed number of 26,742 tiles,while varying the tile size. We also assumed that all tilesare successfully classified at the first resolution level, so thatNBIA will only compute tiles of a single resolution.

The speedup of the GPU versus one CPU core is shown,for various tile sizes and two machine configurations, inFig. 9. The results show high variations in relative perfor-mance between the CPU and the GPU: when using the In-tel/8800GT machine (see Fig. 9(a)) their performance issimilar for 32 × 32 pixel tiles, but the GPU is almost 33times faster for 512 × 512 pixel tiles. For small tasks, the

Fig. 8 ODDS: communication policy overview

Cluster Comput (2012) 15:125–144 135

(a) Intel/8800GT

(b) AMD/GTX260

Fig. 9 NBIA: CPU/GPU relative performance variation according tothe tile dimensions

overhead of using the GPU is proportional to the analysis ex-ecution time, making its use inefficient. The same behavioris also observed on the experiments with the AMD/GTX260computing node, although the variation in relative perfor-mance is even higher, as is shown in Fig. 9(b). The hugedifference in relative performance observed between thetwo machines occurred because the second computing node(AMD/GTX260) is equipped with a faster GPU and a CPUthat is 53% slower than the Intel CPU for NBIA.

Figure 9 also shows that the performance of NBIAis strongly affected by the input tile size. Moreover, thespeedup variation shows that different types of processingunits have different performance capabilities. This perfor-mance heterogeneity can be observed in other applications,such as linear algebra kernels, data mining applications likekNN and Eclat, FFTs, and applications that employ pyramidmulti-resolution representations [13, 22, 28].

In large-scale parallel executions of NBIA, multipleprocessing tasks will process different tile resolutions con-currently, making the performance of the devices vary ac-cording to the tasks they process. This heterogeneity createsthe demand for techniques to efficiently use the CPU andGPU collaboratively, as discussed in Sect. 6.3.

(a) Intel/8800GT

(b) Tesla C2050

Fig. 10 VI: number of streams vs. input data chunk size

6.2 Effect of async. CPU/GPU data transfers

The results of our approach to improve CPU/GPU datatransfers are presented in this section. For the sake of ourevaluation, we used two applications: NBIA, and a vector in-crementer (VI) that divides a vector into small chunks whichare copied to the GPU and incremented, iterating over eachvalue six times (resulting in a computation to communica-tion ratio of 7:3).

6.2.1 Vector incrementer

Figure 10 shows the VI execution times as we vary the num-ber of concurrent events/CUDA streams for three data chunksizes: 100K, 500K, and 1M, using an input vector of 360Mintegers for 8800GT and C2050 GPUs, which is here usedto evaluate the new Fermi dual overlapped memory transferengines. These results show that a large number of streamsare necessary to attain maximum performance. Please alsonote that the number of CUDA streams required for op-timum execution times varies with the size of data beingtransferred. Additionally, the memory and network band-width of the GPU are an important factor influencing theoptimal number of CUDA streams for minimal applicationexecution time. Therefore, for an application with multiplekernel functions or run in an environment with mixed GPU

136 Cluster Comput (2012) 15:125–144

Table 2 VI: Static vs. dynamic number of CUDA Streams (In-tel/8800GT)

Input chunk size

CUDA stream size 100K 500K 1M

Best static stream size (secs.) 16.50 16.16 16.15

Dynamic algorithm (secs.) 16.53 16.23 16.16

Table 3 VI: Static vs. dynamic number of CUDA Streams (TeslaC2050)

Input chunk size

CUDA stream size 100K 500K 1M

Best static stream size (secs.) 1.19 1.10 1.09

Dynamic algorithm (secs.) 1.19 1.11 1.10

types, an optimal single value might not exist, even if theapplication parameters are static.

We believe the results in Fig. 10 strongly motivate theneed for an adaptive method to control the number of con-current CPU/GPU transfers, such as we have proposed inSect. 5.1. The performance of the proposed algorithm is pre-sented in Table 2, where it is compared to the best perfor-mance among all number of fixed number of CUDA streamsfor the 8800GT GPU. The results achieved by the dynamicalgorithm are very close to the best static performance, andare within one standard deviation (near to 1%) of the aver-age static performance numbers. Finally, in Table 3, we alsopresent the comparison of the dynamic algorithm to the beststatic configuration, when using the Tesla C2050 GPU. Asbefore, the execution times of the application when using thedynamic algorithm are very close to those of the best staticchoice.

6.2.2 NBIA

The improvement in NBIA execution time due to the useof asynchronous data copies is presented in Fig. 11, wherethe tile size is varied (but tiles are classified at the first res-olution level). The results show that NBIA performance im-proves for all tile sizes. For tiles of 512 × 512 and on anIntel/8800GT computing node, the data transfer overheadwas reduced by 83%, resulting in a gain of roughly 20%application performance, as shown in Fig. 11(a). When run-ning the same experiment on the AMD/GTX260 machine(Fig. 11(b)), the performance improvement to the applica-tion due to asynchronous copy was even higher, achieving25%.

It is important to highlight that the gains achieved whenusing the proposed algorithm are not only relevant in termsof performance, but were achieved using an automated ap-proach that do not require the developer to do a tedious pa-

(a) Intel/8800GT

(b) AMD/GTX260

Fig. 11 NBIA: synchronous vs. asynchronous copy

rameter search for each new application, parameter or com-puting environment.

6.3 Effect of intra-filter task assignment

This section shows how different intra-filter task assignmentpolicies affect the performance of NBIA during large-scaleexecution of the application, with multiple resolution tilesconcurrently active in the processing pipeline. These exper-iments were run using a fixed number of 26,742 tiles andtwo resolution levels: (32 × 32) and (512 × 512). We variedthe tile recalculation rate (the percent of tiles recalculated athigher resolution) to show how it affects the performance ofNBIA and our optimizations.

Figure 12 presents the speedup of NBIA using vari-ous system configurations and intra-filter task assignmentpolicies: GPU-only, and GPU+CPU with DDFCFS andDDWRR policies. For these experiments, DDFCFS signifi-cantly improves the application performance only for a tilerecalculation rate of 0%. When no tiles are recalculated,both the CPU and the GPU process tiles of the smallest size,for which they have the same performance. Therefore, forthis recalculation rate, the speedup is roughly 2 when addinga second device of any type. For reference, we also includethe execution time of the CPU-only version of NBIA as therecalculation rate is varied in Table 4.

Cluster Comput (2012) 15:125–144 137

Fig. 12 Intel/8800GT: intra-filter assignment policies

Table 4 Intel/8800GT: execution time of the CPU-only version ofNBIA as the recalculation rate is varied

Recalc. rate (%) 0 4 8 12 16 20

Exec. time (sec) 30 350 665 974 1287 1532

Table 5 Intel/8800GT: number of tiles processed by CPU using 16%tile recalculation

Tile size

32 × 32 512 × 512

DDFCFS 1.52% 14.70%

DDWRR 84.63% 0.16%

When increasing the recalculation rate however, theDDWRR policy still almost doubles the speedup of thepure GPU, while DDFCFS achieves no or only little im-provement. For instance, with 16% tile recalculation, theGPU-only version of the application is 16.06 times fasterthan CPU version, while using the CPU and GPU togetherachieved speedups of: 16.78 and 29.79, for DDFCFS andDDWRR, respectively. The profile of the tasks processed bythe CPU for each policy, shown in Table 5, explains the per-formance gap. When using DDFCFS, the CPU processedsome tiles of both resolutions, while DDWRR schedulesthe majority of low resolution tiles to the CPU, leaving theGPU to focus on the high resolution tiles, for which it is farfaster than the CPU. The overhead due to the task assign-ment policy, including our on-line performance estimation,was negligible.

The performance of the NBIA using the various con-figurations and intra-filter task assignment policies, as dis-cussed previously, is presented for the AMD/GTX260 ma-chine in Fig. 13. For these results, when compared to a singleIntel/8800GT node, DDFCFS achieved better performancerelative to the GPU-only version, as it was capable to im-prove the application performance up to about 12% of tile

Fig. 13 AMD/GTX260: intra-filter assignment policies

recalculation rate. These gains are due to the higher num-ber of computing cores available in the AMD machine, suchthat even with an inefficient schedule, the CPUs were ableto collaborate effectively to the entire application execu-tion. Moreover, DDWRR also performs much better thanDDFCFS on most of the cases, and this performance gapwas maintained as the tile recalculation rate increases. Forinstance, when the recalculation rate is 20%, DDWRR is137% faster than the GPU-only version of NBIA.

6.4 Effect of tile resolutions on intra-filter task assignment

In the last section, we showed the performance of the intra-filter task assignment policies for large-scale executions ofNBIA, where multiple resolution tiles are concurrently ac-tive in the processing pipeline using two resolution levels:(32 × 32) and (512 × 512). In this section, we analyze theintra-filter task assignment policies as we vary the dimen-sions of the tile size at the highest resolution, maintainingthe lowest resolution fixed at (32 × 32) and using 26,742tiles as before. The experiments shown in this and in thenext section were performed using only the Intel/8800GTmachine.

Figure 14 shows the speedup as compared to the CPU-based sequential version of NBIA for different configura-tions of the application and task assignment policies: GPU-only, and GPU+CPU with DDFCFS and DDWRR. Firstof all, DDWRR significantly improved the performance ofNBIA over the GPU-only version for all configurations, andit was always faster than DDFCFS. The gains of DDFCFSover the GPU-only version of the application decrease as thedimensions of the tile size at the second resolution increase.Thus, DDFCFS significantly improves upon the GPU-onlyexecution only for second-resolution tile sizes of: (64 × 64)and (128 × 128), while for other configurations its gains areminimal.

The DDWRR policy, when compared to DDFCFS, wascapable of maintaining higher gains for all configurations oftiles sizes, being nearly 80% faster than DDFCFS for the

138 Cluster Comput (2012) 15:125–144

Fig. 14 Intra-filter assignment policies performance as the tile sizeat the second resolution is varied. The first resolution level and tilerecalculation rate are fixed, respectively, at (32 × 32) and 20%

Fig. 15 Intra-filter assignment DDWRR gains over DDFCFS as thetile size at the second resolution is varied. The first resolution level andtile recalculation rate are fixed, respectively, at (32 × 32) and 20%

two highest second-resolution tile sizes: (256 × 256) and(512×512). Figure 15 summarizes the gains of the DDWRRintra-filter scheduling policy over DDFCFS as the tile di-mensions are varied.

In Fig. 16, we present the percentage of tiles processed byCPU or GPU for each tile dimension and intra-filter assign-ment policy. As shown in Fig. 16(a), when using DDFCFSthe CPU was able to compute nearly 50% and 35% of thetasks for the two first resolutions, contributing so to the en-tire execution of NBIA. For other configurations of tile di-mensions, although, the CPU has processed few tasks and,consequently, the DDFCFS performance compared to theGPU-only is almost the same.

However, the number of tasks processed by the CPUwhen using DDWRR (see Fig. 16(b)) is larger than inDDFCFS, and it increases as the resolutions of the tiles atthe second magnification level are increased. The behaviorof the number of tasks processed by CPU for both DDFCFSand DDWRR is different, and this explains the performancegap between these policies. In DDFCFS, the CPU is notwell exploited in most cases, resulting in small gains in per-

(a) FCFS

(b) DDWRR

Fig. 16 Percent of tiles computed by each processor type accordingto the scheduling policy, as the tiles size at the second resolution levelis varied and the first resolution level is fixed in (32 × 32). The tilerecalculation rate for all experiments is 20%

formance over the GPU-only version of NBIA. DDWRR,on the other hand, allocates the appropriate tasks to eachprocessor, and the CPU is able to significantly collaboratein the entire execution.

6.5 Effect of inter-filter optimizations

In this section, we evaluate the proposed on-demand dy-namic selective stream task assignment policy, ODDS. Ourevaluation was conducted using two cluster configurationsto understand both the impact of assigning tasks at thesender side and the capacity of ODDS to dynamically adaptthe streamRequestsSize (the number of target data buffersnecessary to keep processors busy with a minimum load im-balance). The cluster configurations are: (i) a homogeneouscluster of 14 machines equipped with one CPU and oneGPU, as described in the beginning of Sect. 6; and (ii) aheterogeneous cluster with the same 14 machines, but turn-ing off 7 GPUs. Thus, we have a cluster with heterogene-ity among machines, where 7 nodes are CPU- and GPU-equipped machines, and 7 nodes are dual-core CPU-onlymachines.

Cluster Comput (2012) 15:125–144 139

Table 6 Differentdemand-driven schedulingpolicies used in Sect. 6

Demand-driven Area of Queue Policy Size of request for

Scheduling Policy effect Sender Receiver data buffers

DDFCFS Intra-filter Unsorted Unsorted Static

DDWRR Intra-filter Unsorted Sorted by speedup Static

ODDS Inter-filter Sorted by speedup Sorted by speedup Dynamic

In Table 6 we present three demand-driven policies(where consumer filters only get as much data as they re-quest) used in our evaluation. All these scheduling policiesmaintain some minimal queue at the receiver side, such thatprocessor idle time is avoided. Simpler policies like round-robin or random do not fit into the demand-driven paradigm,as they simply push data buffers down to the consumer filterswithout any knowledge of whether the data buffers are beingprocessed efficiently. As such, we do not consider these tobe good scheduling methods, and we exclude them from ourevaluation.

The First-Come, First-Served (DDFCFS) policy simplymaintains FIFO queues of data buffers on both ends of thestream, and a filter instance requesting data will get what-ever data buffer is next out of the queue. The DDWRRpolicy uses the same technique as DDFCFS on the senderside, but sorts its receiver-side queue of data buffers bythe relative speedup to give the highest-performing databuffers to each processor. Both DDFCFS and DDWRR havea static value for requests for data buffers during execu-tion, which is chosen by the programmer. For ODDS, dis-cussed in Sect. 5.3, the sender and receiver queues are sortedby speedup and the receiver’s number of requests for databuffers is dynamically calculated at run-time.

6.5.1 Homogeneous cluster base case

This section presents the results of experiments run in thehomogeneous cluster base case, which consists of a sin-gle CPU/GPU-equipped machine. In these experiments, wecompared ODDS to DDWRR. DDWRR is the only oneused for comparison because it achieved the best perfor-mance among the intra-filter task assignment policies (seeSect. 6.3). These experiments used NBIA with asynchro-nous copy, and 26,742 image tiles with two resolution levels,as in Sect. 6.3, and the tile recalculation rate is varied.

The results, presented in Fig. 17, surprisingly show thateven for one processing node ODDS could surpass the per-formance allowed by DDWRR. The gains due to asynchro-nous transfers between ODDS and DDWRR at a 20% tilerecalculation rate, for instance, is around 23%. The improve-ments obtained by ODDS are directly related to the ability tobetter select data buffers that maximize the performance ofthe target processing units. It occurs even for one processingmachine because the data buffers are queued at the sender

Fig. 17 Homogeneous base case evaluation

Fig. 18 Tiles processed by CPU for each communication policy asrecalculation rate is varied

side for both policies, but ODDS selects the data buffers thatmaximize the performance of all processors of the receiver,improving the ability of the receiver filter to better assigntasks locally.

Figure 18 presents the percentage of tasks processed bythe CPU according to the communication policy and tile re-calculation rate. As shown, DDFCFS is only able to processa reasonable amount of tiles when the reconfiguration rateis 0%; its collaboration to the entire execution is minimumfor the other experiments. When analyzing DDWRR andODDS, on the other hand, both allow the CPU to computea significant number of tiles for all values of reconfigurationrate, which directly explains the performance gap between

140 Cluster Comput (2012) 15:125–144

Fig. 19 Heterogeneous base case evaluation

them and DDFCFS. Finally, the same figure shows that inODDS, the CPU processes a higher number of tiles thanDDWRR, for instance 18% when the tile recalculation rateis 20%. Consequently, ODDS is faster even for 1 machineas discussed before.

6.5.2 Heterogeneous cluster base case

The demand-driven stream task assignment policies areevaluated in this section in a heterogeneous environment,where the base case consists of two computing nodes: thefirst equipped with one CPU and one GPU, and the secondbeing a dual-core CPU-only machine. Figure 19 presents thespeedups for each stream policy as the tile recalculation rateis varied.

When comparing the results for the homogeneous clusterbase case vs. the heterogeneous cluster base case, shown inFigs. 17 and 19, respectively, notice that DDFCFS achievesslightly better performance with the additional dual-coreCPU of the second computing node, so does DDWRR.However, the performance of ODDS increased significantly.For instance, at 8% recalculation rate, DDWRR and ODDSachieve, respectively, 23x and 25x the performance of a sin-gle CPU-core on the homogeneous base case cluster; ontwo heterogeneous nodes, DDWRR’s speedup slightly in-creases to 25, while the performance of ODDS increases toa speedup of 44.

To understand how the computation is distributed in theseexperiments, we next present in Table 7 the profile of thedata buffers processed by the GPU when using each streampolicy, at an 8% tile recalculation rate. For the homogeneousbase case experiments, it is notable that the performancedifference between DDFCFS and DDWRR and ODDS oc-curred because in the DDFCFS scheme the CPU did notsignificantly collaborate in the execution. That is, 92-98%of both the low resolution and the high resolution tiles areprocessed by the GPU, leaving nothing much for the CPUto do. However, the DDWRR and ODDS algorithms show a

(a) Best streamRequestsSize: number of data buffer requests+ received by a filter

(b) CPU utilization for DBSA-only and DDWRR, using10% of recalculation rate and the best value of streamRe-questsSize

Fig. 20 Understanding performance of stream policies with staticstreamRequestsSize values

preference to give the GPU the vast majority of the high res-olution buffers, and save the majority of the low resolutionbuffers for the CPU.

The profiles’ comparison with one and two nodes is alsouseful to understand the impact of adding an extra CPU-onlynode to the performance. The DDFCFS performance gainsare simply because in the configuration with two nodes theCPU was able to process a slightly higher proportion of tilesat both resolutions. The DDWRR scheduling scheme, on theother hand, could not efficiently utilize the second node. Asshown in Table 7, under DDWRR, the GPU processed al-most the same number of low resolution tiles and a fewmore high resolution tiles than when using only one ma-chine. When the ODDS approach is utilized, since the deci-sion about where each buffer should be sent is made initiallyat the sender, ODDS was able to intelligently utilize the sec-ond additional CPU for the processing of the remaining lowresolution tiles as well as a few high resolution tiles.

An important factor in the preceding heterogeneous basecase experiments is the choice of the number of the databuffer requests that maximizes performance. In Fig. 20(a),we show the number of requests that gives the best execution

Cluster Comput (2012) 15:125–144 141

Table 7 Percent of tilesprocessed by the GPU at eachresolution/stream policy

Config. Homogeneous base case Heterogeneous base case

Scheduling DDFCFS DDWRR ODDS DDFCFS DDWRR ODDS

Low res.(%) 98.16 17.07 6.98 84.85 16.72 0

High res.(%) 92.42 96.34 97.89 85.67 92.92 97.62

time for each stream policy and tile recalculation rate. Thesevalues were determined via exhaustive search. For policieswith static numbers of requests, the programmer is respon-sible for determining this parameter.

The DDWRR stream approach achieved better perfor-mance for a higher number of requests as it is important forthis stream policy to have a large number of data buffers onthe input queue the CPU/GPU machines to create opportuni-ties for intra-filter scheduling. DDFCFS, on the other hand,had better performance with a smaller streamRequestsSizebecause it results in less load imbalance among the comput-ing nodes. For both DDFCFS and DDWRR, the best per-formance was achieved in a configuration where processorutilization is not maximum during the whole execution. Forthese policies, leaving processors idle may be better than re-questing a high number of data buffers and generating loadimbalance among filter instances at the end of the applica-tion’s execution.

Figure 20(b), shows the processors utilization forDDWRR and DBSA at the sender side, with a fixed andbest size of streamRequestsSize. As shown, the better perfo-mance of this stream communication policy is achieved in aconfiguration where the hardware is not fully utilized. Thisobservation, as discussed before, motivated the developmentof DQAA to dynamically calculate the appropriate value ofstreamRequestsSize for each computing node according toits characteristics. Moreover, DQAA and DBSA were thencoupled to create the ODDS stream communication policyfor heterogeneous environments first proposed in the earlierversion of this work [37].

In contrast to both DDFCFS and DDWRR, ODDS canadapt the number of data requests according to the process-ing rate of each receiver filter instance, providing it the abil-ity to better utilize the processors during the whole execution(see Fig. 21(a)). To show this, we show in Fig. 21(b) howODDS changes the streamRequestsSize dynamically in oneexecution of the NBIA application. This experiment used a10% reconfiguration rate. As expected, the streamRequests-Size varies as the execution proceeds, adapting to the slackin each queue. It is especially dramatic that at the end ofthe execution where there is a build-up of higher-resolutiondata buffers to be computed. It is this build-up of the higher-resolution data buffers (and their longer processing times onthe CPU-only machine) which causes DQAA to reduce thenumber of requests of the CPU-only machine, reducing the

(a) CPU utilization for ODDS

(b) ODDS data requests size

Fig. 21 ODDS execution detailed

load imbalance among the computing nodes at the tail endof the application’s execution.

6.5.3 Scaling homogeneous and heterogeneous cases

Finally, in this section, we show NBIA performance as thenumber of machines is increased for each cluster type. Thescaling of the homogeneous cluster was done by simply in-creasing the number of nodes, while for the heterogeneouscluster 50% of the computing nodes are equipped with botha CPU and a GPU, and the other 50% are without the GPUs.The results have been performed using 267,420 image tileswith two resolution levels, as before. We used an 8% tile re-calculation rate, and the speedups are calculated accordingto a single CPU-core version of the application.

142 Cluster Comput (2012) 15:125–144

Fig. 22 Scaling homogeneous base case

Fig. 23 Scaling heterogeneous base case

In Fig. 22, we first present the NBIA speedups on the ho-mogeneous cluster for four configurations of the application:GPU-only, and GPU and CPU collaboratively using threestream policies, presented in Table 6: DDFCFS, DDWRR,and ODDS. The results show that DDFCFS could not im-prove the application performance much over the GPU-onlyversion of the application. DDWRR, on the other hand, dou-bled the performance of the GPU-only configuration. Fur-ther, the ODDS was 15% faster than DDWRR even in thehomogeneous environment as it can better choose whichdata buffers to send to requesting processors, due to itsknowledge of downstream processors’ performance charac-teristics.

The experimental results of increasing the number of ma-chines in the heterogeneous environment are presented inFig. 23. ODDS again showed the best performance amongthe stream policies, but now it almost doubled the perfor-mance achieved by DDWRR. Once again, the DDWRR andDDFCFS results shown for each number of machines arethose with the manually-determined best number of bufferrequests, while ODDS ran and found its values dynamically.

The speedup achieved by ODDS when using 14 heteroge-neous nodes is also four times higher than seven GPU-onlymachines, showing that significant gains can be achieved by

mixing heterogeneous computing nodes. The ODDS taskassignment policy’s better scalability is due to its abilityto adapt the streamRequestsSize to the available hardware,reducing inefficiency due to load imbalance and processorunder-utilization. Also, by targeting data buffers that maxi-mize the performance of the downstream consumers, ODDScan dynamically maximize the processor performance overthe whole range of data buffers in the application.

Indeed, by using a GPU cluster equipped with 14 nodesand by using ODDS to coordinate the execution of the ap-plication, the execution time of NBIA for an 8% tile recal-culation rate is reduced from over 11 minutes to just over 2seconds. Certainly, much of the speedup is due to the GPUs,but the GPU-only version of the application takes nearly 5seconds to achieve the same analysis results. In real-worldscenarios, many of these biomedical images will need to beanalyzed together, and gains in performance such as thoseoffered by ODDS can bring tangible benefits.

7 Conclusions

In this paper, we presented and evaluated several run-timeoptimizations for filter-stream applications on heteroge-neous environments. These optimizations have the capacityto improve performance in such systems by reducing theoverhead due to data transfers between CPUs and GPUs,and by coordinating appropriate and collaborative task exe-cution on CPUs and GPUs.

The experimental results for a complex, real-world bio-medical image analysis application, which exhibits data-dependent processing speedups, show that our optimizationsreduce the total overhead due to data transfers between theCPU and the GPU up to 83%. Also, the appropriate coordi-nation between CPUs and GPUs doubles the performance ofthe GPU-only version of the application by simply addinga single CPU-core. Moreover, our proposed ODDS streampolicy provides the developer with a straightforward way tomake efficient use of both homogeneous and heterogeneousclusters, arguing for the use of heterogeneous processingnodes. The proposed performance estimator was evaluatedfor several applications, showing good relative performanceprediction, while estimating execution times directly gavepoor results.

The task assignment policy proposed in this paper allo-cates the entire GPU for each filter internal task it has toprocess. As future work, we intend to consider the con-current execution of multiple tasks on the same GPU toexploit filters’ intrinsic data parallelism. This may not beonly a source for performance optimization, but could alsoease the development of GPU kernels, since partitioning thetask among the GPU’s execution units would be obviated.The performance estimator is another focus for future work,

Cluster Comput (2012) 15:125–144 143

where we plan to evaluate more sophisticated model learn-ing algorithms and to use the current prediction model inother contexts.

Acknowledgements This work was partially supported by CNPq,CAPES, Fapemig, and INWeb; by the DOE grant DE-FC02-06ER2775; by AFRL/DAGSI Ohio Student- Faculty Research Fellow-ship RY6-OSU-08-3; by the NSF grants CNS-0643969, OCI-0904809,OCI-0904802 and CNS-0403342; and computing time from the OhioSupercomputer Center.

References

1. Arpaci-Dusseau, R.H., Anderson, E., Treuhaft, N., Culler, D.E.,Hellerstein, J.M., Patterson, D., Yelick, K.: Cluster I/O with river:making the fast case common. In: IOPADS ’99: Input/Output forParallel and Distributed Systems (1999)

2. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu:A unified platform for task scheduling on heterogeneous multicorearchitectures. In: Euro-Par ’09: Proceedings of the 15th Interna-tional Euro-Par Conference on Parallel Processing, pp. 863–874(2009)

3. Berman, F.D., Wolski, R., Figueira, S., Schopf, J., Shao,G.: Application-level scheduling on distributed heterogeneousnetworks. In: Supercomputing ’96: Proceedings of the 1996ACM/IEEE Conference on Supercomputing, p. 39 (1996)

4. Beynon, M., Ferreira, R., Kurc, T.M., Sussman, A., Saltz, J.H.:DataCutter: middleware for filtering very large scientific datasetson archival storage systems. In: IEEE Symposium on Mass Stor-age Systems, pp. 119–134 (2000)

5. Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A.,Saltz, J.: Distributed processing of very large datasets with Data-Cutter. Parallel Comput. 27(11), 1457–1478 (2001)

6. Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu, W.: Coyote:a system for constructing fine-grain configurable communicationservices. ACM Trans. Comput. Syst. 16(4), 321–366 (1998)

7. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Hous-ton, M., Hanrahan, P.: Brook for gpus: stream computing ongraphics hardware. ACM Trans. Graph. 23(3), 777–786 (2004)

8. Catalyurek, U., Beynon, M.D., Chang, C., Kurc, T., Sussman, A.,Saltz, J.: The virtual microscope. IEEE Trans. Inf. Technol. Bio-med. 7(4), 230–248 (2003)

9. Fahringer, T., Zima, H.P.: A static parameter based performanceprediction tool for parallel programs. In: ICS ’93: Proceedings ofthe 7th International Conference on Supercomputing, pp. 207–219(1993)

10. Fix, E., Hodges, J.: Discriminatory analysis, nonparametric dis-crimination, consistency properties. Computer science techni-cal report, School of Aviation Medicine, Randolph Field, Texas(1951)

11. Hartley, T.D., Catalyurek, U.V., Ruiz, A., Ujaldon, M., Igual, F.,Mayo, R.: Biomedical image analysis on a cooperative cluster ofgpus and multicores. In: 22nd ACM Intl. Conference on Super-computing (2008)

12. He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: Amapreduce framework on graphics processors. In: Parallel Archi-tectures and Compilation Techniques (2008)

13. Hoppe, H.: View-dependent refinement of progressivemeshes. In: SIGGRAPH 97 Proc., pp. 189–198 (1997).http://research.microsoft.com/hoppe/

14. Hsu, C.H., Chen, T.L., Li, K.C.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the masterslave paradigm. Future Gener. Comput. Syst. (2007)

15. Iverson, M., Ozguner, F., Follen, G.: Parallelizing existing appli-cations in a distributed heterogeneous environment. In: 4th Het-erogeneous Computing Workshop (HCW’95) (1995)

16. Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman,H.J., Gittings, M.: Predictive performance and scalability mod-eling of a large-scale application. In: Supercomputing ’01: Pro-ceedings of the 2001 ACM/IEEE Conference on Supercomputing(CDROM), pp. 37–37 (2001)

17. Kurc, T., Lee, F., Agrawal, G., Catalyurek, U., Ferreira, R., Saltz,J.: Optimizing reduction computations in a distributed environ-ment. In: SC ’03: Proceedings of the 2003 ACM/IEEE Conferenceon Supercomputing, p. 9 (2003)

18. Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a com-piler framework for automatic translation and optimization. In:PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming, pp. 101–110(2009)

19. Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: aprogramming model for heterogeneous multi-core systems. ACMSIGPLAN Not. 43(3), 287–296 (2008)

20. Low, S., Peterson, L., Wang, L.: Understanding tcp vegas: a dual-ity model. In: Proceedings of ACM Sigmetrics (2001)

21. Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism onheterogeneous multiprocessors with adaptive mapping. In: 42ndInternational Symposium on Microarchitecture (MICRO) (2009)

22. Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluationof multiresolution optimization strategies for multimodality imageregistration by maximization of mutual information. Med. ImageAnal. 3(4), 373–386 (1999)

23. NVIDIA: NVIDIA CUDA SDK (2007). http://nvidia.com/cuda24. O’Malley, S.W., Peterson, L.L.: A dynamic network architecture.

ACM Trans. Comput. Syst. 10(2) (1992)25. Patkar, N., Katsuno, A., Li, S., Maruyama, T., Savkar, S., Simone,

M., Shen, G., Swami, R., Tovey, D.: Microarchitecture of hal’scpu. In: IEEE International Computer Conference, p. 259 (1995)

26. Ramanujam, J.: Toward automatic parallelization and auto-tuningof affine kernels for gpus. In: Workshop on Automatic Tuning forPetascale Systems (2008)

27. Rocha, B.M., Campos, F.O., Plank, G., dos Santos, R.W., Lieb-mann4, M., Haase, G.: Simulations of the electrical activity in theheart with graphic processing units. Accepted for publication inEighth International Conference on Parallel Processing and Ap-plied Mathematics (2009)

28. Rosenfeld, A. (ed.): Multiresolution Image Processing and Analy-sis. Springer, Berlin (1984)

29. Ruiz, A., Sertel, O., Ujaldon, M., Catalyurek, U., Saltz, J., Gurcan,M.: Pathological image analysis using the gpu: Stroma classifica-tion for neuroblastoma. In: Proc. of IEEE Int. Conf. on Bioinfor-matics and Biomedicine (2007)

30. Sancho, J.C., Kerbyson, D.J.: Analysis of double buffering on twodifferent multicore architectures: quad-core opteron and the Cell-BE. In: International Parallel and Distributed Processing Sympo-sium (IPDPS) (2008)

31. Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H.,Gurcan, M.N.: Computer-aided prognosis of neuroblastoma onwhole-slide images: classification of stromal development. PatternRecognit. 42(6) (2009)

32. Shimada, H., Ambros, I.M., Dehner, L.P., Ichi Hata, J., Joshi, V.V.,Roald, B.: Terminology and morphologic criteria of neuroblas-tic tumors: recommendation by the international neuroblastomapathology committee. Cancer 86(2) (1999)

33. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling forlinear algebra algorithms on distributed-memory multicore sys-tems. In: SC ’09: Proceedings of the Conference on High Perfor-mance Computing Networking, Storage and Analysis (2009)

34. Sundaram, N., Raghunathan, A., Chakradhar, S.T.: A frameworkfor efficient and scalable execution of domain-specific templates

http://research.microsoft.com/hoppe/

http://nvidia.com/cuda

144 Cluster Comput (2012) 15:125–144

on gpus. In: IPDPS ’09: Proceedings of the 2009 IEEE Interna-tional Symposium on Parallel and Distributed Processing, pp. 1–12. (2009)

35. Tavares, T., Teodoro, G., Kurc, T., Ferreira, R., Guedes, D., Meira,W.J., Catalyurek, U., Hastings, S., Oster, S., Langella, S., Saltz,J.: An efficient and reliable scientific workflow system. In: IEEEInternational Symposium on Cluster Computing and the Grid,pp. 445–452 (2007)

36. Teodoro, G., Fireman, D., Guedes, D. Jr., Ferreira, R.: Achiev-ing multi-level parallelism in filter-labeled stream programmingmodel. In: The 37th International Conference on Parallel Process-ing (ICPP) (2008)

37. Teodoro, G., Hartley, T.D.R., Catalyurek, U., Ferreira, R.: Run-time optimizations for replicated dataflows on heterogeneous en-vironments. In: Proc. of the 19th ACM International Symposiumon High Performance Distributed Computing (HPDC) (2010)

38. Teodoro, G., Sachetto, R., Fireman, D., Guedes, D., Ferreira, R.:Exploiting computational resources in distributed heterogeneousplatforms. In: 21st International Symposium on Computer Archi-tecture and High Performance Computing, pp. 83–90 (2009)

39. Teodoro, G., Sachetto, R., Sertel, O., Gurcan, M. Jr., Catalyurek,U., Ferreira, R.: Coordinating the use of GPU and CPU for im-proving performance of compute intensive applications. In: IEEECluster (2009)

40. Teodoro, G., Tavares, T., Ferreira, R., Kurc, T., Meira, W., Guedes,D., Pan, T., Saltz, J.: Run-time support for efficient execution ofscientific workflows on distributed environmments. In: Interna-tional Symposium on Computer Architecture and High Perfor-mance Computing, Ouro Preto, Brazil (2006)

41. Vrsalovic, D.F., Siewiorek, D.P., Segall, Z.Z., Gehringer, E.F.:Performance prediction and calibration for a class of multiproces-sors. IEEE Trans. Comput. 37(11) (1988)

42. Welsh, M., Culler, D., Brewer, E.: Seda: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev.35(5), 230–243 (2001)

43. Woods, B., Clymer, B., Saltz, J., Kurc, T.: A parallel implemen-tation of 4-dimensional haralick texture analysis for disk-residentimage datasets. In: SC ’04: Proceedings of the 204 ACM/IEEEConference on Supercomputing (2004)

George Teodoro is currently a Post-doc at The University of Marylandat College Park in the Department ofComputer Science. His research in-terests include runtime systems fordata-intensive computing, executiondataflow computing in modern par-allel environments, efficient execu-tion of dataflows under online vari-able demands, and auto-tuning forscientific applications. He receivedhis Ph.D., M.S. and B.S. in Com-puter Science from UniversidadeFederal de Minas Gerais, Brazil, in2010, 2006 and 2004, respectively.

Timothy D.R. Hartley is a Ph.D.Candidate in the Department ofElectrical and Computer Engineer-ing at The Ohio State University.His research interests involve high-performance runtime systems andheterogeneous architectures. He re-ceived his M.S. from The Ohio StateUniversity in 2006 and his B.S.from New Mexico State Universityin 2002. He is currently a Fellowof the Dayton Area Graduate Stud-ies Institute’s AFRL/DAGSI OhioStudent-Faculty Research Fellow-ship Program.

Umit V. Catalyurek is an Asso-ciate Professor in the Department ofBiomedical Informatics at The OhioState University, and has a joint fac-ulty appointment in the Departmentof Electrical and Computer Engi-neering. His research interests in-clude combinatorial scientific com-puting, runtime systems for data-intensive computing, and high-per-formance computing in biomedi-cine. He received his Ph.D., M.S.and B.S. in Computer Engineer-ing and Information Science fromBilkent University, Turkey, in 2000,1994 and 1992, respectively.

Renato Ferreira is an associateprofessor in the Department of Com-puter Science at Universidade Fed-eral de Minas Gerais. His researchfocuses on compiler and run-timesupport for high performance com-puting and large, dynamic datasets.It involves both high performance,important issue from the applica-tions end-users’ perspective, andhigh-level programming abstrac-tions, important for the applicationdomain developers. His work is cur-rently targeting support for knowl-edge discovery in very large online

datasets. Dr. Ferreira obtained his Ph.D. from the University of Mary-land, College Park in 2001.

Optimizing dataflow applications on heterogeneous environments

Documents

Transcript of Optimizing dataflow applications on heterogeneous environments