1 Entropy-based Scheduling Policy for Cross Aggregate ...chiychow/papers/TSC_2016.pdf ·...

1

Entropy-based Scheduling Policy for CrossAggregate Ranking Workloads

Chengcheng Dai, Sarana Nutanong, Member, IEEE, Chi-Yin Chow, Member, IEEEand Reynold Cheng, Member, IEEE

Abstract—Many data exploration applications require the ability to identify the top-k results according to a scoring function. We studya class of top-k ranking problems where top-k candidates in a dataset are scored with the assistance of another set. We call this classof workloads cross aggregate ranking. Example computation problems include evaluating the Hausdorff distance between twodatasets, finding the medoid or radius within one dataset, and finding the closest or farthest pair between two datasets. In this paper,we propose a parallel and distributed solution to process cross aggregate ranking workloads. Our solution subdivides the aggregatescore computation of each candidate into tasks while constantly maintains the tentative top-k results as an uncertain top-k result set.The crux of our proposed approach lies in our entropy-based scheduling technique to determine result-yielding tasks based on theirabilities to reduce the uncertainty of the tentative result set. Experimental results show that our proposed approach consistentlyoutperforms the best existing one in two different types of cross aggregate rank workloads using real datasets.

Index Terms—Query processing, Knowledge and data engineering tools and techniques.

F

1 INTRODUCTION

MANY data exploration applications require the abilityto identify the top-k results according to a scor-

ing function. Well-known computational problems includenearest neighbor search [1], Hausdorff distance computa-tions [2], [3], and medoid computations [4], [5]. Processinga top-k query generally involves scoring and ranking [6],[7], [8], [9], i.e., applying a scoring function to a set ofcandidates and ranking them in ascending or descendingorder to identify the top-k. For example, the k-nearest neigh-bor query which can be processed by scoring candidateobjects according to the proximity to a given reference pointand identifying the k nearest candidates through a rankingmechanism.

Our investigation focuses on a class of top-k rankingproblems where top-k candidates in one dataset are scoredwith the assistance of another set. For example, the di-rectional Hausdorff distance between two sets of objectsis the worst case nearest neighbor distance from one setto another. Notable applications which make use of theHausdorff distance include• service coverage measurement: the greatest distance be-

tween any customer to their nearest warehouse;• surface approximation error: the worst case discrepancy

between any point on an approximated surface withrespect to the nearest point on the original surface.

As shown in Figure 1, when computing the directionalHausdorff distance from set X to set Y , we may considereach object in X as a candidate, while the scoring function isgiven by miny∈Y D(x, y). The overall result is the maximum

• C. Dai, S. Nutanong and C. Chow are with the Department of ComputerScience, City University of Hong Kong, Hong Kong SAR.Email: {chengcdai2-c@my., snutanon@, chiychow@}cityu.edu.hk

• R. Cheng is with Department of Computer Science, University of HongKong, Hong Kong SAR.Email: [email protected]

of minimum distances, i.e.,

dh = maxx∈X

miny∈Y

D(x, y).

We call X the candidate set and Y the scoring set. A simplemethod to compute the directional Hausdorff distance fromX to Y is the nested-loop join method, which (i) traversesthe Cartesian product (cross-join) X × Y , (ii) computesthe aggregate score for each x in X , and (iii) ranks thecandidates accordingly. As a result, we name this workloadtype cross aggregate rank.

Fig. 1. Directional Hausdorff Distance from X to Y

Note that in some cases, we want argmin or argmaxinstead of the actual aggregate value. For example, themedoid computation involves computing the data objectx∗M that minimizes the total distance to other entries in thesame set, i.e.,

x∗M = argminxc∈X

∑xs∈X

D(xc, xs).

Another useful generalization is the top-k query. That is,instead of finding the absolute minimum or maximum, weobtain the top-k to mitigate the effect of outliers.

We can see that both Hausdorff distance and medoidcomputations can be defined by two parts ranking and

2

aggregation. For the Hausdorff distance, the ranking partcorresponds to descending ordering (max), while the aggre-gation part is min. For the medoid computation, the rankingpart corresponds to ascending ordering (min), while theaggregation part is sum. Other popular query types that canbe expressed in the format of (ranking, aggregation) include(i) the closest pair problem (min,min) [10] and (ii) the farthestpair problem (max,max) [11].

These query types are widely used in scientific data ex-ploration applications. In such cases, a data user may accessthe data interactively. For example, a data user executes aquery to compare the discrepancy between two datasets.Based on the result, they may decide whether to exploreother similar datasets. As a result, we need an efficientsolution to enable scientific data users to execute crossaggregate ranking queries and obtain results in a timelyfashion.

An off-the-shelf solution to speed up cross aggregateranking computations is to exploit data parallelism of theproblem. We can process different parts of X × Y in adistributed manner using a main memory data parallelframework [12], [13]. Specifically, we partition the scoringset Y into blocks and distribute them across different dataprocessing nodes (termed workers in this paper). In this way,when computing the score for each object x in the candidateset X , each scoring block corresponds to a partial aggregationtask which returns a partial aggregate result as the aggregatedistance from x to a scoring block in Y . The aggregate scoreof x is obtained by combining the partial aggregation resultsfrom different blocks. The query result can be obtained byranking the candidates according to their scores.

The main advantage of this approach is that a signif-icant degree of parallelism can be obtained by processingeach partial aggregation task (or simply task when contextis clear) independently. However, this approach needs toevaluate D(x, y) for all of {(x, y) ← X × Y }, which is adrawback when the datasets are large or when the distancefunction is computationally intensive.

The branch-and-bound principle is a widely used tech-nique to process queries that involve ranking [14], [15]. Abranch-and-bound algorithm generally involves (i) organiz-ing the solution space in a hierarchical fashion; (ii) usinga heuristic function to guide the solution space explorationprocess and to prune parts of the solution space that cannotyield the result.

Our objective is to combine (i) the scalability and speedbenefits of parallel and distributed processing; (ii) the selec-tive exploration benefit of the branch-and-bound approach.Our main challenge here is that traditionally in a branch-and-bound algorithm, the exploration decisions are madesequentially by following the hierarchical organization ofthe solution space. To address this challenge, we devisea novel scheduling method to prioritize multiple result-yielding partial aggregation tasks so that they can be pro-cessed in parallel.

Recall that we partition the candidate set X and thescoring set Y according to proximity. This organizationallows us to derive useful best-case and worst-case estimatesfor each partial aggregation task. In this way, the score ofeach candidate is considered as an uncertain value boundedwithin a range [16].

The crux of our method lies in the mapping between par-tial aggregation task scheduling and uncertain data clean-ing [17], [18]. Since overlaps between ranges are the sourceof uncertainty in the top-k results, making a schedulingdecision involves identifying partial aggregation tasks toreduce uncertainty contributing overlaps. We consider thisprocess to be equivalent to cleaning uncertain candidates ina probabilistic database. This mapping allows us to quantifythe top-k uncertainty as entropy and prioritize partial aggre-gation tasks according to their entropy reduction potentials.

We conducted an experimental study comparing ourproposed solution with methods adapted from the best-first search principle using two different workload typesand two different real datasets. Experimental results showthat our method consistently outperforms the benchmarkmethods. Specifically, our method consistently obtains aspeed up of one order of magnitude in comparison to thebest competitor for the Hausdorff distance computationworkload in both datasets.

In summary, our contributions are as follows.• A main-memory parallel and distributed approach to

processing cross aggregate ranking workloads• A novel technique to determine result-yielding partial

aggregation task based on the entropy• A novel scheduling policy based on the mapping be-

tween partial aggregation task scheduling and uncer-tain data cleaning, which continuously assesses theentropy contribution of each partial aggregation task

• An extensive experimental study to evaluate the perfor-mance of our proposed method

The rest of this paper is organized as follows. Section 2provides related work discussions. In Section 3, we describeour proposed parallel and distributed query processingframework, which processes cross aggregate rank queriesin an online fashion. In Section 4, we derive our novelentropy-based scheduling policy, which provides entropycontribution assessments for partial aggregation tasks. Sec-tion 5 presents experimental results. Section 6 concludes thepaper.

2 RELATED WORK

Recall that our objective is to combine the benefits ofparallel and distributed processing and branch-and-boundapproach. For parallel and distributed processing, a con-siderable amount of research attention has been dedicatedto enhancing the MapReduce framework for different ap-plications. Doulkeridis et al [19] provides a comprehen-sive survey on different types of MapReduce variants andenhancement. In this section, we discuss those that arerelated to our work. For the branch-and-bound approach,we discuss query processing techniques that are relatedto join processing. Besides, we also discuss techniques forprobabilistic top-k queries.

2.1 MapReduce Variants and Enhancements

MapReduce [20] is a parallel, distributed programmingmodel for processing large datasets. To implement a dataparallel program in MapReduce, one has to specify twostages of operations, Map and Reduce. Due to its popularity,

3

the framework has been adopted to many types of queryprocessing problems.Sampling and result approximation. Online aggrega-tion [21], [22] is a technique generally used to provide earlyresults by incrementally aggregating Map results as theybecome available. Pansare et al. [22] propose a MapReducemodification which enables a user to monitor the accu-racy in real time and to stop the computation once theyare satisfied with the accuracy. Laptev et al. [23] applya statistical method called bootstrapping [24] to estimatethe result accuracy and use it as a termination condition.Agarwal et al. [25] propose a parallel approximate queryengine which allows a user to provide time bounds and/orerror bounds. The query engine maintains a set of multi-dimensional samples and estimates the sample size basedon the speed or accuracy requirements imposed by the user.

In our proposed method, we employ the concept ofonline aggregation [21], [22] to report approximate scoresfor the candidates in order to produce preliminary rankingresults. However, instead of computing approximate resultsfrom sampling, our work applies the branch-and-boundprinciple to compute the lower and upper bounds of thescore for each candidate. In this way, we are able to computeexact results without scanning the entire solution space.Ranking and selective access. One of the main drawbacksof MapReduce is that the Map stage requires access to theentire dataset. This restriction can be considered undesirablewhen running a query with filtering conditions, e.g., com-pute the average height of male students older than 15. InHadoop++ [26], this problem is addressed by introducinga cover index for the actual data content. A Map operationmay use the cover index to determine which part of thecovered data block is needed to be accessed.

Another type of query that may benefit from the abilityto select parts of the dataset to access is the top-k query [19].Techniques in this area are concerned with sorting dataaccess to facilitate early production of the top-k results [27]and deriving a termination condition by estimating theworst-case top-k score [28].

In terms of indexing, our proposed method organizeseach dataset, which is a set of objects in a metric space, intoblocks. Each block is represented by its center and radius,as well as, the number of objects. We can consider thissummarized representation as a cover index [26], which isused to determine whether the block should be examined,or how it should be prioritized for examination. For top-k processing, our work is focused on reducing the scorecalculation cost rather than minimizing the sorting andranking effort.Join processing. Join processing in MapReduce can besupported by modifying the input reader so that it ac-cepts input from multiple sources [29]. After input reading,joining two datasets can be done during either the Mapstage (map-side join) or the reduce stage (reduce-side join)depending on how the join keys of the two datasets arelocated [30]. The map-side join is possible when one of thedatasets (preferably the smaller one) is duplicated acrossdifferent cluster nodes; or when the join keys of the twodataset are colocated. The reduce-side join is generally usedwhen the map-side join is not possible. The method usesthe MapReduce shuffle mechanism to deliver tuples with

the same join key to the same reducer. Studies on joinprocessing in MapReduce are largely concerned with opti-mizing the query performance with respect to different joinquery types, such as equi-join [31], theta-join [32], similarityjoin [33], [34], k-NN join [35] and top-k join [28].Comparison to our work. For the cross aggregate rankingproblem studied in this paper, there is no predicate defin-ing part of the Cartesian product to be evaluated. As aresult, query processing techniques for equi-join [31], theta-join [32], similarity join [33], [34] are not applicable to ourquery processing problem.

The problem of joining two datasets without explicitjoining conditions is studied by Candan et al., [28] as top-k join. A top-k join between two sets R and S producesthe top-k pairs (r, s) from R × S according to a scoringfunction f(r, s). The query type can be considered as aspecial case of our cross aggregate ranking query problemunder the condition that the ranking function and the ag-gregate function have to be the same type, i.e., (min,min)or (max,max). The query processing method proposed byCandan et al., [28] relies on this condition, which may notbe true for the cross aggregate ranking query. For example,the Hausdorff distance computation problem has max as theranking function and min as the aggregate function. Hence,we cannot use their proposed method to solve our queryprocessing problem.

2.2 Branch-and-bound Algorithm Design PrincipleBranch-and-bound search [14], [15], [36] is a widely usedtechnique to process queries that involve ranking. A branch-and-bound algorithm generally involves (i) organizing thesolution space in a hierarchical fashion; (ii) using a heuris-tic function to guide the process of finding the top-kresults. In this subsection, we focus our discussions onbrand-and-bound methods in join processing: closest pair(min,min), farthest pair (max,max), and Hausdorff dis-tance (max,min).

For the closest pair (min,min) and farthest pair(max,max) problems [11], a single priority queue is used toguide the search process. This is because in these problems,the optimization goal for exploring X and the optimizationgoal for exploring Y are aligned and can be merged intoone, e.g., argminx∈X,y∈Y D(x, y) for the closest pair.Comparison to our work. When the optimization goalsare not aligned, e.g., (max,min) or (min,max), the searchprocess involves determining whether to explore the hier-archical structure of X or the hierarchical structure of Y .Nutanong et al. [3] apply a balanced approach to ensurethatX and Y are equally explored. However, the drawbacksof this solution are (i) it only works with (max,min) or(min,max) and does not generally solves the cross aggre-gate rank problem; (ii) it heavily relies on priority queueswhich make the solution strictly sequential.

2.3 Probabilistic Top-k QueriesDue to the rapid increase in the amount of uncertain datacollected from multiple data sources, analyzing a largevolume of uncertain data has become an important prob-lem [16]. Comparing to certain data, the fundamental chal-lenge is how to handle the uncertainty in the dataset. Query

4

processing techniques on uncertain and probabilistic datacan be classified according to the two following types: localcertainty and global certainty. In a local certainty setting,each query answer is independent from other objects, (i.e.,range query [37], [38]). In a global certainty setting, eachquery answer depends on other objects (i.e., top-k [39],[40], [41], [42], skyline [43]) due to the ranking/comparativenature of the query.

Various types of top-k queries have been studied in-cluding U-Topk, U-kRanks [39], [41], [42] and probabilisticthreshold top-k queries (i.e., PT-k query) [44]. Well-knownsolutions range from Poisson binomial recurrence [41] tospatial and probabilistic pruning techniques [42]. The in-tuition behind these solutions is to enumerate and prunepossible answers systematically. In addition to the discussedexact algorithms [39], [41], [42], sampling-based approxima-tion methods [44] use a sample to estimate the probabilities,while the approximation quality is guaranteed by applyingChernoff-Hoeffding bound. Probabilistic top-k queries ina distributed setting are studied in [45], [46], [47]. In thiscase, queries must be answered in a communication-efficientmanner on probabilistic data from multiple, distributed sitessuch as sensor networks. Sampling is adopted to decidethe expected ranking for distributed objects to reduce thecommunication cost.Comparison to our work. While the discussed techniquescan handle uncertain data efficiently, they are not designedfor data-intensive cross-join aggregate workloads. Existingprobabilistic top-k processing methods are inapplicable toour problem. In this paper we proposed a method whichapplies the concept of data cleaning [18] in a probabilisticdatabase in order to quantify the ambiguity of answers inthe approximate/tentative top-k result set.

3 PROPOSED SOLUTION

3.1 Basic Design PrinciplesOur proposed solution is based on the branch-and-bound(BB) algorithm design principle. The principle is generallyapplied to discrete optimization problems that can be de-composed into a search tree. For example, the best-firstsearch algorithm computes an optimistic estimate of eachsearch tree branch; the search process prioritizes brancheswith the best estimates. In order to adapt the BB principleto our cross aggregate ranking problem, we have to addressthe issues of data partitioning and branch estimation.

3.1.1 Data partitioningWe propose adopting a metric tree data structure, suchas the M-tree [48], VP-trees [49], and MVP-trees [50], toorganize the dataset into blocks according to the triangleinequality. For each dataset, we construct an M-tree in orderto use the leaf nodes as blocks. In this way, each block Yi isrepresented by its center Yi.c, radius Yi.r and size Yi.n.

Other indexes such as k-D tree [51] and R-tree [52] arealso suitable for spatial data. For those indexes that do notprovide exact center and radius, Yi.c and Yi.r could beobtained by some loose estimations based on block infor-mation. In this study, we consider metric space structures tobe more suitable than vector space structures due to thefollowing reasons. The main drawback of structures like

the minimum bounding box (MBR) and quad-tree boundingbox, is that the number of corners we need to consider ex-ponentially increases as the number of dimension increases.On the other hand, the M-tree uses the hypersphere as itsbounding structure, making the process of calculating dis-tance upper and lower bounds dimensionality-independent.Furthermore, constructing an M-tree bounding hypersphererequires only a metric distance function. This allows us toindex generic metric space datasets, e.g., dataset measuredby Levenshtein distance, while R-tree is suitable for vectorspace datasets.

For our application, the metric space structures, M-tree [48], VP-trees [49], and MVP-trees [50] produce similarresults. Since data partitioning is not the main focus of thisinvestigation, we choose the M-tree due to its simplicity.

3.1.2 Branch estimationThe score of each candidate x in a candidate set X iscomputed by applying an aggregate function (max, min,sum, count, average) to the distances between x and entriesin a scoring set Y , i.e.,

S(x) = aggy∈Y

D(x, y).

Assume that the scoring set Y is partitioned into Y1, ..., Ym.Estimating S(x) can be done by aggregating the estimatesof each scoring block in Y . For the aggregate function min,Smin(x) is guaranteed to be in the range of

[minYi

D(x, Yi.c)− Yi.r,minYi

D(x, Yi.c)]. (1)

The same principle can also be applied to the max aggregatefunction.

We call this type of estimate a point-to-block estimate,since it is the aggregate distance from a candidate x in thecandidate set X to a scoring block in the scoring set Y .Figure 2 shows an example how point-to-block estimatescan be computed. Given that the aggregate function is min,the upper and lower point-to-block estimates are [2, 3.5] forthe distance from x2 to Y1; [2.5, 3] for x2 to Y2; [2.75, 3.75]for x2 to Y3 using the expression (1).

x2

Y2

Y3

Y1

x3

1.5

3.751

3.53

0.5

x4

x1

Fig. 2. An example of point-to-block distance

Although it is possible to apply the same principle tothe sum aggregate function. That is, we can multiply theminimum distance and maximum distance by the number

5

of data entries to get the lower estimate and upper estimate,respectively:

[(D(x, Yi.c)− Yi.r) · Yi.n, (D(x, Yi.c) + Yi.r) · Yi.n].

Tighter bounds can be obtained by profiling the scoringblock Yi using an equi-depth distance histogram. Specifi-cally, we organize the data objects in Yi into equal-sizedbuckets where each bucket is represented by the maximumdistance bj .d from the center Yi.c and its size bj .n. Let bjdenote a bucket in Yi, bj .d denote the distance from thefarthest object in the bucket to the center Yi.c and bj .n de-note the number of objects in the bucket. When computingthe aggregate score, we consider these buckets as concentricblocks and aggregate the individual upper and lower boundestimates accordingly. As a result, the aggregate score fromthe candidate x to objects in the block Yi is estimated as∑

j

(D(x, Yi.c)− bj .d) · bj .n,∑j

(D(x, Yi.c) + bj .d) · bj .n

.This method provides tighter estimates than the basic (non-histogram) method, since inner buckets are guaranteed tohave a smaller radius than the entire block. Furthermore, theestimate calculation does not incur any additional distancecomputation, since all buckets share the same center.

Figure 3 provides a comparison between the non-histogram and histogram methods. In both sub-figures (Fig-ures 3(a) and (b)), the distance from the candidate x tothe block center is 10 and the block radius is 5.5. For thebasic method, the upper and lower point-to-block estimatesare [45, 155]. For the histogram method, assume that themaximum bin size is set to 2 and we create a bucket forthe center by itself. We obtain the upper and lower point-to-block estimates as [66.5, 133.5]. As can be seen, the his-togram method reduces the range by 39.1% in comparisonto the basic method.

10

5.5

Candidate x

c

(a) Basic (non-histogram)

10

5.55

4

32

Candidate x

c

(b) Histogram

Fig. 3. Using an equi-depth distance histogram for additive aggregatefunctions.

3.2 Parallel and distributed computingFor scalability, our solution also utilizes parallel and dis-tributed computing resources. A straightforward paralleland distributed solution to this problem is to apply the blocknested loop join technique [30] to scan the entire cross-joinspace and compute the aggregate score of each candidatein parallel. However, as previously stated, one of our ob-jectives is to reduce the search space. In order to achievethis, we need to introduce the following capabilities to ourframework.

• Task prioritization. In a branch-and-bound search algo-rithm, it is beneficial to prioritize branches that aremore result-yielding than others. Hence, our frame-work needs to support a scheduler which assesses tasksaccording to the ability to get us closer to the finalresult.

• Incremental aggregation. We need the ability to monitorthe tentative upper and lower estimates of differentcandidates as partial aggregation tasks get processed.In this way, we can identify candidates that have thepotential to be in the result set.

• Early termination. We need the ability to check whetherwe have obtained all query results so that the com-putation can terminate before exhausting the cross-joinspace.

3.3 System OverviewIn this section, we propose a master-slave framework forprocessing cross aggregate ranking workloads based onthe design principles discussed in the previous subsec-tion. Specifically, as shown in Figure 4, our proposed sys-tem consists of the following components to derive tenta-tive/approximate top-k results in an incremental manner:1) Online candidate scoring table (OCST): maintaining par-tial aggregation results for each candidate; 2) Task sched-uler: scheduling and prioritizing partial aggregation tasks;3) Worker pool: processing partial aggregation tasks; 4) Can-didate maintenance and termination check module: maintaininga list of active candidates and checking whether the termi-nation condition has been satisfied.

Fig. 4. System overview

3.3.1 Online candidate scoring tableWe maintain candidate information in a data structurecalled Online Candidate Scoring Table (OCST), which con-tains the following fields. 1) CandidateID: Reference toa data object in the candidate set X 2) ScoringBlockID:Reference to a block in scoring set Y 3) Upper Bound:Upper bound of the actual score for this entry 4) LowerBound: Lower bound of the actual score for this entry5) Status: Indication of whether the entry is untouched, hasbeen scheduled for processing, or has already been processedEach OCST entry corresponds to a partial aggregation taskof computing the aggregate distance (partial aggregationresult) from a candidate to a scoring block.

This organization allows us to separately maintain es-timates for each candidate and gives us the ability to se-lectively evaluate any partial aggregation tasks from anycandidate.

6

Each OCST entry is initialized with point-to-block upperand lower estimates, which will be replaced by a singlevalue once processed. In Figure 4, given the same X andY in Figure 2, the OCST entries associated with x2 areinitialized as [2, 3.5], [2.5, 3] and [2.75, 3.75]. The finalaggregate score for each candidate is computed by apply-ing the target aggregate function to the individual upperand lower bounds, i.e., creating a view with the aggregatefunction. Given min as the aggregate function, since 2 and3 are respectively the minimum lower bound and upperbound, the aggregate score of x2 is [2, 3]. As a result, we canalways derive tentative or approximate top-k results fromthe OCST. As more partial aggregation tasks get processed,the differences between the upper and lower bounds ofcandidates become smaller and the tentative top-k resultsbecome more accurate.

3.3.2 Worker poolThe scheduled tasks are processed by a pool of workerswhich are long running processes distributed over differentnodes. For performance, datasets are stored in main mem-ory. Specifically, each worker is associated with a numberof scoring blocks, which are loaded onto the main memoryupon initialization. Each worker is allowed to accept onlypartial aggregation tasks with a matching scoring block.For example, worker 1 in Figure 4 can process only tasksassociated with scoring blocks Y1 or Y2. As a result, eachworker needs to be provided with only the candidate datato process a task. The result of each partial aggregation taskis the aggregate distance from the candidate to the scoringblock.

3.3.3 Candidate maintenance and termination check mod-ulePartial aggregation result, i.e., the exact aggregate distancefrom x to Yi, is returned to the master asynchronously.Upon receiving a partial aggregation result, the status ofthe corresponding OCST entry is updated to “processed”,and the lower and upper bounds of the aggregate distancefrom x to Yi are updated to the partial aggregation result.

Using the information from the OCST, the system peri-odically checks each candidate in order to separate out can-didates that need not to be further examined. In particular,we separate candidates into the following three categories.1) Positive: A positive candidate is guaranteed to be includedin the top-k result set. Assume ranking in ascending order.A candidate is considered positive when there are morethan n − k candidates that have a lower bound greaterthan the candidate’s upper bound. 2) Negative: A negativecandidate is guaranteed to be excluded from the top-k resultset. Assume ranking in ascending order. A candidate isconsidered negative when there are more than k candidatesthat have an upper bound smaller than the candidate’slower bound. 3) Uncertain: This is the initial candidate statusfor all candidates.

3.3.4 Task schedulerThe main objective of the task scheduler is to select “un-touched” partial aggregation tasks in the OCST. However,tasks are not equally important. Some partial aggregation

tasks contribute more to the score of a candidate andsome candidates require more considerations than othercandidates in order to determine whether they should beincluded or excluded from the top-k result set.

Partial aggregation tasks have the potential to reduce therange of the aggregate score except when its lower boundis already greater than the upper bounds of the other tasks.Consider three tasks in Figure 2 as an example. All the taskshave the potential to reduce the range of the aggregate scoreS(x2) and we cannot prune any of them. The scheduleralso updates the status of each scheduled OCST entry to“scheduled” upon selection.

Our framework allows users to define their own schedul-ing policy. The challenge of task scheduling is how to iden-tify result-yielding partial aggregation tasks to be processednext. In the next section, we discuss our proposed entropy-based scheduling policy, as well as, comparative schedulingpolicies.

4 SCHEDULING POLICIES

In this section, we formulate our proposed entropy-basedscheduling policy. We also describe comparative schedulingpolicies that we use to benchmark our proposed method.

4.1 Problem Modeling

We now describe how our problem can be adapted touncertain data cleaning [17], [18]. Recall that the OCSTcontinuously maintains the ranges of candidates’ scores andorganizes them into three categories: positive, negative, anduncertain. The main objective of a scheduler is to identifypartial aggregation tasks that will help reduce the numberof uncertain candidates. In particular, these candidates areconsidered uncertain because their score ranges overlapeach other in such a way that each of them still retainsa possibility to be included in the top-k result set. Byprocessing partial aggregation tasks associated with theseuncertain candidates, their ranges and the overlaps betweenthem become smaller, while the termination check moduleincrementally moves candidates out of the uncertain cate-gory.

The intuition behind our proposed scheduling policy ishow to quantify the uncertainty in the OCST so that wecan attempt to reduce it objectively. We consider an overlapbetween the score ranges of two or more candidates tobe the source of uncertainty. In particular, we consider theactual score of each candidate as a random variable that mayassume a value anywhere in the candidate’s score range; anoverlap between two score ranges introduces an uncertaintyin how the corresponding two candidates should be ranked.By using the entropy as an uncertainty measurement, wecan assess the potential of how each partial aggregationtask will reduce the uncertainty in the result set in anquantitative fashion.

Algorithm 1 provides a summary of the proposedscheduling policy. The algorithm assesses the entropy con-tribution of each uncertain candidate x based on how muchits score range overlaps with those of other uncertain candi-dates. For each uncertain candidate, we assess the entropycontribution (line 2). We then use the result to compute the

7

entropy contribution (i.e., the entropy reduction potential)of each partial aggregation task γ (lines 3-4). The schedulerthen uses this information to try to maximize the totalentropy contribution for a given query processing budgetC , which can be represented as the number of distancecalculations.

Algorithm 1: Entropy-based adaptive schedulinginput : OCST, Budget Coutput: Schedule tasks

1 while # Positives is smaller than k do2 foreach Uncertain Candidate x in OCST do3 Compute the entropy contribution of x;4 foreach Task γ associated with Candidate x do5 Compute the entropy contribution of γ;

6 Schedule tasks that maximize the total entropycontribution under a given budget C;

4.2 Result Set EntropyWe adopt the probabilistic database concept [16], [18], [53]and treat the score S(x) of each candidate x as a randomvariable that may assume any value in a given score range[Slower, Supper] with a certain probability distribution. Fig-ure 5 shows four candidates x1, x2, x3 and x4 in a top-2descending-order ranking query. According to the definitiongiven in Section 3, x1 is positive, x4 is negative, and theremaining candidates, x2 and x3, are uncertain. Hence,we have two possibilities for the result set: {x1, x2} and{x1, x3}.

The two possible outcomes depend on the final scoresof x2 is greater or smaller than that of x3. Hence, theentropy can be calculated by assessing Pr[S(x2) ≤ S(x3)]and Pr[S(x2) > S(x3)], where S(x) denotes the final scoreof x. Specifically, the entropy Q of the top-k query result canbe computed as

Q = −∑r∈R

Pr(r) logPr(r), (2)

where R denotes a set of all possible top-k outcomes, i.e.,{x1, x2} and {x1, x3} in this example.

4.2.1 Background: x-form methodComputing the entropy by enumerating all possible out-comes is a prohibitively expensive process due to the expo-nential complexity with respect to the number of uncertaincandidates. In order to avoid such an expensive compu-tation, we adopt the x-form [18] for entropy calculationsin a probabilistic top-k processing problem. In an x-tupledatabase, each candidate x is a random variable which mayassume a value in a discrete set of real numbers. Each valuev is also associated with an existential probability Pr[x = v].In this way, each candidate x is represented as a set of tuplest = (v, e = Pr[x = v]). Cheng et al. [18] showed that theentropy of a probabilistic top-k query (when representedusing the described x-form) can be assessed by evaluatingthe entropy contribution of each tuple t = (v, e) from theset T of all tuples in the database:

Q =∑

t=(v,e)∈T

ω(t, T )TopKProb(t), (3)

where ω(t, T ) is the weight of the tuple t = (v, e) andTopKProb(t) is the probability that the candidate x asso-ciated with t = (v, e) is a top-k result given that (x = v).

Specifically, ω(t, T ) can be computed using Equation 4.

ω(t = (v, e), T ) = log e+1

e(Z(1−

∑(v′,e′)∈prcd(t,T )

e′)

−Z(1−∑

(v′,e′)∈prcd(t,T )

e′ + e)) (4)

where Z(a) is defined as a log a and prcd(t, T ) represents aset of the tuples t′ in T associated with the same candidateas t that are ranked higher than t under the top-k orderingfunction. Note that ω(t = (v, e), T ) can be evaluated in anincremental fashion [18].

The x-form method [18] computes TopKProb(t) usingthe PSR (probabilistic similarity ranking) method [54]. Thefunction TopKProb(t) is the total probability of the candi-date x (associated with t) taking any rank h between 1 and k,which can be evaluated by adding up the probability of eachh value. Specifically, let ρh(t = (v, e)) denote the probabilityof the associated candidate x being ranked h-th given that(x = v).

TopKProb(t) =k∑h=1

ρh(t). (5)

The exact definition and derivation of ρh(.) is given byBernecker et al. [54].

4.2.2 Mapping data cleaning to our problemThe x-tuple entropy calculation method is not directly ap-plicable to our probabilistic database problem setting de-scribed earlier in this section. Specifically, the x-form re-quires tuples to be rankable. That is, for any given tuples(vi, ei) and (vj , ej), if xi = vi and xj = vj then wecan determine with the absolute certainty that (xi < xj);(xi = xj); or (xi > xj). However, the score S(x) of acandidate x is a random variable that may assume any valuein a continuous range [Slower, Supper] with a given proba-bility distribution [53], [55], [56]. As a result, score rangesof candidates may overlap and hence are not rankable.In order to mitigate the ranking uncertainty problem, weneed to partition each range into subranges in such a waythat a subrange does not overlap with any other subrangeor overlap with identical one or more subranges. In thisway, subranges can be ordered in a similar fashion as x-form tuples, where each tuple represents the score lowerbound lv , upper bound uv , and the existential probabilityPr[S(x) ∈ [lv, uv]].

Figure 5 shows four candidates {x1, x2, x3, x4} rankedin descending order according to their maximum values. InFigure 5, since the ranking function is max, the optimisticand pessimistic estimates of a candidate score correspondto its upper bound and lower bound, respectively. In orderto identify the worst cases of the top-k result set, we sortcandidates according to the lower bounds and then identifythe lower bounds of the candidates x1 and x2 (i.e., E1 andE2). Similarly, we identify the best cases of candidates out-side the tentative top-k result set, i.e., the late (n−k)-rankedcandidates. The upper bounds of these candidates (x3 and

8

x4) are marked as E3 and E2. These end points E1, E2 andE3 are used to partition the score range of each candidateinto non-overlapping subranges as shown in Table 1. Similarpartition techniques are used in paper [53], [55], [56]. Table 1also shows how uncertainty is introduced by the subrangesthat fall inside the worst case of the tentative top-k resultset (e.g., E2) and the best case of the candidates outside thetop-k result set (e.g., E3)1.

x1

x2

x3

t1

t2 t3

t4 t5x4

t6[0.5, 2]

[1, 2.5]

[2, 3]

[3.5, 3.5]

E1E2

E2 E3

(Positive)

(Uncertain)

(Uncertain)

(Negative)

Fig. 5. Descending-order top-k candidate ranking (k=2).

The next step is to assess the existential probability foreach subrange [lv, uv] associated with each candidate x.Without loss of generality, we assume that the probabilitydistribution of S(x) is uniform for ease of exposition. In thiscase, the existential probability of [lv, uv] is proportional tothe subrange width (uv − lv) relative to the score range ofx. For example, t2 occupies 50% of the score range of x2.As a result, its existential probability is 0.5. The existentialprobabilities of other subranges are given in Table 1.

Candidate Candidate Status Tuple Subrange Prob.

x1 Positive t1 [3.5, 3.5] 1.0

x2 Uncertain t2 [2.0, 2.5] 0.5t3 [2.5, 3.0] 0.5

x3 Uncertain t4 [1.0, 2.0] 0.66t5 [2.0, 2.5] 0.34

x4 Negative t6 [0.5, 2.0] 1.0TABLE 1

Existential probabilities of the subranges in Figure 5.

The way in which we represent S(x) as a set of tuplesis now compatible with the x-form. This enables us toapply Equation 4 to compute the weight and Equation 5 tocompute the top-k probability of each tuple. After obtainingthe weight and top-k probability of each tuple, we can applyEquation 3 at the candidate level to compute the entropycontribution Qx of each candidate x. That is,

Qx =∑

t=(v,e)∈Tx

ω(t, T )TopKProb(t), (6)

where T represents a set of all tuples and Tx represents a setof tuples associated with x.

4.3 Entropy Contribution of Each TaskWe are now ready to compute the entropy contribution ofeach task, which will be used for scheduling. Note that if weprocess all tasks associated with the same candidate x, S(x)

1. When ranking tuples occupy the same interval, we use the opti-mistic estimates of candidate scores as the tiebreaker.

is reduced to a single value and the entropy contribution Qxbecomes 0. As a result, the total entropy contribution of alltasks associated with x is equal to Qx.

The entropy contribution of each individual task is cal-culated based on its potential to reduce the score range of x.The way in which the score range of x is computed dependson the nature of the aggregate function i.e., ranking oradditive. As a result, we present two methods for assessingtask entropy contributions as follows.

4.3.1 Ranking aggregate functionWhen the aggregate function is a ranking function (e.g.,min), we compute the expected entropy contribution foreach task γ of x as

Qγ = AggValProb(γ)×Qx (7)

where AggValProb(γ) is the probability that the aggregatevalue of x is equal to the partial aggregation result vγ thatthe partial aggregation task γ provides. It is equivalent tothe probability that vγ is smaller than those of other partialaggregation tasks associated with the same candidate.

Before processing, each partial aggregation task is rep-resented as a range [lγ , uγ ] of possible partial aggregationresults. To compute AggValProb(γ) for each task, we parti-tion each task range in a similar fashion as how we partitionscore ranges of candidates in Section 4.2.2 (with k of 1).

Figure 6 shows three tasks associated with the samecandidate x2 ranked in ascending order according to theirlower bounds. The upper bound of γ2, i.e., E1, representsthe worst case of what is inside the top-k, while the lowerbounds of γ2 and γ3, i.e., E2 and E3, represent the best casesof what are outside the top-k. The partial aggregation resultvγ1 of task γ1 is partitioned into t1, t2, t3 and t4; vγ2 is par-titioned into t5 and t6; vγ3 is partitioned into t7 and t8. Weapply the same method to calculate existential probabilitiesfor candidate subranges (Table 1) to evaluated the existentialprobabilities for task partitions. The existential probabilityof each task partition is shown in Table 2.

t1 t2 t3

[2, 3]

[2, 3.5]

[2.5, 3]

[2.75, 3.75]

γ1 = (x2, Y1)

γ2 = (x2, Y2)

γ3 = (x2, Y3)

t4

t5 t6

t7 t8

E1

E2 E3

x2

Fig. 6. Entropy reduction potential of each task for the min aggregatefunction

We can see that for each task tuple ti, ti does not overlapwith any other task tuple (e.g., t1); or ti overlaps with tjin such a way that their subranges occupy the exact sameintervals (e.g., {t2, t5} and {t3, t6, t7}). Note that t4 and t8are ignored from the entropy contribution evaluation, sincethey are non-result-yielding, i.e., their subranges are outsidethe worst case of S(x2). Again, we assume the uniformdistribution here for ease of exposition.

9

Task = Task Task Existential Prob.(Candidate,Scoring Block) Tuple Subrange

γ1 = (x2, Y1) t1 [2.00, 2.50] 0.33t2 [2.50, 2.75] 0.17t3 [2.75, 3.00] 0.17t4 [3.00, 3.50] 0.33

γ2 = (x2, Y2) t5 [2.50, 2.75] 0.5t6 [2.75, 3.00] 0.5

γ3 = (x2, Y3) t7 [2.75, 3.00] 0.25t8 [3.00, 3.75] 0.75

TABLE 2Existential probabilities of task subranges from Figure 6.

At this point, we can apply the PSR method [54] tocompute the rank-1 probability ρ1(.) for each task tuple tassociated with the same task γ. The top-1 probability ofeach task γ is computed as

AggValProb(γ) =∑t∈Tγ

ρ1(t).

where Tγ denotes the set of task tuples associated with thetask γ. Consider γ2 in Figure 6 as an example. The rank-1probability of task tuple t5, t6 are respectively ρ1(t5) = 0.25,ρ1(t6) = 0.17. Hence,

AggValProb(γ2) = ρ1(t5) + ρ1(t6) = 0.42.

Finally, the entropy contribution of γ2 can be computedusing Equation 7.

4.3.2 Additive aggregate function

When the aggregate function is additive (e.g., sum, count, oraverage), we compute the expected entropy contribution foreach task γ of x according to how much γ contributes to thecurrent score range of x. For the sum aggregate function,the score range of x is evaluated by adding up the lowerbounds and upper bounds of all tasks associated with x.As a result, the entropy contribution Qγ to the score rangeof x for each task γ is proportional to its width and thecandidate’s entropy contribution Qx, i.e.,

Qγ =uγ − lγ

Supper(x)− Slower(x)×Qx. (8)

where [Slower(x), Supper(x)] is the score range of x.Figure 7 illustrates candidate x3 with three tasks γ1, γ2

and γ3 with ranges of [1, 7], [2, 6] and [3, 5], respectively.These ranges correspond to the score range [6, 18]. Afterprocessing γ1, its width becomes 0 and the width of theaggregate value shrinks from 12 to 6 units. As a result, theentropy contribution of γ1 is 50% of the candidate’s entropycontribution.

4

2

x36

γ3 = (x3, Y3)

γ2 = (x3, Y2)

γ1 = (x3, Y1)

Slower(x3) = 6 Supper(x3) = 18

Fig. 7. Entropy reduction potential of each task for the sum aggregatefunction

4.4 Scheduling as an Optimization Problem

The objective of our entropy-based scheduler is to find theset Γ∗C of tasks with the greatest entropy reduction potentialfor a given computational budget C . As stated earlier, weuse the total entropy contribution for scheduled tasks as ourentropy reduction measurement. Specifically, assume thatthe current entropy of the uncertain candidates is Q. Afterprocessing tasks in Γ∗C , the new entropy Q′ is expected tobe reduced by the total entropy contribution of the tasks inΓ∗C . That is,

Q′ = Q−∑γ∈Γ∗C

Qγ .

where Qγ is given by Equation 7 for ranking aggregatefunctions and Equation 8 for additive aggregate functions.

The budget C is the total computational allowance forprocessing the scheduled tasks. Specifically, the processingcost is defined as the number of distance calculations, whichis equivalent to the number of data objects in the scoringblocks. The optimization constraint requires that the com-putational total cost may not exceed C . As a result, thisoptimization problem can be modeled as a 0/1 knapsackproblem defined as follows.Definition 1. Let Γ denote the set of all tasks {γ1, .., γL}; ci

and Qγi denote the cost and entropy contribution of γi,respectively.

Maximize −L∑i=1

bi ·Qγi (9)

Subject toL∑i=1

bi · ci ≤ C and bi ∈ {0, 1} (10)

The 0/1 knapsack problem [57] can be solved usingdynamic programming in pseudo-polynomial time. In thispaper, we adopt the heuristic approach due to the efficiencyconsideration. Specifically, we formulate a greedy heuristicsolution that computes the benefit-cost ratio βi of each taskγi, i.e.,

βi =Qγici.

All tasks are ranked according to the benefit-cost ratio. Wethen select the top-ranked tasks whose total cost does notexceed C .

4.5 Budget C

As stated earlier, the budget C represents the total amountof computational work submitted to the workers. Recall thateach task corresponds to the aggregate distance computa-tion over a scoring block with respect to a candidate object.As a result, the number of scoring objects in the scoringblock directly corresponds to the amount of work (i.e., thecost ci) of the task γi.

In order to ensure that a decent degree of parallelismis achieved, C should be large enough to keep all workersactive. However, recall that the entropy contribution of eachtask is an estimate computed based on the current stateof the tentative result set. A smaller C allows for entropyestimates to be updated more often. This results in a greaterentropy estimation accuracy. Therefore its value is related

10

to the number of workers Wn and how fast each workerconsumes a task. We set C to a parameter multiplying thenumber of workers empirically. We found a good trade-offbetween parallelism and accuracy is obtained by setting Cto 10 ∗ Wn. Ideally, the budget C should be set througha feedback control mechanism that dynamically adjusts Cby considering the degree of parallelism and the estimateaccuracy. One way is to increase theC value if the utilizationdrops below the threshold and decrease the C value if theutilization is well above the threshold.

4.6 Comparative Scheduling Policies

In order to benchmark the effectiveness of the proposedentropy-based scheduling method, we prepare three com-parative scheduling policies, namely best scoring block first(BBF), best candidate first (BCF) and random (RD).Best scoring block first (BBF). For each candidate, BBF se-lects the best or the most result-yielding task. For a rankingaggregate function, the best task is the most optimistic one.Consider the min aggregate function as an example. Thebest task for each candidate is the one with the smallestlower bound in comparison to other tasks associated withthe same candidate. Note that we may select the second andthird best tasks of each candidate if the budget C allows.Best candidate first (BCF). BCF is considered as a candidate-level best first search. That is, we select the most optimisticcandidate first. Consider the descending order ranking asan example. The most optimistic candidate is the one withthe greatest upper bound. We schedule all the tasks from themost optimistic candidates within a given budget C .Random (RD). RD randomly selects unscheduled tasks toprocess within a given budget C .

5 PERFORMANCE EVALUATION

In this section, we evaluate the the performance of ourentropy-based scheduling (ES) method through experiments.We first describe our experiment settings, and then analyzeexperimental results.

5.1 Experiment Settings

We used the PostgresSQL RDBMS to manage the online can-didate scoring table (OCST) and the termination check mod-ule. All scheduling policies and workers were implementedin Python. All experiments were conducted on AmazonEC2 c3.8xlarge instances with 32 virtual CPUs and 60Gmemories. The operating system is Ubuntu server 14.04 LTS(HVM) . The performance evaluation consists of studieson the effect of the following parameters: the number ofworkers, the result set size k, the duplication factor andrecall. The default value and range of each parameter aregiven in Table 3.

Parameter Default Value RangeNumber of workers 8 [1,2, 4, 6, 8, 10]

k 15 [5, 10, 15, 20, 25]Duplication factor 3 [1, 2, 3, 4, 5]

Recall 1.0 [0.2, 0.4, 0.6, 0.8, 1.0]TABLE 3

Parameter table.

We present results from two different workload types,namely ranking-ranking and ranking-additive. For ranking-ranking, we use the Hausdorff distance computation(MaxMin) as our workload. For ranking-additive, we usethe MinSum computation as our workload.Data set. We use two data sets, namely molecular dynamicssimulation (MD) and point cloud city (PC). 1) Moleculardynamics simulation (MD). We use molecular dynamics simu-lation results from [58]. The original dataset contains 4×106

protein simulation shapes organized into 5 clusters. Eachprotein shape contains 6.4×104 of 3D coordinates. For eachcandidate set and scoring set, 200 samples from each clusterare selected from the original dataset. The total data sizeis 46 GB. 2) Point cloud city (PC). We also use a collectionof point sets representing the old town area of Bremen,Germany. We use a 3D point cloud scan which contains1.6 · 107 data points 2. points. For the candidate set, wesample 1,000 data points, while the rest of the points becomethe scoring set. The total data size is 806 MB.

For each dataset, we index the data points using a single-level M-tree, which organizes the data points into leaf nodes(blocks). The number of blocks is set to 100 for all data sets.Query processing involves two steps: OCST initializationand partial aggregation task processing.Preprocessing: preparing OCST. To prepare the OCST, wefirst compute the block-wise estimates for the candidates.In this case, candidates from the same block will share thesame upper and lower estimates. We then apply the rulesdescribed in Section 3 to filter out positive and negativecandidates. For the remaining (uncertain) candidates, wecompute point-to-block estimate for each candidate andscoring block combination. We then apply the same pruningrule again to refine the uncertain candidate set. The distancecalculation cost of this step is given by the number of block-wise estimates and the number of point-to-block estimatesthat we need to computes.

Note that all scheduling methods share the same prepro-cessing cost (which is at most 100 · 100 + 100 · 1000 in thiscase). Consequently, we exclude the preprocessing cost fromthe results reported in this section.Measurements. The performance is measured in terms ofDistance calculation cost, Running time, and Utilization.• Distance calculation cost is the total times we compute

the distance between two objects including the prepro-cessing step.

• Running time measures the total query execution time.In figures running time is shown in log scale.

• Utilization is the average utilization rate for the workers.Specifically, the utilization rate for each worker is theratio between the non-idle time divided by the runningtime. By nature, different solutions result in differentutilization rates depending on how they diversify thescoring blocks.

5.2 Summary of ResultsA summary of experimental results is given in Table 4. Wedisplay the results from the two datasets. The two queryworkload types, MaxMin and MinSum, were executed usingthe default parameter values described in Table 3. Our

2. http://kos.informatik.uni-osnabrueck.de/3Dscans/

11

Dataset Method Ranking-Ranking: MaxMin (Hausdorff Distance) Ranking-Additive: MinSum

Dist. Cal. Cost Running Time (s) Utilization Dist. Cal. Cost Running Time (s) Utilization

MD

BBF 2.00x105 (27.60) 4368.51 (63.16) 0.958 (1.22) 7.18x105 (1.08) 20794.54 (3.37) 0.484 (0.60)BCF 5.96x104 (8.22) 795.20 (11.50) 0.793 (1.01) 7.90x105 (1.19) 10506.10 (1.70) 0.797 (0.98)RD 8.68x105 (119.75) 12133.29 (175.41) 0.744 (0.95) 7.38x105 (1.11) 10277.02 (1.67) 0.743 (0.92)ES 7.25x103 (1.00) 69.17 (1.00) 0.782 (1.00) 6.66x105 (1.00) 6169.93 (1.00) 0.810 (1.00)

PC

BBF 7.38x108 (109.90) 1457.03 (123.90) 0.892 (1.12) 9.72x109 (3.40) 47747.31 (7.23) 0.384 (0.47)BCF 3.35x108 (49.95) 733.37 (62.36) 0.864 (1.09) 4.65x109 (1.63) 10388.81 (1.57) 0.846 (1.04)RD 2.96x108 (44.11) 613.62 (52.18) 0.842 (1.06) 4.34x109 (1.52) 9669.30 (1.46) 0.845 (1.04)ES 6.72x106 (1.00) 11.76 (1.00) 0.795 (1.00) 2.86x109 (1.00) 6602.31 (1.00) 0.815 (1.00)

TABLE 4Summary of results where each cell contains an absolute result and a result relative to ES in the format “absolute (relative-to-ES)”.

proposed method, ES, consistently performs better thanthe other three methods in both datasets in terms of thedistance computation cost and running time. We can seethat the performance improvement for the ranking-rankingworkload type is much greater than that for the ranking-additive workload type. This is due to the fact that theranking-ranking workload provides a better opportunity toprune partial aggregation tasks when computing aggregatescores of candidates.

5.3 Ranking-ranking: MaxMin (Hausdorff Distance)

We here discuss the performance of different schedulingpolicies in computing the Hausdorff distance on MD set.Number of workers. In this experiment, we varied thenumber of workers from 1 to 10. The number of workersof 1 is included to compare the performance of differentscheduling methods in the case of sequential processing.Figure 8a shows that the number of workers has no effect onthe distance computation cost for the three methods, sincethe number of workers does not affect the total number oftasks we have to process. The figure also shows that ourproposed method is the best performer in this cost measure.Figure 8b shows that for all methods, the query processingtime decreases as the number of workers increases due tothe decrease in the number of tasks per worker. Again, ourproposed method is the best performer in this measure. Wecan also see that the running time reduces sharply as thenumber of workers increases from 1 to 4, while the runningtime remains relative unchanged as the number of workersincreases from 4 to 10. Figure 8c shows that as the numberof workers increases, the utilization rate decreases. This isbecause, as the number of workers increases, the scoringblocks become more distributed and hence task schedulingbecomes more restricted. As a result, we cannot obtain asubstantial speedup by simply increasing the number ofworkers without increasing the duplicate factor.The number k of results. In this experiment, we varied thevalue of k from 5 to 25. Figure 9a shows the effect of k onthe distance computation cost. For our proposed method, ask increases, the distance computation cost increases due tothe increased number of results we have to produce. Forthe total execution time (Figure 9b), the results conformwith the distance computation cost. The two figures showthat our proposed method is the best performer for thetwo measures. Figure 9c shows that k has no effect on theutilization rate for all methods.

Duplication factor. In this experiment, we varied the dupli-cation factor from 1 to 5. In general, as the duplication factorincreases, each block is associated with more workers. Forexample, if duplication factor is 3, each block is loaded ontothe main memory upon initialization in 3 different workers.Figure 10a shows that the duplication factor does not affectthe distance computation cost. This is because a change inthe duplication factor does not affect the number of partialaggregation tasks we have to process. For the total runningtime (Figure 10b) and utilization Figure 10c, recall that eachworker is allowed to process tasks with a matching scoringblock. As duplication factor increases, the flexibility of taskscheduling increases hence the utilization increases and thetotal running time decreases.Recall. We redefine the termination condition to allow forapproximate results. Specifically, one can define the guaran-teed degree of discrepancy between the tentative result setand the ideal one [59]. We define our recall [60] thresholdR as the ratio between the number of true results in theapproximate result set and k. Hence, the computation canterminate when dk · Re of positive candidates are obtained.In this experiment, we compare our proposed method, ES,with the three benchmark methods using different recallthreshold values. Both distance computations and runningtime increase as shown in Figure 11a and Figure 11b. This isbecause a smaller recall threshold means that more approx-imate results are allowed.

5.4 Ranking-additive: MinSumIn this subsection, we discuss the performance of differentscheduling policies for the MinSum workload on the MDdatasets. We compare the proposed method ES with BBF,BCF and RD with different k values. In this experiment,we varied the value of k from 5 to 25. Figure 12a shows theeffect of k on the distance computation cost. For all the meth-ods, as k increases, the distance computation cost increases.Figure 12b shows that for all methods, the total runningtime increases due to the increased distance computationcost. Figure 12c shows changes in the k value have no effecton the utilization rate. Note that the utilization rate of BBF issignificantly lower than those of other methods for MinSum.This is due to the fact that the best scoring block for MinSumis the block with the greatest potential to reduce the scorerange, which is the same for all candidates. As a result,BBF schedules a large number of tasks associated with thesame scoring blocks. However, the number of scoring blocksduplicate is limited, causing a utilization bottleneck.

12

1 2 4 6 8 10No. of workers

0

2

4

6

8

10D

ist.

cal.

cost

(x10

0,00

0)

RDBBFBCFES

(a) Dist. cal. cost


100

101

102

103

104

105

Run

ning

tim

e (s

econ

ds)

RDBBF

BCFES

(b) Running time


0.2

0.4

0.6

0.8

1.0

Util

izat

ion

rate

RDBBFBCFES

(c) Utilization

Fig. 8. Effect of number of workers.

5 10 15 20 25Top k

0

2

4

6

8

10

Dis

t. ca

l. co

st (x

100,

000)

RDBBFBCFES

(a) Dist. cal. cost

5 10 15 20 25Top k

100

101

102

103

104

105

Run

ning

tim

e (s

econ

ds)

RDBBF

BCFES

(b) Running time

5 10 15 20 25Top k

0.2

0.4

0.6

0.8

1.0

Util

izat

ion

rate

RDBBFBCFES

(c) Utilization

Fig. 9. Effect of the number k of results.

1 2 3 4 5Duplicate factor

0

2

4

6

8

10

Dis

t. ca

l. co

st (x

100,

000)

RDBBFBCFES

(a) Dist. cal. cost


100

101

102

103

104

105

Run

ning

tim

e (s

econ

ds)

RDBBF

BCFES

(b) Running time


0.2

0.4

0.6

0.8

1.0

Util

izat

ion

rate

RDBBFBCFES

(c) Utilization

Fig. 10. Effect of duplicate factor.

6 CONCLUSIONS

In this paper, we formalize the notion of cross aggregateranking workload to describe the processing pattern ofvarious types of data exploration queries: medoid, Haus-dorff distance etc. We propose a scalable framework forvarious types of cross aggregate ranking workloads. Weformulated a novel scheduling policy for the frameworkbased on the concept of entropy. Experimental results showthat our proposed solution significantly outperforms otherbenchmark solutions. As future work, we plan to use adynamical feedback value as the budget C and apply ourtechnique to other similar top-k processing problems.

ACKNOWLEDGEMENTS

S. Nutanong was partially supported by a CityU re-search grant (CityU Projects 6000511, 7200387, and 7004421).Reynold Cheng is supported by the Research Grant Councilof Hong Kong (RGC GRF project 17205115).

REFERENCES

[1] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighborqueries,” in SIGMOD, 1995, pp. 71–79.

[2] D. P. Huttenlocher, G. A. Klanderman, and W. Rucklidge, “Com-paring images using the hausdorff distance,” IEEE Trans. PatternAnal. Mach. Intell., vol. 15, no. 9, pp. 850–863, 1993.

13

0.2 0.4 0.6 0.8 1.0Recall

0

2

4

6

8

10D

ist.

cal.

cost

(x10

0,00

0)

RDBBFBCFES

(a) Dist. cal. cost

0.2 0.4 0.6 0.8 1.0Recall

100

101

102

103

104

105

Run

ning

tim

e (s

econ

ds)

RDBBF

BCFES

(b) Running time

0.2 0.4 0.6 0.8 1.0Recall

0.2

0.4

0.6

0.8

1.0

Util

izat

ion

rate

RDBBFBCFES

(c) Utilization

Fig. 11. Effect of recall.

5 10 15 20 25Top k

4

5

6

7

8

Dis

t. ca

l. co

st (x

100,

000)

RDBBFBCFES

(a) Dist. cal. cost

5 10 15 20 25Top k

100

101

102

103

104

105

Run

ning

tim

e (s

econ

ds)

RDBBFBCFES

(b) Running time

5 10 15 20 25Top k

0.2

0.4

0.6

0.8

1.0

Util

izat

ion

rate

RDBBF

BCFES

(c) Utilization

Fig. 12. Effect of top k.

[3] S. Nutanong, E. H. Jacox, and H. Samet, “An incremental haus-dorff distance calculation algorithm,” PVLDB, vol. 4, no. 8, pp.506–517, 2011.

[4] L. Kaufman and P. Rousseeuw, Clustering by means of medoids.North-Holland, 1987.

[5] H. Park and C. Jun, “A simple and fast algorithm for k-medoidsclustering,” Expert Syst. Appl., vol. 36, no. 2, pp. 3336–3341, 2009.[Online]. Available: http://dx.doi.org/10.1016/j.eswa.2008.01.039

[6] M. L. Yiu and N. Mamoulis, “Multi-dimensional top-k dominatingqueries,” VLDB J., vol. 18, no. 3, pp. 695–718, 2009.

[7] L. Chen, X. Lin, H. Hu, C. S. Jensen, and J. Xu, “Answering why-not questions on spatial keyword top-k queries,” in ICDE, 2015,pp. 279–290.

[8] X. Lin, J. Xu, and H. Hu, “Reverse keyword search for spatio-textual top-$k$ queries in location-based services,” IEEE Trans.Knowl. Data Eng., vol. 27, no. 11, pp. 3056–3069, 2015.

[9] Q. Chen, H. Hu, and J. Xu, “Authenticating top-k queries inlocation-based services with confidentiality,” PVLDB, vol. 7, no. 1,pp. 49–60, 2013.

[10] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopou-los, “Closest pair queries in spatial databases,” in SIGMOD, 2000,pp. 189–200.

[11] E. H. Jacox and H. Samet, “Spatial join techniques,” ACM Trans.Database Syst., vol. 32, no. 1, p. 7, 2007.

[12] S. H. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan,A. Buyuktosunoglu, A. Davis, and F. Li, “Comparing imple-mentations of near-data computing with in-memory mapreduceworkloads,” IEEE Micro, vol. 34, no. 4, pp. 44–52, 2014.

[13] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly,M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributeddatasets: A fault-tolerant abstraction for in-memory cluster com-puting,” in NSDI, 2012, pp. 15–28.

[14] G. R. Hjaltason and H. Samet, “Distance browsing in spatialdatabases,” ACM Trans. Database Syst., vol. 24, no. 2, pp. 265–318,1999.

[15] Y. Tao, V. Hristidis, D. Papadias, and Y. Papakonstantinou,“Branch-and-bound processing of ranked queries,” InformationSyst., vol. 32, no. 3, pp. 424–445, 2007.

[16] J. Pei, M. Hua, Y. Tao, and X. Lin, “Query answering techniques onuncertain and probabilistic data: tutorial summary,” in SIGMOD,2008, pp. 1357–1364.

[17] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data withquality guarantees,” PVLDB, vol. 1, no. 1, pp. 722–735, 2008.

[18] L. Mo, R. Cheng, X. Li, D. W. Cheung, and X. S. Yang, “Cleaninguncertain data for top-k queries,” in ICDE, 2013, pp. 134–145.

[19] C. Doulkeridis and K. Nørvag, “A survey of large-scale analyticalquery processing in mapreduce,” VLDB J., vol. 23, no. 3, pp. 355–380, 2014.

[20] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processingtool,” Comm. of the ACM, vol. 53, no. 1, pp. 72–77, 2010.

[21] V. Kalavri, V. Brundza, and V. Vlassov, “Block sampling: Efficientaccurate online aggregation in mapreduce,” in CloudCom, 2013, pp.250–257.

[22] N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie, “Onlineaggregation for large mapreduce jobs,” PVLDB, vol. 4, no. 11, pp.1135–1145, 2011.

[23] N. Laptev, K. Zeng, and C. Zaniolo, “Early accurate results foradvanced analytics on mapreduce,” PVLDB, vol. 5, no. 10, pp.1028–1039, 2012.

[24] B. Efron, “Bootstrap methods: Another look at the jackknife,”The Annals of Statistics, vol. 7, no. 1, pp. 1–26, 01 1979. [Online].Available: http://dx.doi.org/10.1214/aos/1176344552

[25] S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, andI. Stoica, “Blink and it’s done: Interactive queries on very largedata,” PVLDB, vol. 5, no. 12, pp. 1902–1905, 2012.

[26] J. Dittrich, J. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, andJ. Schad, “Hadoop++: Making a yellow elephant run like a cheetah(without it even noticing),” PVLDB, vol. 3, no. 1, pp. 518–529, 2010.

[27] C. Doulkeridis and K. Nørvag, “On saying ”enough already!” inmapreduce,” in Cloud-I, 2012, p. 7.

14

[28] K. S. Candan, J. W. Kim, P. Nagarkar, M. Nagendra, and R. Yu,“Rankloud: Scalable multimedia data processing in server clus-ters,” IEEE MultiMedia, vol. 18, no. 1, pp. 64–77, 2011.

[29] V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica,“Hyracks: A flexible and extensible foundation for data-intensivecomputing,” in ICDE, 2011, pp. 1151–1162.

[30] T. White, Hadoop: The Definitive Guide, 1st ed. O’Reilly Media,Inc., 2009.

[31] D. Jiang, A. K. H. Tung, and G. Chen, “MAP-JOIN-REDUCE:toward scalable and efficient data analysis on large clusters,” IEEETrans. Knowl. Data Eng., vol. 23, no. 9, pp. 1299–1311, 2011.

[32] I. K. Koumarelas, A. Naskos, and A. Gounaris, “Binary theta-joins using mapreduce: Efficiency analysis and improvements,”in EDBT/ICDT, 2014, pp. 6–9.

[33] S. Fries, B. Boden, G. Stepien, and T. Seidl, “Phidj: Parallel similar-ity self-join for high-dimensional vector data with mapreduce,” inICDE, 2014, pp. 796–807.

[34] Y. Kim and K. Shim, “Parallel top-k similarity join algorithmsusing mapreduce,” in ICDE, 2012, pp. 510–521.

[35] C. Zhang, F. Li, and J. Jestes, “Efficient parallel knn joins for largedata in mapreduce,” in EDBT, 2012, pp. 38–49.

[36] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “An optimal andprogressive algorithm for skyline queries,” in SIGMOD, 2003, pp.467–478.

[37] Y. Tao, X. Xiao, and R. Cheng, “Range search on multidimensionaluncertain data,” TODS, vol. 32, no. 3, p. 15, 2007.

[38] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Evaluating prob-abilistic queries over imprecise data,” in SIGMOD, 2003, pp. 551–562.

[39] M. A. Soliman, I. F. Ilyas, and K. Chen-Chuan Chang, “Top-k queryprocessing in uncertain databases,” in ICDE, 2007, pp. 896–905.

[40] X. Lian and L. Chen, “Top-k dominating queries in uncertaindatabases,” in EDBT, 2009, pp. 660–671.

[41] K. Yi, F. Li, G. Kollios, and D. Srivastava, “Efficient processing oftop-k queries in uncertain databases,” in ICDE, 2008, pp. 1406–1408.

[42] X. Lian and L. Chen, “Probabilistic ranked queries in uncertaindatabases,” in EDBT, 2008, pp. 511–522.

[43] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic skylines onuncertain data,” in PVLDB, 2007, pp. 15–26.

[44] M. Hua, J. Pei, W. Zhang, and X. Lin, “Efficiently answeringprobabilistic threshold top-k queries on uncertain data.” in ICDE,vol. 8, 2008, pp. 1403–1405.

[45] M. Hua, J. Pei, and X. Lin, “Ranking queries on uncertain data,”VLDB J, vol. 20, no. 1, pp. 129–153, 2011.

[46] F. Li, K. Yi, and J. Jestes, “Ranking distributed probabilistic data,”in SIGMOD, 2009, pp. 361–374.

[47] M. Ye, X. Liu, W.-C. Lee, and D. L. Lee, “Probabilistic top-k queryprocessing in distributed sensor networks,” in ICDE, 2010, pp.585–588.

[48] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An efficient accessmethod for similarity search in metric spaces,” in PVLDB, 1997,pp. 426–435.

[49] P. N. Yianilos, “Data structures and algorithms for nearest neigh-bor search in general metric spaces,” in ACM/SIGACT-SIAM, 1993,pp. 311–321.

[50] T. Bozkaya and Z. M. Ozsoyoglu, “Indexing large metric spacesfor similarity search queries,” ACM Trans. Database Syst., vol. 24,no. 3, pp. 361–404, 1999.

[51] J. L. Bentley, “Multidimensional binary search trees used forassociative searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517,1975.

[52] A. Guttman, “R-trees: A dynamic index structure for spatialsearching,” in SIGMOD, 1984, pp. 47–57.

[53] J. Chen, R. Cheng, M. Mokbel, and C.-Y. Chow, “Scalable pro-cessing of snapshot and continuous nearest-neighbor queries overone-dimensional uncertain data,” VLDB J., vol. 18, no. 5, pp. 1219–1240, 2009.

[54] T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle,“Scalable probabilistic similarity ranking in uncertain databases,”IEEE Trans. Knowl. Data Eng., vol. 22, no. 9, pp. 1234–1246, 2010.

[55] R. Cheng, J. Chen, M. F. Mokbel, and C. Chow, “Probabilisticverifiers: Evaluating constrained nearest-neighbor queries overuncertain data,” in ICDE, 2008, pp. 973–982.

[56] R. Cheng, L. Chen, J. Chen, and X. Xie, “Evaluating probabil-ity threshold k-nearest-neighbor queries over uncertain data,” inEDBT, 2009, pp. 672–683.

[57] T. H. Cormen, Introduction to algorithms. MIT press, 2009.[58] D. E. Shaw, P. Maragakis, K. Lindorff-Larsen, S. Piana, R. O. Dror,

M. P. Eastwood, J. A. Bank, J. M. Jumper, J. K. Salmon, Y. Shanet al., “Atomic-level characterization of the structural dynamics ofproteins,” Science, vol. 330, no. 6002, pp. 341–346, 2010.

[59] M. Theobald, G. Weikum, and R. Schenkel, “Top-k query evalua-tion with probabilistic guarantees,” in PVLDB, 2004, pp. 648–659.

[60] M. K. Buckland and F. C. Gey, “The relationship between recalland precision,” JASIS, vol. 45, no. 1, pp. 12–19, 1994.

Chengcheng Dai received the bachelor’s de-gree in computer science and technology fromEast China Normal University, Shanghai, Chinain 2011 and M.Phil. degree in computer sciencefrom City University of Hong Kong in 2013. Sheis pursuing her Ph.D. degree at City Universityof Hong Kong. Her research interests includequery processing, location-based recommenda-tion and cloud computing.

Sarana Nutanong received the PhD degree incomputer science from the University of Mel-bourne, Australia, in 2010. He is currently anassistant professor in Department of ComputerScience, City University of Hong Kong. Beforejoining CityU, he was a Postdoctoral ResearchAssociate at University of Maryland Institute forAdvanced Computer Studies between 2010 and2012 and held a research faculty position at theJohns Hopkins University from 2012 to 2013. Hisresearch interests include scientific data man-

agement, data-intensive computing, spatial-temporal query processing,and large-scale machine learning.

Chi-Yin Chow received the M.S. and Ph.D.degrees from the University of Minnesota-TwinCities in 2008 and 2010, respectively. He iscurrently an assistant professor in Departmentof Computer Science, City University of HongKong. His research interests include big data an-alytics, data management, GIS, mobile comput-ing, and location-based services. He is the co-founder and co-organizer of the ACM SIGSPA-TIAL MobiGIS 2012 to 2016 and editor of theACM SIGSPATIAL Newsletter.

Reynold Cheng received the B.E. degree incomputer engineering and the M.Phil. degreein computer science and information systemsfrom the University of Hong Kong (HKU), HongKong in 1998 and 2000, respectively, and theM.Sc. and Ph.D. degrees from the Departmentof Computer Science, Purdue University, WestLafayette, IN, USA, in 2003 and 2005, respec-tively. He is an Associate Professor in the De-partment of Computer Science, HKU. From 2005to 2008, he was an Assistant Professor in the

Department of Computing, Hong Kong Polytechnic University. He is amember of the IEEE, ACM, ACM SIGMOD, and UPE. He has servedon the program committees and review panels for leading databaseconferences and journals. He is a member of the Editorial Board ofInformation Systems and DAPD journal. He is also a Guest Editor for aspecial issue in TKDE. His current research interests include databasemanagement, querying and mining of uncertain data.

1 Entropy-based Scheduling Policy for Cross Aggregate ...chiychow/papers/TSC_2016.pdf ·...

Documents

Transcript of 1 Entropy-based Scheduling Policy for Cross Aggregate ...chiychow/papers/TSC_2016.pdf ·...