Hashing-baseddelayedduplicatedetectionasanapproachto ... · Dateofacceptance Grade Instructor...

Date of acceptance Grade

Instructor

Hashing-based delayed duplicate detection as an approach toimprove the scalability of optimal Bayesian network structurelearning with external memory frontier breadth-first branchand bound search

Niklas Jahnsson

Helsinki February 22, 2015

UNIVERSITY OF HELSINKIDepartment of Computer Science

Faculty of Science Department of Computer Science

Niklas Jahnsson

Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesiannetwork structure learning with external memory frontier breadth-first branch and bound search

Computer Science

Master’s thesis February 22, 2015 53 pages + 0 appendices

Bayesian networks, Artificial intelligence, Algorithms, Delayed duplicate detection

Bayesian networks are graphical models used to represent the joint probability distribution for allvariables in a data set. A Bayesian network can be constructed by an expert, but learning thenetwork from the data is another option. This thesis is centered around exact score-based Bayesiannetwork structure learning. The idea is to optimize a scoring function measuring the fit of thenetwork to the data, and the solution represents an optimal network structure.

The thesis adds to the earlier literature by extending an external memory frontier breadth-firstbranch and bound algorithm by Malone et al. [MYHB11], which searches the space of candidatenetworks using dynamic programming in a layered fashion. To detect duplicates during the can-didate solution generation, the algorithm uses efficiently both semiconductor and magnetic diskmemory. In-memory duplicate detection is performed using a hash table, while a delayed duplicatedetection strategy is employed when resorting to the disk. Delayed duplicate detection is designedto work well against long disk latencies, because hash tables are currently still infeasible on disk. De-layed duplicate detection allows the algorithm to scale beyond search spaces of candidate solutionsfitting only in the comparatively expensive and limited semiconductor memory at disposal.

The sorting-based delayed duplicate detection strategy employed by the original algorithm has beenfound to be inferior to a hashing-based delayed duplicate strategy in other application domains[Kor08]. This thesis presents an approach to use hashing-based delayed duplicate detection inBayesian network structure learning and compares it to the sorting-based method. The problemfaced in the hashing of candidate solutions to disk is dividing the candidate solutions in a certainstage of the search in an efficient and scalable manner to files on the disk. The division presentedin this thesis distributes the candidate solutions into files unevenly, but takes into account themaximum number of candidate solutions that can be held in the semiconductor memory. Themethod works in theory as long as operating system has free space and inodes to allocate for newfiles.

Although the hashing-based method should in principle be faster than the sorting-based, the bench-marks presented in this thesis for the two methods show mixed results. However, the analysis pre-sented also highlights the fact that by improving the efficiency of, for example, the distribution ofhashed candidate solutions to different files, the hashing-based method could be further improved.

Tiedekunta — Fakultet — Faculty Laitos — Institution — Department

Tekijä — Författare — Author

Työn nimi — Arbetets titel — Title

Oppiaine — Läroämne — Subject

Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages

Tiivistelmä — Referat — Abstract

Avainsanat — Nyckelord — Keywords

Säilytyspaikka — Förvaringsställe — Where deposited

Muita tietoja — övriga uppgifter — Additional information

HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI

ii

Contents

1 Introduction 1

2 Background 3

2.1 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Score-based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Scoring functions . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . 6

2.2.3 Graph-based search approach . . . . . . . . . . . . . . . . . . 7

2.2.4 Order graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.5 Sparse parent graph . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.6 Heuristic function . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Delayed duplicate detection 12

3.1 Sorting-based delayed duplicate detection . . . . . . . . . . . . . . . . 12

3.2 Hashing-based delayed duplicate detection . . . . . . . . . . . . . . . 13

3.3 Hybrid duplicate detection . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Using delayed duplicate detection with hashing in Bayesian net-work structure learning 15

4.1 First-k subcombinations . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Selecting the number of variables used in a first-k subcombination . . 18

4.3 Dividing candidate order graph nodes to files on the disk . . . . . . . 19

4.4 Distribution of candidate order graph nodes to different files on thedisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Experimental results 24

5.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Data used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

5.4 Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5 Scenario with 20 variables . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5.1 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.6 Scenario with 29 variables . . . . . . . . . . . . . . . . . . . . . . . . 30

5.6.1 First set of results under light load . . . . . . . . . . . . . . . 31

5.6.2 Second set of results under heavy load . . . . . . . . . . . . . 39

5.6.3 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusions 49

References 51

1

1 Introduction

Bayesian networks are probabilistic graphical models used to represent relationshipsbetween variables in data sets [Pea88]. A Bayesian network is a directed acyclicgraph: its nodes represent the variables in the dataset and its edges represent thedirect dependencies among the variables. In other words, a Bayesian network rep-resents the joint probability distribution of all the variables in it.

To use Bayesian networks in practice, a Bayesian network representing the dataneeds to be constructed. When lacking experts to construct the network, an optionis to learn the network from data. There are different ways to learn a network fromthe data, and this thesis is centered around score-based Bayesian network structurelearning [HGC95], in which a scoring function is used to measure how well a certainBayesian network structure fits the data. The goal in score-based learning is to finda network structure that optimizes the value of the scoring function.

With the score-based learning framework, the earlier literature has utilized bothapproximate [Bou94, Hec96] and exact methods [KS04, SM06, JSGM10, dCJ11] tolearn a Bayesian network. One of the major motivations for exact methods is thatthe solutions found by the approximate learning methods have an unknown qualitywhereas the exact methods are guaranteed to provide an optimal solution.

This thesis adds to the earlier literature by extending an external memory frontierbreadth-first branch and bound algorithm by Malone et al. [MYHB11]. The al-gorithm by Malone et al. will be later on referred to as MYHB algorithm, and itsearches the space of candidate subnetworks using dynamic programming in a lay-ered fashion. This enables the search to efficiently use both semiconductor memoryand also to resort to magnetic disk memory if needed. When searching, duplicatescorresponding to subnetworks over the same set of variables are created, and theyare detected using an in-memory hash table. Only the best subnetwork is retained.This process is called duplicate detection, and it is a key operation in the algorithm.However, this also means that when resorting to disk, the algorithm needs to employa duplicate detection strategy to make sure that duplicates on disk are detected.

Delayed duplicate detection is a process utilized in artifical intelligence algorithmsemploying external memory to detect duplicates in generated candidate solutionsduring, for example, a brute force search. It replaces a hash table in semiconductormemory, and it is designed to work well despite long disk latencies when searchingon a magnetic disk storage. This is important because hash tables are currently

2

infeasible on disk. In other words, delayed duplicate detection enables algorithms tosearch spaces of candidate solutions which would not fit the semiconductor memoryat disposal.

The MYHB algorithm uses a sorting-based delayed duplicate detection algorithm,which sorts and writes the candidate solutions to disk every time the memory fillsup. In the end of the search for a single layer the duplicates can be detected byreading the candidate solutions in an orderly fashion from the files on the diskinto memory. However, in the earlier litetrature a hashing-based delayed duplicatedetection has outperformed the sorting-based method, for example, in a breadth-first search of Towers of Hanoi [Kor08]. In principle, the hashing-based methodruns in O(n) time in the number of candidate solutions, while the sorting basedmethod requires O(n log n) time . This suggests that it might be beneficial to applythe hashing-based method also in the structural learning framework for Bayesiannetworks.

A version of the algorithm using a hashing-based delayed duplicate detection ispresented in this thesis and compared against the MYHB algorithm. The problemfaced in the hashing of candidate solutions to disk is dividing the candidate solutionsin a certain stage of the search in efficient and scalable way to files on the magneticdisk. The division presented in this thesis distributes the candidate solutions intofiles unevenly, but takes into account the maximum number of candidate solutionsthat can be held in the semiconductor memory. In other words, the method shouldin theory be able to scale up as long as the operating system has free space andinodes to allocate for new files. Although hashing-based method should in principlebe faster than sorting-based, the benchmarks presented for the hashing and sorting-based algorithms show mixed results.

The thesis is organized as follows. The next section will present the relevant back-ground in Bayesian networks and structure learning. After that, delayed duplicatedetection is introduced, and its application to structure learning is outlined. Sec-tion 4 defines one way to implement hashing-based delayed duplicate detection inBayesian network structure learning context and analyzes briefly the way the chosenmethod divides to candidate solutions to the disk in practice. Section 5 will intro-duce the experimental setup and showcase the results. The last section concludesthe thesis.

3

2 Background

This section briefly summarizes the earlier work related to structure learning ofBayesian networks and introduces the concepts utilized later on. First, a randomvariable is a variable that may take on one of several possible outcomes, or values,from a specific domain. The following will assume that random variables referto categorized random variables, i.e. random variables taking values from finiteunordered domains.

Notation will follow mostly the standards in the literature as presented, for example,by Heckerman [Hec96]. Random variables will be referred with upper-case letters,for example X1, and their states with same letter in lower case, for example x1. Aset of random variables will be referred by a bold-face, upper-case letter, for exampleV , and a state of a set with each variable assigned to a particular value in the setwill be referred to with a corresponding bold-face, lower-case letter, for example, v.In the previous, it is assumed that the set of random varibles V is finite with sizen ∈ N.

The probability of seeing random variable X1 having value x1 will be denoted byp(X1 = x1), and as a shorthand for this p(x1) will be used. Similarly, notation p(x1|ξ)will be used as a shorthand of p(X1 = x1|ξ) to denote the conditional probabilitythat X1 = x1 given we know state information ξ. Note that in addition to using p(·)as a shorthand for probability, it will also be used to denote probability distributionand probability mass function; the usage will be clear from the context.

2.1 Bayesian networks

A Bayesian network is defined as a directed acyclic graph (DAG) G as depicted byFigure 1. The vertices of the graph G represent single random variables from a setof random variables V = {X1, X2, ..., Xn} with n ∈ N. The edges of the graph G

represent dependencies between two variables: for i, j ∈ [1, n] a directed edge fromnode Xi to node Xj means that node Xi is a parent of node Xj or, in other words,node Xj is a child of node Xi. Parent set PAi for node Xi is defined to containall the parents of node Xi. Bayesian network factorization assumes that given itsparent set PAi, node Xi is independent from its non-decendants [Pea88].

Relations between the variables in the network are quantified by a set of conditionalprobability distributions. In a Bayesian network each variable Xi has its own con-

4

X1

X2 X3

Figure 1: An example of a Bayesian network of three variables V = {X1, X2, X3}.

ditional probability distribution p(Xi|PAi) conditioning on its parents PAi. As awhole, a Bayesian network is used to represent a joint probability distribution overa set of random variables V . Due to the independence of a single node from its non-decendants given its parents, the joint probability distribution can be factorized asthe product of all the individual nodes’ conditional distributions as follows

p(x1, x2, ..., xn) =n∏

i=1

p(xi|pai). (1)

Overall, Bayesian networks have been used earlier in the literature to solve variousinference tasks. There are roughly three approaches to tackle a Bayesian networkstructure learning task: score-based learning, constraint-based learning and hybridmethods [YM13]. Score-based learning measures the different network structuresusing a scoring function and selects the one having the best score [HGC95, Hec96].On the other hand, constraint-based learning methods try to find conditional in-dependence relationships from the data using statistical tests, for example, forcausality, and then use the found relationships to build up the network structure[Pea88, Pea00]. Hybrid methods are combinations of the two earlier methods. Nextscore-based learning will be explained in more detail as it is the paradigm followedin this thesis.

2.2 Score-based learning

Learning a Bayesian network is defined as a quest to find a Bayesian network struc-ture that best fits the data. In other words, given a scoring function score(·), thetask is to find a network structure G∗ fitting the data best

G∗ = argminG

score(G). (2)

5

It should be noted that whether the argument in Equation 2 is minimized or maxi-mized depends on the scoring function used.

Overall, Chickering [Chi96] has shown that score-based learning with a typical scor-ing function is NP-hard, and early research in this area mainly focused in using ap-proximation algorithms. However, recently also multiple exact learning algorithmshave been developed based on, for example, dynamic programming [SM06] and linearinteger programming paradigms [JSGM10]. The benefit of exact learning algorithmsis that the quality of the solution they provide is by construction known to be op-timal whereas the quality of the solution provided by the approximation algorithmsis unknown.

Dataset D used to learn a network structure can be defined as a set of vectorsD = {D1, D2, ..., DN} with N ∈ N [YM13]. This definition assumes that eachdatapoint Dk with k ∈ N is a vector of values over variables in set V . In addition, itis assumed that each variable Xi in set V is categorical and has only a finite numberof possible values. It is also assumed that no data point is missing values.

Score-based learning can be divided to two elements: scoring function and searchstrategy [YM13]. A scoring function defines the quality of the solution while thesearch strategy decides how the candidate solutions are generated and how thescoring function is optimized.

2.2.1 Scoring functions

A scoring function score(·) measures how well a network structure fits the givendata. Effectively, it takes as parameters the observed data and the Bayesian networkstructure. Given its parameters, a scoring function returns a score, which reflectshow well the data fits the given network structure. For any decomposable scoringfunction the score of the network can be defined as the sum of the scores for thenodes in the network [Hec96]

score(G) =n∑

i=1

score(Xi|PAi). (3)

Two network structures are equivalent if the set of distributions that can be rep-resented using one of the networks is identical to the set of distributions that canbe represented using the other [Chi95]. If this is the case, then the two networkstructures belong to the same equivalence class. Scoring functions can be score

6

equivalent, which means that they assign the same score to each network structurein the same equivalence class. We now define a scoring function using minimumdescription length principle (MDL) as its metric, and this scoring function is knownto be score equivalent [Chi95].

Given the number of states for Xi as ri, the number of data points consistent withPAi = pai as Npai

and number of data points further having Xi = xi as Nxi,paiwe

can define the MDL scoring function as follows [LB94]

MDL(G) =n∑

i=1

MDL(Xi|PAi), (4)

where

MDL(Xi|PAi) = H(Xi|PAi) +log(N)

2K(Xi|PAi), (5)

H(Xi|PAi) = −∑xi,pai

Nxi,pailog(

Nxi,pai

Npai

), (6)

K(Xi|PAi) = (ri − 1)∏

Xl∈PAi

rl. (7)

With MDL the goal is to find the network that minimizes the MDL score. Notethat for other scoring functions the task might be maximization, but it can betransformed to minization by just by switching the sign of the score.

The idea behind the MDL principle is to find a model which minimizes the sum ofthe encoding lengths of the data and the model itself [LB94]. In other words, theMDL principle is a formalization of the well known Occam’s razor principle, whichstates that among competing hypotheses, the one with the simplest explanationshould be selected. In the above equations function H(·) represents the encoding ofthe data, while the function K(·) represents the encoding of the model.

2.2.2 Dynamic programming

To find an optimal Bayesian network for a set of variables V , it is sufficient tofind the best choice of a variable Xi as a leaf [YM13]. This effectively means thatfor any variable Xi chosen as the best leaf, the best possible Bayesian network canbe created by choosing its optimal parent set PAXi

from the rest of the variables

7

V \ {Xi} and constructing an optimal subnetwork for the rest of the variablesV \{Xi}. This means that the optimal leaf is such that the sum of score(Xi,PAXi

)

and bestscore(V \{Xi}) are minimized, where bestscore(·) is used here to select theoptimal parent set for variable Xi. In other words, we have the following relation

score(V ) = minXi∈V{score(V \ {Xi}) + bestscore(Xi,V \ {Xi})}, (8)

where

bestscore(Xi,V \ {Xi}) = minPAXi

⊆V \{Xi}score(Xi,PAXi

). (9)

2.2.3 Graph-based search approach

The idea in the following is to formulate the learning task as a graph search problem.The search graph naturally decomposes into layers, which can be searched one at atime. This enables the algorithm to use memory and disk in an orchestrated mannerwithout keeping all the nodes in memory at any point in time. Theoretically, thisalso means that the algorithm can scale up to any number of nodes if there is enoughdisk available. In the following, a definition of the order graph is given. Effectively,an order graph is the search graph. After that, a short review of the sparse parentgraph is given as it is used to obtain optimal scores for different parent sets duringthe search in the order graph. As noted by Malone et al. [MYHB11], the formulationof the problem of finding an optimal Bayesian network as a graph search problem,where the search graph is the order graph, opens up the possibility to use any graphsearch algorithm to find the best path from the start node to the goal node. Thisthesis is centered around breadth-first search, but, for example, Yuan and Malone[YM13] use an A* algorithm to search for an optimal Bayesian network.

2.2.4 Order graph

As defined by Malone et al. [MYHB11], an order graph is a Hasse diagram. Eachof its nodes Nk, with k = 1, ...,m and m ∈ N, contains the score for the associatedoptimal subnetwork for one subset of variables. Effectively, the order graph is apowerset for the set of variables V as it contains all the subsets of the set of variablesV . This also straight away yields that the number of the nodes m in the order graphequals 2n. One can derive it from the fact that any subset of V can be represented

8

as {ω1, ω2, ..., ωn} where each item ωi in the subset ,1 ≤ i ≤ n, can have eithervalue one or zero based on wheter a variable Xi belongs to the subset. As there aretwo choices for each n variables in the variable set V , there are together 2n uniquesubsets. Figure 2 shows an example of an order graph for four variables.

The binomial coefficient C(n, k), where n, k ∈ N, gives the number of distinct k-element subsets for any set containing n elements. For an order graph, binomialcoefficient can be used to calculate the number of nodes on a certain layer. For anorder graph with n variables in the set of variables V , the number of nodes on layerl is given by C(n, l).

A directed path in the order graph starts from the top of the graph from thestart node and ends at the bottom of the graph at the goal node. A path ef-fectively induces an ordering for the variables on the path. For example, a path{}, {X1}, {X1, X2}, {X1, X2, X3}, {X1, X2, X3, X4} produces an ordering of the vari-ables where variable X1 is first, variable X2 is second, variable X3 third and variableX4 is last. All variables preceding a variable in the ordering are candidate parents forthe variable. Each edge on the path has a cost equal to bestscore(·), where the newvariable is the child node and its parents are a subset of the preceding variables. Forexample, the cost of the edge edge between node {X1, X2} and node {X1, X2, X3}is given by bestscore(X3, {X1, X2}).

{}

{X2}{X1} {X3} {X4}

{X1, X2} {X1, X3} {X1, X4} {X2, X3} {X2, X4} {X3, X4}

{X1, X2, X3} {X1, X2, X4} {X1, X3, X4} {X2, X3, X4}

{X1, X2, X3, X4}

Figure 2: An example of an order graph for a Bayesian network of four variablesV = {X1, X2, X3, X4}.

2.2.5 Sparse parent graph

A parent graph is a data structure used to compute the costs for the arcs in the ordergraph [YM13]. In other words, for variable Xi and candidate parents U ⊂ V \{Xi}

9

it will be used to find bestscore(Xi,U) for the edge from U → U ∪ {Xi}. Eachvariable in the dataset has its own parent graph. A full parent graph is a Hassediagrams containing all subsets of the variables in set of variables V except thevariable itself in question, Xi. Figure 3a gives an example of the parent graph forvariable X1 with a set of variables V = {X1, X2, X3, X4}.

(a) Parent graph

{}12

{X3}6

{X2}5

{X4}7

{X2, X3}7

{X2, X4}4

{X3, X4}12

{X2, X3, X4}10

(b) Propagated scores

{}12

{X3}6

{X2}5

{X4}7

{X2, X3}5

{X2, X4}4

{X3, X4}6

{X2, X3, X4}4

(c) Pruned parent graph

{}12

{X3}6

{X2}5

{X4}7

{X2, X3}5

{X2, X4}4

{X3, X4}6

{X2, X3, X4}4

Figure 3: A sample parent graph for variable X1 for a Bayesian network of fourvariables V = {X1, X2, X3, X4}. Figure (a) shows the raw scores for all parent setsof X1. The first line specifies the parent set and the second line gives the scoreof using that set as the parent set for X1. Figure (b) gives the optimal parent setscores, i.e. bestscore(X1,...) for each candidate parent set. Note that the second linehere gives the optimal score of using this parent set as the parent set for X1 and itis propagated from a predecessor parent set if the predecessor has better score thanthe parent set itself. Last Figure (c) shows the effect of pruning the parent sets byleaving away pruned candidate parent sets in white. A parent set can be pruned ifany of parent set’s predecessors have better score than the parent set itself.

Each node in the parent graph stores the optimal parent set PAXiformed using

the predecessors U to minimize score(Xi,PAXi). Figure 3a shows the scores for

different parents of node X1. In practice, Yuan et al. collect the counts froma dataset using sparse AD-trees and compute scores based on the counts [YM13,ML98]. Score calculation would easily become a problem with exponential amountof scores to calculate unless some of the parent sets could be pruned. Actually, someparent sets can be discarded without even calculating their scores at all due to thefollowing theorems by Tian [Tia00]. Although the following theorems concern onlyMDL scores, other scoring functions have similar pruning rules which could be used

10

[dCJ11].

We begin with Theorem 1 from Tian, which gives an upper bound to the number ofparents in an optimal Bayesian network measured by the MDL scoring function.

Theorem 2.1. In an optimal Bayesian network based on the MDL scoring function,each variable has at most blog( 2N

logN)c parents, where N is the number of data points.

Another theorem used in pruning candidate parent sets is the equation 13 from Tian[Tia00] and it will be presented next.

Theorem 2.2. Let U and S be two candidate parent sets for Xi, U ⊂ S, andK(Xi|S) − MDL(Xi|U) > 0. Then S and all supersets of S cannot possibly beoptimal parent sets for Xi.

One more pruning theorem for the parent graph has appeared in the literature[Tey05] and is presented next. It effectively means that a parent set is not optimalwhen a subset has better score.

Theorem 2.3. Let U and S be two candidate parent sets for Xi such that U ⊂ S,and score(Xi,U) is better than score(Xi,S). Then S is not the optimal parent setof Xi for any candidate set.

This theorem can be seen in action in Figure 3b, where the optimal parent setpropagates to successor nodes having worse scores than their predecessors. Due tothis propagation, the bestscore-function must have the following property [Tey05].

Theorem 2.4. Let U and S be two candidate parent sets for Xi such that U ⊂ S.Then we have bestscore(Xi,S) ≤ bestscore(Xi,U).

The full parent graph for each variable Xi would enumerate all the powerset of V \{Xi} and store bestscore(Xi,U) values for all the powersets’ constituents. However,Theorem 2.3 shows that the number of optimal parent sets is often less than thewhole powerset and that an optimal parent set might be shared by several candidateparent sets. As a response to this, a sparse representation of the parent graph canbe used instead [YM13].

First of all, Theorems 2.1 and 2.2 allow the algorithm to prune some of the parentsets without evaluating their quality. This effectively means that whole powerset willnever be created. Second, instead of creating Hasse diagrams to represent parent

11

sets, the optimal parent scores are sorted for each variable Xi in a list along withanother list containing the corresponding optimal parent sets. By using these listseffectively, Yuan and Malone [YM13] are able to largely alleviate the problems withfull parent graphs’ size increasing exponentially with input variable count.

2.2.6 Heuristic function

In order to prune candidate solutions during the search in the order graph, a heuristicfunction is needed to assess the quality of subnetworks [YM13]. The idea is basedon the fact that an admissable heuristic function can be used to guide the search.The optimal solution to a relaxed problem can be used as an admissable bound forthe original problem. However, as this thesis is centered around the performanceof delayed duplicate detection explained in further detail later, the use of heuristicfunction to guide the search in the order graph is left aside. All the results presentedin this thesis are done without any pruning using a heuristic function during theorder graph search.

12

3 Delayed duplicate detection

Many search algorithms are often limited by the amount of semiconductor memoryavailable. At the same time disk storage is a few orders of magnitudes cheaperthan semiconductor memory. One of the main challenges with using disk insteadof memory in search algorithms is the detection of duplicates during the generationof the candidate nodes in the search graph, in other words, duplicate detection.When working with semiconductor memory, duplicate detection can be done withan in-memory hash table. However, due to long disk latencies, randomly accessedhash-tables have been infeasible using disk. According to Korf [Kor08], the mainreason why in-memory hash tables using virtual memory on disk are infeasible isthat the access to memory would in many cases result in a page fault.

Search algorithms can address the problem of duplicate detection when the searchgraph does not fit into semiconductor memory by using a mechanism called delayedduplicate detection (DDD) [Kor08, Kor04, Ros94]. It can be generically definedas a process in which the duplicates are detected after they have been written todisk. According to Korf [Kor08], Jerry Brian was among the first to propose theuse of delayed duplicate detection to perform brute-force breadth first searches ofRubik’s Cube using external memory. However, the work was never published andonly briefly described in a series of posts to the Cube-Lovers mailing list startingin December 1993. It is also worth noting that although the price of semiconductormemory has gone down, there has still been interest in doing research in delayedduplicate detection by making the delayed duplicate detection faster, for example,by moving the calculations to GPU [ES11].

3.1 Sorting-based delayed duplicate detection

There have not been too many deviations from Jerry Brian’s idea until quite re-cently. The general idea of delayed duplicate detection until Korf’s paper in 2008[Kor08] was to push the candidate solutions from memory to disk as soon as theyare generated [Kor04]. Then at the end of the search the candidate solutions aresorted on the disk and duplicates detected. The sorting part of this runs in principlein O(n log n) time, which is the bottleneck of the method.

A variant of this approach was used by Malone et al. [MYHB11] in the MYHBalgorithm. This approach works in a way that the candidate solutions are sortedfirst in memory before writing them to disk to separate files. Once the candidate

13

solutions for a one layer in the search graph have been generated, they are in fileson the disk. To create the global order between all of the candidate solutions inthe files, the algorithm then reads the topmost candidate solution from each file inmemory, and so creates an ordered set of the topmost candidate solutions in memory.By ordering this set and then selecting the first candidate solution according to theordering, the procedure finds the first candidate solution in the global ordering.After that the set of candidate solutions needs to be updated to again contain thetopmost candidate solutions left in the files on the disk, and then the next candidatesolution in the global ordering can be selected. This process finds a global orderingfor the nodes on the disk, and the procedure allows the detection of duplicates asthe ordering will present duplicates continguous to each other. The global orderingensures that after a new candidate solution with certain score has been read fromthe disk, all of the duplicates of the candidate solution will be read from the diskbefore reading a new unseen candidate solution.

After Jerry Brian’s initial work, there has been two major changes to delayed du-plicate detection. One of them is the structured duplicate detection [ZH04], but itwill not be considered more extensively here. The second one is the hashing-baseddelayed duplicate detection introduced by Korf [Kor08] and it will be next discussedin more detail.

3.2 Hashing-based delayed duplicate detection

Hashing-based delayed duplicate detection runs in linear time in the number ofcandidate solutions being processed. This makes it faster than the sorting-basedmethod, which runs in O(n log n) time in the number number candidate solutionsprocessed.

The main idea in the hashing-based delayed duplicate detection is to partition thesearch space into non-overlapping subsets of unique candidate solutions. Thesesets define files on the disk so that a single file only contains candidate solutionsbelonging to one of the sets. Next, the candidate solutions are written to the filesas they are generated. The delayed duplicate detection is done by employing anin-memory hash table: the candidate solutions from a single file are read into thehash table and duplicates are detected one at a time. The definition of the subsetspartinioning the search space must guarantee that the hash table will not containmore than a given amount of unique candidate solutions. This ensures that memoryrequirements of any size can be met.

14

Korf [Kor08] presents as an example the 30-disc four-peg Towers of Hanoi Problem.He divides the discs into the 16 largest and the 14 smallest, and uses the 16 largestdiscs to decide the file into which a candidate solution is written in. The division ofcandidate solutions based on the 16 largest discs also reduces the memory require-ment, because in a single file the 16 largest discs are the same for each candidatesolutions. This means that it is actually enough to write only the 14 smallest discsper candidate solution into each file.

The method can also be parallelized without problems [Kor08]. Both reading andwriting to separate files are not subject to concurrent accesses that would need tobe synchronized on the program level. The candidate solutions can be generatedin separate threads without the two threads accessing the same resources because asingle thread can generate candidate solutions without affecting others. In a layeredsearch, the files containing the candidate solutions from the previous layer can bedistributed between the worker threads so that each thread only needs to access onefile at a time. Similarly, writes to files on the disk need not to be synchronized,because the writes to the same file by separate threads are temporarily stored toseparate output buffers [Kor08]. In other words, it is enough to allow the operatingsystem to serialize the writes to the files on the disk.

3.3 Hybrid duplicate detection

The MYHB algorithm by Malone et al. [MYHB11] utilizes in-memory duplicatedetection as long as possible and resorts to delayed duplicate detection only whennecessary. This behavior can be named as hybrid duplicate detection, and it is done toreduce the number of writes done to the disk. The barrier where duplicate detectionchanges to delayed duplicate detection is binary, which means that the in-memoryhash table is flushed to disk if it contains more nodes than a given predefined limit.

15

4 Using delayed duplicate detection with hashing in

Bayesian network structure learning

Since the size of the largest layer of the order graph with n variables is order ofC(n, n

2), even for modest n the number of unique order graph nodes generated for

that layer is too large to fit into semiconductor memory available for the algorithm.The generation of the nodes during the search through the order graph also pro-duces duplicates, and only the duplicate with the best score should be used in nodegeneration for the next order graph layer. Overall, this means that the algorithmneeds to utilize disk to manage all generated order graph nodes.

The MYHB algorithm by Malone et al. [MYHB11] performs delayed duplicatedetection using the sorting-based hybrid method. The idea is to perform the delayedduplicate detection using multiple sorted files and reading nodes from them one at atime to detect possible duplicates as described in Section 3.1. However, in the workby Korf [Kor08], disk-based hashing was superior to the sorting-based approach.

The main idea in this thesis is to study whether this is also the case for Bayesiannetwork structure learning. In principle, the hashing-based DDD algorithm runs inlinear time in the number of candidate solutions generated while the sorting basedDDD algorithm has a theoretical speed limit of O(n log n).

The main concern when using hashing to deal with external memory is dividing thegenerated nodes to different files when writing them to the disk. In the following aconcept of file-set is used to refer to a set of candidate order graph nodes which arelaid into the same file on the disk. The delayed duplicate detection is performed ona file basis, so a practical requirement is that each file-set should not contain moreunique nodes than fits into memory. Otherwise, all the nodes will not fit into thememory, and the duplicate detection would fail. For the implementation, a naturalstarting point is to first satisfy the constraint with memory, and that will be furtherdiscussed next.

To hash a single candidate order graph node to certain file on the disk a file hash-function needs to be defined. It divides a single layer of the order graph into subsetsat most of a given size. One approach to do the division will be presented next.The idea used in dividing the nodes to different files is very similar to the ideaKorf [Kor08] uses when dividing the candidate solutions to different files based onthe certain number of bits in the bitstring representing a single candidate solution.Such division allows the algorithm to control the number of files and the number

16

of unique candidate solutions per file, which is crucial to ensure that the delayedduplicate detection respects given memory requirements.

4.1 First-k subcombinations

The basic idea in dividing order graph nodes into different files is based on the useof combinations and subcombinations similar to Tamada et al. [TIM11].

Definition 4.1 (Combination). An ordered set of variables U ⊂ V is referredto as a combination, and it uniquely identifies a node in the order graph. Withthe number of variables in the set of variables V denoted by n, a combination isrepresented using an indicator set I = {ω1, ω2, ..., ωn} where each item ωi, 1 ≤ i ≤ n

in the indicator set I can have either value one or zero indicating whether a variableXi ∈ V belongs to subset U .

It should be noted that the index i is used in indexing the set of variables V , but alsothe indicator set I for the combinationU . This way the index i creates a precendenceorder for the variables belonging to the combination and is important to keep inmind in the following definitions. It is also worth noting that such precendence orderis arbitrary in the sense that variable X1 has highest precedence solely because ithappens to be the first variable in the set of variables V .

Definition 4.2 (Subcombination). An ordered subset S ⊂ U ⊂ V of a combinationU is called a subcombination.

Example 4.1. To highlight the importance of the indicator set, we showcase itusing set notation. Define an element of V as Xi. Now a subcombination S ofcombination U can be represented using an indicator set

I = {ωi | if Xi ∈ S, then ωi = 1, otherwise ωi = 0 } .

Based on the these two definitions we can define the first-k subcombination as fol-lows:

Definition 4.3 (First-k subcombination). A first-k subcombination of combinationU is a subcombination having the first k variables in common with combination U .

First-k subcombinations are used in the following to define file-set. This means thatwe need to define when two combinations have a common first-k subcombination.

17

Definition 4.4 (Indicator set for a common first-k subcombination). Define theindex of the kth variable appearing in the first-k subcombination S in the variableset V as l. Given two combinations U1 and U2 from variable set V , the indicatorset Ik for the common first-k subcombination is defined as

Ik = {ωi | ωi ∈ IU1 , λi ∈ IU2 , ωi = λi when i ≤ l ≤ n, otherwise ωi = 0 } .

The following example illustrates definition of a common first-k subcombination.

Example 4.2. If we have combination U1 = {X1, X4, X6, ...} and another combi-nation U2 = {X1, X4, X7, ...}, we can define knowing only the first four variables ofthe both combinations their common first-2 subcombination S = {X1, X4}.

It is important to note that from the definition of the first-k subcombination itfollows that each possible combination from the set of variables V belongs to exactlyone first-k subcombination for each k. This is built in to the definition of the first-ksubcombination. We can see this by defining l as the index of the kth variable.Then the first l indicators in the indicator set I = {ω1, ω2, ..., ωn} of a combinationU identify uniquely the first-k subcombination in question. The following examplewill explain this in further detail.

Example 4.3. In the following, it is assumed that there are ten variables in theset of variables V = {X1, X2, ..., X10}, the combinations formed from the set ofvariables have five variables in them, and we are considering a first-4 subcombination.We take as an example first-4 subcombination S = {X1, X2, X3, X7}. From thedefinition of first-4 subcombination S we know that the index, l, of the fourthvariable, X7, is seven. This means that any combination belonging to this first-4 subcombination will have the first seven indicators in its indicator set havingvalues {1, 1, 1, 0, 0, 0, 1}. If we assumed that there were two first-k subcombinationswith the same indicator sets, then the two would be by definition the same first-ksubcombination. In other words, any combination having these seven indicatorsset as defined above, would have the same first-4 subcombination. Two examplesare combinations C1 = {X1, X2, X3, X7, X9} and C2 = {X1, X2, X3, X7, X8} whichboth have same first seven values in their indicator sets IC1{1, 1, 1, 0, 0, 0, 1, 0, 1, 0}and IC2{1, 1, 1, 0, 0, 0, 1, 1, 0, 0}.

Based the definition of a first-k subcombination, we define a set of combinationshaving a common first-k subcombination. A set of combinations is a set of sets: itwill be called a family of combinations and referred with italic capital font like F .

18

Definition 4.5 (Family of combinations). A set of combinations is a family of setscontaining combinations having a certain common subcombination. In other words,given a first-k subcombination S with k ≤ l, for each combination Uj, j ∈ N in thefamily of combinations F , the indicator set of the jth combination must satisfy

IUj= {ωi | ωi ∈ IS when i ≤ l ≤ n } .

Not all of the possible first-k subcombinations correspond to combinations. Forexample, if we have ten variables in the set of variables V = {X1, X2, ..., X10} and acombination has five variables, then we can see that a first-2 subcombination havingvariables X9 and X10 will correspond to zero combinations. This is because a nodehaving variables X9 and X10 must have three other variables in it as well, and all ofthese three variables must have indices smaller than 9 and 10, because 9 and 10 arethe largest indices.

4.2 Selecting the number of variables used in a first-k sub-

combination

Binomial coefficients can be used to calculate how many distinct variable combina-tions can be generated into the family of combinations F for a first-k combination.With the number of variables in the set of variables V denoted by n, the numberof distinct order graph nodes having m variables are given by binomial coefficientC(n,m).

Similarly, the same method can be used to solve for the number of variables used inthe first-k subcombination definining the family of combinations for each file, i.e. afile-set. The number k of first-k variables used to create the family of combinationsF can be calculated by enforcing that the number of distinct variable combinationsper file is at most the number of nodes that fit into memory. In practice, thesolving is implemented by guessing that the number of variables used in the first-ksub-combination is not larger than m. Then, k is decremented one-by-one untilthe number of order graph nodes in the family of combinations per file is as largeas possible still fitting into memory. The implementation selects k to be as smallas possible to avoid creating more files than needed because more files increaseoverhead.

We will now present an example showing how to calculate the number of unique com-binations for a single file-set based on first-k subcombination. This example mimics

19

the process used in the implemented algorithm to find the number of variables usedin first-k subcombination, which defines the file-sets on the disk.

Example 4.4. In the following, it is assumed that there are ten variables in theset of variables V = {X1, X2, ..., X10} and we are on level 5 of the order graph, i.e.,we have five variables in a candidate order graph node. This means that there areC(10, 5) = 252 unique order graph nodes. Assume now that we can fit into memoryat most 50 unique order graph nodes at a time. This means that a single file-setcan at most contain unique 50 nodes. We can now start the iteration, where wedecrement the number of variables used in the first-k subcombination one-by-oneuntil a first-k subcombination will at most match to 50 unique order graph nodes.If we use five variables for the first-k subcombination, then there will be exactlyone unique order graph node in each of the files, and 252 different files. If we usefour variables for the first-k subcombination, then there will be C(10− 4, 5− 4) =

C(6, 1) = 6 unique combinations in each file-set. We can decrease the number ofvariables used in the first-k subcombination to three, which means that there willbe C(10− 3, 5− 3) = C(7, 2) = 21 unique combinations in each file-set. If we woulddecrease the number of variables used in the first-k subcombination to two, thenthere would be C(10− 2, 5− 2) = C(8, 3) = 56 unique combinations in each file-set,and that would violate the limit for memory set earlier. When we use three asthe number of variables in the first-k subcombination, the possible number of filesused on the disk can be calculated by noticing that there are three variables thatwe select from a set of ten variables. Thus, we can have C(10, 3) = 120 differentfile-sets based on this first-k subcombination.

It is worth noting here that the number of files used on the disk is far from perfect.When we have 21 unique order graph nodes in a single file, and altogether 252unique order graph nodes, it would make sense to have only 252

21= 12 files on the

disk. However, the definition of the first-k subcombinations means that some of thefiles will contain only a few unique order graph nodes. This property of the first-ksubcombinations will be later assessed in more detail.

4.3 Dividing candidate order graph nodes to files on the disk

After selecting the number of variables used in the first-k subcombination, the nextstep is to divide the candidate order graph nodes to the files on the disk based onthe file-set to which they belong. Each first-k subcombination will have its own file,

20

and the definition of first-k subcombination guarantees that each one of these fileswill not contain more unique nodes than what can fit into memory.

We will next present an example showing how the candidate order graph nodes aredivided to files on the disk based on the first-k subcombination defining the differentfile-sets. This is the process used in the algorithm when writing the candidate ordergraph nodes to the disk.

Example 4.5. In the following, it is assumed that there are ten variables in the setof variables V = {X1, X2, ..., X10} and we are on the level 5 of the order graph, i.e.,we have five variables in a candidate order graph node. If we now use two variablesas the k for the first-k subcombinations, we can see that we have, for example,the following first-k subcombinations: {X1, X2}, {X1, X3}. Similarly, we have, forexample, the following candidate order graph nodes which need to be written to thefiles on the disk: {X1, X2, X3, X5, X6}, {X1, X2, X5, X6, X9}. Both of these nodeswould be written to the file corresponding to the first-k subcombination {X1, X2}because the first two variables in for the two combinations are X1 and X2.

4.4 Distribution of candidate order graph nodes to different

files on the disk

The definition of first-k subcombination means that the distribution of unique ordergraph nodes to files is uneven. The maximum number of nodes per file will also bereached at most in one file. In other words, there is still room for improvement in theway this hashing function divides order graph nodes into different files. To quantifythe distribution of combinations to file-sets we next define a union of combinations.

Definition 4.6 (Union of combinations). A union of combinations is a union of thevariables belonging to the combinations that are part of the union. For set combi-nations Uj, j ∈ N the union is defined as

⋃j∈N Uj. In other words, for combinations

with indicator sets Ij and components ωi,j, the indicator set for the union, I∪, isdefined as

I∪ = {ωi | If ωi,j = 1 for at least one j, then ωi = 1, otherwise ωi = 0 } .

We now consider an example showing how first-k subcombinations lead to unevendistribution of candidate order graph nodes to files on the disk.

Example 4.6. In the following, it is assumed that there are ten variables in the setof variables V = {X1, X2, ..., X10} and we are on the level 3 of the order graph. This

21

means that there are C(10, 3) = 120 unique order graph nodes. In addition, thereare C(9, 2) = 36 nodes containing for example X1, because if we take one of thevariables as given, i.e. X1, then we have nine variables left and we need to choosetwo of them. Similarly, we have C(8, 1) = 8 nodes containing sub-combination{X1, X2} or sub-combination {X1, X3}. Now if we look at the union of the twoearlier combinations {X1, X2, X3}, we can see that there are C(7, 0) = 1 nodescontaining this combination. This means effectively that this one node will be laidto the file corresponding to sub-combination {X1, X2}, because variables X1 and X2

precede variable X3 in the ordering of the variables in the set of variables V . Thismeans that instead of having eight nodes, the file for sub-combination {X1, X3} willhave one node less. The same will also happen for sub-combinations {X1, X3} and{X1, X4}, whose union is {X1, X3, X4}. The file for subcombination {X1, X4} willcontain only six nodes, because both subcombinations {X1, X2} and {X1, X3} willboth steal one node from it. The analysis can be continued in this case up to thesuper-combination {X1, X9, X10} and it would show that file for sub-combination{X1, X9} will have exactly one node and file for sub-combination {X1, X10} willnot contain any nodes. This means that combinations will be divided unevenlybetween the file-sets and only one of the file-sets, in this case the file-set for first-2subcombination {X1, X2}, will contain the maximum number of nodes.

Table 1 shows the counts and percentages for five different subcombinations forthree different variable sets on differnet levels of the order graph. It also showswhat happens when the number of nodes on the layer is increased or the numberof variables in the set of variables is increased. The main take from the table isthat as the number of variables is increased also the number of variables in thesubcombinations used needs to increase as well to allow hashing to properly dividethe order graph nodes to files on the disk. It also highlights the fact that the hardestpart for the hashing is always the layer in the order graph having half of the variablesin the set of variables. For example, if we have twenty variables in the variable setand the sub-combination used uses four variables, then the most combinations arecreated on layer twelve and thirteen of the order graph. This can be explained bythe fact that the the number of unique combinations from a set of variables V withn variables is largest when we select a subset with n

2variables. In other words, when

we fix four of the 20 variables, then we have 20− 4 = 16 variables remaining. Thismeans that the most combinations are created when we are on level twelve of theorder graph, because there we are fixing four variables from 12. This means that wehave left 12− 4 = 8 variables, which is half of sixteen.

22

Table 1 also shows that as the number of variables in the order graph node is in-creased, the effect of stealing from other nodes increases. This can be seen fromthe fact that, for example, the number of nodes belonging to family of first-2 sub-combination {1, 2} increase until about half of the nodes belong to its family. Theproblem with stealing is not very apparent with ten nodes in the set of variables.However, there are 24310 nodes in the family of first-3 subcombination {1, 2, 3} onlayer eleven of the order graph for twenty variables, so the problem comes into focus.However, by increasing the size of the first-k sub-combination, the problem can bealleviated: the sub-combination {1, 2, 3, 4, 5} has only 5005 nodes on layer elevenand at most 6435 nodes on layer 12 of the order graph for twenty variables.

23

countsn l size {1} {1, 2} {1, 2, 3} {1, 2, 3, 4} {1, 2, 3, 4, 5}10 1 10 1 (0.10) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00)10 2 45 9 (0.20) 1 (0.02) 0 (0.00) 0 (0.00) 0 (0.00)10 3 120 36 (0.30) 8 (0.07) 1 (0.01) 0 (0.00) 0 (0.00)10 4 210 84 (0.40) 28 (0.13) 7 (0.03) 1 (0.00) 0 (0.00)10 5 252 126 (0.50) 56 (0.22) 21 (0.08) 6 (0.02) 1 (0.00)10 6 210 126 (0.60) 70 (0.33) 35 (0.17) 15 (0.07) 5 (0.02)10 7 120 84 (0.70) 56 (0.47) 35 (0.29) 20 (0.17) 10 (0.08)10 8 45 36 (0.80) 28 (0.62) 21 (0.47) 15 (0.33) 10 (0.22)10 9 10 9 (0.90) 8 (0.80) 7 (0.70) 6 (0.60) 5 (0.50)10 10 1 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00)15 1 15 1 (0.07) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00)15 2 105 14 (0.13) 1 (0.01) 0 (0.00) 0 (0.00) 0 (0.00)15 3 455 91 (0.20) 13 (0.03) 1 (0.00) 0 (0.00) 0 (0.00)15 4 1,365 364 (0.27) 78 (0.06) 12 (0.01) 1 (0.00) 0 (0.00)15 5 3,003 1,001 (0.33) 286 (0.10) 66 (0.02) 11 (0.00) 1 (0.00)15 6 5,005 2,002 (0.40) 715 (0.14) 220 (0.04) 55 (0.01) 10 (0.00)15 7 6,435 3,003 (0.47) 1,287 (0.20) 495 (0.08) 165 (0.03) 45 (0.01)15 8 6,435 3,432 (0.53) 1,716 (0.27) 792 (0.12) 330 (0.05) 120 (0.02)15 9 5,005 3,003 (0.60) 1,716 (0.34) 924 (0.18) 462 (0.09) 210 (0.04)15 10 3,003 2,002 (0.67) 1,287 (0.439 792 (0.26) 462 (0.15) 252 (0.08)15 11 1,365 1,001 (0.73) 715 (0.52) 495 (0.36) 330 (0.24) 210 (0.15)15 12 455 364 (0.80) 286 (0.63) 220 (0.48) 165 (0.36) 120 (0.26)15 13 105 91 (0.87) 78 (0.74) 66 (0.63) 55 (0.52) 45 (0.43)15 14 15 14 (0.93) 13 (0.87) 12 (0.80) 11 (0.73) 10 (0.67)15 15 1 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00)20 1 20 1 (0.05) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00)20 2 190 19 (0.10) 1 (0.01) 0 (0.00) 0 (0.00) 0 (0.00)20 3 1,140 171 (0.15) 18 (0.02) 1 (0.00) 0 (0.00) 0 (0.00)20 4 4,845 96 (0.20) 153 (0.03) 17 (0.00) 1 (0.00) 0 (0.00)20 5 15,504 3,876 (0.25) 816 (0.05) 136 (0.01) 16 (0.00) 1 (0.00)20 6 38,760 11,628 (0.30) 3,060 (0.08) 680 (0.02) 120 (0.00) 15 (0.00)20 7 77,520 27,132 (0.35) 8,568 (0.11) 2,380 (0.03) 560 (0.01) 105 (0.00)20 8 125,970 50,388 (0.40) 18,564 (0.15) 6,188 (0.05) 1,820 (0.01) 455 (0.00)20 9 167,960 75,582 (0.45) 31,824 (0.19) 12,376 (0.07) 4,368 (0.03) 1,365 (0.01)20 10 184,756 92,378 (0.50) 43,758 (0.24) 19,448 (0.11) 8,008 (0.04) 3,003 (0.02)20 11 167,960 92,378 (0.55) 48,620 (0.29) 24,310 (0.14) 11,440 (0.07) 5,005 (0.03)20 12 125,970 75,582 (0.60) 43,758 (0.35) 24,310 (0.19) 12,870 (0.10) 6,435 (0.05)20 13 77,520 50,388 (0.65) 31,824 (0.41) 19,448 (0.25) 11,440 (0.15) 6,435 (0.08)20 14 38,760 27,132 (0.70) 18,564 (0.48) 12,376 (0.32) 8,008 (0.21) 5,005 (0.13)20 15 15,504 11,628 (0.75) 8,568 (0.55) 6,188 (0.40) 4,368 (0.28) 3,003 (0.19)20 16 4,845 3,876 (0.80) 3,060 (0.63) 2,380 (0.49) 1,820 (0.38) 1,365 (0.28)20 17 1,140 969 (0.50) 816 (0.72) 680 (0.60) 560 (0.49) 455 (0.40)20 18 190 171 (0.90) 153 (0.81) 136 (0.72) 120 (0.63) 105 (0.55)20 19 20 19 (0.95) 18 (0.90) 17 (0.85) 16 (0.80) 15 (0.75)20 20 1 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00) 1 (1.00)

Table 1: Counts and percentages in parentheses of the order graph nodes per layerl for five sub-combinations {1}, {1, 2}, {1, 2, 3}, {1, 2, 3, 4} and {1, 2, 3, 4, 5} withvariable set V = {1, 2, 3, ...} having 10, 15 and 20 number of variables (n) on alllayers of the order graph.

24

5 Experimental results

5.1 Implementation details

The hashing-based DDD was implemented on top of the codebase for structurelearning created by Malone for a series of published papers [MYHB11, YM13]. Thecodebase1 is written in C++ and can be obtained from the original author. Thus,a natural experimental setup is to compare the effect of the changes against theoriginal codebase. The changes are available as a fork2 of the original codebase.

The implementation allows the program to have at most 500 files open at the sametime while dividing the order graph nodes. This is due to the fact that the operatingsystem can only have a limited number of files open at any time. The operatingsystem used in the tests had a hard limit of 1024 open files per process, so 500 fileswas selected to ensure no operating system related problems affect the tests.

Implementation uses hybrid duplicate detection in the same manner as the MYHBalgorithm. The algorithm takes as a parameter the maximum number of nodes keptin memory, and once this limit is reached all of the nodes in memory at that pointin time are written to disk.

5.2 Environment

Experiments were performed on a cluster consisting of 240 Dell PowerEdge M610blade servers. The operating system on the nodes was Ubuntu 12.04.5 LTS (GNU/Linux3.2.0-60-generic x86i_64). Each node consists of 32 gigabytes of RAM and two four-core Intel Xeon E5540 processors supporting hyper threading. Each node had 16gigabytes of local disk for all users. This meant that the running program couldwrite at most 16 gigabytes on the disk. In a more realistic situation the amount ofdisk is multiple times the amount of RAM.

The code is single-threaded, but the experiments were run parallel to each other onseparate nodes with each node running only a single instance at a time. None ofthe experiments were run on a dedicated node, so other users’ access to a node’sresources might also reflect in the results in nondeterministic ways. Due to the factthat other users could affect the results, all of the timing tests were run ten times

1https://bitbucket.org/bmmalone/urlearning-cpp2https://bitbucket.org/nikkekin/urlearning-cpp

25

to minimize their impact.

Measurements for the runtime were taken by adding logging statements into thecode. The Boost Timer library3 was used to log time, and the results reportedin the following utilize the concepts introduced in the Timer library. Wall-timemeasures time as if it was measured by a clock on the wall. User time measures thetime charged by the operating system as user cpu time and system time measuresthe time charged by the operating system as system cpu time. Cpu time should bea sum of the two latter, and any deviations between the wall and cpu times suggestthat either the code runs parallel in multiple cores or that there has been waitingtime, for example, on input and output. The code used is single-threaded, so thedeviations seen in our case result from congestion on input and/or output.

5.3 Data used

The experiments consist of two different scenarios. The first the scenario is the sameused by Malone and Yuan [MY13] to allow testing of sensitivity of the algorithmused to the network and dataset generation. The scenario consists of networksof 29 variables with 2,4,6 or 8 maximum parents per variable. In addition, thegenerated dataset sizes are varied with 1k, 5k, 10k and 20k data points. The differentdataset names are given as follows: <number of variables>.<maximum number ofparents>.<number of data points>.

The second scenario was generated using the same ideas to test the sensitivity of thealgorithm to the network and dataset generation, but this time with a dataset withfewer variables to allow for more fine-grained understanding of the dataset. Only 20variables were used in the network to make sure that the runtimes of the algorithmsused would stay manageable. The maximum parents were again 2, 4, 6 or 8. Thedataset sizes were again selected to be 1k, 5k, 10k and 20k data points. The namingof the different datasets follows the same convention as already explained above.The results will be first presented for the dataset with 20 variables, as it introducesthe concepts used in the analysis.

For the scenario with 20 variables, the datasets were generated. The generationmechanism used was the same as used by Malone and Yuan [MY13] and utilizesC++ library already referenced above and a Java-library provided by Malone and

3http://www.boost.org

26

Yuan4. First a random Bayesian network was generated using Ide and CozmanBayesian network structure generation algorithm [IC02] given the structural con-straint with maximum number of parents. Then logic sampling [Hen88] was used tosample data points from the generated network. Logic sampling is a process wherethe network is first topologically sorted and then sampled in the order variable byvariable. The sampling works correctly because the topological sorting guaranteesthat once a certain node is reached its parents have already been sampled. Thismeans that it is enough to sample appropriately from the conditional distribution ofthe variable given its parents. After the dataset for a certain network was generated,the parent scores were calculated and stored to a file which was used as an input forthe algorithm.

5.4 Overall performance

Overall, user time is the bulk of the time spent running the learning algorithm,and the difference between the wall time and whole used cpu time is not large.This suggests that there is not too much waiting for disk. This is true for bothdatasets and suggests that writing to disk is not a bottleneck. It is also clear that asthe maximum number of nodes kept in memory is increased, less delayed duplicatedetection is needed. This is natural as the more duplicate detection takes place inmemory, the less delayed duplicate detection is needed.

5.5 Scenario with 20 variables

Overall, the results reported in Table 2 and Tables 12-26 in online Appendix A5 showthat the runtime results for the two algorithms with 20 variables are not sensitiveto either the amount of parents in the generating network or the amount of data.For brevity, only Table 2 is with the main-part of the thesis. The rest of the tablesare in online Appendix A.

The results in, for example, Table 2, show that the results for the sorting codebasewith 1000 maxnodes are empty. This is related to the fact that the algorithm diedin the middle of the run. The reason for the original program dying is that theamount of file handles open exceeds the limit on the operating system of 1024 file

4http://www.cs.helsinki.fi/u/bmmalone/urlearning-files/URLearning.jar5https://bitbucket.org/nikkekin/urlearning-cpp/src/original_with_timer/tables.

pdf

27

handles.

Another point to make is that when reducing the maximum number of nodes kept inmemory from 5000 to 1000, the performance of the codebase using hashing deterio-rates significantly. The mean of the time used for hashing-based delayed duplicatedetection jumps from 6.48 seconds to 21.15 seconds. This deterioration is probablyrelated to the fact that number of files used in the hashing-based codebase increasedramatically. The important point to make here is that the maximum number ofnodes to keep in the semiconductor memory governs two things in the hashing code-base: the number of files on the disk and the number of times the nodes in thememory are divided to those files. In the MYHB algorithm the idea is similar, butthe difference is that the number of files used on the disk does not depend on theway the nodes can be divided to them: each time the nodes are written to disk anew file is created.

With a small number for the maximum number of nodes kept in memory it does seemoverall that the sorting-based codebase is able to slightly outperform the hashing-based codebase. However, as the number for maximum number of nodes kept inmemory is increased, the hashing-based codebase seems overtakes the sorting-basedcodebase. However, in both cases the differences are not significant and might berelated to, for example, the different data structures used by the codebases or thedifferent ways the two write the nodes on the disk.

It is also worth noting that when using the 1000 as the number of maximum nodeskept in memory, the results do show 0.75 seconds of delay from the disk whencalculated as a difference between the mean times for wall and CPU for the hashing-based codebase. This is consistent over all of the results reported for networks with20 variables. However, it is still worth noting that even in this case the disk takesonly around 0.75/21.15 = 3.5% of the runtime, which is not a large percentage.Overall, the results show that both sorting and hashing codebases seem to be ableto utilize the disk quite effectively without suffering from long waiting times forinput and output.

28

delayed

duplicate

detection

wholerun

wall

user

system

cpu

wall

user

system

cpu

filenam

ecodebase

max

nodes

nmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sd20.2.1000

sorting

1,000

100.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

20.2.1000

hashing

1,000

1021.15

0.44

10.79

0.30

9.61

0.20

20.40

0.42

25.76

0.57

13.82

0.41

11.79

0.22

25.62

0.57

20.2.1000

sorting

5,000

102.58

0.05

2.21

0.11

0.30

0.04

2.51

0.11

4.82

0.10

4.21

0.09

0.61

0.05

4.81

0.10

20.2.1000

hashing

5,000

106.48

0.05

4.15

0.14

2.14

0.13

6.29

0.17

9.48

0.06

6.55

0.12

2.90

0.08

9.45

0.07

20.2.1000

sorting

10,000

102.20

0.06

1.92

0.10

0.23

0.04

2.15

0.10

4.24

0.09

3.67

0.10

0.57

0.05

4.23

0.09

20.2.1000

hashing

10,000

104.15

0.30

3.01

0.25

1.07

0.14

4.08

0.32

6.92

0.49

5.36

0.37

1.54

0.16

6.90

0.49

20.2.1000

sorting

50,000

101.47

0.01

1.26

0.04

0.16

0.03

1.42

0.06

3.13

0.02

2.53

0.05

0.59

0.04

3.12

0.02

20.2.1000

hashing

50,000

101.94

0.09

1.74

0.10

0.18

0.03

1.92

0.11

4.09

0.20

3.45

0.17

0.63

0.05

4.08

0.20

20.2.1000

sorting

75,000

101.39

0.01

1.18

0.05

0.15

0.04

1.33

0.07

3.05

0.02

2.46

0.04

0.58

0.04

3.04

0.01

20.2.1000

hashing

75,000

101.75

0.01

1.57

0.06

0.15

0.03

1.72

0.06

3.79

0.02

3.18

0.04

0.60

0.04

3.78

0.02

20.2.1000

sorting

88,000

101.30

0.05

1.12

0.05

0.15

0.02

1.27

0.06

2.93

0.12

2.34

0.12

0.58

0.05

2.92

0.11

20.2.1000

hashing

88,000

101.57

0.02

1.40

0.06

0.14

0.03

1.54

0.07

3.60

0.03

3.01

0.05

0.58

0.03

3.59

0.03

20.2.1000

sorting

94,000

101.30

0.04

1.12

0.06

0.16

0.04

1.28

0.07

2.94

0.11

2.37

0.09

0.56

0.04

2.93

0.11

20.2.1000

hashing

94,000

101.53

0.01

1.34

0.06

0.16

0.03

1.50

0.07

3.49

0.01

2.88

0.04

0.60

0.03

3.48

0.01

20.2.1000

sorting

125,000

101.22

0.01

1.04

0.03

0.14

0.03

1.18

0.01

2.79

0.01

2.23

0.04

0.55

0.04

2.78

0.01

20.2.1000

hashing

125,000

101.46

0.01

1.30

0.06

0.13

0.05

1.43

0.07

3.42

0.01

2.83

0.03

0.58

0.03

3.41

0.01

20.2.1000

sorting

150,000

101.03

0.04

0.88

0.07

0.13

0.04

1.01

0.07

2.72

0.10

2.14

0.09

0.57

0.04

2.72

0.10

20.2.1000

hashing

150,000

101.13

0.01

0.99

0.08

0.11

0.03

1.10

0.07

3.10

0.02

2.52

0.04

0.57

0.04

3.09

0.02

20.2.1000

sorting

170,000

100.82

0.02

0.69

0.07

0.12

0.02

0.81

0.08

2.38

0.06

1.84

0.08

0.53

0.04

2.38

0.07

20.2.1000

hashing

170,000

100.82

0.04

0.66

0.07

0.11

0.03

0.77

0.09

2.86

0.12

2.24

0.11

0.60

0.03

2.85

0.12

20.2.1000

sorting

190,000

100.72

0.01

0.56

0.06

0.13

0.03

0.69

0.07

2.28

0.02

1.71

0.04

0.56

0.04

2.27

0.02

20.2.1000

hashing

190,000

100.61

0.01

0.47

0.04

0.11

0.02

0.58

0.05

2.55

0.02

1.96

0.04

0.59

0.04

2.54

0.02

20.2.1000

sorting

250,000

100.73

0.03

0.55

0.06

0.14

0.03

0.69

0.07

2.35

0.08

1.72

0.07

0.62

0.05

2.34

0.08

20.2.1000

hashing

250,000

100.61

0.03

0.49

0.05

0.11

0.03

0.60

0.08

2.57

0.11

1.95

0.10

0.61

0.06

2.56

0.11

20.2.1000

sorting

500,000

100.72

0.03

0.56

0.04

0.11

0.03

0.67

0.05

2.31

0.10

1.74

0.07

0.57

0.05

2.30

0.10

20.2.1000

hashing

500,000

100.60

0.00

0.49

0.07

0.12

0.03

0.61

0.07

2.57

0.01

1.97

0.05

0.59

0.04

2.57

0.01

20.2.1000

sorting

1,000,000

100.72

0.02

0.57

0.04

0.14

0.05

0.71

0.07

2.30

0.06

1.71

0.07

0.58

0.04

2.29

0.05

20.2.1000

hashing

1,000,000

100.60

0.00

0.49

0.03

0.12

0.03

0.61

0.04

2.54

0.00

1.94

0.05

0.59

0.05

2.53

0.01

Table2:

Tim

ingresultsin

second

sfortheda

tasetwith20

variab

les,

max

imum

2pa

rentspe

rvariab

lean

d10

00ob

servations

from

theruns

withvaryingnu

mbe

rof

nodeskept

insemicon

ductor

mem

ory(m

axno

des).Tim

ingresultsfordelayeddu

plicate

detectionrepo

rtthetimeused

fordelayeddu

plicatedetectionwhilethetimingresultsforthewho

lerunrepo

rttimeused

runn

ing

thealgo

rithm

altogether.

29

5.5.1 Locality

Due to hybrid duplicate detection, many duplicates are detected in semiconductormemory. In this context, locality refers to the amount of nodes detected in this way.As the algorithms used in this thesis are deterministic, locality can be measured byrunning the codebase only once. In addition, as we have restricted the search of theorder graph not to use any pruning, it happens in exactly the same way regardlessof the dataset generation mechanism, for example, the amount of maximum parentsper variable or the number of observations do not impact the results. However, boththe maximum number of nodes kept in memory, i.e. maxnodes, and the method usedto write and read the candidate order graph nodes from the disk affect the duplicatedetection behavior as they change the order in which candidates are generated.

Table 3 shows the expanded order graph nodes and candidate order graph nodeswritten to disk during three runs of the hashing codebase as maxnodes varies forthe codebase using hashing. It effectively shows how much more work the delayedduplicate detection has to do on each layer, and as whole, when the maximumnumber of nodes is increased from 5,000 to 150,000. In the optimal scenario noduplicate detection is needed and the number of nodes written to disk on layer lequals the number of nodes expanded on layer l + 1.

Similarly, Table 3 also shows the same information for the codebase using sorting.The differences in the amount of nodes written to disk between hashing and sortingare explained by the fact that the two methods read the nodes from the previouslayers from the disk in a different order. Thus, they also expand nodes in differ-ent orders. This means that a different amount of duplicates will be detected inmemory. Surprisingly, there is quite a big a difference of almost one million nodesbetween the two codebases when only 5000 candidate nodes are kept in the memory.Interestingly, the results for twenty variables also show that for these three sizes ofmaximum number of nodes kept in memory, the sorting codebase always writes lessorder graph nodes to disk. One plausible explanation for this would be that thesorting codebase implicitly benefits from the order in which order graph nodes aregenerated and does not distort it in the same way as the hashing codebase does.This is based on the assumption that the duplicates for the nodes expanded arefound more likely to be found from the in-memory hash table for the sorting-basedcodebase than for the hashing-based. This is interesting because the sorting-basedcodebase writes nodes to the disk in groups that are generated and then written todisk as the number of nodes in the hash table exceeds maxnodes, while the hashing-

30

maxnodes 5,000 50,000 150,000codebase layer expanded written expanded written expanded written

hashing

0 1 20 1 20 1 201 20 190 20 190 20 1902 190 1,140 190 1140 190 1,1403 1,140 4,845 1,140 4,845 1,140 4,8454 4,845 34,191 4,845 15,504 4,845 15,5045 15,504 121,570 15,504 38,760 15,504 38,7606 38,760 299,277 38,760 148,021 38,760 77,5207 77,520 598,778 77,520 334,625 77,520 125,9708 125,970 855,097 125,970 483,161 125,970 271,1849 167,960 1,088,706 167,960 596,340 167,960 291,98110 184,756 1,129,665 184,756 549,388 184,756 283,93011 167,960 838,553 167,960 393,760 167,960 125,97012 125,970 523,100 125,970 192,809 125,970 77,52013 77,520 254,566 77,520 38,760 77,520 38,76014 38,760 69,389 38,760 15,504 38,760 15,50415 15,504 4,845 15,504 4,845 15,504 4,84516 4,845 1,140 4,845 1,140 4,845 1,14017 1,140 190 1,140 190 1,140 19018 190 20 190 20 190 2019 20 1 20 1 20 1All 1,048,575 5,825,283 1,048,575 2,819,023 1,048,575 1,374,994

sorting

0 1 20 1 20 1 201 20 190 20 190 20 1902 190 1,140 190 1,140 190 1,1403 1,140 4,845 1,140 4,845 1,140 4,8454 4,845 33,559 4,845 15,504 4,845 15,5045 15,504 109,752 15,504 38,760 15,504 38,7606 38,760 268,704 38,760 134,479 38,760 77,5207 77,520 504,546 77,520 269,226 77,520 125,9708 125,970 759,800 125,970 431,549 125,970 254,0069 167,960 915,483 167,960 538,909 167,960 284,03610 184,756 890,172 184,756 494,549 184,756 261,85911 167,960 693,281 167,960 342,943 167,960 125,97012 125,970 423,816 125,970 149,992 125,970 77,52013 77,520 195,058 77,520 38,760 77,520 38,76014 38,760 64,038 38,760 15,504 38,760 15,50415 15,504 4,845 15,504 4,845 15,504 4,84516 4,845 1,140 4,845 1,140 4,845 1,14017 1,140 190 1,140 190 1,140 19018 190 20 190 20 190 2019 20 1 20 1 20 1All 1,048,575 4,870,600 1,048,575 2,482,566 1,048,575 1,327,800

Table 3: Order graph nodes expanded and written to disk for dataset with twentyvariables when maximum number of nodes kept in memory (maxnodes) varies withvalues 5000, 50000 and 150000 for the codebases using hashing and sorting.

based codebase distorts this order entirely by using the hashing-function to writethe nodes to separate files.

5.6 Scenario with 29 variables

Due to the exponential size of the order graph with respect to the number of vari-ables, the scenario with 29 variables creates a much larger learning problem thanthe scenario with 20 variables. A small python script was used to execute eachalgorithm ten times on a given dataset on a single server node. Some of the servernodes failed during the experiments, so the node running the experiment was not

31

the same for all of the runs even for the same algorithm with the same parameters.Due to the problems running the experiments, two sets of experiments are presented.They are in every other way the similar to each other, but in the latter a bug inthe timing of the delayed duplicate detection was fixed so that the times for user,system and cpu for delayed duplicate detection were also reported. The decision toshow both experiments was done because of the variation in the results between thetwo experiments. Probably, the variation is related to how the load from other usersaffected the running of the algorithm. One possible source of variation for the loadon the servers between the two experiments is the time of the year the experimentswere run. The first set of results was run during the two weeks starting from theend of July 2014, which is a holiday season in Finland and so the load on the servernodes would likely be low compared to other times during the year. However, thesecond set of experiments was run during the middle of September 2014, which is aperiod when holidays are over.

5.6.1 First set of results under light load

The results reported in Tables 5 and 6 show that the mean run times for the hashingand sorting based algorithms are usually close to each other. There is variationbetween the two depending on the dataset. Overall, as the number for the maximumamount of order graph nodes kept in memory is increased, the amount of delayedduplicate detection decreases due to the fact that more duplicate detection is donein memory. Interestingly the results also suggest that both the hashing and sortingbased seem to be, at least in these tests, sensitive to the maximum amount of parentnodes and to the amount of data available as also noticed earlier by Malone andYuan [MY13]. This can be seen from the fact that the runtimes for the algorithmsare clearly different when these parameters are varied. This can be at least partlyexplained by the number of scores in a sparse parent lists for the dataset, which aregiven in Table 4. These lists need to be scanned during the search to find optimalparents, so their size has an effect on the performance of the algorithm.

32

dataset number of scores29.2.1000 1,08129.2.5000 5,79829.2.10000 11,56729.2.20000 27,13829.4.1000 2,83729.4.5000 28,16529.4.10000 81,12829.4.20000 258,05629.6.1000 2,34929.6.5000 29,47229.6.10000 104,12829.8.1000 78729.8.5000 11,00229.8.10000 38,611

Table 4: Number of scores in the sparse parent lists for the datasets with 29 variables.

Figures 4, 5 and 6 show the kernel density estimates6 for the runtime of the delayedduplicate detection and for the runtime of the algorithm for the first set of resultsconditioned on the codebase, maximum number of nodes kept in memory, numberof data points and maximum number of parents. These figures together show thatboth of the algorithms behaved in a consistent manner with their mean runtimesclose to each other over the tests and the timing results contain only a few outlierswhich do not affect the interpretation of the results.

6The kernel density functions presented in this thesis were estimated using a standard libraryroutine density from the statistical package R. Density functions have an advantage over histogramswhen presenting data, because they require less parameters. For a histogram, one has to selectthe number of subintervals dealing the data, the size of the intervals and the locations of theintervals [TK76]. However, with non-parametric kernel density estimates, it is enough to onlyselect the size of the intervals. The kernel function is situated at each observation instead ofgrouping observations [TK76]. The size of the intervals corresponds to the standard deviation ofthe kernel density and is called bandwidth. Another parameter needed is the smoothing kernel. Rpackage uses by default normal distribution as the smoothing kernel and selects bandwidth usingSilverman’s rule of thumb. These parameters are sufficient for comparisons presented in this thesis,but especially the bandwidth can be chosen using more advanced methods [JMS96].

33ddd

wholerun

wall

wall

user

system

cpu

dataset

codebase

max

nodes

nmean

sdmean

sdmean

sdmean

sdmean

sd29.2.1000

sorting

25,000,000

102,150.17

135.61

6,019.91

463.43

5,460.54

456.74

556.26

15.46

6,016.80

463.22

29.2.1000

hashing

25,000,000

101,908.11

2.93

5,374.14

50.37

5,057.08

49.79

314.64

2.01

5,371.72

50.35

29.2.1000

sorting

50,000,000

101,544.19

94.84

5,330.25

382.56

4,814.18

373.89

513.46

32.78

5,327.64

382.30

29.2.1000

hashing

50,000,000

101,617.55

108.13

5,535.33

400.87

5,227.31

401.56

305.47

12.77

5,532.78

400.65

29.2.1000

sorting

75,000,000

10734.80

20.32

3,607.10

192.77

3,241.62

194.35

363.76

12.00

3,605.38

192.70

29.2.1000

hashing

75,000,000

10716.95

1.94

4,286.77

8.19

3,998.45

6.37

286.40

2.80

4,284.85

8.19

29.2.10000

sorting

25,000,000

101,909.54

343.57

6,648.36

1,311.40

6,198.24

1,233.14

438.57

78.65

6,636.81

1,305.65

29.2.10000

hashing

25,000,000

101,984.03

220.97

6,564.30

905.34

6,239.10

892.78

322.22

12.37

6,561.32

904.84

29.2.10000

sorting

50,000,000

10997.85

42.71

4,246.46

216.04

3,861.73

202.58

382.49

17.48

4,244.22

215.57

29.2.10000

hashing

50,000,000

101,589.61

141.30

6,639.36

664.06

6,310.17

660.56

326.13

21.90

6,636.31

663.78

29.2.10000

sorting

75,000,000

10878.78

167.28

5,356.61

1,144.86

4,885.28

1,102.76

464.50

57.11

5,349.78

1,140.82

29.2.10000

hashing

75,000,000

10723.72

8.57

5,329.47

151.72

5,044.67

156.57

282.42

5.43

5,327.09

151.65

29.2.20000

sorting

25,000,000

103,592.61

389.21

14,758.94

1,737.70

14,022.56

1,695.40

707.02

53.72

14,729.58

1,731.99

29.2.20000

hashing

25,000,000

102,385.01

160.28

9,249.63

606.53

8,846.94

610.94

398.28

13.93

9,245.22

606.38

29.2.20000

sorting

50,000,000

101,647.15

1,120.47

8,490.24

5,873.91

7,960.04

5,670.09

512.91

180.50

8,472.95

5,848.31

29.2.20000

hashing

50,000,000

101,827.46

274.34

8,877.60

1,484.23

8,489.37

1,406.97

381.78

80.22

8,871.15

1,480.68

29.2.20000

sorting

75,000,000

101,361.14

472.64

10,714.96

4,034.33

10,073.93

3,891.70

617.74

135.49

10,691.67

4,019.65

29.2.20000

hashing

75,000,000

10735.79

27.74

6,361.16

263.87

6,067.05

257.73

291.25

6.81

6,358.30

263.69

29.2.5000

sorting

25,000,000

101,691.64

443.04

5,256.72

1,398.11

4,865.68

1,266.60

388.53

133.47

5,254.21

1,397.29

29.2.5000

hashing

25,000,000

103,980.16

1,568.48

13,217.18

5,492.14

12,608.99

5,321.35

586.82

155.34

13,195.81

5,475.46

29.2.5000

sorting

50,000,000

101,547.45

94.92

6,037.35

380.69

5,511.31

366.10

523.25

26.25

6,034.56

380.50

29.2.5000

hashing

50,000,000

101,685.24

180.11

6,633.30

876.23

6,327.75

868.77

302.43

16.37

6,630.19

875.72

29.2.5000

sorting

75,000,000

10729.65

10.36

4,033.69

38.47

3,668.69

34.87

363.10

10.11

4,031.79

38.47

29.2.5000

hashing

75,000,000

10755.05

45.13

5,129.94

340.74

4,807.74

356.81

319.87

25.50

5,127.60

340.58

29.4.1000

sorting

25,000,000

102,016.93

108.58

6,316.97

313.60

5,838.23

303.30

464.78

16.99

6,303.01

312.83

29.4.1000

hashing

25,000,000

101,961.20

161.54

5,907.46

588.17

5,593.15

591.10

311.65

3.85

5,904.80

587.91

29.4.1000

sorting

50,000,000

101,108.57

149.91

4,159.26

643.01

3,760.23

595.77

394.82

46.95

4,155.05

640.25

29.4.1000

hashing

50,000,000

101,886.20

86.84

7,424.72

470.02

7,002.16

453.05

417.46

27.17

7,419.63

469.06

29.4.1000

sorting

75,000,000

10975.75

11.26

5,309.89

47.80

4,815.29

48.80

483.98

3.14

5,299.27

48.18

29.4.1000

hashing

75,000,000

10732.18

96.24

4,956.64

811.53

4,669.91

787.24

284.49

25.56

4,954.40

811.08

29.4.10000

sorting

25,000,000

102,088.88

78.16

13,683.00

691.57

13,091.54

657.76

585.19

35.66

13,676.72

691.26

29.4.10000

hashing

25,000,000

101,933.50

34.50

10,230.75

150.92

9,900.31

150.40

325.81

2.03

10,226.12

150.69

29.4.10000

sorting

50,000,000

101,962.42

459.70

15,937.79

3,122.70

15,295.99

3,039.08

623.90

69.16

15,919.89

3,106.51

29.4.10000

hashing

50,000,000

102,142.95

48.81

15,150.82

488.11

14,627.01

476.88

506.99

12.92

15,134.00

486.07

29.4.10000

sorting

75,000,000

10996.86

333.68

11,321.53

3,583.92

10,784.29

3,444.61

513.13

129.03

11,297.42

3,570.23

29.4.10000

hashing

75,000,000

10732.11

13.59

9,358.81

179.87

9,055.46

175.66

299.13

14.88

9,354.58

179.70

29.4.20000

sorting

25,000,000

101,492.85

139.01

17,077.87

742.46

16,710.10

719.64

359.99

26.04

17,070.09

741.89

29.4.20000

hashing

25,000,000

103,201.09

1,480.25

30,588.39

17,479.62

29,944.65

17,057.74

604.06

372.73

30,548.71

17,428.56

29.4.20000

sorting

50,000,000

101,157.37

153.38

17,601.04

3,052.40

17,252.06

2,995.11

341.17

57.18

17,593.23

3,050.90

29.4.20000

hashing

50,000,000

101,726.67

189.62

19,868.54

1,689.41

19,496.89

1,622.63

358.28

61.17

19,855.16

1,678.10

29.4.20000

sorting

75,000,000

10721.89

51.07

16,819.60

1,125.95

16,332.27

1,169.45

478.64

83.43

16,810.90

1,125.82

29.4.20000

hashing

75,000,000

10817.17

66.23

18,167.08

648.54

17,747.01

625.29

411.10

30.08

18,158.11

646.24

Table5:

Tim

ingresultsin

second

sforthefirst

sevenda

tasets

from

thefirst

setof

runs.Tim


plicate

detection(ddd

)repo

rtthetimeused

forde

layeddu

plicatedetectionwhile

thetimingresultsforthewho

lerunrepo

rttimeused

runn

ingthealgo

rithm

altogether.

34ddd

wholerun

wall

wall

user

system

cpu

dataset

codebase

max

nodes

nmean

sdmean

sdmean

sdmean

sdmean

sd29.4.5000

sorting

25,000,000

102,644.91

785.36

10,954.51

2,934.08

10,325.70

2,830.66

613.23

89.61

10,938.93

2,918.97

29.4.5000

hashing

25,000,000

101,989.50

134.44

7,840.63

701.31

7,512.79

696.89

324.31

13.68

7,837.09

700.97

29.4.5000

sorting

50,000,000

102,507.34

218.49

13,723.52

1,137.32

13,041.32

1,081.55

654.43

59.96

13,695.75

1,130.94

29.4.5000

hashing

50,000,000

101,636.47

116.67

7,663.24

511.75

7,211.48

467.28

318.39

12.46

7,529.87

471.61

29.4.5000

sorting

75,000,000

101,627.96

282.37

13,074.73

2,323.15

12,380.34

2,249.61

657.05

61.30

13,037.39

2,309.79

29.4.5000

hashing

75,000,000

10740.90

50.73

6,844.61

753.50

6,543.63

737.21

297.84

18.51

6,841.47

753.11

29.6.1000

sorting

25,000,000

101,702.65

380.43

5,258.92

1,379.73

4,864.00

1,280.65

386.95

93.06

5,250.95

1,373.31

29.6.1000

hashing

25,000,000

101,911.16

6.56

5,692.08

8.58

5,375.13

7.99

314.38

2.03

5,689.51

8.57

29.6.1000

sorting

50,000,000

101,146.52

13.16

4,166.82

28.99

3,770.82

24.99

393.89

6.54

4,164.70

28.96

29.6.1000

hashing

50,000,000

101,700.66

170.46

6,189.37

756.90

5,734.93

749.79

361.08

46.87

6,096.00

750.97

29.6.1000

sorting

75,000,000

10809.50

161.39

4,249.42

1,001.97

3,778.35

981.47

465.34

38.04

4,243.69

997.66

29.6.1000

hashing

75,000,000

10719.06

41.87

4,834.90

400.58

4,553.95

406.73

278.77

17.17

4,832.72

400.39

29.6.10000

sorting

25,000,000

101,349.13

8.48

9,634.72

18.27

9,316.49

15.05

313.97

4.27

9,630.46

18.26

29.6.10000

hashing

25,000,000

102,439.71

189.56

14,536.62

740.13

14,108.93

740.12

420.49

6.65

14,529.43

739.88

29.6.10000

sorting

50,000,000

101,046.04

154.32

9,633.96

833.04

9,312.07

813.40

317.60

19.28

9,629.67

832.56

29.6.10000

hashing

50,000,000

101,750.69

160.64

13,024.34

1,167.04

12,678.65

1,139.38

339.68

38.49

13,018.32

1,166.26

29.6.10000

sorting

75,000,000

10678.68

23.74

9,431.60

452.79

9,024.17

462.55

402.62

38.20

9,426.79

452.63

29.6.10000

hashing

75,000,000

101,080.22

175.45

15,653.81

2,830.53

15,162.16

2,715.67

483.93

117.72

15,646.09

2,828.84

29.6.5000

sorting

25,000,000

101,339.07

11.03

5,677.45

38.02

5,376.32

36.30

298.57

3.73

5,674.89

37.90

29.6.5000

hashing

25,000,000

101,983.51

201.49

7,866.05

811.58

7,534.40

788.16

328.09

23.05

7,862.50

811.13

29.6.5000

sorting

50,000,000

10978.25

3.33

5,347.19

8.11

5,046.81

7.73

298.01

1.29

5,344.81

8.11

29.6.5000

hashing

50,000,000

101,615.18

62.12

7,864.50

467.53

7,550.09

471.66

310.85

7.02

7,860.94

467.30

29.6.5000

sorting

75,000,000

10654.63

5.54

5,168.25

25.58

4,787.48

50.51

378.16

38.81

5,165.64

25.59

29.6.5000

hashing

75,000,000

101,246.26

131.49

11,294.45

608.61

10,677.64

537.30

573.09

20.54

11,250.73

552.27

29.8.1000

sorting

25,000,000

101,326.45

6.82

3,653.97

11.82

3,366.45

10.35

285.89

4.46

3,652.34

11.82

29.8.1000

hashing

25,000,000

101,986.72

164.48

5,681.70

711.67

5,360.77

705.05

318.35

12.82

5,679.12

711.34

29.8.1000

sorting

50,000,000

101,409.29

73.48

4,921.52

273.03

4,477.26

229.87

441.97

55.26

4,919.23

272.91

29.8.1000

hashing

50,000,000

101,652.28

75.39

5,741.93

457.78

5,437.86

463.78

301.44

12.77

5,739.31

457.59

29.8.1000

sorting

75,000,000

10722.78

11.12

3,500.94

90.02

3,142.37

91.07

356.89

8.32

3,499.26

89.96

29.8.1000

hashing

75,000,000

101,181.93

54.46

7,069.67

498.96

6,498.75

490.78

567.40

21.72

7,066.15

498.71

29.8.10000

sorting

25,000,000

102,118.25

1,228.90

10,532.66

5,888.17

10,074.82

5,644.10

444.04

226.63

10,518.86

5,870.52

29.8.10000

hashing

25,000,000

102,155.16

208.21

9,723.51

705.64

9,391.26

693.48

327.80

18.26

9,719.07

705.27

29.8.10000

sorting

50,000,000

101,023.53

51.69

6,823.16

375.01

6,392.39

360.35

426.52

14.76

6,818.91

372.87

29.8.10000

hashing

50,000,000

101,777.87

365.48

10,175.69

1,799.30

9,761.06

1,715.27

361.78

49.52

10,122.84

1,759.02

29.8.10000

sorting

75,000,000

10721.59

92.43

6,729.97

615.52

6,299.54

589.54

426.23

58.14

6,725.77

613.03

29.8.10000

hashing

75,000,000

101,180.47

29.49

13,078.88

658.64

12,513.05

638.06

559.64

22.43

13,072.69

658.12

29.8.5000

sorting

25,000,000

101,330.38

9.66

4,754.03

14.94

4,462.01

11.77

289.91

5.75

4,751.92

14.94

29.8.5000

hashing

25,000,000

102,200.48

402.31

7,712.17

1,362.40

7,352.59

1,294.38

350.45

66.54

7,703.04

1,349.14

29.8.5000

sorting

50,000,000

10999.87

30.02

4,544.29

145.15

4,249.96

146.30

292.27

2.25

4,542.23

144.95

29.8.5000

hashing

50,000,000

101,705.80

207.38

7,636.75

841.30

7,336.50

837.60

296.75

17.53

7,633.25

840.83

29.8.5000

sorting

75,000,000

10653.00

3.70

4,227.85

12.18

3,842.49

68.15

383.19

59.96

4,225.68

12.19

29.8.5000

hashing

75,000,000

10785.50

108.48

6,257.64

763.09

5,920.43

734.63

333.81

40.85

6,254.24

761.28

Table6:

Tim

ingresultsin

second

sforthelast

sevenda

tasets

from

thefirst

setof

runs.Tim


plicate

detection(ddd

)repo

rtthetimeused

forde

layeddu



lerunrepo

rttimeused

runn

ingthealgo

rithm

altogether.

35

(a)

DDD wall runtime in seconds

Den

sity

, dat

a po

ints

0.00000.00020.00040.00060.00080.0010

0 2000 4000 6000

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●

hashing1000

●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●

sorting1000

●● ●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●

hashing5000

0.00000.00020.00040.00060.00080.0010

●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●

sorting5000

0.00000.00020.00040.00060.00080.0010

● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●

hashing10000

●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

sorting10000

●●●● ●●●● ●●● ●●● ● ●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●

hashing20000

0 2000 4000 6000

0.00000.00020.00040.00060.00080.0010

● ● ●● ●●● ●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●

sorting20000

(b)


Den

sity

, dat

a po

ints

0.00000.00020.00040.00060.00080.0010

0 2000 4000 6000

●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

hashing2

●● ●●●●●●●●●● ●●●●●●●● ● ● ●● ●●● ●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●

sorting2

●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

hashing4

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ● ●●● ●●● ●●●

sorting4

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●● ● ●●●● ●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●

hashing6

● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting6

●●●●●●● ●●● ●●●● ●●●●●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●● ●●● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

hashing8

0 2000 4000 6000

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●

sorting8

Figure 4: Kernel density estimates and data points plotted for the first set of resultsfor the runtime of the delayed duplicate detection conditioned on the codebase andthe number of data points in the dataset are presented Figure (a). In Figure (b)runtime of the delayed duplicate detection is conditioned on the codebase and thenumber of data points.

36

(a)


Den

sity

, dat

a po

ints

0.000

0.001

0.002

0.003

1000 3000 5000

●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●● ●●●●●●● ●●●●● ●● ●●

25Mhashing

●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●●●●●●●●

50Mhashing

1000 3000 5000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

75Mhashing

●●●●●●●●●●●● ●●●●●●●● ●● ●● ●●● ●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●● ●●●●●●● ●●●●●●●●●●●●

25Msorting

1000 3000 5000

●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

50Msorting

0.000

0.001

0.002

0.003

●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

75Msorting

(b)

Run wall runtime in seconds

Den

sity

, dat

a po

ints

0e+001e−042e−043e−044e−045e−046e−04

0 20000 40000 60000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

hashing1000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting1000

●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

hashing5000

0e+001e−042e−043e−044e−045e−046e−04

●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting5000

0e+001e−042e−043e−044e−045e−046e−04

●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●

hashing10000

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting10000

●●●●●●●●●● ● ●●●● ●●● ● ●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●

hashing20000

0 20000 40000 60000

0e+001e−042e−043e−044e−045e−046e−04

●●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●●●●●● ●●● ●●●●●● ●●●●●●●●●●

sorting20000

Figure 5: Kernel density estimates and data points plotted for the first set of resultsfor the runtime of the delayed duplicate detection conditioned on the codebase andthe maximum number of nodes kept in memory are presented Figure (a). In Figure(b) kernel density estimates and data points are plotted for the first set of results forthe runtime of the algorithm conditioned on the codebase and the number of datapoints in the dataset.

37

(a)


Den

sity

, dat

a po

ints

0.000000.000050.000100.000150.000200.000250.00030

0 20000 40000 60000

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

hashing2

●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●

sorting2

●●●●●●●●●● ●●●●●●●●●● ● ●●●● ●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

hashing4

0.000000.000050.000100.000150.000200.000250.00030

●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting4

0.000000.000050.000100.000150.000200.000250.00030

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●

hashing6

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting6

●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

hashing8

0 20000 40000 60000

0.000000.000050.000100.000150.000200.000250.00030

●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting8

(b)


Den

sity

, dat

a po

ints

0.00000

0.00005

0.00010

0.00015

0 20000 40000 60000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

25Mhashing

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

50Mhashing

0 20000 40000 60000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

75Mhashing

●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●

25Msorting

0 20000 40000 60000

●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●

●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

50Msorting

0.00000

0.00005

0.00010

0.00015

●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

75Msorting

Figure 6: Kernel density estimates and data points plotted for the first set of resultsfor the runtime of the algorithm conditioned on the codebase and the maximumnumber of parents are presented Figure (a). In Figure (b) runtime of the algorithmis conditioned on the codebase and the maximum number of nodes kept in memory.

38

To analyze statistically the differences between the hashing-based and sorting-basedcodebases, Mann-Whitney-Wilcoxon [MW47, Wil45] tests7 were executed for eachinput file used for time used for delayed duplicate detection and for the whole run.The results for the first set of results are given in Table 7. It shows that in mostof the cases the null hypothesis stating that the hashing-based and sorting-basedpopulations are the same can be rejected at a confidence lower than 1 percent forboth timing results. This suggests that the difference in times used for delayedduplicate detection and for the whole run for the two methods are for most of theinput files statistically significantly different.

7Mann-Whitney-Wilcoxon [MW47, Wil45] tests presented in this thesis were executed using astandard library function wilcox.test from the statistical package R. Mann-Whitney-Wilcoxon testis a non-parametric test, which assesses whether two samples, e.g. A and B, are from the samedistribution. If this were the case, then the two samples should have values, which are distributedsimilarly. In practice the test assigns a rank for each observation in the two samples based on thenumerical value of an observation in relation to all the observations in the two samples. Then a ranksum is calculated for one of the samples, in this example for sample A. Under the null hypothesis,each of the observations is equally likely to be part of the rank sum calculated, and based on thisassumption a distribution for the rank sums can be calculated. Based on the distribution of therank sums, the test assesses what is the probability to see a rank sum smaller or equal to the oneobserved for sample A.

39

maxnodes maxparents data points n DDD p-value run p-value25M 2 1000 20 0.000* 0.000*50M 2 1000 20 0.247 0.48175M 2 1000 20 0.000* 0.000*25M 2 10000 20 0.353 0.35350M 2 10000 20 0.000* 0.000*75M 2 10000 20 0.481 0.48125M 2 20000 20 0.000* 0.000*50M 2 20000 20 0.143 0.14375M 2 20000 20 0.143 0.14325M 2 5000 20 0.001* 0.001*50M 2 5000 20 0.075 0.10575M 2 5000 20 0.436 0.000*25M 4 1000 20 0.089 0.002*50M 4 1000 20 0.000* 0.000*75M 4 1000 20 0.002* 0.002*25M 4 10000 20 0.001* 0.000*50M 4 10000 20 0.315 0.48175M 4 10000 20 0.000* 0.68425M 4 20000 20 0.000* 0.000*50M 4 20000 20 0.000* 0.002*75M 4 20000 20 0.002* 0.01525M 4 5000 20 0.023 0.001*50M 4 5000 20 0.000* 0.000*75M 4 5000 20 0.000* 0.000*25M 6 1000 20 1.000 1.00050M 6 1000 20 0.000* 0.000*75M 6 1000 20 1.000 0.31525M 6 10000 20 0.000* 0.000*50M 6 10000 20 0.000* 0.000*75M 6 10000 20 0.000* 0.000*25M 6 5000 20 0.000* 0.000*50M 6 5000 20 0.000* 0.000*75M 6 5000 20 0.000* 0.000*25M 8 1000 20 0.000* 0.000*50M 8 1000 20 0.000* 0.000*75M 8 1000 20 0.000* 0.000*25M 8 10000 20 0.143 0.14350M 8 10000 20 0.000* 0.000*75M 8 10000 20 0.000* 0.000*25M 8 5000 20 0.000* 0.000*50M 8 5000 20 0.000* 0.000*75M 8 5000 20 0.000* 0.000*

Table 7: Results from Mann-Whitney-Wilcoxon tests for first set of results for timeused for delayed duplicate detection (DDD) and for the time used for the wholerun. The tests were executed for each individual file used as an input for the twoalgorithms. P-values significant at 1 percent level have been marked with asterisk(*).

5.6.2 Second set of results under heavy load

Tables 8 and 9 report the results for the second set of runs. They confirm the earlierresults that a larger number as the maximum number of nodes kept in memory meansless writing to disk and a faster runtime. At the same time, the results from thesecond runs turn upside down the performance of the hashing based codebase againstthe sorting based codebase; in these tests, sorting-based codebase was typicallyfaster. Due to the difference in testing conditions, the results between the two sets

40

of runs are inconsistent. That is, the best choice of algorithm depends upon theenvironment in which it is executed.

Another interesting observation relates to timing of the delayed duplicate detectionand the whole run. The whole runtime includes delayed duplicate detection, butstill the delayed duplicate detection shows larger differences between the used CPUtime and the wall time than the whole run’s used CPU and wall time. For example,for the dataset 29.2.1000 with sorting and 25 million as the maximum number ofnodes kept in memory, the difference between the used cpu time for hashing andwall time is about 25 seconds. However, for the same run the difference between thewhole run wall time and the used cpu time is only six seconds. In a normal situationthe differences here could be used to judge how much time the algorithm has useddoing input and output.

Figures 7, 8 and 9 show kernel density estimates for the runtime of the delayedduplicate detection runtime and for the runtime of the algorithm conditioned onthe codebase, maximum number of nodes kept in memory, number of data pointsand maximum number of parents. Figure 7 shows how the variation in the delayedduplicate detection’s runtime is clearly more variant with a larger maximum valuefor the hashing-based codebase. Figure 8 shows how delayed duplicate detection forthe hashing-based codebase is clearly slower than the sorting-based codebase whenmaximum number of parents is 25M or 50M.

41ddd

wholerun

wall

user

system

cpu

wall

user

system

cpu

filenam

e2codebase

max

nodes

nmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sd29.2.1000

sorting

25,000,000.00

101,690.27

66.55

1,606.38

65.43

59.49

2.68

1,665.87

65.50

4,854.27

220.32

4,465.43

215.95

382.97

12.00

4,848.40

220.05

29.2.1000

hashing

25,000,000.00

102,805.99

115.78

2,710.20

105.59

66.06

5.43

2,776.27

110.14

8,392.87

473.13

7,961.94

431.65

421.30

55.88

8,383.24

470.55

29.2.1000

sorting

50,000,000.00

101,218.53

219.16

1,151.57

202.58

41.65

8.31

1,193.21

210.69

4,425.68

987.35

4,128.80

945.76

294.08

43.31

4,422.88

986.15

29.2.1000

hashing

50,000,000.00

101,566.48

142.80

1,509.91

141.96

38.55

1.11

1,548.46

142.74

5,562.99

605.61

5,224.33

592.45

335.87

16.55

5,560.21

605.28

29.2.1000

sorting

75,000,000.00

10716.17

7.90

662.37

8.17

34.40

0.79

696.77

7.95

3,522.78

36.19

3,213.64

36.92

307.47

7.34

3,521.11

36.17

29.2.1000

hashing

75,000,000.00

10786.61

59.44

744.17

62.27

27.25

2.29

771.43

60.43

4,854.59

478.65

4,525.95

482.32

326.38

16.05

4,852.33

478.42

29.2.10000

sorting

25,000,000.00

101,528.04

107.23

1,456.46

105.05

49.87

3.43

1,506.34

105.38

5,325.80

401.53

4,967.26

386.25

353.96

27.24

5,321.22

400.54

29.2.10000

hashing

25,000,000.00

102,377.44

291.03

2,297.88

283.59

55.20

6.92

2,353.08

288.36

8,106.72

1,117.54

7,745.62

1,087.63

357.21

36.33

8,102.83

1,116.93

29.2.10000

sorting

50,000,000.00

101,237.03

121.29

1,174.59

116.70

39.15

3.44

1,213.73

118.51

5,543.38

595.76

5,213.71

583.94

327.04

13.55

5,540.76

595.42

29.2.10000

hashing

50,000,000.00

101,750.23

129.49

1,688.21

121.44

38.93

2.99

1,727.14

123.41

7,431.82

601.80

7,064.14

582.67

363.95

31.52

7,428.09

601.29

29.2.10000

sorting

75,000,000.00

10674.32

34.30

625.48

31.23

31.83

2.83

657.31

31.55

4,154.32

312.63

3,817.82

302.39

334.35

30.71

4,152.17

311.98

29.2.10000

hashing

75,000,000.00

10820.85

28.96

777.10

29.38

25.77

2.52

802.87

29.45

6,148.23

201.58

5,815.41

206.88

329.46

19.19

6,144.87

202.08

29.2.20000

sorting

25,000,000.00

101,473.53

79.31

1,404.61

74.80

45.08

3.24

1,449.69

77.41

6,325.81

466.58

5,975.39

434.82

343.81

31.46

6,319.20

462.91

29.2.20000

hashing

25,000,000.00

102,515.68

240.51

2,439.12

238.65

51.62

3.48

2,490.74

239.37

9,548.77

960.14

9,188.89

949.45

355.44

21.28

9,544.33

959.65

29.2.20000

sorting

50,000,000.00

101,047.72

55.90

994.05

53.43

30.36

0.99

1,024.41

53.70

5,493.79

281.70

5,219.65

278.41

271.61

12.69

5,491.26

281.61

29.2.20000

hashing

50,000,000.00

101,661.10

114.28

1,601.79

115.33

38.95

7.26

1,640.74

114.04

7,899.74

593.93

7,531.52

585.98

364.45

42.60

7,895.97

593.68

29.2.20000

sorting

75,000,000.00

10772.73

58.42

720.02

56.32

29.82

2.78

749.84

58.23

5,952.20

571.97

5,590.46

546.27

357.01

27.89

5,947.47

568.99

29.2.20000

hashing

75,000,000.00

10743.33

17.14

698.81

14.25

25.09

1.69

723.90

15.15

6,403.29

229.76

6,043.40

212.22

356.91

28.97

6,400.31

229.62

29.2.5000

sorting

25,000,000.00

101,449.19

74.26

1,383.77

70.56

45.84

2.32

1,429.61

72.87

4,540.63

330.17

4,233.29

316.43

305.22

13.69

4,538.50

329.97

29.2.5000

hashing

25,000,000.00

102,306.62

59.33

2,240.20

58.04

46.07

2.00

2,286.27

58.15

7,229.57

222.02

6,912.88

225.54

313.35

11.29

7,226.24

221.88

29.2.5000

sorting

50,000,000.00

101,177.22

48.88

1,111.74

44.13

41.14

2.74

1,152.88

46.54

4,832.65

278.15

4,497.03

255.22

332.83

23.60

4,829.86

277.74

29.2.5000

hashing

50,000,000.00

101,916.40

100.31

1,855.27

98.27

39.98

1.81

1,895.24

99.09

7,745.91

502.38

7,394.19

491.80

347.94

16.54

7,742.13

502.01

29.2.5000

sorting

75,000,000.00

10716.68

6.70

663.46

5.70

32.35

0.93

695.81

6.26

4,041.78

27.13

3,723.56

24.97

316.31

5.45

4,039.86

27.11

29.2.5000

hashing

75,000,000.00

10757.34

29.23

717.18

27.94

24.62

1.70

741.81

28.35

5,276.43

286.25

4,966.83

282.80

306.98

9.53

5,273.81

286.05

29.4.1000

sorting

25,000,000.00

101,501.52

134.74

1,430.41

129.96

49.30

5.85

1,479.71

132.57

4,535.69

443.18

4,207.80

420.72

325.06

33.70

4,532.86

442.32

29.4.1000

hashing

25,000,000.00

102,108.48

182.30

2,040.51

183.38

49.47

6.10

2,089.98

184.01

6,422.82

627.67

6,101.08

624.70

318.81

34.29

6,419.89

627.37

29.4.1000

sorting

50,000,000.00

101,036.06

46.39

981.53

44.39

36.40

0.98

1,017.93

44.94

3,957.21

232.82

3,678.23

224.50

276.92

11.36

3,955.15

232.28

29.4.1000

hashing

50,000,000.00

101,589.83

58.21

1,534.77

57.51

39.28

4.19

1,574.05

58.06

6,250.43

315.16

5,915.16

318.21

332.34

29.74

6,247.49

315.00

29.4.1000

sorting

75,000,000.00

10690.02

23.66

639.13

22.65

34.67

3.11

673.79

23.21

3,664.15

159.63

3,327.24

160.34

335.20

32.47

3,662.44

159.52

29.4.1000

hashing

75,000,000.00

10803.03

34.10

757.77

33.44

27.73

3.74

785.50

33.45

5,429.11

357.40

5,106.65

357.77

319.96

27.66

5,426.60

357.24

29.4.10000

sorting

25,000,000.00

101,749.57

93.74

1,673.77

94.85

44.61

3.42

1,718.38

93.76

11,280.08

443.34

10,868.04

433.14

399.19

26.64

11,267.24

441.79

29.4.10000

hashing

25,000,000.00

102,331.92

199.86

2,258.66

196.80

46.16

3.43

2,304.82

199.76

11,788.56

763.25

11,440.42

755.23

342.79

10.71

11,783.21

762.84

29.4.10000

sorting

50,000,000.00

101,317.38

302.47

1,241.94

272.34

36.75

12.50

1,278.69

284.59

9,945.95

1,792.84

9,586.25

1,731.66

353.45

58.86

9,939.70

1,789.67

29.4.10000

hashing

50,000,000.00

101,805.11

374.97

1,731.57

351.69

39.75

12.34

1,771.33

362.94

11,767.15

1,874.48

11,356.65

1,839.12

404.16

49.06

11,760.81

1,872.70

29.4.10000

sorting

75,000,000.00

10755.69

30.12

701.62

25.26

25.13

1.25

726.75

26.14

8,583.45

261.28

8,225.48

249.06

351.30

11.65

8,576.77

257.01

29.4.10000

hashing

75,000,000.00

10801.62

46.89

757.68

46.45

20.15

1.05

777.84

46.53

10,112.17

473.00

9,763.67

465.12

343.87

26.32

10,107.54

472.72

29.4.20000

sorting

25,000,000.00

101,564.14

220.61

1,480.87

206.61

37.09

5.40

1,517.96

210.94

17,542.41

1,304.73

17,162.48

1,268.34

371.68

42.80

17,534.15

1,303.72

29.4.20000

hashing

25,000,000.00

102,374.97

122.66

2,284.52

114.53

42.43

4.21

2,326.95

117.56

20,739.87

756.46

20,340.27

725.49

388.76

33.86

20,729.03

754.27

29.4.20000

sorting

50,000,000.00

101,092.18

61.86

1,029.47

54.78

25.02

1.57

1,054.49

56.10

16,768.25

686.32

16,420.74

668.26

339.76

20.56

16,760.50

685.45

29.4.20000

hashing

50,000,000.00

101,608.61

171.69

1,538.29

166.16

30.20

2.30

1,568.49

167.53

19,034.98

1,131.87

18,638.14

1,105.03

387.61

44.20

19,025.75

1,131.34

29.4.20000

sorting

75,000,000.00

10747.59

20.10

685.35

21.49

21.40

1.77

706.75

21.65

16,881.61

293.95

16,463.83

275.90

409.77

36.98

16,873.60

293.53

29.4.20000

hashing

75,000,000.00

10877.48

118.69

811.57

106.32

17.95

2.47

829.52

108.27

19,101.55

1,669.61

18,725.33

1,631.23

367.51

40.84

19,092.84

1,668.59

Table8:

Tim

ingresultsin

second

sforthefirst

sevenda

tasets

from

thesecond

setof

runs.Tim


plicate

detection(ddd

)repo

rtthetimeused

forde

layeddu



lerunrepo

rttimeused

runn

ingthealgo

rithm

altogether.

42ddd

wholerun

wall

user

system

cpu

wall

user

system

cpu

filenam

e2codebase

max

nodes

nmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sdmean

sd29.4.5000

sorting

25,000,000.00

101,664.85

82.88

1,591.53

84.04

47.00

3.53

1,638.53

81.04

7,239.70

166.60

6,858.45

185.97

374.22

26.60

7,232.68

167.23

29.4.5000

hashing

25,000,000.00

102,479.17

518.38

2,384.30

466.14

59.76

24.47

2,444.06

489.46

9,669.11

2,065.55

9,268.26

1,972.41

391.70

85.79

9,659.96

2,055.32

29.4.5000

sorting

50,000,000.00

101,145.29

176.06

1,084.27

164.14

34.35

6.56

1,118.62

169.82

6,114.37

980.86

5,808.40

948.98

302.91

35.04

6,111.31

980.44

29.4.5000

hashing

50,000,000.00

101,674.49

133.30

1,616.82

135.50

35.74

5.35

1,652.57

132.89

8,165.69

669.02

7,810.86

681.40

350.83

39.19

8,161.69

668.79

29.4.5000

sorting

75,000,000.00

10779.19

140.75

721.88

123.52

29.89

6.78

751.77

129.90

6,069.98

1,138.98

5,734.16

1,091.56

332.59

47.25

6,066.74

1,137.87

29.4.5000

hashing

75,000,000.00

10756.40

51.49

712.28

49.69

24.75

1.86

737.03

50.85

6,646.86

343.47

6,291.90

329.57

351.88

34.95

6,643.77

343.26

29.6.1000

sorting

25,000,000.00

101,327.94

11.15

1,262.49

8.71

46.46

1.53

1,308.95

9.98

3,915.21

26.07

3,615.22

13.68

298.21

13.57

3,913.43

26.05

29.6.1000

hashing

25,000,000.00

102,218.24

214.09

2,149.23

207.91

50.20

4.34

2,199.43

211.88

6,682.55

693.94

6,354.15

677.46

325.03

17.27

6,679.19

693.21

29.6.1000

sorting

50,000,000.00

101,061.93

79.17

1,008.91

78.03

33.88

1.70

1,042.79

77.25

4,000.64

338.62

3,745.93

343.32

252.70

13.23

3,998.64

338.24

29.6.1000

hashing

50,000,000.00

101,683.59

100.56

1,631.32

100.70

36.59

1.16

1,667.90

101.08

6,432.78

452.94

6,120.55

452.70

309.20

13.18

6,429.75

452.67

29.6.1000

sorting

75,000,000.00

10680.58

10.40

628.49

10.04

36.05

2.69

664.54

10.58

3,555.43

56.34

3,210.37

61.73

343.44

22.88

3,553.81

56.32

29.6.1000

hashing

75,000,000.00

10775.61

41.94

730.20

38.72

29.60

5.29

759.80

40.46

5,102.86

357.29

4,764.09

333.21

336.40

46.14

5,100.49

357.11

29.6.10000

sorting

25,000,000.00

101,593.12

182.05

1,513.66

161.17

42.60

11.75

1,556.26

171.29

10,891.19

1,002.17

10,530.61

966.60

354.14

37.89

10,884.75

999.61

29.6.10000

hashing

25,000,000.00

102,249.16

136.89

2,177.35

133.99

40.76

1.67

2,218.11

134.98

13,157.02

733.33

12,807.63

724.56

343.26

11.15

13,150.90

732.58

29.6.10000

sorting

50,000,000.00

101,147.85

55.69

1,085.44

47.56

30.54

2.57

1,115.99

50.02

10,522.74

519.99

10,173.94

493.22

343.07

26.04

10,517.01

519.10

29.6.10000

hashing

50,000,000.00

101,636.99

125.61

1,575.29

124.84

32.70

2.64

1,607.99

125.47

12,445.88

775.81

12,062.54

771.50

377.14

42.03

12,439.68

775.27

29.6.10000

sorting

75,000,000.00

10750.07

10.30

696.01

10.46

24.86

1.02

720.87

10.50

10,001.82

65.69

9,629.19

65.46

367.91

11.41

9,997.09

65.63

29.6.10000

hashing

75,000,000.00

10829.70

95.90

775.34

87.93

19.20

1.38

794.54

88.99

12,211.17

1,278.41

11,851.44

1,256.25

352.83

22.28

12,204.27

1,276.08

29.6.5000

sorting

25,000,000.00

101,441.19

80.55

1,372.38

78.06

43.95

3.51

1,416.33

80.69

6,171.89

323.57

5,847.19

310.20

321.41

14.98

6,168.60

322.72

29.6.5000

hashing

25,000,000.00

102,575.74

609.75

2,484.35

571.15

57.02

19.62

2,541.38

590.59

10,149.17

2,378.33

9,766.37

2,305.16

375.68

68.31

10,142.05

2,372.54

29.6.5000

sorting

50,000,000.00

101,160.66

193.21

1,099.97

175.09

35.61

10.03

1,135.58

184.92

6,323.86

903.43

6,012.93

871.28

307.65

32.71

6,320.57

901.95

29.6.5000

hashing

50,000,000.00

101,675.28

113.81

1,614.64

113.74

37.56

3.21

1,652.20

112.13

8,482.05

593.83

8,107.14

606.50

369.76

34.24

8,476.89

592.55

29.6.5000

sorting

75,000,000.00

10733.29

7.80

682.08

8.06

28.23

0.84

710.31

8.05

5,779.68

33.04

5,444.16

34.99

332.77

7.81

5,776.93

33.01

29.6.5000

hashing

75,000,000.00

10803.85

93.03

758.90

84.47

22.94

3.10

781.84

87.19

7,312.16

824.16

6,977.22

789.60

331.55

35.90

7,308.77

823.67

29.8.1000

sorting

25,000,000.00

101,697.45

57.86

1,610.98

58.24

62.91

2.83

1,673.89

56.93

4,847.02

190.53

4,444.41

188.98

393.92

9.47

4,838.33

190.13

29.8.1000

hashing

25,000,000.00

102,312.20

264.35

2,240.90

246.54

51.28

10.76

2,292.18

256.86

6,744.57

954.41

6,426.79

914.64

314.58

43.76

6,741.37

953.67

29.8.1000

sorting

50,000,000.00

101,118.57

16.80

1,060.09

16.00

38.23

1.34

1,098.31

16.32

3,950.67

51.80

3,673.24

51.45

275.57

6.82

3,948.81

51.78

29.8.1000

hashing

50,000,000.00

101,625.51

288.94

1,560.20

275.83

45.34

7.46

1,605.54

281.52

5,774.49

1,217.14

5,413.89

1,171.70

356.73

50.41

5,770.61

1,214.26

29.8.1000

sorting

75,000,000.00

10715.75

8.28

660.98

9.35

35.53

1.42

696.50

9.46

3,499.39

29.26

3,186.18

31.18

311.53

7.43

3,497.71

29.25

29.8.1000

hashing

75,000,000.00

10818.34

138.39

768.86

121.40

29.56

8.72

798.42

128.80

5,120.43

942.97

4,787.40

906.32

330.07

43.28

5,117.47

941.42

29.8.10000

sorting

25,000,000.00

101,544.44

120.84

1,477.52

118.68

42.78

3.11

1,520.30

119.78

8,466.56

669.39

8,100.73

644.54

358.09

30.24

8,458.83

666.96

29.8.10000

hashing

25,000,000.00

102,508.82

216.44

2,423.42

200.64

53.33

9.16

2,476.75

208.38

11,499.69

1,067.00

11,112.19

1,029.58

378.82

39.10

11,491.01

1,064.39

29.8.10000

sorting

50,000,000.00

101,191.33

148.52

1,126.74

133.71

32.87

6.18

1,159.61

139.58

7,693.06

882.60

7,391.24

849.04

296.26

33.80

7,687.51

879.12

29.8.10000

hashing

50,000,000.00

101,816.99

334.58

1,744.97

310.18

37.35

9.80

1,782.32

319.47

10,616.97

1,931.94

10,232.95

1,857.25

377.06

70.82

10,610.00

1,927.38

29.8.10000

sorting

75,000,000.00

10730.97

31.90

682.58

31.21

26.25

1.97

708.82

31.47

7,072.39

357.50

6,726.12

364.34

342.81

16.36

7,068.93

357.09

29.8.10000

hashing

75,000,000.00

10864.61

83.73

811.25

77.45

24.84

1.58

836.09

78.31

9,262.80

788.50

8,869.58

774.90

388.73

19.36

9,258.31

788.01

29.8.5000

sorting

25,000,000.00

101,538.76

168.48

1,472.63

163.73

43.85

2.18

1,516.48

165.20

5,587.27

661.17

5,269.05

648.55

315.39

13.54

5,584.44

660.57

29.8.5000

hashing

25,000,000.00

102,190.11

138.65

2,121.47

139.76

46.88

3.01

2,168.36

138.52

7,674.57

580.72

7,317.79

574.09

353.30

34.70

7,671.09

580.47

29.8.5000

sorting

50,000,000.00

101,087.77

37.59

1,036.10

36.06

30.97

1.57

1,067.08

36.23

4,981.51

164.09

4,721.82

165.50

257.45

12.01

4,979.27

164.01

29.8.5000

hashing

50,000,000.00

101,516.79

156.06

1,461.22

146.88

35.17

3.96

1,496.39

150.78

6,750.31

835.40

6,415.50

810.41

330.87

23.76

6,746.37

833.75

29.8.5000

sorting

75,000,000.00

10676.77

41.75

629.79

38.15

28.11

1.79

657.90

39.85

4,450.70

333.37

4,153.17

322.31

295.48

17.84

4,448.65

333.15

29.8.5000

hashing

75,000,000.00

10782.94

75.38

744.60

76.39

21.93

1.28

766.53

75.51

6,279.63

691.54

5,970.70

685.58

306.10

13.89

6,276.80

691.24

Table9:

Tim

ingresultsin

second

sforthelast

sevenda

tasets

from

thesecond

setof

runs.Tim


plicate

detection(ddd

)repo

rtthetimeused

forde

layeddu



lerunrepo

rttimeused

runn

ingthealgo

rithm

altogether.

43

(a)


Den

sity

, dat

a po

ints

0.00000.00020.00040.00060.00080.0010

0 1000 2000 3000 4000

●● ●●●●●● ●●●●● ●● ●●●●●● ●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

hashing1000

●●●●● ●●●●●● ●●● ●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting1000

●●●●●●●●●● ●● ●●●●●●●●● ●● ●●●●● ● ●● ●●●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●

hashing5000

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●

sorting5000

0.00000.00020.00040.00060.00080.0010

● ●●● ●● ●● ●●● ●●●●● ●●●●●●●●● ●●● ●● ●● ●●●●●●● ●●●●●●●● ●●●● ●●●●● ●●● ●●●● ●●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●

hashing10000

●●●● ●●●●●● ●●●●●●●●●●●● ●●● ●●●●●●●● ●●● ●●●●●●●●●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting10000

● ●●● ●●● ●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

hashing20000

0 1000 2000 3000 4000

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting20000

(b)


Den

sity

, dat

a po

ints

0.00000.00020.00040.00060.00080.0010

0 1000 2000 3000 4000

●● ●●●●●● ●●● ●●● ●● ●● ●●● ●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

hashing2

●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting2

●●● ●● ●●●●●● ●●●●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●

hashing4

0.00000.00020.00040.00060.00080.0010

● ●●● ●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●

sorting4

0.00000.00020.00040.00060.00080.0010

● ●●● ● ●●●●●●●●●● ●●● ●●● ●● ●●●●● ● ●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●

hashing6

●●●●●●●●●● ●● ●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting6

●●●●●●●● ●● ●● ●●●●●●● ●● ●●●● ●●●●●●●●●●●● ●●●● ●●● ●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●

hashing8

0 1000 2000 3000 4000

0.00000.00020.00040.00060.00080.0010

●●●●●●●●●●●●● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting8

Figure 7: Kernel density estimates and data points plotted for the second set ofresults for the runtime of the delayed duplicate detection conditioned on the codebaseand the number of data points in the dataset are presented Figure (a). In Figure(b) runtime of the delayed duplicate detection is conditioned on the codebase andthe maximum number of parents.

44

(a)


Den

sity

, dat

a po

ints

0.000

0.002

0.004

0.006

0.008

0.010

0.012

1000 2000 3000 4000

●● ●●●●●● ●●● ●●● ●● ●● ●●● ●●● ●●● ●● ●●●●●●●●●●●●●●●● ●●●●●● ●●●●● ●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●●●● ●●● ●● ●●●●● ● ●●●●●●●●● ●● ●● ●●●●●●● ●● ●●●● ●●●●●

25Mhashing

●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●● ●●● ●●● ●●● ●●●●●●●●●●

50Mhashing

1000 2000 3000 4000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●

75Mhashing

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●●●

● ●●● ●●●●●●● ●●●●●●●

25Msorting

1000 2000 3000 4000

●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●

50Msorting

0.000

0.002

0.004

0.006

0.008

0.010

0.012

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

75Msorting

(b)


Den

sity

, dat

a po

ints

0e+00

2e−04

4e−04

6e−04

8e−04

0 10000 20000 30000

●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

hashing1000

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

sorting1000

●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●

hashing5000

0e+00

2e−04

4e−04

6e−04

8e−04

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●

sorting5000

0e+00

2e−04

4e−04

6e−04

8e−04

● ●●●●● ●●●● ● ●●●●● ●●●● ●●●●● ●●●●● ●● ●●●●●●● ●●●●●●●● ●●● ● ●●●●●●●● ●●●● ●●●●●●●● ●●● ●● ● ●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●● ●● ●●● ●●●● ●●●●

hashing10000

●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting10000

● ●●● ●●● ●● ● ●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●● ●●

hashing20000

0 10000 20000 30000

0e+00

2e−04

4e−04

6e−04

8e−04

●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●

sorting20000

Figure 8: Kernel density estimates and data points plotted for the second set ofresults for the runtime of the delayed duplicate detection conditioned on the codebaseand the maximum number of nodes kept in memory are presented Figure (a). InFigure (b) kernel density estimates and data points are plotted for the second set ofresults for the runtime of the algorithm conditioned on the codebase and the numberof data points in the dataset.

45

(a)


Den

sity

, dat

a po

ints

0e+00

1e−04

2e−04

3e−04

0 5000 10000150002000025000

●● ●●●●●●●●● ●●● ●● ●●●●● ●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

hashing2

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting2

●●●●●●●●●● ● ●●●●● ●●●● ●●●●● ●●● ●●●● ●●● ●●●●●●●●●●●●●●● ● ●●●●● ●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●● ●●● ●●●●●●● ●●●●●

hashing4

0e+00

1e−04

2e−04

3e−04

●●●●●●●●●● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ● ● ●●●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●

sorting4

0e+00

1e−04

2e−04

3e−04

● ●●●●●●●●● ●●●●● ●●●●●● ●●●●●●● ● ●●● ●●●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●● ●● ●●●●●● ●●●

hashing6

●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting6

●●●●●●●● ●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●● ●●● ● ●●● ●● ● ●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●● ●●●●●●●●●● ●●●●

hashing8

0 5000 10000150002000025000

0e+00

1e−04

2e−04

3e−04

●●●●●●●●●● ●●● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

sorting8

(b)


Den

sity

, dat

a po

ints

0.00000

0.00005

0.00010

0.00015

0.00020

0 5000 15000 25000

●● ●●●●●●●●● ●●●●● ●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●● ●●●●● ●●● ●●●● ●●●●●●●●● ●●●●●●●●● ●●●●● ●●●●●● ●●●●●●● ● ●●●●●●●●● ●● ●● ●●●●●●● ●●●●●●●●●●●

25Mhashing

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●

●●●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●●●●●●●●●●●

50Mhashing

0 5000 15000 25000

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●● ●●●●●● ●●●●●●●●●● ●●● ●● ●●●● ●●●●●●●●●● ●●●●

75Mhashing

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

25Msorting

0 5000 15000 25000

●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●● ●●●●●●●●●●●

50Msorting

0.00000

0.00005

0.00010

0.00015

0.00020

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

75Msorting

Figure 9: Kernel density estimates and data points plotted for the second set ofresults for the runtime of the algorithm conditioned on the codebase and the max-imum number of parents are presented Figure (a). In Figure (b) runtime of thealgorithm is conditioned on the codebase and the maximum number of nodes keptin memory.

46

The results for the second set of results from the Mann-Whitney-Wilcoxon tests forthe time used for delayed duplicate detection and for the whole run are given in Table10. Also this time the results show that in most of the cases the null hypothesisstating that the hashing-based and sorting-based populations are the same can berejected at a confidence lower than 1 percent for both timing results. This meansthat the difference in times used for delayed duplicate detection and for the wholerun for the two methods are for most of the input files statistically significantlydifferent for both sets of results. This means that statistically the distributionsgenerating the timing results for the two algorithms are not in most of the casessame.

47

maxnodes maxparents data points n DDD p-value run p-value25M 2 1000 20 0.000* 0.000*50M 2 1000 20 0.005* 0.009*75M 2 1000 20 0.000* 0.000*25M 2 10000 20 0.000* 0.000*50M 2 10000 20 0.000* 0.000*75M 2 10000 20 0.000* 0.000*25M 2 20000 20 0.000* 0.000*50M 2 20000 20 0.000* 0.000*75M 2 20000 20 0.393 0.01525M 2 5000 20 0.000* 0.000*50M 2 5000 20 0.000* 0.000*75M 2 5000 20 0.000* 0.000*25M 4 1000 20 0.000* 0.000*50M 4 1000 20 0.000* 0.000*75M 4 1000 20 0.000* 0.000*25M 4 10000 20 0.000* 0.07550M 4 10000 20 0.005* 0.01575M 4 10000 20 0.019 0.000*25M 4 20000 20 0.000* 0.000*50M 4 20000 20 0.000* 0.000*75M 4 20000 20 0.000* 0.000*25M 4 5000 20 0.000* 0.000*50M 4 5000 20 0.000* 0.001*75M 4 5000 20 0.393 0.01925M 6 1000 20 0.000* 0.000*50M 6 1000 20 0.000* 0.000*75M 6 1000 20 0.000* 0.000*25M 6 10000 20 0.000* 0.000*50M 6 10000 20 0.000* 0.000*75M 6 10000 20 0.015 0.000*25M 6 5000 20 0.000* 0.000*50M 6 5000 20 0.000* 0.001*75M 6 5000 20 0.075 0.000*25M 8 1000 20 0.000* 0.000*50M 8 1000 20 0.000* 0.000*75M 8 1000 20 0.000* 0.000*25M 8 10000 20 0.000* 0.000*50M 8 10000 20 0.000* 0.000*75M 8 10000 20 0.004* 0.000*25M 8 5000 20 0.000* 0.000*50M 8 5000 20 0.000* 0.000*75M 8 5000 20 0.000* 0.000*

Table 10: Results from Mann-Whitney-Wilcoxon tests for the second set of resultsfor time used for delayed duplicate detection (DDD) and for the time used for thewhole run. The tests were executed for each individual file used as an input forthe two algorithms. P-values significant at 1 percent level have been marked withasterisk (*).

5.6.3 Locality

Table 11 shows the number of order graph nodes expanded and written to diskduring the runtime of the algorithms for 29 variables. It shows the same resultas the tables with 20 variables: the algorithm using hashing in delayed duplicatedetection writes to disk clearly a larger amount of order graph nodes.

48

maxnodes 25M 50M 75Mcodebase layer expanded written expanded written expanded written

hashing

0 1 29 1 29 1 291 29 406 29 406 29 4062 406 3,654 406 3,654 406 3,6543 3,654 23,751 3,654 23,751 3,654 23,7514 23,751 118,755 23,751 118,755 23,751 118,7555 118,755 475,020 118,755 475,020 118,755 475,0206 475,020 1,560,780 475,020 1,560,780 475,020 1,560,7807 1,560,780 4,292,145 1,560,780 4,292,145 1,560,780 4,292,1458 4,292,145 1,0015,005 4,292,145 10,015,005 4,292,145 10,015,0059 10,015,005 20,030,010 10,015,005 20,030,010 10,015,005 20,03,001010 20,030,010 63,099,947 20,030,010 34,597,290 20,030,010 34,597,29011 34,597,290 121,474,189 34,597,290 84,186,959 34,597,290 51,895,93512 51,895,935 192,163,998 51,895,935 130,416,997 51,895,935 67,863,91513 67,863,915 221,675,261 67,863,915 187,434,159 67,863,915 129,551,95414 77,558,760 248,842,458 77,558,760 190,105,665 77,558,760 131,210,95015 77,558,760 199,639,480 77,558,760 137,433,140 77,558,760 67,863,91516 67,863,915 137,836,117 67,863,915 9,072,6381 67,863,915 51,895,93517 51,895,935 71,474,182 51,895,935 34,597,290 51,895,935 34,597,29018 34,597,290 20,030,010 34,597,290 20,030,010 34,597,290 20,030,01019 20,030,010 10,015,005 20,030,010 10,015,005 20,030,010 10,015,00520 10,015,005 4,292,145 10,015,005 4,292,145 10,015,005 4,292,14521 4,292,145 1,560,780 4,292,145 1,560,780 4,292,145 1,560,78022 1,560,780 475,020 1,560,780 475,020 1,560,780 475,02023 475,020 118,755 475,020 118,755 475,020 118,75524 118,755 23,751 118,755 23,751 118,755 23,75125 23,751 3,654 23,751 3,654 23,751 3,65426 3,654 406 3,654 406 3,654 40627 406 29 406 29 406 2928 29 1 29 1 29 1All 536,870,911 1,329,244,743 536,870,911 962,536,992 536,870,911 642,516,295

sorting

0 1 29 1 29 1 291 29 406 29 406 29 4062 406 3,654 406 3,654 406 3,6543 3,654 23,751 3,654 23,751 3,654 23,7514 23,751 118,755 23,751 118,755 23,751 118,7555 118,755 475,020 118,755 475,020 118,755 475,0206 475,020 1,560,780 475,020 1,560,780 475,020 1,560,7807 1,560,780 4,292,145 1,560,780 4,292,145 1,560,780 42921458 4,292,145 10,015,005 4,292,145 10,015,005 4,292,145 10,015,0059 10,015,005 20,030,010 10,015,005 20,030,010 10,015,005 20,030,01010 20,03,0010 61,421,587 20,030,010 34,597,290 20,030,010 34,597,29011 34,597,290 96,453,404 34,597,290 74,996,588 34,59,7290 51,895,93512 51,895,935 163,037,873 51,895,935 125,619,058 51,895,935 67,863,91513 67,863,915 197,249,101 67,863,915 137,311,880 67,863,915 116,137,57814 77,558,760 198,139,736 77,558,760 139,902,903 77,558,760 118,401,95015 77,558,760 190,680,156 77,558,760 132,848,015 77,558,760 67,863,91516 67,863,915 121,986,981 67,863,915 82,302,851 67,863,915 51,895,93517 51,895,935 699,82,011 51,895,935 34,597,290 51,895,935 34,597,29018 34,597,290 20,030,010 34,597,290 20,030,010 34,597,290 20,030,01019 20,030,010 10,015,005 20,030,010 10,015,005 20,030,010 10,015,00520 10,015,005 4,292,145 10,015,005 4,292,145 10,015,005 4,292,14521 4,292,145 1,560,780 4,292,145 1,560,780 4,292,145 1,560,78022 1,560,780 475,020 1,560,780 475,020 1,560,780 475,02023 475,020 118,755 475,020 118,755 475,020 118,75524 118,755 23,751 11,8755 23,751 118,755 23,75125 23,751 3,654 23,751 3,654 23,751 3,65426 3,654 406 3,654 406 3,654 40627 406 29 406 29 406 2928 29 1 29 1 29 1All 536,870,911 1,171,989,960 536,870,911 835,214,986 53,687,0911 616,292,919

Table 11: Order graph nodes expanded and written to disk for maximum numberof nodes taking values 25M, 50M and 75M for hashing and sorting codebases.

49

6 Conclusions

This thesis has presented a way to use hashing-based delayed duplicate detectionin structure learning of Bayesian networks. The method developed divides the can-didate order graph nodes to files on the disk by using the first-k variables of thecandidate node. This way a single layer of the order graph can be divided so that asingle file on the disk contains at most a given amount of unique order graph nodes.

The hashing-based delayed duplicate detection has proven to be a more efficientmethod than sorting-based methods in other application areas, but in structurelearning of Bayesian networks the results are mixed. The results presented in thisthesis show that the order in which the candidate solutions are read from the earlierlayer during the search through the order graph affects the number of duplicatesdetected in memory. The locality measure presented shows that this order can havean significant impact on the number of candidate solutions written to disk duringdelayed duplicate detection. Overall, the timing results presented for the differentdatasets show mixed results. The disparencies between otherwise identical experi-ments suggests that using a shared environment with other users can have a majoreffect on the runtime of the algorithm, and hashing-based delayed duplicate detec-tion seems to be affected differently by this than sorting-based delayed duplicatedetection algorithm.

There are several areas for future work. Different division of combinations into fileson the disk per first-k subcombinations could improve the efficiency of hashing thecandidate solutions to disk; the current method unnecessarily increases the numberof files because the combinations are unevenly distributed between the created files.A more efficient division of combinations to files could decrease the number of filesneeded during the search. However, it is unclear whether there is a possibility tofind such division generically without enumerating the different candidate solutionsduring the search.

The algorithm presented here could also be parallelized in the same manner as sug-gested by Korf [Kor08]. He parallelized the writing and reading of the candidatesolutions to the files on the disk. This could speed up the search by allowing bothcandidate generation and delayed duplicate detection phases to run parallel on mul-tiple cores. After all, the results clearly suggest that the delayed duplicate detectionwith hashing is still only a minor portion of the runtime of the whole algorithm.The algorithm could be created using an idea of a "stop the world" pause between

50

the writing and reading phases. This pause is necessary because the delayed dupli-cate detection cannot be performed before all of the candidates have been generated.However, a clear problem with parallelization is that the in-memory duplicate detec-tion utilizes a hash table, which means that the parallelization would slow down thein-memory duplicate detection since multiple threads would be accessing the samehash table at the same time. Korf did not experience this problem as his solutionuses only delayed duplicate detection. Of course, another possibility is to test thespeedup that the hybrid duplicate detection gives to the whole algorithm comparedto the parallelized algorithm using only delayed duplicate detection.

One more possible venue of future research would be to investigate how the order ofreading the order graph nodes from the previous layer can affect the number of nodeswritten to disk during delayed duplicate detection. This could be an interestingfuture research area especially, because right now the candidate order graph nodesare not deliberately generated in a certain order, but instead just created in a bruteforce manner. If there was a certain way of generating the candidate solutions, whichwould allow for an efficient in memory duplicate detection, this would of course makethe generation of candidate solutions more efficient than what it is right now. Basedon the results presented in this thesis it can be said that the order in which thesorting-based algorithm generates the candidate solutions implicitly generates thecandidate solutions in a more suitable order for in-memory duplicate detection thanthe hashing-based algorithm presented in this thesis. This observation has not beenmade previously and it is also important in other application domains of delayedduplicate detection than only Bayesian network structure learning.

51

References

Bou94 Bouckaert, R. R., Properties of Bayesian Belief Network Learning Al-gorithms. Proceedings of the 10th International Conference on Uncer-tainty in Artificial Intelligence, San Francisco, CA, USA, 1994, MorganKaufmann Publishers Inc., pages 102–109.

Chi95 Chickering, D. M., A Transformational Characterization of EquivalentBayesian Network Structures. Proceedings of the 11th Conference onUncertainty in Artificial Intelligence, 1995, pages 87–98.

Chi96 Chickering, D. M., Learning Bayesian Networks is NP-Complete. Learn-ing from Data: Artificial Intelligence and Statistics V. Springer-Verlag,1996, pages 121–130.

dCJ11 de Campos, C. P. and Ji, Q., Efficient Structure Learning of BayesianNetworks Using Constraints. Journal of Machine Learnng Research, 12,pages 663–689.

ES11 Edelkamp, S. and Sulewski, D., External Memory Breadth-First Searchwith Delayed Duplicate Detection on the GPU. In Model Checking andArtificial Intelligence, van der Meyden, R. and Smaus, J.-G., editors,volume 6572 of Lecture Notes in Computer Science, Springer BerlinHeidelberg, 2011, pages 12–31.

Hec96 Heckerman, D., A Tutorial on Learning With Bayesian Networks. Tech-nical Report, Learning in Graphical Models, 1996.

Hen88 Henrion, M., Propagating uncertainty in Bayesian networks by proba-bilistic logic sampling. Proceedings of the 2nd International Conferenceon Uncertainty in Artificial Intelligence, 1988, pages 317–324.

HGC95 Heckerman, D., Geiger, D. and Chickering, D. M., Learning BayesianNetworks: The Combination of Knowledge and Statistical Data. 1995,pages 197–243.

IC02 Ide, J. S. and Cozman, F. G., Random Generation of Bayesian Net-works. In Brazilian Symposium on Artificial Intelligence. Springer-Verlag, 2002, pages 366–375.

52

JMS96 Jones, M. C., Marron, J. S. and Sheather, S. J., A brief survey ofbandwidth selection for density estimation. Journal of the AmericanStatistical Association, 91,433(1996), pages 401–407.

JSGM10 Jaakkola, T., Sontag, D., Globerson, A. and Meila, M., LearningBayesian Network Structure using LP Relaxations. Journal of MachineLearning Research - Proceedings Track, 9, pages 358–365.

Kor04 Korf, R. E., Best-first Frontier Search with Delayed Duplicate Detec-tion. Proceedings of the 19th National Conference on Artifical Intelli-gence, 2004, pages 650–657.

Kor08 Korf, R. E., Linear-time disk-based implicit graph search. Journal ofthe ACM, 55,6(2008), pages 26:1–26:40.

KS04 Koivisto, M. and Sood, K., Exact Bayesian Structure Discovery inBayesian Networks. Journal of Machine Learning Research, 5, pages549–573.

LB94 Lam, W. and Bacchus, F., Learning Bayesian belief networks: anapproach based on the MDL principle. Computational Intelligence,10,3(1994), pages 269–293.

ML98 Moore, A. and Lee, M. S., Cached Sufficient Statistics for Efficient Ma-chine Learning with Large Datasets. Journal of Artificial IntelligenceResearch, 8,1(1998), pages 67–91.

MW47 Mann, H. B. and Whitney, D. R., On a Test of Whether one of TwoRandom Variables is Stochastically Larger than the Other. The Annalsof Mathematical Statistics, 18,1(1947), pages 50–60.

MY13 Malone, B. M. and Yuan, C., Evaluating Anytime Algorithms for Learn-ing Optimal Bayesian Networks. Proceedings of the 29th Conference onUncertainty in Artificial Intelligence, 2013.

MYHB11 Malone, B. M., Yuan, C., Hansen, E. A. and Bridges, S., Improvingthe Scalability of Optimal Bayesian Network Learning with External-Memory Frontier Breadth-First Branch and Bound Search. Proceedingsof the 27th Conference on Uncertainty in Artificial Intelligence, 2011,pages 479–488.

53

Pea88 Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 1988.

Pea00 Pearl, J., Causality: Models, Reasoning, and Inference. CambridgeUniversity Press, New York, NY, USA, 2000.

Ros94 Roscoe, A. W., A Classical Mind. Prentice Hall International (UK)Ltd., Hertfordshire, UK, UK, 1994, chapter Model-checking CSP, pages353–378.

SM06 Silander, T. and Myllymaki, P., A simple approach for finding the glob-ally optimal Bayesian network structure. Proceedings of the 22nd An-nual Conference on Uncertainty in Artificial Intelligence, 2006.

Tey05 Teyssier, M., Ordering-based search: A simple and effective algorithmfor learning Bayesian networks. Proceedings of the 21st Conference onUncertainty in Artificial Intelligence, 2005, pages 584–590.

Tia00 Tian, J., A Branch-and-bound Algorithm for MDL Learning BayesianNetworks. Proceedings of the 16th Conference on Uncertainty in Arti-ficial Intelligence, 2000, pages 580–588.

TIM11 Tamada, Y., Imoto, S. and Miyano, S., Parallel Algorithm for LearningOptimal Bayesian Network Structure. Journal of Machine LearningResearch, 12, pages 2437–2459.

TK76 Tarter, M. E. and Kronmal, R. A., An Introduction to the Implementa-tion and Theory of Nonparametric Density Estimation. The AmericanStatistician, 30,3(1976), pages pp. 105–112.

Wil45 Wilcoxon, F., Individual Comparisons by Ranking Methods. BiometricsBulletin, 1,6(1945), pages 80–83.

YM13 Yuan, C. and Malone, B., Learning Optimal Bayesian Networks: AShortest Path Perspective. Journal of Artififical Intelligence Research,48, pages 23–65.

ZH04 Zhou, R. and Hansen, E. A., Structured duplicate detection in external-memory graph search. In Proceedings of the 19th National Conferenceon Artificial Intelligence, 2004.

Hashing-baseddelayedduplicatedetectionasanapproachto ... · Dateofacceptance Grade Instructor...

Documents

Transcript of Hashing-baseddelayedduplicatedetectionasanapproachto ... · Dateofacceptance Grade Instructor...