A Clustering-Based Framework for Incrementally Repairing...

13
A Clustering-Based Framework for Incrementally Repairing Entity Resolution Qing Wang (B ) , Jingyi Gao, and Peter Christen Research School of Computer Science, The Australian National University, Canberra, ACT 0200, Australia {qing.wang,jingyi.gao,peter.christen}@anu.edu.au Abstract. Although entity resolution (ER) is known to be an impor- tant problem that has wide-spread applications in many areas, including e-commerce, health-care, social science, and crime and fraud detection, one aspect that has largely been neglected is to monitor the quality of entity resolution and repair erroneous matching decisions over time. In this paper we develop an efficient method for incrementally repairing ER, i.e., fix detected erroneous matches and non-matches. Our method is based on an efficient clustering algorithm that eliminates inconsisten- cies among matching decisions, and an efficient provenance indexing data structure that allows us to trace the evidence of clustering for supporting ER repairing. We have evaluated our method over real-world databases, and our experimental results show that the quality of entity resolution can be significantly improved through repairing over time. Keywords: Data matching · Record linkage · Deduplication · Data provenance · Data repairing · Consistent clustering 1 Introduction Entity resolution (ER) is the process of deciding which records from one or more databases correspond to the same entities. Typically, two types of matching decisions are involved in the ER process: match (two records refer to the same entity) and non-match (two records refer to two different entities). Based on matching decisions, a clustering algorithm is often used to group records into different entity clusters, each of which consists of a set of records and corresponds to one entity [5]. Due to its importance in practical applications, ER has been extensively studied in the past [5, 8]. Nonetheless, the ER process is still far from satisfactory in practice. For example, when data is dirty (i.e., incorrect, missing, or badly formatted), an ER result may contain errors that cannot be detected at the time of performing the ER task. Instead, such errors are often detected later on by users, particularly when users use their entities in various ER applications. If we repair these errors incrementally whenever being detected, the quality of ER can be improved over time, which would accordingly improve the quality of applications that use ER. c Springer International Publishing Switzerland 2016 J. Bailey et al. (Eds.): PAKDD 2016, Part II, LNAI 9652, pp. 283–295, 2016. DOI: 10.1007/978-3-319-31750-2 23

Transcript of A Clustering-Based Framework for Incrementally Repairing...

Page 1: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Frameworkfor Incrementally Repairing Entity Resolution

Qing Wang(B), Jingyi Gao, and Peter Christen

Research School of Computer Science, The Australian National University,Canberra, ACT 0200, Australia

{qing.wang,jingyi.gao,peter.christen}@anu.edu.au

Abstract. Although entity resolution (ER) is known to be an impor-tant problem that has wide-spread applications in many areas, includinge-commerce, health-care, social science, and crime and fraud detection,one aspect that has largely been neglected is to monitor the quality ofentity resolution and repair erroneous matching decisions over time. Inthis paper we develop an efficient method for incrementally repairingER, i.e., fix detected erroneous matches and non-matches. Our methodis based on an efficient clustering algorithm that eliminates inconsisten-cies among matching decisions, and an efficient provenance indexing datastructure that allows us to trace the evidence of clustering for supportingER repairing. We have evaluated our method over real-world databases,and our experimental results show that the quality of entity resolutioncan be significantly improved through repairing over time.

Keywords: Data matching · Record linkage · Deduplication · Dataprovenance · Data repairing · Consistent clustering

1 Introduction

Entity resolution (ER) is the process of deciding which records from one or moredatabases correspond to the same entities. Typically, two types of matchingdecisions are involved in the ER process: match (two records refer to the sameentity) and non-match (two records refer to two different entities). Based onmatching decisions, a clustering algorithm is often used to group records intodifferent entity clusters, each of which consists of a set of records and correspondsto one entity [5]. Due to its importance in practical applications, ER has beenextensively studied in the past [5,8]. Nonetheless, the ER process is still far fromsatisfactory in practice. For example, when data is dirty (i.e., incorrect, missing,or badly formatted), an ER result may contain errors that cannot be detected atthe time of performing the ER task. Instead, such errors are often detected lateron by users, particularly when users use their entities in various ER applications.If we repair these errors incrementally whenever being detected, the quality ofER can be improved over time, which would accordingly improve the quality ofapplications that use ER.c© Springer International Publishing Switzerland 2016J. Bailey et al. (Eds.): PAKDD 2016, Part II, LNAI 9652, pp. 283–295, 2016.DOI: 10.1007/978-3-319-31750-2 23

Page 2: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

284 Q. Wang et al.

However, in many real-life situations, repairing ER is complicated and labour-intensive. For example, we may find that two records r3 and r5 in an entitycluster 〈r1, r2, r3, r4, r5〉 are an erroneous match, and therefore should be splitinto two different entities. Since there are many possible ways of splitting r3 andr5 into two entity clusters, such as {〈r3〉, 〈r1, r2, r4, r5〉}, {〈r5〉, 〈r1, r2, r3, r4〉},and {〈r3〉, 〈r5〉, 〈r1, r2, r4〉}, repairing this error would require a manual reviewon all the records in the entity cluster, their match or non-match decisions, andpossible inconsistencies. Such repairing tasks can become even more difficult ifsome constraints, such as hard matches and hard non-matches (often dictated bydomain experts or users), are required to be preserved in the repairing process.For example, suppose that we have another entity cluster 〈r6, r7, r8, r9〉 andknow that (r3, r8) is an erroneous non-match, but (r3, r1) is a hard match and(r1, r9) is a hard non-match. In this case, it is not easy to repair the erroneousmatch (r3, r5) and non-match (r3, r8), while still preserving hard matches andnon-matches.

This paper aims to develop an efficient method for incrementally repairingER. Our observation is that repairing errors in an ER result should look at howan entity was produced, e.g., matches used to determine its entity cluster overtime and the confidences of these matches. Thus, in order to establish an efficientprocedure and also a reasonable level of credibility for performing ER repairing,we propose to build a provenance index that allows us to trace the evidences ofmatches used in the ER process, and interact with the evidence of non-matches.As illustrated in Fig. 1, such a provenance index can be built during the ERclustering process, then iteratively maintained in the repairing process. Basedon this provenance index, our repairing algorithms can leverage the evidence ofmatches and non-matches to repair errors accurately and efficiently.

Fig. 1. A general framework for repairing ER based on provenance indexing

Contributions. In summary, our key contributions are as follows:

(1) We develop an algorithm that can solve the inconsistencies of matching deci-sions to generate entity clusters.

Page 3: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 285

(2) We design a provenance index structure which can keep track of matchesthat are relevant to each entity and capture the relationships among thesematches as they occur in the clustering process.

(3) Based on the provenance index structure, we propose a novel approach toautomatically and efficiently repair errors in ER whenever they are detected.

(4) We experimentally validate the feasibility of the proposed provenance indexand the benefits of our repairing technique using real-world databases.

2 Related Work

Our work lies in the intersection of two active lines of research: entity resolutionand database repair. Below we discuss the related works from these two areas.

Entity resolution (ER) has long been central to the studies of data qualityand integration, and has many important applications [5,8,15]. Previous researchhas studied various aspects of the ER process, such as blocking, similarity com-parison, classification, evaluation and training data [5,9,14]. Recently, severalworks incorporated constraints into similarity-based ER techniques to improvethe quality of ER [2,10–12]. Nonetheless, these works focused on preventingerrors in ER results, rather than repairing ER results that contain errors.

Database repair has been extensively studied in the recent years [1,6,16],which was concerned with data consistency and accuracy. Existing approachesmostly deal with data errors either by obtaining consistent answers to queriesover an inconsistent database [1] or by developing automated methods to finda repair that satisfies required constraints and also “minimally” differs fromthe original database [6]. These approaches are often computationally expensiveand not applicable to repairing ER, particularly when ER is used in applicationswhere the processing time is critical.

To date, few work has been reported on the entity repairing problem [13]. Inthis paper we show how to use a provenance index structure to incrementallyand automatically repair errors in ER results. We consider not only matches butalso non-matches in the ER process. Our work not only generalizes the clusteringalgorithm in [2] for supporting efficient clustering in the presence of both matchesand non-matches, but also improve the work in [13] by automating the repairingprocess. In [13], the authors only considered ER rules that generate matches, andtheir work on ER repairing requires the manual review of suspicious matches bydomain experts.

3 Problem Statement

Let R be a database that contains a set of relations, each having a finite set ofrecords. An entity relation R∗ ∈ R has three attributes, two of which refer torecords in R−{R∗} and the third refers to values in [−1, 1]. That is, (r1, r2, a) ∈R∗ indicates a match (r1, r2) with the confidence value |a| if a > 0, or a non-match (r1, r2) with the confidence value |a| if a < 0. A match or non-match ishard if it has |a| = 1, and soft otherwise.

Page 4: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

286 Q. Wang et al.

Accordingly, there exist two kinds of ER rules: match rules and non-matchrules. A match rule (resp. non-match rule) is a function that, given R as input,generates a set of matches (resp. non-matches). Any ER algorithm that generatesmatches (resp. non-matches) may be considered as a match (resp. non-match)rule in our work, including pairwise similarity and machine-learning algorithms[5,7]. Each ER rule associates with a weight, i.e., for a match rule u, w(u) ∈ (0, 1];for a non-match rule u, w(u) ∈ [−1, 0). In practice, not every ER rule is equallyimportant. The more important a rule is, the higher its weight value is. Similarly,we call a rule u with |w(u)| = 1 as a hard rule. We use u(ri, rj) to indicate thatan ER rule u generates a match or non-match (ri, rj).

Given a set U of ER rules, applying U over R generates matches and non-matches in R∗. For each pair (ri, rj) of records, we use U+

ij = {u|u(ri, rj)∧w(u) >

0} and U−ij = {u|u(ri, rj) ∧ w(u) < 0} to denote the set of match or non-match

rules that generate (ri, rj), respectively. With a slight abuse of notations, we willuse | · | to denote the absolute value with lower-case letters and the cardinalityof a set with upper-case letters. If there exists a hard rule u ∈ U that generates(ri, rj), then we have (ri, rj , w(u)) ∈ R∗ with |w(u)| = 1. Otherwise, we have(ri, rj , a) ∈ R∗ where the confidence value a is determined by:

|a| = max

⎛⎜⎝

u∈U+ij

|w(u)||U+

ij | ,∑

u∈U−ij

|w(u)||U−

ij |

⎞⎟⎠ . (1)

We have a > 0 if∑

u∈U+ij

|w(u)||U+

ij | ≥ ∑u∈U−

ij

|w(u)||U−

ij | , and a < 0 otherwise.

An entity cluster c is a set of records in R. An ER clustering C is a set{c1, . . . , ck} of entity clusters which are pairwise disjoint. We use C(r) to denotethe cluster of C to which a record r belongs. A clustering C is valid if, (i) foreach hard match (ri, rj , 1) ∈ R∗, C(ri) = C(rj), and (ii) for each hard non-match(ri, rj ,−1) ∈ R∗, C(ri) �= C(rj).

Users can provide feedback on erroneous matches or non-matches, andaccordingly user feedback may be viewed as hard matches or hard non-matches.In this paper, we address two related problems: (1) entity clustering and(2) entity repairing. The entity clustering problem is, given a set R∗ of matchesand non-matches, to find a valid clustering w.r.t. R∗. The entity repairing prob-lem is, given a valid clustering C w.r.t. R∗ and a collection Rf of user feedback,to find a valid clustering C ′ w.r.t. R∗ ∪ Rf . We say C ′ repairs C w.r.t. Rf .

We assume no conflicts among hard matching decisions because such con-flicts have to be manually resolved by a domain expert. Our focus here is toautomatically resolve conflicts among soft matches and soft non-matches.

4 Clustering Algorithms

We first present clustering graphs to describe how matches and non-matchesare related in a graphical structure. Then we develop an efficient clusteringalgorithm, which generalizes the work in [2], to cluster records under consistency.

Page 5: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 287

4.1 Clustering Graphs

Conceptually, each clustering graph is a graph with labelled edges. Let L =LP ∪LN be a set of labels consisting of a subset LP of positive labels and a subsetLN of negative labels, and LP ∩LN = ∅. A clustering graph G = (V,E, λ) over Lhas a set V of vertices where each vertex represents a record, a set E ⊆ V ×V ofedges where each edge represents a match or non-match, and a labelling functionλ that assigns each edge e ∈ E with a label λ(e) ∈ L. For the labels in L, wehave LP ⊇ (0, 1] and LN ⊇ [−1, 0) which correspond to the confidence values ofmatches and non-matches represented by edges, respectively. For convenience,we use E(1) = {e|e ∈ E and λ(e) = 1} and E(−1) = {e|e ∈ E and λ(e) = −1}to refer to the subsets of edges in a clustering graph G which represent hardmatches and non-matches.

A clustering graph G = (V,E, λ) is consistent if it corresponds to an ERclustering CG over V such that for each edge (ri, rj) ∈ E, if λ(ri, rj) ∈ LP , thenri and rj are in the same entity cluster; otherwise (i.e., λ(ri, rj) ∈ LN ) ri andrj are in two different entity clusters. Such an ER clustering CG is valid w.r.t.R∗ if every hard match (resp. non-match) in R∗ is represented by an edge witha positive (resp. negative) label in G.

Given a clustering graph G that represents matches and non-matches inR∗, two questions naturally arise: (1) Can we efficiently decide whether G isconsistent? (2) If G is not consistent, can we efficiently generate a consistentclustering graph from G to provide a valid ER clustering w.r.t. R∗? To answerthese questions, we will discuss two clustering algorithms. The first algorithmwas introduced in [2] which checks the consistency of clustering through trianglesand, as proven by the authors, is a 3-factor approximation algorithm for thecorrelation clustering problem [3]. The second algorithm is proposed by us, whichgeneralizes the first one to improve the efficiency by checking through cycles.

4.2 Triangle-Based Clustering (TriC)

The central idea of TriC is to eliminate occurrences of inconsistent triangles (asdepicted in Fig. 2(a)) in a clustering graph. Thus, two transitive closure (TC)rules are used in the clustering process:

– TC1: If (r1, r2) and (r1, r3) are two hard matches, then (r2, r3) must also bea hard match, as depicted in Fig. 2(b).

Fig. 2. (a) inconsistent triangle, (b)–(c) two transitive closure rules used in TriC

Page 6: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

288 Q. Wang et al.

– TC2: If (r1, r2) is a hard match and (r1, r3) is a hard non-match, then (r2, r3)must be a hard non-match, as depicted in Fig. 2(c).

Algorithm 1. Triangle-based clustering (TriC)Input: A clustering graph G1 = (V, E1, λ1)Output: A clustering graph G2 = (V, E2, λ2) that is consistent and only labelled by {1, −1}1: G2 := G1 // Initialize the clustering graph G22: Apply TC1 and TC2 on all possible edges in G2 (i.e., edges in V × V )

3: Determine a queue Q of soft edges E1 − (E(1)2 ∪ E

(−1)2 )

4: Do the following until Q = ∅:5: e := Q.pop() // Get the first soft edge from the queue6: if λ1(e) ∈ LP , then: // Harden the edge according to its label7: λ2(e) := 18: else:9: λ2(e) := −110: Apply TC1 and TC2 all possible edges in G2 (i.e., edges in V × V )

11: Remove the hard edges in E(1)2 ∪ E

(−1)2 from Q

12: Return G2

Algorithm 1 describes the main steps of TriC. Firstly, the TC1 and TC2 rulesare applied to harden all possible edges over V , including the edges that do notexist in the initial clustering graph G2 (line 2). Then, a queue Q of soft edgesis determined (line 3), which reflects the relative importance of hardening softedges according to their labels (i.e., hardened as a match if the label is positive,or as a non-match if the label is negative). Different strategies may be used todetermine such a queue. We will discuss and evaluate two of such strategies inSect. 6. Each time when an edge is hardened, the TC1 and TC2 rules are appliedagain to further harden other edges if any (lines 5–10). In each iteration, newlyhardened edges are removed from the queue (line 11). The iterations continueuntil the queue Q is empty. Then the algorithm returns the clustering graph G2

as output (line 12).As TriC ensures the consistency of clustering through triangles, it unavoid-

ably leads to a clique over V after the clustering (i.e., for any v1, v2 ∈ V andv1 �= v2, we have (v1, v2) ∈ E2 in G2). The complexity of the algorithm is O(|V |3)in the worst case (i.e., it needs |V |!

2×(|V |−3)! steps of checking edges), which is com-putationally expensive when clustering over large databases.

Fig. 3. (a) inconsistent cycle, (b)–(c) two extended transitive closure rules used in CyC

Page 7: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 289

Algorithm 2. Cycle-based clustering (CyC)Input: A clustering graph G1 = (V, E1, λ1)Output: A clustering graph G2 = (V, E2, λ2) that is consistent and only labelled by {1, −1}1: G2 := G1 // Initialize the clustering graph G22: Apply ETC1 and ETC2 on edges in E2

3: Determine a queue Q of soft edges E1 − (E(1)2 ∪ E

(−1)2 )

4: Do the follow until Q = ∅:5: e := Q.pop() // Get the first soft edge from the queue6: if λ1(e) ∈ LP , then: // Harden the edge according to its label7: λ2(e) := 18: else:9: λ2(e) := −110: Apply ETC1 and ETC2 on edges in E2

11: Remove the hard edges E(1)2 ∪ E

(−1)2 from Q

12: Return G2

4.3 Cycle-Based Clustering (CyC)

To improve the efficiency of clustering, we now propose an optimized algorithm(i.e. CyC) which ensures the consistency based on cycles, instead of triangles. Ina nutshell, CyC generalizes TriC by eliminating inconsistent cycles, as depictedin Fig. 3(a). This is motivated by the observation that, for entity clustering, it isimpossible to have a cycle (r1, . . . , rk, rk+1, . . . , rk+m, r1) in which all edges arehard matches except for one edge (r1, rk+m) that is a hard non-match. Thus, toeliminate such inconsistent cycles, two extended transitive closure rules are usedin the clustering process:

– ETC1: If each (ri, ri+1) for i ∈ [1, . . . , k + m − 1] is a hard match, then(r1, rk+m) must also be a hard match, as depicted in Fig. 3(b).

– ETC2: If (rk, rk+1) is a hard non-match, and other edges from r1 to rk+m

are hard matches, then (r1, rk+m) must be a hard non-match, as depicted inFig. 3(c).

Algorithm 2 describes the main steps of CyC. Similar to TriC, the rules ETC1and ETC2 are first applied to harden soft edges in G2 (line 2). After that, a queueQ of soft edges is determined, and each of these soft edges is hardened accordingto their orders (lines 3–9). Again the queue reflects the relative importance ofhardening different soft edges. Next, CyC applies the ETC1 and ETC2 rulesto ensure the consistency of clustering (line 11). Newly hardened edges in eachiteration are removed from the queue (line 12). The iterations continue until thequeue is empty. Finally, the resulting graph G2 is returned (line 13).

CyC applies the rules only on the existing edges of a clustering graph G,which has the complexity O(|E| × |V |) in the worst case. It is more efficientthan TriC because |E| is often much smaller than |V |2 in a clustering graph. Inpractice, we may simply merge records that are connected by hard matches andtreat each of such merged records as a single record.

Page 8: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

290 Q. Wang et al.

5 Repairing Algorithms

In this section, we discuss an efficient provenance index structure and repairingalgorithms for entities.

5.1 Provenance Index

In our work, a provenance index is constituted by a number of entity trees. Eachentity tree describes how the records of an entity are identified from a numberof matches. We use leaf (t) to refer to the set of all leaves of an entity tree t,root(t) to the root vertex of t, parent(v) to the parent vertex of a vertex v,subtree(v) to the subtree rooted at v, and subtree(v/v1, t1) to a subtree thatreplaces subtree(v1) by t1 in subtree(v), where v1 is a child vertex of v. Anentity tree t is a binary tree with a labelling function � such that (1) each leafrepresents a record, (2) each internal vertex represents a match, and (3) each edge(parent(v), v) is labelled by a record that participates in the match representedby parent(v) such that �(parent(v), v) ∈ leaf (t′) for t′ = subtree(v).

Entity trees are constructed from bottom up during the clustering process.In terms of Algorithm 2, hard matches are first added into an entity tree andrepresented by vertices that are close to leaves, and these vertices must be undervertices for soft matches (line 2 of Algorithm 2). Then each time when a soft edge(r1, r2) is hardened as a match (i.e., line 5 of Algorithm 2), a vertex v representingthat match is added as the root of the corresponding entity tree t if {r1, r2} �leaf (t). Otherwise, we discard the match because this match has already beenimplicitly captured in the provenance index. Thus, entity trees enjoy two niceproperties to support efficient repairing. (1) Distinct leaves: Records representedby leaves of an entity tree are distinct. This provides an efficient index for findingmatches on records. (2) Ordered vertices: Internal vertices are ordered to reflectthe importance of matches, which is determined by the queue Q in Algorithm 2,e.g., the closer a vertex is to a leaf (in terms of the number of edges betweenthem), the more important the match represented by it is.

5.2 Entity Tree Splitting Algorithm

Algorithm 3 describes the proposed splitting algorithm for repairing erroneousmatches. Given an erroneous match (r1, r2) detected by a user, according to theproperty of ordered vertices, the algorithm first finds the lowest common parentvertex v of r1 and r2 (line 1). Then the child vertices of v are split into twosubtrees (lines 2–3). After that, to propagate the effect of splitting two childvertices of v, each vertex in the path from v to the root is checked. Dependingon the edge label, the vertex is added to the top of one of the two subtrees (lines6–9). The complexity of Algorithm 3 is O(|V |) which is linear in the number |V |of vertices in an entity tree t.

Page 9: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 291

Algorithm 3. Entity Tree Splitting Algorithm

Input: An entity tree t, an erroneous match (r1, r2) with r1, r2 ∈ leaf (t)

Output: Two entity trees t1 and t2 with r1 ∈ leaf (t1) and r2 ∈ leaf (t2)

1: Find the lowest common parent vertex v of r1 and r2 in t

2: t1 := subtree(v1) where v1 is the left child vertex of v

3: t2 := subtree(v2) where v2 is the right child vertex of v

4: v’:=v

5: Do the following for each vertex v′ in the path from v to root(t):

6: if �(parent(v′), v′) ∈ leaf (t1), then:

7: t1 := subtree(parent(v′)/v′, t1)

8: else:

9: t2 := subtree(parent(v′)/v′, t2)

10: v′ := parent(v′)11: Return t1 and t2

Algorithm 4. Entity Tree Merging Algorithm

Input: Two entity trees t1 and t2, an erroneous non-match (r1, r2) with ri ∈ leaf (ti) for i=1,2, and

a set of hard non-matches between t1 and t2 in R∗

Output: A set T of entity trees which includes an entity tree t ∈ T and r1, r2 ∈ leaf (t)

1: T := {t} where t is created with one vertex that has two child vertices r1 and r22: v1

i := ri, vi := parent(ri), v2i := otherchild(vi, v1

i ) and leni := 1 for i=1,2

3: Do the following until len1 = 0 and len2 = 0:

4: if len1 ≤ len25: k := 1

6: else:

7: k := 2

8: Find the tree t′ ∈ T satisfying �(vk, v1k) ∈ leaf (t′)

9: if there exists (ri, rj , −1) ∈ R∗ with ri, rj ∈ leaf (t′) ∪ leaf (subtree(v2k))

10: T := T ∪ {subtree(v2k)}

11: else:

12: t′ := subtree(vk/v1k, t′)

13: if parent(vk) exists

14: v1k := vk, vk := parent(vk), v2

k := otherchild(vk, v1k) and lenk := lenk + 1

15 else:

16: lenk := 0

17: Return T

5.3 Entity Tree Merging Algorithm

Algorithm 4 describes the proposed merging algorithm for repairing erroneousnon-matches, in which we use otherchild(v, vi) to denote the other child vertexof v in addition to vi. Given an erroneous non-match (r1, r2) detected by a user,the algorithm first creates an entity tree t with one internal vertex representing(r1, r2) (line 1). Then for each vertex on the paths from the leaf r1 (resp. r2) toroot(t1) (resp. root(t2)) (lines 3–16), if there is no violation of hard non-matches,the vertex is added into an entity tree in T based on its closeness to r1 or r2in t1 and t2 (line 12); otherwise, a new entity tree is created for the subtree atthe other child vertex of the vertex (line 11). The complexity of this algorithmis O(|V1| + |V2|) which is linear in the number of vertices in t1 and t2.

6 Experiments

We have implemented our framework to evaluate how efficiently our repairingalgorithms can improve the quality of ER based on the provenance indexing.

Page 10: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

292 Q. Wang et al.

6.1 Experimental Setup

Our implementation is written in Python. The experiments were performed ona Linux machine with Intel Core i7-3770 CPU at 3.40 GHz, and 8 GBytes RAM.

Data Sets. We used two real-world data sets in our experiments: CORA andNCVR. The CORA data set contains 1,878 machine learning publications and ispublicly available together with its ground truth1. The NCVR data set is a pub-lic voter registration data set from North Carolina2, which contains the namesand addresses of over 8 million voters. Each record includes a voter registrationnumber, providing us with the true match status of record pairs. The numberof true duplicate records in the full NCVR data set is below 2 % of all records.We therefore extracted a smaller subset containing 448,134 records that includesall duplicates as well as randomly selected individuals (i.e. singleton clusters).Table 1 presents some characteristics of these two data sets.

ER Rules. For these data sets, we used the ER rules as described in Table 2 toobtain an initial ER clustering result, which nonetheless only serves as a baselinefor evaluating our ER repair algorithms. The better quality an initial clusteringhas, the less repairs we would need for achieving a high-quality clustering. Thus,we used simple ER rules that can be easily setup by domain experts.

To our knowledge, this work is the first report on developing automatic tech-niques for repairing errors in ER. There are no baseline techniques which we cancompare our repairing approach with.

Table 1. Characteristics of data sets used in our experiments

Data setnames

Numberof records

Number of uniqueweight vectors

Numberof clusters

Min. sizeof clusters

Max. sizeof clusters

Avg. sizeof clusters

CORA 1,878 77,004 185 1 239 10.15

NCVR 448,134 1,556,184 296,431 1 6

1.51

6.2 Quality of Repairing

In our experiments, we used two different strategies to determine the queue inAlgorithm 2: (a) Random: i.e., randomly selecting soft edges from a clusteringgraph and adding them to the queue; (b) Weight : i.e., adding soft edges to thequeue in terms of their weights in descending order. We simulated user feedbackby randomly selecting erroneous matches or non-matches in accordance with theclustering result after each repair, and used six metrics to evaluate the qualityof ER: (1) Precision, Recall and Fmeasure, which are based on pairs of records

1 Available from: http://www.cs.umass.edu/∼mccallum/.2 Available from: ftp://alt.ncsbe.gov/data/.

Page 11: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 293

Table 2. ER rules used in our experiments, where sim(A) indicates the similarity ofvalues in the attribute A of two records.

Data set Rule Rule description Weight

CORA u1 sim(title) > 0.6 ∧ sim(author) > 0.3 ∧ sim(date) > 0.3 ∧sim(name) > 0.1

1

u2 sim(author) > 0.5 ∧ sim(name) > 0.5 0.36

u3 sim(title) > 0.7 ∧ sim(author) > 0.7 0.74

u4 sim(title) > 0.1 ∧ sim(name) > 0.9 ∧ sim(vol) > 0.7 0.9

u5 sim(title) < 0.4 ∧ sim(author) < 0.3 –1

NCVR u6 sim(gender) = 1 ∧ sim(first name) ≥ 0.8 ∧sim(last name) ≥ 0.8 ∧ sim(age) ≥ 0.7 ∧sim(phone number) = 1

1

u7 sim(first name) ≥ 0.6 ∧ sim(last name) ≥ 0.6 ∧sim(age) ≥ 0.5

0.8

u8 sim(first name) ≥ 0.7 ∧ sim(zip code) ≥ 0.7 ∧sim(phone number) ≥ 0.7

0.7

u9 sim(first name) ≥ 0.5 ∧ sim(last name) ≥ 0.5 ∧sim(zip code) ≥ 0.8

0.8

u10 sim(gender) < 1 ∧ sim(phone number) < 1 ∧sim(first name) < 0.2 ∧ sim(last name) ≤ 0.2

–1

as in the traditional process [5], and (2) Closest Cluster Precision, Recall andFmeasure (written as CC-Precision, CC-Recall and CC-Fmeasure, respectively),which compare clusterings by counting completely correct clusters [4].

In Figs. 4 and 5, the Precision and Fmeasure values are improved over bothdata sets, although the Recall values for NCVR are dropping slightly before goingup. The CC-Precision, CC-Recall and CC-Fmeasure values also increase withincreasing repairs for both data sets, and particularly repairing over time canlead to ER clustering results that are close to the ground truth. Compared withNCVR, CORA has clusters of larger sizes and its initial clustering result containsmore conflicts. Thus, the difference between the random-based and weight-basedstrategies in CORA is also larger than in NCVR.

6.3 Efficiency of Repairing

We have also evaluated the efficiency of repairing over NCVR since it containsnearly half a million records. We measured the memory usage and time resourcesrequired during the repairing process. As depicted in Fig. 6, the weighted-basedstrategy is less time efficient than the random-based strategy because it requirestime to sort edges based on their weights. Nonetheless, compared with therandom-based strategy, the weighted-based strategy has more uneven but gen-erally less memory usage requirement.

Page 12: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

294 Q. Wang et al.

Fig. 4. Quality of the clustering over CORA after repairing errors

Fig. 5. Quality of the clustering over NCVR after repairing errors

Fig. 6. Memory usage and time of repairing over NCVR

Page 13: A Clustering-Based Framework for Incrementally Repairing ...users.cecs.anu.edu.au/~u5170295/papers/pakdd-wang-2016.pdf · A Clustering-Based Framework for Incrementally Repairing

A Clustering-Based Framework for Incrementally Repairing 295

7 Conclusion

We have developed a framework for incrementally repairing errors existing in anER clustering. During the clustering process, we establish a provenance index forcapturing how records are clustered into each entity. This index provides usefulinformation which enables us to efficiently repair errors when they are detectedlater on. In the future, we plan to continue this line of research by looking intohow the provenance index structure can be generalized to support repairing forentity resolution that collectively resolves entities of different types.

References

1. Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithmsand complexity. In: ICDT, pp. 31–41 (2009)

2. Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using dedu-palog. In: ICDE, pp. 952–963 (2009)

3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3),89–113 (2004)

4. Barnes, M.: A practioner’s guide to evaluating entity resolution results (2014)5. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity

Resolution, and Duplicate Detection. Springer, Heidelberg (2012)6. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency

and accuracy. In: PVLDB, pp. 315–326 (2007)7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a

survey. IEEE TKDE 19, 1–16 (2007)8. Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64(328),

1183–1210 (1969)9. Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to

control block sizes for entity resolution. In: KDD, pp. 279–288 (2015)10. Schewe, K.-D., Wang, Q.: A theoretical framework for knowledge-based entity res-

olution. TCS 549, 101–126 (2014)11. Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–

867 (2005)12. Shen, Z., Wang, Q.: Entity resolution with weighted constraints. In: Manolopoulos,

Y., Trajcevski, G., Kon-Popovska, M. (eds.) ADBIS 2014. LNCS, vol. 8716, pp.308–322. Springer, Heidelberg (2014)

13. Wang, Q., Schewe, K.-D., Wang, W.: Provenance-aware entity resolution: leveragingprovenance to improve quality. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A.(eds.) DASFAA 2015. LNCS, vol. 9049, pp. 474–490. Springer, Heidelberg (2015)

14. Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection forlarge-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B.,Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 562–573.Springer, Heidelberg (2015)

15. Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. VLDB 3(1–2), 1326–1337 (2010)

16. Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)