Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS,...

30
Blank Node Matching and RDF/S Comparison Functions Yannis Tzitzikas , Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE ISWC2012, Boston, Nov. 2012

Transcript of Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS,...

Page 1: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Blank Node Matching andRDF/S Comparison Functions

Yannis Tzitzikas , Christina Lantzaki and Dimitris Zeginis

Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE

ISWC2012, Boston, Nov. 2012

Page 2: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

In two slides (1/2)

Several RDF/S Knowledge Bases rely heavily on blank nodes • Bnodes are convenient for representing complex attributes or

resources whose identity is unknown but their attributes (either literals or associations with other resources) are known.

• We show how to exploit blank node anonymity in order to reduce the delta size when comparing RDF/S Knowledge Bases.

• We approach the problem as an optimization problem:– Find the mapping that gives the minimum in size delta

G1 G2

2

Chris

_:ad1

Arlington St 77

street no city

hasAddress

Boston

Chris

_:ad2

77

street no city

hasAddress

BostonArlington St

Jim

hasAddressBlank node prevalence *

Opencalais.com 44.9%hi5.com foaf 87.5%

*[On blank nodes ISWC 2011]

FORTH-ICS, ISWC 2012

Page 3: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

In two slides (2/2)All KBs

(general case)NP-Hard

O(n3)

Time Complexity

O(n logn)

O(n3)

||

||||

opt

optx

deviation

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

[0, 7.2]

[1, 7.2]

OptimalMapping

ApproximatelyOpt. mapping

3FORTH-ICS, ISWC 2012

Mapping of 150,000 blank nodes ~11 sec

Deviation from optimal

Page 4: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Outline• Motivation• RDF Knowledge Bases with Blank Nodes• On finding the Optimal Bnode Mapping

– Delta and Bnode Name Tuning– The Optimization Problem– Polynomially-solved Cases

• Approximate Bnode Matching Algorithms– Hungarian Bnode Matching Algorithm– A Fast Signature-based Algorithm

• Experimental Evaluation• Discussing Semantics and Inference Rules• Related Work• Concluding Remarks

4FORTH-ICS, ISWC 2012

Page 5: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Motivation• World evolves, and world models (e.g. KBs expressed in

RDF/S) evolve as well.• The result of the comparison of two KBs is called Delta.• Deltas can be useful for

– aiding humans to understand the evolution of knowledge – to reduce the amount of data that need to be exchanged

and managed over the network in order to build synchronization, versioning and replication services

• The inability to match bnodes increases the delta size and does not assist in detecting the changes between subsequent versions of a KB. However, a large percentage of the nodes of existing RDF KBs are blank nodes– Opencalais.com: 44.9% bnodes, hi5.com foaf: 87.5%

bnodes5FORTH-ICS, ISWC 2012

Page 6: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

FORTH-ICS, ISWC 2012

RDF Knowledge Bases with Blank Nodes

Def: Equivalence. Two RDF graphs G1 and G2 are equivalent if there is a bijection M between the sets of nodes of the two graphs (N1 and N2), such that:

– M(uri) = uri for each uri U∈ 1 ∩ N1

– M(lit) = lit for each lit L∈ 1

– M maps bnodes to bnodes

– The triple (s, p, o) is in G1 if and only if the triple (M(s), p,M(o)) G∈ 2

Bijection M

Identity function

Identity function

?

N1 N2

Graph notationN: nodes B: blank nodes L : literals U : URIs

6

Page 7: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

RDF Knowledge Bases with Blank Nodes (Cont)Def: Edit Distance over Nodes given a Bijection Let o1 and o2 be two nodes of G1 and G2, and suppose a bijection between

the nodes of these graphs, i.e. a function h : N1 → N2 .

We define the edit distance between o1 and o2 over h, denoted by disth(o1, o2),

as the number of additions or deletions of triples which are required for making

the “direct neighborhoods” of o1 and o2 the same. Formally,

disth(o1, o2) = |{(o1, p, a) G∈ 1 | (o2, p, h(a) ∉ G2}| + |{(a, p, o1) G∈ 1 | (h(a), p, o2)) ∉ G2}|+

|{(o2, p, a) G∈ 2 | (o1, p, h-1(a)) ∉ G1}|+ |{(a, p, o2) G∈ 2 | (h-11(a),p,o1) ∉

G1}|

Theorem: RDF Graph Equivalence G1 ≡h G2 ⇔ disth(o, h(o)) = 0 for each o N∈ 1

7FORTH-ICS, ISWC 2012

o2

K1 K2o1

o3 o4

p

p p

o6

o5

o7 o8

p

p p

h = {(o1 → o7), (o2 → o6),(o3 → o5), (o4 → o8)}

dist h(o2,o6) = 4

Page 8: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Deltas and Bnode Mappings• For the case were the Knowledge Bases are not necessarily equivalent,

we would like to find the bnode mapping that reduces the delta size• Delta

– we use the differential function Δe, . The computed delta consists of triple additions and triple deletions

• Consider the following example: G1 = {(_:1, name, Joe)}

G2 = {(_:2, name, Joe),(_: 2,lives,UK)}

Δe(G1 → G2) = {Add(t) | t G∈ 2 − G1} {Del(t) | t G∪ ∈ 1 − G2}

Note:No rename operation is needed and hence no particular execution order

Δe without bnode matching

Bnode Name Tuning

Δe without bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)

Δe without bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Δe without bnode matching

Δe with bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Δe without bnode matching

Δe with bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Add(_:2, lives, UK)

Δe without bnode matching

Δe with bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Add(_:2, lives, UK)

| Δe | = 1

Δe without bnode matching

Δe with bnode matching

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Add(_:2, lives, UK)

| Δe | = 1

Add(_:1, lives, UK)

8FORTH-ICS, ISWC 2012

Page 9: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

On Finding the Optimal Mapping• Our objective is to find the bijection M (between bnodes) that

minimizes the delta size– concerns the mapping of the blank nodes of the subsets B1 and B2

– the bijection M a priori contains the mappings of all the URIs (U1, U2) and literals(L1,L2) as identity functions

• The number of possible bijections M is exponential– |J| = n2 * (n2 -1) * …*(n2-n1+1) , if |B1| = n1, |B2|= n2, |B1| < |B2|

• The cost of a bijection M (which is a actually the part of deltas tha concerns bnodes)– Cost(M) = ∑ b1 B1 ∈ distM(b1,M(b1))

Proof: reduction to the subgraph isomorphism problem (NP-Complete)

Problem Statement Given two Knowledge Bases, find the bijection (or bijections) that minimizes the cost. Msol = argM minM J ∈ (Cost(M))

Theorem: Hardness of Optimality Finding the optimal bijection is NP-Hard.

9FORTH-ICS, ISWC 2012

Page 10: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

All KBs (general case)

NP-Hard

O(n3)

Time Complexity

O(n logn)

O(n3)

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

OptimalMapping

ApproximatelyOpt. mapping

10FORTH-ICS, ISWC 2012

Page 11: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Polynomially-solved cases: Not directly connected bnodes

Key observation: If there are no directly connected bnodes, then the edit distance between a pair of bnodes is independent of the other pairs

Consequence• The optimization problem can be solved using the Hungarian

Algorithm [J. Munkres, 1957]– The elements of B1 play the role of workers

– The elements of B2 play the role of jobs

– The edit distances of the pairs in B1 X B2 play the role of the costs

• All the possible combinations can be checked with only |B1| * |B2| (or else n2, assuming n=|B1| = |B2|) edit distance computationTheorem

Finding the optimal bijection is a polynomial task if there are no directly connected blank nodes.

• The Hungarian-based method has cubic time complexity O(n3) and quadratic main memory complexity.

11FORTH-ICS, ISWC 2012

Page 12: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

All KBs (general case)

NP-Hard

O(n3)

Time Complexity

O(n logn)

O(n3)

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

OptimalMapping

ApproximatelyOpt. mapping

FORTH-ICS, ISWC 2012 12

Page 13: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

The Hungarian-based Algorithm (1/2)

• It is a variation of the optimal Hungarian algorithm that provides an approximate solution, as there is a need for an assumption about the treatment of the directly connected blank nodes at the computation of disth

Two possible assumptions:• All connected bnodes are considered different • All connected bnodes are considered the same

It again makes only |B1| * |B2| (n2) edit distance computations and its complexity remains in the same level (O(n3))

13FORTH-ICS, ISWC 2012

Page 14: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

The Hungarian-based Algorithm (2/2)

Jim

_:3

_:1

_:4

Chris Zeginis JohnTom

_:2

_:5

hasAgenda hasAgenda

brother friend friend

name sname name name

Jim

_:8

_:6

_:9

Chris Zeginis TomJohn

_:7

_:10

hasAgenda hasAgenda

brother friend friend

name sname name name

G1 G2

disth (_:1,_:6) = ?

– dependent on the mappings of bnodes _:3, _:4, _:8, _:9

Assume all the connected bnodes are considered:• the same disth (_:1,_:6) = 0 exploits the similarity of their predicates

This assumption is used for the experiments

• different disth (_:1,_:6) = 4 does not take common predicates into account

14FORTH-ICS, ISWC 2012

Page 15: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

All KBs (general case)

NP-Hard

O(n3)

Time Complexity

O(n logn)

O(n3)

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

OptimalMapping

ApproximatelyOpt. mapping

15FORTH-ICS, ISWC 2012

Page 16: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

no

The Signature-based Algorithm (1/2)It consists of two steps

1. Signature Construction Phase: for each bnode a signature (string) is constructed based on the direct neighborhood of the bnode

2. Mapping Construction Phase: the two bags of signatures are matched. Each signature matching corresponds to a mapping of a pair of blank nodes

Example of Signature Construction:

Christina

_:1

Oxford St 14 London

street city

hasAddress

Yannis

_:2

Broadway 445 New York

streetno

city

hasAddressAddress

rdf:typerdf:type

G1 Christina

_:3

Oxford St 14 London

streetno

city

hasAddress

Yannis

_:4

Michigan A 132 Chicago

streetno

city

hasAddressAddress

rdf:typerdf:type

G2

16FORTH-ICS, ISWC 2012

Page 17: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

The Signature-based Algorithm (2/2)

Mapping Construction

Signature Construction Phase

Lexicographical sorting

• The mapping is exported in two passes• For both passes we start from the

smaller list, say BS1 and for each bs1 in that list we perform a lookup in the second list BS2, using binary search (logarithmic complexity)

• First pass (exact match) exports only the exact matches

• Second pass (closest match) is applied over the remainder part of BS1, BS2 and matches each element of BS1 to the closer lexicographically elementNote:

we perform the closest matches after finishing with the exact matches in order to avoid the situation where an approximate match deters an exact match at a later step.

17FORTH-ICS, ISWC 2012

Page 18: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

All KBs (general case)

NP-Hard

O(n3)

Time Complexity

O(n logn)

O(n3)

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

OptimalMapping

ApproximatelyOpt. mapping

Experimental evaluation

18FORTH-ICS, ISWC 2012

Page 19: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Experimental Evaluation• Over real data sets

– Available in the LOD cloud– Two versions from each dataset

• Over synthetic datasets – A synthetic generator was implemented – Built over the UBA generator [Y. Guo et. al ISWC ’04]– Extended to support control over the number of blank nodes and the blank

node properties

• Evaluation Aspects– Delta reduction potential– Equivalence detection potential– Time efficiency– Deviation from optimal delta

Experiments were conducted using Sesame RDF/S Repository (main memory model) and using a PC with Intel Core i3 at 2.2 Ghz, 3.8 GB Ram, running Ubuntu 11.10.

19FORTH-ICS, ISWC 2012

Page 20: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

20

Experimental Evaluation: Real Datasets

Swedish Open Cultural Heritage*

Italian Museums*None of the datasets contains directly connected blank nodes

• The Hungarian always finds the optimal solution

Delta Size

• The Signature gave a 0.34 times bigger delta than the Hungarian

Mapping Time• The Hungarian requires more (from 15 to 624 times) time than the

Signature• The Signature needs less than one second for mapping 6390 blank nodes

* The datasets were downloaded from CKAN

• The proposed algorithms give a much smaller (12.7 to 7,924 times reduced) delta than without blank node matching

FORTH-ICS, ISWC 2012

Page 21: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Experimental Evaluation: Synthetic Datasets 1 Synthetic Generation 1

• A set of 9 datasets, from KB0 to KB8 were generated

– all of them contain the same number of blank nodes (240)– gradually create more complex blank node structures

21

Two rounds of experiments1. Delta reduction potential: Compare each dataset with another version2. Equivalence detection potential: Compare each dataset with itself

FORTH-ICS, ISWC 2012

Page 22: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Experimental Evaluation: Synthetic Datasets 1Delta Reduction Potential

0 1 2 3 4 5 6 7 80.1

1

10

100

1000

Delta Reduction Potential

OPTIMAL NO BNODE MATCHING HUNGARIANSIGNATURE

Pair of datasets

delt

a si

ze p

erce

ntag

e in

log

scal

e

Delta size is given as Without bnode matching the delta size ranges from 95% to 143% |'|

|)'(|

KBKB

KBKBe

• The algorithms provide a much smaller delta than without blank node matching

• The Hungarian achieves the optimal delta for most of the pairs• The Hungarian yields from 0 to 3 times smaller deltas than the Signature

22FORTH-ICS, ISWC 2012

Page 23: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Experimental Evaluation: Synthetic Datasets 1

Equivalence detection potential– Both the proposed algorithms detected equivalence for the first

five Knowledge Bases

Time Efficiency– The Signature gives two orders of magnitude lower mapping

times than the Hungarian

0 1 2 3 4 5 6 7 810

100

1000

10000Mapping Time

HUNGARIAN SIGNATURE

Pair of Datasets

Map

ping

Tim

e (m

s) in

log

scal

e

23FORTH-ICS, ISWC 2012

Page 24: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Experimental Evaluation: Synthetic Datasets 2

Synthetic Generation 2

• A set of 7 bigger datasets, from KB0 to KB6 were generated containing from 2,400 to 153,600 blank nodes

2,400 4,800 9,600 19,200 38,400 76,800 153,6000

2000

4000

6000

8000

10000

12000Mapping Time

Signature Construction Signature Mapping

|BNodes|

Map

ping

Tim

e (m

s)

Note:The Hungarian Algorithm could not be applied even to the third pair of datasetsdue to its high requirements in main memory space

The mapping time for the Signature was only 10.5 seconds for the seventh pair of Knowledge Bases

24FORTH-ICS, ISWC 2012

Page 25: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Measuring the approximation

Hungarian deviation: 0% - 7.2%

Signature deviation: 1% - 7.2%

0 0.1 0.15 0.2 0.25 0.320000000000001 0.40123456789

Deviation from Optimal Non Equivalent

HUNG ordered HUNG reversed SIGN ordered

SIGN reversed

b_density

devi

ation

dx

Deviation from optimal delta• Investigate how the bnode structures impact on the deviation from optimal delta

||

||||

opt

optxdeviation

The percentage of bnodes in the direct neighborhood

25FORTH-ICS, ISWC 2012

Page 26: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Discussion: Semantics and Inference Rules• Apart from the explicitly specified triples of a KB, other triples can be inferred

based on the RDF/S semantics, or other custom inference rules.• To apply our method the only difference that the graphs should be completed

according to the inferred triples.• It follows that if the semantics is based on a set of inference rules yielding a finite

closure, then the graph is finite and thus our method can be applied. – E.g. Minimal RDFS semantics, ter Horst’s pD* semantics and others

• Note: – It is worth mentioning, that the optimal bnode mapping over the complete

graphs may be different from the optimal mapping when considering the explicit graphs.

26FORTH-ICS, ISWC 2012

Page 27: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Related Work

• Past works focus on detecting only isomorphism– Jena

• Past works focusing on finding delta – RDF Sync: no effort is dedicated on finding a blank node mapping– PromptDiff :employs heuristic matchers, but does not treat blank nodes– Otnoview: no blank node matching is offered– CWM: require for the blank nodes to have term labels– SemVersion: creates and assigns unique identifiers for the blank nodes– RDF Molecules (SSWS 2008): a blank node mapping O(n2) is offered ,

but requires the blank nodes to be part of a uniquely identified triple

• They do not try to find a mapping that reduces the delta size• Works for constructing RDF/S mappings are not directly related since they

map the named entities of the two KBs, and thus they take into account lexical similarities, something that is not possible with bnodes.

27FORTH-ICS, ISWC 2012

Page 28: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Concluding Remarks• We have shown how to exploit blank node anonymity in order to reduce

the delta size when comparing RDF/S Knowledge Bases• Proved that finding the optimal mapping is NP-Hard in the general case

(polynomial if there are no directly connected blank nodes)• We presented polynomial approximate algorithms for the general case (a

Hungarian-based and a Signature-based)• In real datasets with no directly connected blank nodes

– Signature Alg.: two orders of magnitude faster than the Hungarian Alg. (1 second for datasets with 6,390 blank nodes). 34% bigger deltas than the Hungarian Alg.

• In synthetic datasets with directly connected blank nodes– Hungarian Alg. yielded from 0 to 3 times smaller deltas than the Signature Alg.

The Signature Algorithm was 18 to 57 times faster

• The algorithms provide a delta of 12.7 to 7,294 times smaller than without blank node matching

• The Signature Algorithm requires only 10.5 seconds to match 153,600 blank nodes!

28FORTH-ICS, ISWC 2012

Page 29: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Possible Future Research

Several issues are interesting for further research• Investigation of other special cases where the optimal blank

node mapping can be found polynomially– Directly connected blank nodes that form graphs of bounded tree

width

• Comparative evaluation of various (probabilistic) signature construction methods and greedy approximation algorithms

29FORTH-ICS, ISWC 2012

Page 30: Yannis Tzitzikas, Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete,

Thank you for your attention

Work done in the context of SCIDIP-ES, APARSEN and i-Marine

Web system available in:http://www.ics.forth.gr/isl/BNodeDelta

FORTH-ICS, ISWC 2012