Similarity Flooding

66
Yishai Beeri Similarity Flooding SDBI – Winter 2001 1 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard Rahm

description

Similarity Flooding. A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard Rahm. Introduction & Motivation. Goal: matching elements of related, complex objects Matching elements of two data schemes Matching elements of two data instances - PowerPoint PPT Presentation

Transcript of Similarity Flooding

Page 1: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

1

Similarity Flooding

A Versatile Graph Matching Algorithmby

Sergey Melnik, Hector Garcia-Molina, Erhard Rahm

Page 2: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

2

Introduction & Motivation

• Goal: matching elements of related, complex objects

• Matching elements of two data schemes• Matching elements of two data instances• Many conceivable uses for object matching• Looking for a generic algorithm with wide

applicability

Page 3: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

3

Applications

• Comparing data schemes:– Items from different shopping sites

– Merger between two corporations

– Preparation of data for data warehousing and analyzing processes

• Comparing data instances:– Bio-informatics

– Collaboration: allowing multiple users to edit a program / system

Page 4: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

4

Existing Approaches

• Comparing SQL: can use type information• Comparing XML: can use hierarchy

Requires domain-specific knowledge and coding

Solution:• Generic algorithm that is agnostic to domain

• Structural model – relies on structural similarities to find a matching

Page 5: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

5

Part I: Algorithm Framework

General Discussion of Algorithm Input, Output, and Main Components

Page 6: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

6

Algorithm Framework

• Input: two objects to match• Representation of objects as graphs:

G1=(V1, E1), G2=(V2, E2)• Matching between graphs gives mapping:

V1xV2 • Filtering of mapping to obtain meaningful match• Output: mapping between elements of input

objectsHuman verification sometimes required

Page 7: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

7

Input Graph Mapping Filtering

• Input are two objects to be matched• Match will be between sub-elements of the two

objects• Match of sub-elements will be scored. High scores

indicate a strong similarity• Assumption: Objects can be represented as graphs

Page 8: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

8

Input Graph Mapping Filtering

• Represent objects as directed, labeled graphs• Choose any sensible graph representation (this is

domain-specific) that maintains structural information

• Structural information in graphs will be used for mapping.

• Intuition: similar elements have similar neighbors

G1 = (V1, E1), G2 = (V2, E2)

Page 9: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

9

Input Graph Mapping Filtering

• We want a mapping :V1xV2 • Convenient to normalize such that 0 (v,u) 1• Begin with initial mapping function:

– Null function: (v, u) := 1 for all v in V1, u in V2– String Matching function– Other domain-specific function

• Perform an iterative fixpoint calculation. Each iteration floods the similarity value (v,u) to the neighbors of v and u

Page 10: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

10

Input Graph Mapping Filtering

• We have a mapping :V1xV2 • We are usually not interested in all pairs V1xV2• Applying filtering functions yields a partial

mapping:– Threshold (only when (v,u) > some constant)

– Wedding (each v mapped to only one u and vice versa)

• Result is a useful mapping that matches elements of V1 with elements of V2

Page 11: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

11

Part II: An Example - Relational Schemas

An Example Employing the Algorithm to Match Two Simple Relational Schemas

Page 12: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

12

Example: Relational Schemas

• Scenario: two relational schemas that describe similar or same data

• Goal: match elements of two given relational schemas

• Input: SQL statements for creating each scheme• Desired output: a meaningful mapping between

the elements of the two schemas

Page 13: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

13

Example: Relational SchemasInput Graph Mapping Filtering

CREATE TABLE Personnel (

Pno int,

Pname string,

Dept string,

Born date,

UNIQUE perskey(Pno)

)

S1

CREATE TABLE Employee (EmpNo int PRIMARY KEY,EmpName varchar(50),DeptNo int REFERENCES

Department,Salary dec(15,2),Birthdate date

)CREATE TABLE Department (

DeptNo int PRIMARY KEY,DeptName varchar(70)

)S2

Page 14: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

14

Example: Relational Schemas

Algorithm script:

G1 = SQLDDL2Graph(S1);

G2 = SQLDDL2Graph(S2);

initialMap = StringMatch(G1, G2);

product = SFJoin(G1, G2, initialMap);

result = SelectThreshold(product)

Page 15: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

15

Example: Relational SchemasInput Graph Mapping Filtering

• Any graph representation of schemas can be chosen

• Representation should maintain as much information as possible, in particular structural information

• Example uses Open Information Model (OIM) – based graph representation

Page 16: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

16

Example: Relational SchemasInput Graph Mapping Filtering

Page 17: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

17

Example: Relational SchemasInput Graph Mapping Filtering

• Calculate initial mapping to improve performance• Initial mapping can apply domain knowledge• In this example: StringMatch is used:

– Compares common prefixes and suffixes of literals

– Assumes elements with similar names have similar meaning

– Applies on all elements – including elements that are created by the graph representation (e.g. ‘type’)

• Initial mapping still far from satisfactory

Page 18: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

18

Top values of similarity mapping after StringMatch

Node in G1Node in G2Node in G1Node in G2

1.0ColumnColumn0.26‘Pname’‘DeptName’

0.66ColumnTypeColumn0.26‘Pname’‘EmpName’

0.66‘Dept’‘DeptNo’0.22‘date’‘BirthDate’

0.66‘Dept’‘DeptName’0.11‘Dept’‘Department’

0.5UniqueKeyPrimaryKey0.06‘int’‘Department’

Example: Relational SchemasInput Graph Mapping Filtering

Page 19: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

19

Example: Relational Schemas Input Graph Mapping Filtering

• Next step: similarity flooding (SFJoin)• Initial similarity values taken from initial mapping• In each iteration similarity of two elements affects

the similarity of their respective neighbors (e.g. similarity of type names such as ‘string’ adds to similarity of columns from the same type)

• Iterate until similarity values are stable

Page 20: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

20

Example: Relational Schemas Input Graph Mapping Filtering

• After fixpoint calculation, the mapping is filtered to provide a meaningful mapping

• The filter operator SelectThreshold removes node pairs for which (u,v) < some constant

• In this example, the mapping product contained 211 node pairs with positive similarities, which were filtered to a total of 12 node pairs

Page 21: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

21

Similarity mapping after SelectThreshold

Node in G1Node in G2Node in G1Node in G2

1.0ColumnColumn0.29UniqueKey: perskey

PrimaryKey: on EmpNo

0.81Personnel*Employee*0.28Personnel / Dept+Department / DeptName+

0.66ColTypeColType0.25Personnel / Pno+Employee / EmpNo+

0.44int**int**0.19UniqueKeyPrimaryKey

0.43TableTable0.18Personnel / Pname+Employee / EmpName+

0.35date**date**0.17Personnel / Born+Employee / Birthdate+

*Table**SQL column type+ Column

Example: Relational Schemas

Page 22: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

22

Example: Relational Schemas

Summary of example:• Good results without domain-specific knowledge• Graph representation may vary• Similarity flooding results need to be filtered

Page 23: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

23

Part III: Similarity Flooding Calculation

Details of the Similarity Flooding Calculation Algorithm

Page 24: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

24

Similarity Flooding Calculation

• Start with directed, labeled graphs A, B• Every edge e in a graph is represented by a triplet

(s,p,o): edge labeled p from s to o• Define pairwise connectivity graph PCG(A, B):

BypyandAxpxBAPCGyxpyx ,,,,,,,,,

Page 25: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

25

Similarity Flooding Calculation

Pairwise Connectivity Graph – Example

Page 26: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

26

Similarity Flooding Calculation

• Induced Propagation Graph: add edges in opposite direction

• Edge weights: propagation coefficients. They measure how the similarity propagates to neighbors

• One way to calculate weights: each edge type (label) contributes a total of 1.0 outgoing propagation

Page 27: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

27

Similarity Flooding Calculation

Induced Propagation Graph – Example

Page 28: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

28

Similarity Flooding Calculation

• Similarity measure (x,y)0 for all xA and bB. We also call a “mapping”

• Iterative computation of , with propagation in each iteration

i is the mapping after the i’th iteration 0 is the initial mapping• Each iteration computes i based on i-1 and the

propagation graph• Stop when a stable mapping is reached

Page 29: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

29

Similarity Flooding Calculation

BbpyAapxvvvv

i

BypbAxpauuuu

ii

vv

uu

yxbaba

yxbabayx

,,,,,

,,,,,

,,,,

,,,,:,

Propagation from i for similarity of x and y is the sum of all similarities from neighbors, each multiplied by the propagation coefficients

Page 30: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

30

Similarity Flooding Calculation

• Many ways to iterate:

iii

ii

ii

iii

normalize

normalize

normalize

normalize

001

01

01

1

:C

:B

:A

: Basic

• Choice will aim to achieve high quality and fast convergence

Page 31: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

31

Similarity Flooding Calculation

• Basic: each iteration propagates from neighbors; Initial mapping has diminishing effect

• A: initial mapping has high importance. Propagation has diminishing effect

ii

iii

normalize

normalize

01

1

:A

: Basic

Page 32: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

32

Similarity Flooding Calculation

• B: initial mapping has high importance, recurring in propagation

• C: initial mapping and current mapping have identical importance

iii

ii

normalize

normalize

001

01

:C

:B

Page 33: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

33

Part IV: Filtering

Overview of Various Approaches to Filtering of SF Mapping

Page 34: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

34

Filtering

• Result of iterations is a mapping between all pairs in V1 and V2. We usually want much less information!

• Filtering will remove pairs, leaving us with only the interesting ones

• There are many ways to filter. Filter choice is domain-specific

Page 35: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

35

Filtering

Possible filtering directions:• Remove uninteresting pairs according to domain-

specific knowledge (e.g. ‘column’, ‘table’, ‘string’ from SQL matches) and typing information.

• Cardinality considerations: do we want a 1:1 mapping? A n:m mapping?

• Threshold: remove matches with low scores

Page 36: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

36

Filtering: Cardinality

Cardinality-based filters can use techniques from bilateral graph (“marriage”) problems:

• Stable marriage• Assignment problem: max. of (x,y)• Maximum mapping: max. number of 1:1 matches• Maximal mapping: not contained in other mapping• Perfect/Complete: all are “married”All the above give [0,1]:[0,1] (monogamous)

matches, and can be found in polynomial time

Page 37: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

37

Filtering: Relative Similarity (x,y) is the absolute similarity of x and y• We can also define a relative similarity:

yxxBy

,max:max

• Relative similarity is directed. The reverse direction is defined in an analogue manner

• Bipartite graph methods can also handle directed graphs

x

yxyxrel

max

,:,

Page 38: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

38

Filtering: Threshold

• Threshold can be applied to absolute or relative similarities

• A useful example: threshold of trel=1.0 gives a perfectionist egalitarian polygamy – e.g. no man/woman is willing to accept any but the best match

Page 39: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

39

Part V: Examples

Examples of Algorithm Application to Various Problems

Page 40: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

40

Example: Change Detection

• Goal: change detection in two labeled trees• Original tree T1 was changed to give T2:

– Node names were replaced

– Subtrees were copied and moved

– New node was inserted

• We want the best match for every node of T2– Cardinality constraint: [0,n] – [1,1]

Page 41: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

41

Example: Change Detection

Algorithm Script:Product = SFJoin(T2, T1);

Result = SelectLeft(product);

Page 42: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

42

Example: Change Detection

• No initial mapping• SelectLeft operator selects best absolute

match for each element in left argument• Results can also provide hints on type of change

that was performed!

Page 43: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

43

Example: Change Detection

Page 44: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

44

Example: Matching Schemas Using Instance Data

• Goal: match two XML Schemas using instance data

• Two XML product descriptions from two shopping websites

• We want to use the instance data to match the XML schemas

Page 45: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

45

Example: Matching Schemas Using Instance Data

Page 46: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

46

Example: Matching Schemas Using Instance Data

Algorithm Script:G1 = XML2DOMGraph(db1);G2 = XML2DOMGraph(db2);initialMap = StringMatch(G1, G2);product = SFJoin(G1, G2, initialMap);result = XMLMapFilter(product, G1, G2)

• Only new piece of code is the XMLMapFilter operator

Page 47: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

47

Example: Schemas, Instance Data

Page 48: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

48

Part VI: Analysis

Match Quality, Algorithm Complexity, Convergence and Limitations

Page 49: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

49

Match Quality

• Assessing match quality is difficult• Human verification and tuning of matching is

often required• A useful metric would be to measure the amount

of human work required to reach the perfect match

• Recall: how many good matches did we show?• Precision: how many of the matches we show are

good?

Page 50: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

50

Convergence

• Fixpoint iterations are an eigenvector computation for the matrix that corresponds to the propagation graph

• Computation converges iff graph is strongly connected

• To achieve this we use dampening: use 0 in the fixpoint formula, where 0(x,y) > 0 for all x,y

• Convergence rate depends on spectral radius of the matrix, and can be improved by high dampening values

Page 51: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

51

Convergence

• In many cases we are only interested in order of map pairs, and not absolute values of .

• The order usually stabilizes before the actual values do

Page 52: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

52

Complexity

• Usually 5-30 iterations• Each iteration is O(|E|) (edges in propagation

graph)• |E| = O(|E1|•|E2|)• |E1| = O(|V1|2) – if G1 is highly connected• |E2| = O(|V2|2) – if G2 is highly connected• Worst case of each iteration is O(|V1|2•|V2|2)• Average case of each iteration is O(|V1|•|V2|)

Page 53: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

53

Limitations

• Algorithm requires representation as directed, labeled graph– Degrades when edges are unlabeled or undirected– Degrades when labeling is more uniform

• Assumes structural adjacency contributes to similarity– Will not work for matching HTML

• Requires matched objects to be of same type and with same graph representation

Page 54: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

54

Limitations

• Algorithm cannot utilize order and aggregation information (e.g. for XML)– Order: the order of sub-elements within an element

– Aggregation: an element containing an “array” of sub-elements

Page 55: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

55

Part VII: Variability and Applications

Discussion of Algorithm Variability Areas and Possible Applications

Page 56: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

56

Variability in Algorithm

• Graph representation of input objects• Calculation of propagation coefficients• Initial mapping function• Iteration formula• Filtering function

Page 57: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

57

Graph Representation

• Graph representation of input objects is arbitrary; sub-elements can be modeled as nodes, edges, or both.

• On one hand:– Richer graph captures more structure information– Type information about sub-elements can be modeled

• On the other hand:– Larger graphs mean longer computation– Rich graph often implies more uniform labeling

Page 58: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

58

Propagation Coefficients• Propagation coefficients can be calculated in many

ways:– Sum of all outgoing edges is 1.0– Equal weigh (1.0) for all edges– Sum of all outgoing edges of label ‘p’ is 1.0– Sum of all incoming edges is 1.0– Label-specific weight allocation– Etc.

Page 59: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

59

Initial Mapping Function

• Initial mapping can improve performance and help convergence

• Initial mapping function can be naïve, or it can employ domain-specific knowledge

Page 60: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

60

Iteration Formula

• Each iteration calculates i+1 from i , 0, and (i)

• Iteration formula can vary, giving different weight and effect to these components– Example: if initial mapping is good, give higher weight

to 0 • Formula affects convergence speed as well as

resultant mapping

Page 61: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

61

Filtering Function

• Results of iterations require filtering to become a meaningful mapping

• Many approaches to filtering are possible, as discussed

• Choice usually stems from graph representation and specific goal. For example:– If graphs contain many type-related nodes, they can be

pruned from results– If goal is to detect changes, we want a match for each

element of the newer object

Page 62: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

62

Applications

There are many possible applications besides the ones described:

• Comparing websites– Old vs. new versions of website

– Two websites with information about same subject

– Structural information gained from containment and links

Page 63: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

63

Applications

• Natural language processing and speech recognition:– Match given sentence to XML template– Match two text segments that refer to the same subject

• Finding self-similarities and related data items by running SFJoin(G,G)

• Preparation of data and schemas for data warehousing and data mining– Canonization of data and meta-data

Page 64: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

64

Semantic Interpretation - Example

For example (1st approach), the user utterance:"I would like a medium coca cola and a large pizza

with pepperoni and mushrooms.”could be converted to the following semantic result{

drink: {beverage: "coke”drinksize: "medium”

}pizza: {

pizzasize: "large"topping: [ "pepperoni", "mushrooms" ]

}}

Page 65: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

65

Applications

• More…

Page 66: Similarity Flooding

Yishai BeeriSimilarity FloodingSDBI – Winter 2001

66

Summary

• Generic algorithm – with many applications• Relies on structural information captured in graph

representation• Domain-specific customizations can improve

performance and match quality• Useful but does not deliver 100% exact results;

human verification often required