Entity Disambiguation

26
Entity Disambiguation By Angela Maduko Directed by Amit Sheth

description

Entity Disambiguation. By Angela Maduko Directed by Amit Sheth. Entity Disambiguation Problem. Emerges mainly while merging information from different sources Two major levels - PowerPoint PPT Presentation

Transcript of Entity Disambiguation

Page 1: Entity Disambiguation

Entity Disambiguation

ByAngela MadukoDirected byAmit Sheth

Page 2: Entity Disambiguation

Entity Disambiguation Problem

Emerges mainly while merging information from different sources

Two major levels 1. Schema/Ontology level : Determining the

similarity of attributes/concepts/classes from the different schema/ontology to be merged

2. Instance level: Which instances of concepts/classes (/tuples in relational databases ) refer to the same entity

Page 3: Entity Disambiguation

Current approaches for both levels

Feature-based Similarity Approach (FSA) Set-Theory Similarity Approach (STA) Information-Theory Similarity Approach (ITA) Hybrid Approach (HA)

Relationship-based Similarity Approach (RSA)

Hybrid Similarity Approach (HSA)

Page 4: Entity Disambiguation

ITA

In [1], Dekang presents a measure for the similarity between two concepts based on both their commonalities and differences Intuition 1: The similarity between A and B is

related to their commonality. The more commonality they share, the more similar they are.

Intuition 2: The similarity between A and B is related to the differences between them. The more differences they have, the less similar they are.

Intuition 3: The maximum similarity between A and B is reached when A and B are identical, no matter how much commonality they share.

Page 5: Entity Disambiguation

ITA

Consider the concept Fruit A is an Apple B is an Orange Commonality of A and B?

Common(A, B) = Fruit(A) and Fruit(B)

Measures the commonality between A and B = I(common(A, B)) by the amount of information contained in common(A, B)

Where the information content of S I(S) = -logP(S)

Page 6: Entity Disambiguation

ITA

Differences is measured by I(description(A, B)) – I(common(A, B))

Decription(A, B) is a proposition which describes what A and B are

Can be applied at both levels 1 & 2 Intuitively, sim(A, B) =

1 when A and B are exactly alike; 0 when they share no commonalities

Proposes sim(A, B) = B)) on(A,(descripti log

B)) common(A,(log

P

P

Page 7: Entity Disambiguation

ITA

In [2], Resnik measures the similarity between two concepts in an is-a taxonomy based on the information content of their most specific common super-concept

Define P(c) as the probability of encountering an instance of a concept c in the taxonomy

For any two concepts c1 and c2, define S(c1, c2) as the set of concepts that subsume both c1 and c2

Proposes sim(c1, c2) =

))(log(max),( 21

cPccSc

Page 8: Entity Disambiguation

ITA

100 instances of concept X 4 instances of concept Y 200 instances of concept Z 2000 instances of all

concepts sim(A, B) Sim(C, D) sim(A, D) sim(A, E) sim(C, D) > sim(A, B).

Should this be so?

Y

DC

X

A B

Z

FE

Page 9: Entity Disambiguation

ITA

Define s(w) as the set of concepts that are word senses of word w. Proposes a measure for word similarity as follows

Sim(w1, w2) = Can be applied at level 1 only Doctor (medical and PhD) Nurse (medical and nanny) Sim(Doctor, Nurse)

)),((max 21)(),( 2211

ccsimwscwsc

Page 10: Entity Disambiguation

STA

[3] introduces a set theoretical notion of a matching function F based on the following assumptions for classes a, b, c with description sets A, B, C respectively

Matching: s(a, b) = F(A B, A - B, B - A)

Monotonicity: s(a, b) ≥ s(a, c) whenever A B A C, A - B A - C, B - A C - A

Page 11: Entity Disambiguation

STA

Proposes two models: Contrast model: Similarity is defined as

An increasing function of common features A decreasing function of distinctive features

(features that apply to one object but not the other)

S(a, b) = f(A B) - f(A -B) - f(B - A) (,, ≥ 0) Function f measures the salience of set of features f depends on intensity and context factors Intensity – physical salience (eg physical features) Context – salience of features varies with context

Page 12: Entity Disambiguation

STA

Ratio Model S(a, b) = ,, ≥ 0

Can be applied at both levels 1 & 2

)A - B( )B -A ( B) (A

B) (A

fff

f

Page 13: Entity Disambiguation

HA

[7] combines clustering and information content approaches for entity disambiguation (Scalable Information Bottleneck (LIMBO) method)

Attempts to cluster entities in such a way that the clusters are informative about the entities within them

Model: A set T of n entities (relational tuples), defined on m attributes (A1, A2, …, Am) .Domain of attribute Ai is the set Vi = {Vi,1, Vi,2, …, Vi, di}

Let T and V be two discrete random variables that can take values from T and V respectively

Initially, assigns each entity to a cluster ie #clusters = #entities. Let Cq denote this initial clustering, then the mutual information of Cq and T, I(Cq, T) = the mutual information of V and T, I(V, T)

Page 14: Entity Disambiguation

HA

Assumes number of distinct entities k is known

Seeks a clustering Ck of V such that I(Ck, T) remains as large as possible or the information loss I(V, T) - I(Ck, T) is minimal

Page 15: Entity Disambiguation

HSA

In [8], Kashyap and Sheth introduce the concept of semantic proximity (semPro) between entities to capture their similarity

In addition to context, employs relationships and features of entities in determining their similarity

semPro(O1,O2) = <Context, Abstraction, (D1, D2), (S1, S2)> Context context in which objects O1 and O2 are being

compared Abstraction abstraction/mappings relating domains of the

objects (D1, D2) domain definitions of the objects (S1, S2) states of the objects

Page 16: Entity Disambiguation

HSA

Abstractions Total 1-1 value mapping Partial many-one mapping. Generalization/specialization. Aggregation. Functional dependencies. ANY NONE

Page 17: Entity Disambiguation

HSA

Semantic Taxonomy Defines 5 degrees of similarity between

objects Semantic Equivalence Semantic Relationship Semantic Relevance Semantic Resemblance Semantic Incompatibility

Page 18: Entity Disambiguation

HSA

Semantic Equivalence: strongest measure of semantic proximity Two objects are said to be semantically equivalent

when they represent the same real world entity ie semPro(O1,O2) = <ALL, total 1-1 value mapping,

(D1, D2), - > (domain Semantic Equivalence) semPro(O1,O2) = <ALL, M, (D1, D2), (S1, S2)>

where M = a total 1-1 value mappings between (D1, S1) and (D2, S2) (state Semantic Equivalence)

Page 19: Entity Disambiguation

HSA

Semantic Relationship: weaker than semantic equivalence.

semPro(O1,O2) = <ALL, M, (D1 ,D2) , _)> where

M = a partial many-one value mapping, generalization or aggregation

Requirement of a 1-1 mapping is relaxed such that, given an instance O1, we can identify an instance of O2, but not vice versa.

Page 20: Entity Disambiguation

HSA

Semantic Relevance: Two objects are semantically relevant if

there exists any mapping between their domains in some context

semPro(O1,O2) = <SOME, ANY, (D1 ,D2) , _)>

Page 21: Entity Disambiguation

HSA

Semantic Resemblance: weakest measure of semantic proximity.

There does not exists any mapping between their domains in any context

Have same roles in some contexts with coherent definition contexts

Page 22: Entity Disambiguation

HSA

Semantic Incompatibility Asserts semantic dissimilarity. Asserts that there is no context and no

abstraction in which the domains of the two objects are related.

semPro(O1,O2) = <NONE, NONE, (D1,D2), _>

Page 23: Entity Disambiguation

HSA

In [5] Cho et al propose a model derived from the edge-based approach, employing information content of the node based approach based on these facts:

There exists a correlation between similarity and # of shared parent concepts in a hierarchy

Link type (hyponymy, meronymy etc) semantic relationship

Page 24: Entity Disambiguation

HSA

Conceptual similarity between a node and its adjacent child node may not be equal

As depth increases in the hierarchy, conceptual similarity b/w a node and its adjacent child node decreases

Population of nodes is not uniform over entire ontological structure (links in a dense part of hierarchy less distance than that in a less dense part )

Page 25: Entity Disambiguation

HSA

Proposes S(ci, cj) = D(Lj

i)0≤k≤n[ W(tk)d(ck+1k)f(d) ] ( max[H(c)] ), where f(d) is a function that returns a depth factor

(topological location in hierarchy) d(ck+1k) is a density function D(Lj i) is a function that returns a distance factor

between ci and cj (shortest path from one node to the other)

W(tk) is a weight function that assigns weights to each link type (W(tk) = 1 for is-a link)

H(c) is information content of super-concepts of ci and cj

For level 1 only

Page 26: Entity Disambiguation

References

1. Dekang Lin, An Information-Theoretic Definition of Similarity, Proceedings ofthe Fifteenth International Conference on Machine Learning, p.296-304, 1998

2. Philip Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI, 1995.

3. Tversky Amos, Features of Similarity, Psychological Review 84(4), 1977, pp 327 - 352.

4. Debabrata Dey, A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases, IEEE Transactions on Knowledge and Data Engineeing, 14 (3), May/June 2002.

5. Hui Han, Hongyuan Zha and C. Lee Giles, A Model-based K-means Algorithm for Name Disambiguation in Proceedings of the Second International Semantic Web Conference (ISWC-03) Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. 2003

6. M. Andrea Rodriguez and Max J. Egenhofer, Determining Semantic Similarity Among Entity Classes from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering , 15 (2): 442-456, 2003

7. Periklis Andritsos, Renee J. Miller and Panayiotis Tsaparas, Information-Theoretic Tools for Mining Database Structure from Large Data Sets, SIGMOD Conference 2004: 731-742

8. Vipul Kashyap, Amit Sheth, Semantic and schematic similarities between database objects: a context-based approach, VLDB Journal 5, no. 4 (1996): 276--304. 367

9. Miyoung Cho, Junho Choi and Pankoo Kim, An Efficient computational Method for Measuring Similarity between Two Conceptual Entities, WAIM 2003: 381-388