Filtering Inaccurate Entity Co-references on the Linked Open Data

NEXT: Background

SCID: Semantic Co-reference Inaccuracy Detection - [ INTRODUCTION ]

Filtering Inaccurate Entity Co-references on the Linked Open Data

John Cuzzola, Jelena Jovanovic, Ebrahim [email protected]

DEXA 2015

The Linked-Open-Data (LOD) cloud represents hundreds of available datasets throughout the Web.

❖ 570 datasets and 2909 linkage relationships between the datasets.1

1. http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/

NEXT: How are datasets linked?

SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]

To utilize the data from multiple ontologies within the LOD, “equivalence” relationships between concepts is necessary (ie: the “edges” or linkages of the LOD must be defined).

570 datasets and 2909 linkage relationships between the datasets.

?

Equivalence relationships between DBPedia and Freebase?

NEXT: The sameAs predicate


The equivalency relationship is often accomplished via the predicate owl:sameAs

<owl:sameAs> http://rdf.freebase.com/ns/en.doghttp://dbpedia.org/resource/Dog

NEXT: sameAs linkage mistakes


<owl:sameAs> http://rdf.freebase.com/ns/en.bitchhttp://dbpedia.org/resource/Dog

NOT the same!

X

ns:common.topic.description: "Bitch, literally meaning a female dog, is a common slang term in the English language, especially used as a denigrating term applied to a person, commonly a woman”

dbo:abstract:The domestic dog (Canis lupus familiaris) is a usually furry, carnivorous member of the canidae family.

The Problem: There are many incorrect LOD linkages using owl:sameAs.

The Effect: Incorrect (embarrassing) assertions by reasoners that use the LOD.

Example: (from http://www.sameas.org)

NEXT: SCID

SCID: Semantic Co-reference Inaccuracy Detection - [ PROBLEM / MOTIVATION ]

SCID: Semantic Co-reference Inaccuracy Detection

❖ A method of natural language analysis for detecting incorrect owl:sameAs assertions.

1. Construct a baseline comparison vector vb(x,Sx).

2. For each resource (1,2,...) claiming to be the “same”, construct vectors v1(x1,Sx), v2(x2,Sx) …

3. Compare individual distances from v1(x1,Sx), v2(x2,Sx) … to baseline vb(x,Sx)

4. Disregard those v1(x1,Sx), v2(x2,Sx) … that are outside some threshold distance δ.

NEXT: The core functions of SCID.UPCOMING: How is vb(x,Sx) and v1(x1,Sx), v2(x2,Sx) … made?

SCID: Semantic Co-reference Inaccuracy Detection - [ CONTRIBUTION ]

SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]

SCID depends on two key functions:

1. A category distribution function: ρ(t,S) .Given some natural language text (t) and a set of “suitable” subject categories (S) for t, compute a distribution vector of how t relates to each subject category of S.

2. A category selection function S(uri). Given a resource (uri), return a “suitable” set of subject categories (S) that can be used in ρ(t,S).

NEXT: The category distribution function.UPCOMING: The category selection function.


The category distribution function: ρ(t,S) .

Ex: Given input text (t) as shown and three DBpedia subject categories of S=[Fruit, Oranges, Color] ρ(t,S) produces output:

ρ(t,[Fruit, Oranges, Color]) = v1(x1,S) = [ 0.27Fruit, 0.50Oranges, 0.22Color ]

NEXT: The category selection functionUPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?

● Computes Rx,k defined as the importance of word x to category k for every word in t.

○ uses 5 features: (1) count of x in k, (2) count of x across all k, (3) count of concepts where word x appears, (4) ratio of x in k to vocabulary of all k, (5) average word frequency of x per resource in k.


The category selection function: Suri .

⇶ DBpedia contains 656,000+ category:subjects. How do we select a few suitable for ρ(t,S)?

1. Begin with a candidate resource (uri):http://dbpedia.org/resource/Orange_(fruit)

2. Find a DBpedia disambiguation page:http://dbpedia.org/resource/Orange_(disambiguation)

3. Combine (union of) the subject categories for each of these resources.Suri = [ category: { Optical_Spectrum, Oranges, Citrus_hybrids, Tropical_agriculture, American_punk_rock, Rock_music, Hellcat_Records } ]

NEXT: How do we compute v1(x1,S) for sameAs inaccuracy filtering? UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?

dbr:Orange_(colour) dbr:Orange_(fruit) dbr:Orange_(band)

category:Optical_Spectrum category: Oranges category:Citrus_hybrids category:Tropical_agriculture ...

category: American_punk_rock category: Rock_music category: Hellcat_Records


NEXT: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)? UPCOMING: Experimental results

http://www.sameas.org● dbr:Port● www.w3.org:synset-seaport-noun-1● rdf.freebase.com:en.port● sw.opencyc.org:Seaport● rdf.freebase.com:River_port● dbr:Bad_conduct● rdf.freebase.com:en.military_discharge● dbr:IVDP● rdf.freebase.com:en.port_wine

How do we compute v1(t1,S), v2(t2,S), .. for sameAs inaccuracy filtering? 1. Start with a group of resources that are identified as sameAs:Ex: http://dbpedia.org/resource/Port (dbr:Port)

2. Collect subject categories Sdbr:Port using category selection function.

3. For each of the sameAs resources, collect natural language text (t)describing the resource. Collect (t) using dbpedia rdfs:comment, freebase ns.common.topic.description, www3.org wn20schema:gloss.

4. Compute vectors v1(t1,Sdbr:Port), v2(t2,Sdbr:Port)..., t1= rdfs:comment ofdbr:Port, t2= ns.common.topic.description of rdf.freebase:River_port, …

using category distribution function ρ(t,Sdbr:port).

We now have individual v1,2..(t1,2..,Sdbr:Port) vectors. Only need base vector vb(tb,S) for comparison.

http://dbpedia.org/resource/Port


NEXT: Experimental resultsUPCOMING: Conclusion

How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)? 1. Retrieve subject:categories of candidate resource from DBpediaEx: http://dbpedia.org/resource/Port (dbr:Port)

2. Find (all) other resources that use the categories of the candidate resource. Concatenate rdfs:comment from all these resources (t).

3. Compute vb(t,Sdbr:Port) using category distribution function ρ(t,Sdbr:port).

● We now have base vector vb(t,Sdbr:Port) and can be compared to individual sameAs vectors v1,2(x1,2,Sdbr:Port).

● We use Pearson Correlation Coefficient (PCC) to compare vectors.

● Remove vectors whose PCC less than threshold δ.

http://www.dbpedia.org category:Nautical_terms category:Ports_and_harbours

http://dbpedia.org/resource/Port

SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]

NEXT: Experimental results continued.UPCOMING: Conclusion

● We examined 7,690 resources obtained from www.sameAs.org database of five topics:○ Animal, City, Person, Color, and Miscellaneous.

● We performed some data cleansing on these resources.○ removal of: duplicate resources (ie: aliases/redirects), broken links, redundant resources (ie: dbpedialite

is a subset of DBpedia).

● After cleansing 411 unique resources remained with 251 errors identified by human oracle ○ ie: http://dbpedia.org/resource/Dog is not the same as http://rdf.freebase.com/ns/en.bitch


● We computed v411(t,S) individual vectors for all 411 resources with associated baseline comparison vector.

● We computed Pearson Correlation between v411(t,S) and baseline.

● Removed identity links based on thresholds ranging from 0.0 to 0.90. F-score calculated.for each threshold used.

○ Original 411 resources contained 160 correct / 251 incorrect sameAslinks (0.560 F-score)

○ Threshold (δ) of 0.50 and 0.60 gave best F-score.

NEXT: Experimental results continued.UPCOMING: Conclusion


Scatter plot of F-score versus Pearson Correlation Coefficient for oracle-identified right(blue) and wrong(red) identity links.

PEARSON wrong right

δ

NEXT: Conclusion

SCID: Semantic Co-reference Inaccuracy Detection - [ CONCLUSION ]

-- END --

● In this presentation:

○ we introduce SCID: A technique for discovering inaccuracies in identity links assertion (owl:sameAs).

○ Experimental results indicate SCID can identify incorrect identity link assertions and improve precision of an identity database (http://www.sameas.org).

● In the future:

○ Experimentation with identity links other than owl:sameAs (ie: skos:closeMatch, skos:exactMatch, owl:equivalentClasses).

○ Experimentation with vector comparison methods other than Pearson Correlation (ie: cosine similarity, euclidean distance, Spearman rank coefficient).

Filtering Inaccurate Entity Co-references on the Linked Open Data

Data & Analytics

Transcript of Filtering Inaccurate Entity Co-references on the Linked Open Data