Capturing emerging relations between schema ontologies on the Web of Data

24
Capturing emerging relations between schema ontologies on the Web of Data Andriy Nikolov Enrico Motta

Transcript of Capturing emerging relations between schema ontologies on the Web of Data

Page 1: Capturing emerging relations between schema ontologies on the Web of Data

Capturing emerging relations between schema ontologies on the Web of Data

Andriy NikolovEnrico Motta

Page 2: Capturing emerging relations between schema ontologies on the Web of Data

Public linked data

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

• “Linking” in the Linked Data cloud:– References to instance URIs described in external sources– Special case: identity links between equivalent resources

Page 3: Capturing emerging relations between schema ontologies on the Web of Data

Motivation

• Schema heterogeneity is an obstacle both for creating and for utilising these links– Extracting information on the same topic from different

repositories– Discovering equivalence links between individuals

• Motivation for our work: discovering instance-level links– How to choose the repositories to connect a new one?– Which subsets of repositories contain co-referring instances?

TV programs

movies

pieces of music

LinkedMDB

DBPedia

Freebase

MusicBrainz

?

Page 4: Capturing emerging relations between schema ontologies on the Web of Data

Schema-level interlinks

Data-level

Schema-level?

Page 5: Capturing emerging relations between schema ontologies on the Web of Data

Matching approaches

• “Top-down”– Analyzing schema ontologies and

generating alignments (manually or automatically)

– UMBEL• Using CYC as a “backbone”• Mapping commonly used schema ontologies

• “Bottom-up”– Inferring schema mappings based on

instance-level information

Page 6: Capturing emerging relations between schema ontologies on the Web of Data

Our approach

• Constructing a large-scale network of schema mappings– Applying a light-weight instance-

based matcher• Analysing the resulting network

– What does it tell us about the use of ontologies?

Page 7: Capturing emerging relations between schema ontologies on the Web of Data

Motivating factors

• Potential use case scenarios– Discovering relevant sources for connection– Discovering relevant subsets of comparable

instances• Tolerance to the quality of mappings

– A mapping between “strongly overlapping” classes is still useful even if there is no strict equivalence/subsumption

Page 8: Capturing emerging relations between schema ontologies on the Web of Data

Instance-based matching

• Use of instance-based matching– Some implicit schema-level assumptions cannot be

captured using only schema-level evidence• Interpretation mismatches

– dbpedia:Actor = professional actor (film or stage)– movie:actor = anybody who participated in a movie

• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientistRepository Richard Nixon David Garrick

dbpedia:Actor DBPedia - +

movie:Actor LinkedMDB + -

Page 9: Capturing emerging relations between schema ontologies on the Web of Data

Instance set overlaps

LinkedMDB DBPediaMusicBrainz

music:artist/a16…9fdf

==

dbpedia:Ennio_Morriconemovie:music_contributor/2490

movie:music_contributor dbpedia:Artist

is_a is_amo:MusicArtist

DBPedia

dbpedia:Ennio_Morricone

dbpedia:Artist

is_a

yago:ItalianComposers

is_a

• Co-typing

• Declared association

Page 10: Capturing emerging relations between schema ontologies on the Web of Data

Dataset

• Billion Triple Challenge 2009– about 1.14 billion triples– contains

• core LOD repositories (DBPedia, Freebase, Geonames, Musicbrainz, LinkedMDB,…)

• smaller semantic datasets retrieved by search servers (Falcon-S, Sindice)

– ≈3.6M co-typing-based overlapping pairs of classes

– ≈1M association-based pairs

Page 11: Capturing emerging relations between schema ontologies on the Web of Data

Inferring mappings• Classification task

– Classes A, B: is there a mapping?– Boolean classification

• type of mappings assigned based on comparing sizes of instance sets

• Features– , : namespaces of class URIs– : size of the overlap– , : sizes of instance sets– : ratio of the overlapping subset to the complete

instance set- direct/indirect: whether classes have instances

explicitly declared to be equivalent

Page 12: Capturing emerging relations between schema ontologies on the Web of Data

Test

Mapping set Algorithm Precision Recall F1

Association-based J48 0.939 0.689 0.795

Co-typing-based J48 0.952 0.944 0.948

• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation

• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation

• Applying– 2 networks of class mappings

Property Association-based Co-typing-based

Nodes 20365 35578

Edges 82422 67620

Max. connections/node 5301 18137

Node with max. connections geonames:Feature foaf:Person

Avg. connections/node 8.09 3.80

Distribution law power power

Page 13: Capturing emerging relations between schema ontologies on the Web of Data

Observations: class mappings

• Association-based network: classes involved into the largest number of mappings– High-level classes represented concepts covered in many repositories– … and describing categories with very fine-grained class

decomposition– Usually also the most populated ones

1 10 100

1000

10000

100000

1000000

10000000

100000000

1

10

100

1000

10000

Instance set size

Num

ber o

f map

ping

s per

clas

s

geonames:Featurefreebase:people.personyago:PhysicalEntitylinkedmdb:filmumbel:Person

akt:Personakt:ArticleReference…“under-linked” ones?

Page 14: Capturing emerging relations between schema ontologies on the Web of Data

1 10 100

1000

10000

100000

1000000

10000000

100000000

1

10

100

1000

10000

100000

Instance set size

Num

ber o

f map

ping

s per

clas

sObservations: class mappings

• Co-typing-based network: classes involved into the largest number of mappings– Popular classes reused in many repositories– … or in DBPedia– … and describing categories with fine-grained class decomposition– Usually also the most populated ones

foaf:Personumbel:Persondbpedia:Persondbpedia:FootballPlayerwordnet:Persondbpedia:Album

sioc:WikiArticlegeonames:Feature…

Page 15: Capturing emerging relations between schema ontologies on the Web of Data

Links between ontologies

Property Association-based Co-typing-based

Nodes 52 743

Edges 172 1352

Max. connections/node 29 504

Node with max. connections YAGO FOAF

Avg. connections/node 3.96 1.85

Connected components 5 35

• Aggregated network: connections between ontologies– Mapping-based links between ontologies– At least 1 mapping between corresponding classes must exist

Page 16: Capturing emerging relations between schema ontologies on the Web of Data

Association-based network

Page 17: Capturing emerging relations between schema ontologies on the Web of Data

Association-based network

Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia

Page 18: Capturing emerging relations between schema ontologies on the Web of Data

Association-based network

Domain-specific

Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia

Page 19: Capturing emerging relations between schema ontologies on the Web of Data

• Main factor: topic coverage• Popularity for linking is not

reflected– Data-level: DBPedia has more

connections than Freebase– Schema-level: no substantial

difference– Effect of exploiting composed links

Association-based network

Page 20: Capturing emerging relations between schema ontologies on the Web of Data

Co-typing-based network

• Main factor:– Popularity for

reuse

• FOAF and WordNet:– the most

popular

• DBPedia, YAGO, OpenCYC, UMBEL– Reused for

DBPedia instances

Page 21: Capturing emerging relations between schema ontologies on the Web of Data

Outcomes

• Possible usage scenarios for mappings– Selecting suitable sources to connect

• “LinkedMDB contains more movies than DBPedia – more likely to cover all my instances”

– Selecting an ontology to reuse to structure new instances• Which sources use this ontology? Do I want my data to be

integrated with them?– Other data-driven tasks

• E.g., exploratory search

• Generic challenges– How to take into account task requirements in

ontology matching?• Recall vs precision, fuzzy vs exact

– How to capture changes in the data?• BTC 2009 is almost obsolete by now

Page 22: Capturing emerging relations between schema ontologies on the Web of Data

Limitations and future work

• Limitations– Light-weight matcher can lead to lower

quality mappings• OK for our scenario but not others

– Pre-existing instance-level mappings are not always available

• Future work– Combining with schema-based ontology

matching techniques– Taking into account properties and complex

correspondences

Page 23: Capturing emerging relations between schema ontologies on the Web of Data

Questions?

Thanks for your attention

Page 24: Capturing emerging relations between schema ontologies on the Web of Data

Disjoint but overlapping

• Spurious owl:sameAs link– dbpedia:Hippocrates (Hippocrates) =

bookmashup:9004095748 (Hippocratic Lives and Legends (Studies in Ancient Medicine, Vol 4))

• Spurious rdf:type assignment– dbpedia:Celtic_Frost (band) defined as

Person in DBPedia (fixed in the current version of DBPedia)

• Modelling assumptions– dbpedia:Masada describes both the

geographical place and the battle