A Scalable Approach to Learn Semantic Models of Structured Sources

32
A Scalable Approach to Learn Semantic Models of Structured Sources Mohsen Taheriyan Craig Knoblock Pedro Szekely Jose Luis Ambite 8 th IEEE International Conference on Semantic Computing

description

Semantic models of data sources describe the meaning of the data in terms of the concepts and relationships defined by a domain ontology. Building such models is an important step toward integrating data from different sources, where we need to provide the user with a unified view of underlying sources. In this paper, we present a scalable approach to automatically learn semantic models of a structured data source by exploiting the knowledge of previously modeled sources. Our evaluation shows that the approach generates expressive semantic models with minimal user input, and it is scalable to large ontologies and data sources with many attributes.

Transcript of A Scalable Approach to Learn Semantic Models of Structured Sources

Page 1: A Scalable Approach to Learn Semantic Models of Structured Sources

A Scalable Approach to Learn Semantic Models of Structured Sources

Mohsen Taheriyan

Craig Knoblock

Pedro Szekely

Jose Luis Ambite 8th IEEE International Conference on Semantic Computing

Page 2: A Scalable Approach to Learn Semantic Models of Structured Sources

2

Motivation

How to express the intended meaning of data?

Explicit semantics is missing in many of the structured sources

creator? actor? rightsHolder?

artwork? movie? referenced entity?

Page 3: A Scalable Approach to Learn Semantic Models of Structured Sources

3

Map the Source to the Domain Ontology

EDM: Europeana Data Model SKOS: Simple Knowledge Organization SystemFOAF: Friend of a FriendAAC: American Art CollaborativeElementsGr2: RDA Group 2 ElementsORE: Open Archive InitiativeDCTerms: Dublin Core Metadata Terms

Data Source: artworks in the Indianapolis Museum of Art

Domain ontologies

Semantic Model: a mapping from the source to the concepts and relationships defined by the domain

ontologies

Page 4: A Scalable Approach to Learn Semantic Models of Structured Sources

4

Semantic Model

aac:CulturalHeritageObject edm:WebResou

rce

skos:Concept

aac:Person

edm:EuropeanaAggregation

dcterms:title

edm:aggregatedCHO

skos:prefLabel

ElementsGr2:nameOfThePerson

rdf:type

edm:hasResource

dcterms:creator

edm:hasType

dcterms:description

Key ingredient to automate source discovery, data integration, and publishing RDF triples

Page 5: A Scalable Approach to Learn Semantic Models of Structured Sources

5

Problem: How to automatically learn a semantic model for a source

Page 6: A Scalable Approach to Learn Semantic Models of Structured Sources

6

Main Idea

Sources in the same domain often have similar data

Exploit knowledge of known semantic models to hypothesize a semantic model for a new sources

Page 7: A Scalable Approach to Learn Semantic Models of Structured Sources

7

Previous Approach (ISWC 2013)Input

Learn semantic types for attributes(s)

• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Page 8: A Scalable Approach to Learn Semantic Models of Structured Sources

8

LimitationsInput

Learn semantic types for attributes(s)

• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Consider only one semantic type (label) for each attribute

Not scalable to sources with a large number of attributes

Page 9: A Scalable Approach to Learn Semantic Models of Structured Sources

9

ContributionsInput

Learn semantic types for attributes(s)

• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models

Build Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Consider K candidate semantic types per attribute

A Beam search algorithm to score and prune the mappings

Page 10: A Scalable Approach to Learn Semantic Models of Structured Sources

10

Example

New source: Indianapolis Museum of Art

EDM

SKOS

FOAF

AAC

ElementsGr2

ORE

DCTerms

Domain ontologies:

S1(title, creationDate, name, birthDate, deathDate, type)

Known Semantic Models:S1: Dallas MuseumS2: The Metropolitan Museum of Art

S2(name, copyright, materials, dimensions, imageUrl)

S(title, label, image, type, artist)

Goal: Semantic model for source S

Semantic model of S1

Semantic model of S2

Page 11: A Scalable Approach to Learn Semantic Models of Structured Sources

11

• Sample data from new source (S)

ApproachInput

Learn semantic types for attributes(s)

• Domain Ontologies (O)• Known semantic

models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Page 12: A Scalable Approach to Learn Semantic Models of Structured Sources

12

Learn Semantic Types• A CRF-based machine learning technique to learn Semantic Types for each

attribute from its data

• Semantic Type– Ontology Class: <class_uri>– Data Property + Domain Class: <class_uri, property_uri>

• Pick top K semantic types according to their confidence values

New source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title> 0.19

<aac:CulturalHeritageObject, rdfs:label> 0.08

label <aac:CulturalHeritageObject, dcterms:description>

0.7

<aac:Person, ElementsGr2:note> 0.03

image <edm:WebResource> 0.58

<foaf:Document> 0.41

type <skos:Concept, skos:prefLabel> 0.82

<skos:Concept, rdfs:label> 0.15

name <foaf:Person, foaf:name> 0.27

<aac:Person, ElementsGr2:nameOfThePerson>

0.19

Page 13: A Scalable Approach to Learn Semantic Models of Structured Sources

13

• Sample data from new source (S)

ApproachInput

Learn semantic types for attributes(s)

• Domain Ontologies (O)• Known semantic

models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Page 14: A Scalable Approach to Learn Semantic Models of Structured Sources

14

Build Graph G: Add Known Models

Page 15: A Scalable Approach to Learn Semantic Models of Structured Sources

15

Build Graph G: Add Semantic Types

Page 16: A Scalable Approach to Learn Semantic Models of Structured Sources

16

Build Graph G: Expand with Paths from Ontologies

Page 17: A Scalable Approach to Learn Semantic Models of Structured Sources

17

• Sample data from new source (S)

ApproachInput

Learn semantic types for attributes(s)

• Domain Ontologies (O)• Known semantic

models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Page 18: A Scalable Approach to Learn Semantic Models of Structured Sources

18

Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>

label <aac:CulturalHeritageObject, dcterms:description>

<aac:Person, ElementsGr2:note>

image <edm:WebResource> <foaf:Document>

type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>

name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>

Page 19: A Scalable Approach to Learn Semantic Models of Structured Sources

19

Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>

label <aac:CulturalHeritageObject, dcterms:description>

<aac:Person, ElementsGr2:note>

image <edm:WebResource> <foaf:Document>

type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>

name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>

Page 20: A Scalable Approach to Learn Semantic Models of Structured Sources

20

Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>

label <aac:CulturalHeritageObject, dcterms:description>

<aac:Person, ElementsGr2:note>

image <edm:WebResource> <foaf:Document>

type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>

name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>

Page 21: A Scalable Approach to Learn Semantic Models of Structured Sources

21

Scalability Issue

• Multiple mappings from attributes(S) to nodes of G– Each attribute has more than one semantic type– Multiple matches for each semantic type

• Not feasible to generate all possible mappings– The size of graph may be large – The source may have many attributes

• Exponential in terms of number of attributes– N attributes, M match for each MN mappings

Page 22: A Scalable Approach to Learn Semantic Models of Structured Sources

22

Prune the Mappings• Score the partial mappings after mapping each

attribute– Coherence: number of nodes in a mapping that belong to

same component– Confidence: average confidence of semantic types in m– Score = arithmetic mean of coherence and size reduction

• Beam Search – Keep only top K mappings after mapping each attribute

• Number of mappings will not exceed a fixed size after mapping each attribute

Page 23: A Scalable Approach to Learn Semantic Models of Structured Sources

23

Score the MappingsNew source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title>, 0.19 <aac:CulturalHeritageObject, rdfs:label>

label <aac:CulturalHeritageObject, dcterms:description>, 0.7

<aac:Person, ElementsGr2:note>

image <edm:WebResource>, , 0.58 <foaf:Document>

type <skos:Concept, skos:prefLabel>, 0.82 <skos:Concept, rdfs:label>

name <foaf:Person, foaf:name>, 0.27 <aac:Person, ElementsGr2:nameOfThePerson>

Coherence: 4/9 = 0.44Confidence: 0.56Score: 0.5

Example Mapping 2

Page 24: A Scalable Approach to Learn Semantic Models of Structured Sources

24

Score the MappingsNew source: S(title, label, image, type, artist)

title <aac:CulturalHeritageObject, dcterms:title>, 0.19 <aac:CulturalHeritageObject, rdfs:label>

label <aac:CulturalHeritageObject, dcterms:description>, 0.7

<aac:Person, ElementsGr2:note>

image <edm:WebResource>, , 0.58 <foaf:Document>

type <skos:Concept, skos:prefLabel>, 0.82 <skos:Concept, rdfs:label>

name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>, 0.19

Coherence: 6/9 = 0.66Confidence: 0.55Score: 0.605

Example Mapping 1

This mapping gets higher score even though it uses the 2nd ranked semantic type for artist

Page 25: A Scalable Approach to Learn Semantic Models of Structured Sources

25

• Sample data from new source (S)

ApproachInput

Learn semantic types for attributes(s)

• Domain Ontologies (O)• Known semantic

models

Construct Graph G=(V,E)

Generate mappings between attributes(S) and V

Generate and rank semantic models

1

2

3

4

Output• A ranked set of semantic models for

S

Page 26: A Scalable Approach to Learn Semantic Models of Structured Sources

26

Generate Semantic Models

• Select top M mappings• Compute a Steiner tree for each

mapping– A minimal tree connecting nodes of

mapping

• Each tree is a candidate model• Rank candidate models (Steiner trees)

– Cost – Score of the corresponding mapping

Page 27: A Scalable Approach to Learn Semantic Models of Structured Sources

27

Steiner Tree

Page 28: A Scalable Approach to Learn Semantic Models of Structured Sources

28

Evaluation• Dataset

– 29 museum data sources– 332 attributes (average 11 per source)

• Domain ontologies– EDM ,SKOS, FOAF, ORE, ElementsGr2, AAC, DCTerms– 119 classes, 351 properties

• Compute precision and recall between learned models and correct models

How many of the learned relationships are correct?

How many of the correct relationships are learned?

Page 29: A Scalable Approach to Learn Semantic Models of Structured Sources

29

Quality

k = 1 correct semantic type learned only for 62% of attributes k = 4 correct semantic type was among the 4 learned types for 87% of attributes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 280.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

precision (k=1)

recall (k=1)

precision (k=4)

recall (k=4)

precision (correct types)

recall (correct types)

Number of known semantic models

Page 30: A Scalable Approach to Learn Semantic Models of Structured Sources

Performance

The previous approach was not able to learn semantic models for sources with more than 4 attributes in the

timeout of 1 hourExample: S16 with only 5 attributes 16,633,298 mappings (29*29*29*31*22)

0 5 10 15 20 25 300

10

20

30

40

50

60

Previous Approach

New Approach

Number of Attributes

Time

(Kbeam = 100)

Page 31: A Scalable Approach to Learn Semantic Models of Structured Sources

31

Related Work• Schema matching & schema mapping

– iMAP [Dhamankar et al., 2004], Clio [Fagin et al., 2009]

• Mapping databases and spreadsheets to ontologies– Mapping languages: D2R [Bizer, 2003], D2RQ [Bizer and Seaborne, 2004],

R2RML [Das et al., 2012]– Tools: RDOTE [Vavliakis et al., 2010], RDF123 [Han et al., 2008], XLWrap

[Langegger and Woß, 2009]– String similarity between column names and ontology terms [Polfliet and Ichise,

2010]

• Understand semantics of Web tables– Use column headers and cell values to find the labels and relations from a

database of labels and relations populated from the Web [Wang et al., 2012] [Limaye et al., 2010] [Venetis et al., 2011]

• Exploit Linked Open Data (LOD)– Link the values to the entities in LOD to find the types of the values and their

relationships [Muoz et al., 2013] [Mulwad et al., 2013]

• Learn Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007]

– Learns LAV rules from known sources– Only learns descriptions that are conjunctive combinations of known

descriptions

Page 32: A Scalable Approach to Learn Semantic Models of Structured Sources

32

Future Work

• Scalability regarding number of the known models– Create a more compact graph– Consolidate overlapping segments of known models

• Leverage Linked Open Data (LOD)– Exploit the relationships between instances– Improve the accuracy of the learned relations

• Integrate the new approach in Karma– http://www.isi.edu/integration/karma– @KarmaSemWeb