The Linked Data Lifecycle

336

description

ICCL Summer 2013 (http://www.computational-logic.org/content/events/iccl-ss-2013/) lectures on the Linked Data Life-Cycle

Transcript of The Linked Data Lifecycle

Page 1: The Linked Data Lifecycle

The Linked Data Life-Cycle

Jens Lehmann Lorenz Bühmann

contributors:Quan Nguyen Sören Auer Richard Cyganiak Daniel GerberSebastian Hellmann Anja Jentzsch Dimitris Kontokostas Axel NgongaClaus Stadler Christina Unger

2013-08-23

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 1 / 252

Page 2: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 2 / 252

Page 3: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 3 / 252

Page 4: The Linked Data Lifecycle

The Linked Data Principles

The term Linked Data refers to a set of best practices for publishing andinterlinking structured data on the Web.

Linked Data principles:

1 Use URIs as names for things.

2 Use HTTP URIs, so that people can look up those names.

3 When someone looks up a URI, provide useful information, using thestandards (RDF, SPARQL).

4 Include links to other URIs, so that they can discover more things.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 4 / 252

Page 5: The Linked Data Lifecycle

LOD Cloud

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 5 / 252

Page 6: The Linked Data Lifecycle

Linked Data Principles Detailed: 1 + 2

1 URI references to identify not just Web documents and digitalcontent, but also real world objects and abstract concepts

tangible things: people, placesabstract things: relationship type of knowing somebody

2 HTTP URIs enable re-use of Web architecture Linked Data givesemphasis to the Web in Semantic Web

Resource dereferencingRe-use of standard tools for security, load-balancing etc.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 6 / 252

Page 7: The Linked Data Lifecycle

Principles Detailed: 3 Content Negotiation

Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines

Achievable using an HTTP mechanism called content negotiation

Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer

Servers can inspect headers and select appropriate response

Two strategies:

303 URIsHash URIs

Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252

Page 8: The Linked Data Lifecycle

Principles Detailed: 3 Content Negotiation

Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines

Achievable using an HTTP mechanism called content negotiation

Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer

Servers can inspect headers and select appropriate response

Two strategies:

303 URIsHash URIs

Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252

Page 9: The Linked Data Lifecycle

Principles Detailed: 3 Content Negotiation

Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines

Achievable using an HTTP mechanism called content negotiation

Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer

Servers can inspect headers and select appropriate response

Two strategies:

303 URIsHash URIs

Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252

Page 10: The Linked Data Lifecycle

303 URIs

303 Redirect: instead of sending the object itself over the network,the server responds to the client with the HTTP response code 303See Other and the URI of a Web document which describes thereal-world object

Second step: client dereferences new URI and gets a Web documentdescribing the real-world object

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 8 / 252

Page 11: The Linked Data Lifecycle

Hash URIs

Hash URI strategy builds on characteristic that URIs may contain aspecial part (fragment identier) separated from their base part by ahash symbol (#)

HTTP protocol requires the fragment part to be stripped o beforerequesting the URI from the server

→ a URI that includes a hash cannot be retrieved directly andtherefore does not necessarily identify a Web document

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 9 / 252

Page 12: The Linked Data Lifecycle

Hash versus 303

Hash Uris

(+) Reduced number of necessary HTTP round-trips → reduces accesslatency(-) Descriptions of all resources sharing the same non-fragment URIpart are always returned to the client together → can lead to largeamounts of data being unnecessarily transmitted to the client

303 Uris

(+) Flexible because the redirection target can be conguredseparately for each resource (usually points to a single document foreach resource, but could also summarise several resources)(-) Requires two HTTP requests to retrieve a single description of areal-world object

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 10 / 252

Page 13: The Linked Data Lifecycle

Principles Detailed: 4 Links

If an RDF triple connects URIs in dierent namespaces/datasets, is iscalled a link (no unique syntactical denition of link exists)

Basic idea of Linked Data: apply the general hyperlink-basedarchitecture of the World Wide Web to the task of sharing structureddata on global scale

Research challenge: ecient creation of links with high precision andrecall

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 11 / 252

Page 14: The Linked Data Lifecycle

Why Linked Data?

Problem: Try to search for these things on the current Web:

Apartments near German-Russian bilingual childcare in Leipzig.

ERP service providers with oces in Vienna and London.

Researchers working on multimedia topics in Eastern Europe.

Information is available on the Web, but opaque to current Websearch.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 12 / 252

Page 15: The Linked Data Lifecycle

Why Linked Data?

Problem: Try to search for these things on the current Web:

Apartments near German-Russian bilingual childcare in Leipzig.

ERP service providers with oces in Vienna and London.

Researchers working on multimedia topics in Eastern Europe.

Information is available on the Web, but opaque to current Websearch.Solution: complement text on Web pages with structured linked open data& intelligently combine/integrate such structured information fromdierent sources:

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 13 / 252

Page 16: The Linked Data Lifecycle

How to get there?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 14 / 252

Page 17: The Linked Data Lifecycle

Tim Berners-Lee's 5-star plan

Tim Berners-Lee's 5-star plan for an open web of data

F Make data available on the Web under an open license

F F Make it available as structured data

F F F Use a non-proprietary format

F F F F Use URIs to identify things

F F F F F Link your data to other people's data to provide context

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 15 / 252

Page 18: The Linked Data Lifecycle

The 0th star

Data catalog with good metadataMake your data ndable

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 16 / 252

Page 19: The Linked Data Lifecycle

F Data on the Web, Open License

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 17 / 252

Page 20: The Linked Data Lifecycle

F Data on the Web, Open License

Open vs. Closed:

Data used to be closed by default

In the future, it may be open by default.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 18 / 252

Page 21: The Linked Data Lifecycle

F Data on the Web, Open License

Publishers: sharing data to make it more visible

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 19 / 252

Page 22: The Linked Data Lifecycle

F Data on the Web, Open License

E-Commerce: Data sharing for increasing trac

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 20 / 252

Page 23: The Linked Data Lifecycle

F Data on the Web, Open License

Community: Collaboratively created databases

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 21 / 252

Page 24: The Linked Data Lifecycle

Good reasons against opening data

Privacy

Competitive advantage

Producing data and charging for it as business model

Can't get license from upstream

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 22 / 252

Page 25: The Linked Data Lifecycle

F F Structured Data

Enabling re-use:

Delivering data to end users in dierent forms

Combining data with other data

3rd party analysis of data

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 23 / 252

Page 26: The Linked Data Lifecycle

F F Structured Data

Formats:

Good for re-use / Structured: MS Excel, CSV, XML, JSON, Microdata

Not so good for re-use: Pure websites, MS Word

Bad for re-use: PDF

Really bad for re-use: Only charts/maps without numbers

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 24 / 252

Page 27: The Linked Data Lifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 25 / 252

Page 28: The Linked Data Lifecycle

F F F Non-Proprietary Formats

Specialist tools often have specialist formats

Few people have the toolsExpensiveDicult to re-use(Geospatial tools, statistics packages, etc.)

Non-proprietary:

CSV (dead simple)XMLJSONRDF (good for 4+5 stars)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 26 / 252

Page 29: The Linked Data Lifecycle

F F F F URIs as Identiers

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 27 / 252

Page 30: The Linked Data Lifecycle

F F F F URIs as Identiers

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 28 / 252

Page 31: The Linked Data Lifecycle

F F F F URIs as Identiers

URI-Design: prefer stable, implementation independent URIs

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 29 / 252

Page 32: The Linked Data Lifecycle

F F F F URIs as Identiers

Turning local identiers into URIsWhy?

Make them globally unique

Clarify auhority

Make them resolvable

Make them linkable

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 30 / 252

Page 33: The Linked Data Lifecycle

F F F F F Links to Other Data

Hyperlinks are the soul of the Web. The Web of Data is no dierent.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 31 / 252

Page 34: The Linked Data Lifecycle

F F F F F Links to Other Data

Hyperlinks are the soul of the Web. The Web of Data is no dierent.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 31 / 252

Page 35: The Linked Data Lifecycle

Summary

Linked Data Principles:

1 Use URIs to name things (not only documents, but also people,locations, concepts, etc.)

2 To enable agents (human users and machine agents alike) to look upthose names, use HTTP URIs

3 When someone looks up a URI, provide useful information(structured data in RDF, SPARQL).

4 Include links to other URIs allowing agents to discover more things

5-Star-Data:

Five-star plan for realising an emerging web of data, dataset bydataset

2 stars: re-usable data

3 stars: open standards

4+5 stars: connect data silos

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 32 / 252

Page 36: The Linked Data Lifecycle

Summary

Linked Data Principles:

1 Use URIs to name things (not only documents, but also people,locations, concepts, etc.)

2 To enable agents (human users and machine agents alike) to look upthose names, use HTTP URIs

3 When someone looks up a URI, provide useful information(structured data in RDF, SPARQL).

4 Include links to other URIs allowing agents to discover more things

5-Star-Data:

Five-star plan for realising an emerging web of data, dataset bydataset

2 stars: re-usable data

3 stars: open standards

4+5 stars: connect data silos

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 32 / 252

Page 37: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 33 / 252

Page 38: The Linked Data Lifecycle

DBpedia

Community eort to extract structured information from Wikipediaand to make this information available on the Web

Allows to ask sophisticated queries against Wikipedia, and to linkother data sets on the Web to Wikipedia data

Semi-structured Wiki markup → structured information

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 34 / 252

Page 39: The Linked Data Lifecycle

Wikipedia Limitations

Simple Questions hard to answer with Wikipedia:

What have Innsbruck and Leipzig in common?

Who are mayors of central European towns elevated more than1000m?

Which movies are starring both Brad Pitt and Angelina Jolie?

All soccer players, who played as goalkeeper for a club that has astadium with more than 40.000 seats and who are born in a countrywith more than 10 million inhabitants

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 35 / 252

Page 40: The Linked Data Lifecycle

Structure in Wikipedia

Title

Abstract

Infoboxes

Geo-coordinates

Categories

Images

Links

other language versionsother Wikipedia pagesTo the WebRedirectsDisambiguation

...

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 36 / 252

Page 41: The Linked Data Lifecycle

DBpedia Information Extraction Framework

DBpedia Information Extraction Framework (DIEF)

Started in 2007

Hosted on Sourceforge and Github

Initially written in PHP but fully re-written Written in Scala and Java

Around 40 Contributors

See https://www.ohloh.net/p/dbpedia for detailed overview

Can potentially be adapted to other MediaWikis

Currently Wiktionary http://wiktionary.dbpedia.org

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 37 / 252

Page 42: The Linked Data Lifecycle

DIEF - Overview

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 38 / 252

Page 43: The Linked Data Lifecycle

DIEF - Raw Infobox Extractor

WikiText syntaxInfobox Korean settlement|title = Busan Metropolitan City...|area_km2 = 763.46|pop = 3635389|region = [[Yeongnam]]RDF serializationdbp:Busan dbp:title "Busan Metropolitan City"dbp:Busan dbp:area_km2 "763.46"^xsd:oatdbp:Busan dbp:pop "3635389"^xsd:int

dbp:Busan dbp:region dbp:Yeongnam

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 39 / 252

Page 44: The Linked Data Lifecycle

DIEF - Raw Infobox Extractor/Diversity

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 40 / 252

Page 45: The Linked Data Lifecycle

DIEF - Raw Infobox extractor/Diversity

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 41 / 252

Page 46: The Linked Data Lifecycle

DIEF - Mapping-Based Infobox Extractor

Cleaner data:

Combine what belongs together (birth_place, birthplace)

Separate what is dierent (bornIn, birthplace)

Correct handling of datatypes

Mappings Wiki:

http://mappings.dbpedia.org

Everybody can contribute to new mappings or improve existing ones

≈ 170 editors

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 42 / 252

Page 47: The Linked Data Lifecycle

DIEF - Mapping-Based Infobox Extractor

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 43 / 252

Page 48: The Linked Data Lifecycle

URI/IRI schemes

http://lang.dbpedia.org is the main domain

For every article there exists a DBpedia resource in the form:http://lang.dbpedia.org/resource/ArticleName

Properties from the raw infobox extractor use thehttp://lang.dbpedia.org/property/namespace

Ontology is global for all languages and underhttp://dbpedia.org/ontology/namespace

Note: that for English language no language code is used

http://dbpedia.org as main domain

http://dbpedia.org/resource/title for articles

http://dbpedia.org/property/title for properties

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 44 / 252

Page 49: The Linked Data Lifecycle

Linked Data Publication via 303 Redirects

http://dbpedia.org/resource/Dresden - URI of the city ofDresdenhttp://dbpedia.org/page/Dresden - information resourcedescribing the city of Dresden in HTML formathttp://dbpedia.org/data/Dresden - information resourcedescribing the city of Dresden in RDF/XML formatfurther formats supported,e.g. http://dbpedia.org/data/Dresden.n3 for N3

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 45 / 252

Page 50: The Linked Data Lifecycle

DBpedia Links

Data set Predicate Count Tool

Amsterdam Museum owl:sameAs 627 SBBC Wildlife Finder owl:sameAs 444 SBook Mashup rdf:type 9 100

owl:sameAsBricklink dc:publisher 10 100CORDIS owl:sameAs 314 SDailymed owl:sameAs 894 SDBLP Bibliography owl:sameAs 196 SDBTune owl:sameAs 838 SDiseasome owl:sameAs 2 300 SDrugbank owl:sameAs 4 800 SEUNIS owl:sameAs 3 100 SEurostat (Linked Stats) owl:sameAs 253 SEurostat (WBSG) owl:sameAs 137CIA World Factbook owl:sameAs 545 S

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 46 / 252

Page 51: The Linked Data Lifecycle

DBpedia Links

Data set Predicate Count Tool

ickr wrappr dbp:hasPhoto- 3 800 000 CCollection

Freebase owl:sameAs 3 600 000 CGADM owl:sameAs 1 900GeoNames owl:sameAs 86 500 SGeoSpecies owl:sameAs 16 000 SGHO owl:sameAs 196 LProject Gutenberg owl:sameAs 2 500 SItalian Public Schools owl:sameAs 5 800 SLinkedGeoData owl:sameAs 103 600 SLinkedMDB owl:sameAs 13 800 SMusicBrainz owl:sameAs 23 000New York Times owl:sameAs 9 700OpenCyc owl:sameAs 27 100 COpenEI (Open Energy) owl:sameAs 678 S

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 47 / 252

Page 52: The Linked Data Lifecycle

DBpedia Links

Data set Predicate Count Tool

Revyu owl:sameAs 6Sider owl:sameAs 2 000 STCMGeneDIT owl:sameAs 904UMBEL rdf:type 896 400US Census owl:sameAs 12 600WikiCompany owl:sameAs 8 300WordNet dbp:wordnet_type 467 100YAGO2 rdf:type 18 100 000

Sum 27 211 732

(S: Silk, L: LIMES, C: custom script, missing: no regeneration)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 48 / 252

Page 53: The Linked Data Lifecycle

DBpedia Links - Query Example

Compare funding per year (from FTS) and country with the gross domesticproduct of a country (from DBpedia)

SELECT ∗ SELECT ? f t s y e a r ? dbpcount ry (SUM(? amount ) AS ? fund i ng )

?com r d f : t ype f t s−o : Commitment .?com f t s−o : y ea r ? y ea r .? y ea r r d f s : l a b e l ? f t s y e a r .? b e n e f i t f t s−o : deta i lAmount ?amount .? b e n e f i t f t s−o : b e n e f i c i a r y ? b e n e f i c i a r y .? b e n e f i c i a r y f t s−o : coun t r y ? f t s c o u n t r y .? f t s c o u n t r y owl : sameAs ? dbpcount ry .

SELECT ? dbpcount ry ? gdpyear ? gdpnominal ? dbpcount ry r d f : t ype dbo : Country .? dbpcount ry dbp : gdpNominal ? gdpnominal .? dbpcount ry dbp : gdpNominalYear ? gdpyear .

FILTER ( ( ? f t s y e a r = s t r (? gdpyear ) )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 49 / 252

Page 54: The Linked Data Lifecycle

Infrastructure

DBpedia has two extraction modes:

Wikipedia-database-dump-based extraction

DBpedia Live synchronisation (more later)

DBpedia Dumps:

The DBpedia Dump archive is located in:http://downloads.dbpedia.org/

Latest downloads is described in: http://dbpedia.org/Downloads

Ocial Endpoint (by OpenLink): http://dbpedia.org/sparql

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 50 / 252

Page 55: The Linked Data Lifecycle

Query Answering

Back to our Wikipedia questions:

What have Innsbruck and Leipzig in common?

Who are mayors of central European towns elevated more than1000m?

Which movies are starring both Brad Pitt and Angelina Jolie?

All soccer players, who played as goalkeeper for a club that has astadium with more than 40.000 seats and who are born in a countrywith more than 10 million inhabitants

Using the data extracted from Wikipedia and the public SPARQL endpointDBpedia can answer these questions.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 51 / 252

Page 56: The Linked Data Lifecycle

DBpedia Live

DBpedia dumps are generated on a bi-annual basis

Wikipedia has around 100,000 150,000 page edits per day

DBpedia Live pulls page updates in real-time and extraction resultsupdate the triple store

In practice, a 5 minute update delay increases performance by 15%

Links

SPARQL Endpoint: http://live.dbpedia.org/sparql

Documentation: http://wiki.dbpedia.org/DBpediaLive

Statistics: http://live.dbpedia.org/LiveStats/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 52 / 252

Page 57: The Linked Data Lifecycle

DBpedia Live - Overview

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 53 / 252

Page 58: The Linked Data Lifecycle

DBpedia Internationalization (I18n)

DBpedia Internationalization Committee founded:

http://wiki.dbpedia.org/Internationalization

Available DBpedia language editions in:

Korean, Greek, German, Polish, Russian, Dutch, Portuguese, Spanish,Italian, Japanese, FrenchUse the corresponding Wikipedia language edition for input

Mappings available for 23 languages

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 54 / 252

Page 59: The Linked Data Lifecycle

DBpedia I18n - Overview

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 55 / 252

Page 60: The Linked Data Lifecycle

Applications: Disambiguation

Named entity recognition and disambiguation Tools such as: DBpediaSpotlight, AlchemyAPI, Semantic API, Open Calais, Zemanta and ApacheStanbol

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 56 / 252

Page 61: The Linked Data Lifecycle

Applications: Question Answering

DBpedia is the primary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop series

IBM Watson relied also on DBpedia

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 57 / 252

Page 62: The Linked Data Lifecycle

Applications: Faceted Browsing

Neofonie Browser

gFacet

OpenLink faceted browser (fct)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 58 / 252

Page 63: The Linked Data Lifecycle

Applications: Search and Querying

Query Builder

RelFinder

SemLens

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 59 / 252

Page 64: The Linked Data Lifecycle

Applications: Digital Libraries & Archives

Virtual International Authority Files (VIAF) project as Linked Data

VIAF added a total of 250,000 reciprocal authority links to Wikipedia.

DBpedia can also provide:

Context information for bibliographic and archive records (e.g. anauthor's demographics, a lm's homepage, an image etc.)Stable and curated identiers for linking.The broad range of Wikipedia topics can form the basis for a thesaurusfor subject indexing.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 60 / 252

Page 65: The Linked Data Lifecycle

Applications: DBpedia Mobile

DBpedia Mobile is a location-centric DBpedia client application for mobiledevices consisting of a map view, the Marbles Linked Data Browser and aGPS-enabled launcher application.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 61 / 252

Page 66: The Linked Data Lifecycle

Applications: DBpedia Wiktionary

Wiktionary is a Wikimedia project: http://wiktionary.org

171 languages, 3M words for English.

Extracted Using the DBpedia Information Extraction Framework

Easily congurable for every Wiktionary language edition

Pre-congured for German, Greek, English, Russian and French.http://Wiktionary.dbpedia.org100 milion triplesLemon model

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 62 / 252

Page 67: The Linked Data Lifecycle

Other Applications

See http://wiki.dbpedia.org/Applications for a more complete list

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 63 / 252

Page 68: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 64 / 252

Page 69: The Linked Data Lifecycle

Linked Data - Achievements and Challenges

Achievements:

1 Extension of the Web with a datacommons (50B facts)

2 vibrant, global RTD community

3 Industrial uptake begins (e.g.BBC, Thomson Reuters, Eli Lilly,NY Times, Facebook, Google,Yahoo)

4 Governmental adoption in sight

5 Establishing Linked Data as adeployment path for the SemanticWeb.

Challenges:

1 Coherence: Relatively few,expensively maintained links

2 Quality: partly low quality dataand inconsistencies

3 Performance: Still substantialpenalties compared to relational

4 Data consumption: large-scaleprocessing, schema mapping anddata fusion still in its infancy

5 Usability: Missing direct end-usertools and network eect.

These issues are closely related andshould ultimately lead to anecosystem of interlinked knowledge!

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 65 / 252

Page 70: The Linked Data Lifecycle

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 66 / 252

Page 71: The Linked Data Lifecycle

ExtractionLehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 67 / 252

Page 72: The Linked Data Lifecycle

Extraction

From unstructured sources

Formats: plain textMethods: NLP, text mining, ontology learning

From semi-structured sources

Formats: wiki markup, tagsTools: DBpedia framework (Wikipedia, Wictionary)

From structured sources

Formats: databases, spreadsheets, XMLRDB2RDF tools: Sparqlify, D2R, TriplifyCSV converters: RDF extension of Google Rene

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 68 / 252

Page 73: The Linked Data Lifecycle

Extraction Challenges

From unstructured sources

Improve F-Measure of existing NLP approaches (OpenCalais, OntosAPI)

Develop standardized, LOD enabled interfaces between NLP tools(NLP2RDF)

From semi-structured sources

Ecient bi-directional synchronization

From structured sources

Declarative syntax and semantics of data model transformations (W3CWG RDB2RDF)

Orthogonal challenges

Using LOD as background knowledge

Provenance

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 69 / 252

Page 74: The Linked Data Lifecycle

ABCDE

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 70 / 252

Page 75: The Linked Data Lifecycle

RDF Data Management

From unstructured sources

SPARQL RDF access still by a factor 2-10 slower than relational datamanagement

Performance increases steadily

Comprehensive, well-supported open-soure and commercialimplementations are available:

OpenLink's Virtuoso (os+commercial)OWLIM-Lite (free), OWLIM-SE, OWLIM-EnterpriseTalis (hosted)Bigdata (distributed)Allegrograph (commercial)Mulgara (os)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 71 / 252

Page 76: The Linked Data Lifecycle

Storage and Querying Challenges

Reduce the performance gap between relational and RDF datamanagement

SPARQL Query extensions: Spatial/semantic/temporal datamanagement

View maintenance / adaptive reorganization based on common accesspatterns

More realistic benchmarks

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 72 / 252

Page 77: The Linked Data Lifecycle

Authoring

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 73 / 252

Page 78: The Linked Data Lifecycle

Authoring

Integrated in Existing Environments: Tiki

Data oriented: RDFauthor, rdfEditor

Schema oriented: Protégé, TopBraid Composer, NeOn Toolkit,Swoop, Neologism, Knoodl

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 74 / 252

Page 79: The Linked Data Lifecycle

Authoring: Semantic Wikis

1 Semantic (Text) Wikis

Authoring of semantically annotatedtextsSemantic MediaWiki, KiWi,(Wikipedia+DBpedia)

2 Semantic Data Wikis

Direct authoring of structuredinformation (i.e. RDF, RDF-Schema,OWL)OntoWiki

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 75 / 252

Page 80: The Linked Data Lifecycle

Authoring: Semantic Wikis

1 Semantic (Text) Wikis

Authoring of semantically annotatedtextsSemantic MediaWiki, KiWi,(Wikipedia+DBpedia)

2 Semantic Data Wikis

Direct authoring of structuredinformation (i.e. RDF, RDF-Schema,OWL)OntoWiki

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 75 / 252

Page 81: The Linked Data Lifecycle

ABCDDBEFCCF

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 76 / 252

Page 82: The Linked Data Lifecycle

Interlinking

Data Web is an uncontrolled environment proliferation of equivalentor similar entities need for links / merging

Currently only few RDF triples are links

Manual Link Discovery:

Sindice Integration, LODStats, Semantic Pingback

Tool supported / Semi-Automatic:

SILK, LIMES, COMA, RDF-AIUsually via mapping specications / heuristics

Machine Learning / Automatic:

RAVEN, EAGLE, SILK GP

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 77 / 252

Page 83: The Linked Data Lifecycle

Interlinking Challenges

Apply work in the de-duplication/record linkage literature

Consider the open world nature of Linked Data

Use LOD background knowledge

Zero-conguration linking

Explore active learning approaches, which integrate users in a feedbackloop

Maintain a 24/7 linking service: Linked Open Data Around-The-Clockproject (http://latc-project.eu/)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 78 / 252

Page 84: The Linked Data Lifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 79 / 252

Page 85: The Linked Data Lifecycle

Enrichment

Currently, lack of knowledge bases with sophisticated schemainformation and instance data adhering to this schema

Goal: powerful reasoning, consistency checking and querying

Manual:

Via ontology editors, DBpedia mappings

(Semi-)Automatic:

DL-Learner, Statistical Schema Induction

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 80 / 252

Page 86: The Linked Data Lifecycle

Enrichment: Example

Given: knowledge base with property birthPlace (i.e. triples using thatproperty) but no information on the semantics of birthPlacePossibly enrichment:

ObjectProperty: birthPlace

Characteristics: Functional

Domain: Person

Range: Place

SubPropertyOf: hasBeenAt

Benets:

axioms serve as documentation for purpose and correct usage ofschema elements

additional implicit information can be inferred

improve the applicability of schema debugging techniques

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 81 / 252

Page 87: The Linked Data Lifecycle

Repair

Ontology Debugging: OWL reasoning to detect inconsistencies andsatisable classes + detect the most likely sources for the problems

basic task: provide feedback to user for resolving undesired entailments

justication J ⊆ O of an entailment is a minimal set of axioms fromwhich the entailment can be drawn

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 82 / 252

Page 88: The Linked Data Lifecycle

AA

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 83 / 252

Page 89: The Linked Data Lifecycle

Linked Data Quality Analysis

Quality on the Data Web is varying a lot

Hand crafted or expensively curated knowledge base (e.g. DBLP,UMLS) vs. extracted from text or Web 2.0 sources (DBpedia)

Quality = Fitness for use

Often not necessary to x all problems, but to know about them

30+ quality dimensions dened in recent survey

Research Challenge

Establish measures for assessing the authority, provenance, reliability ofData Web resources

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 84 / 252

Page 90: The Linked Data Lifecycle

Evolution © CC-BY-SA by alasis on flickr)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 85 / 252

Page 91: The Linked Data Lifecycle

KB Evolution

Tasks:

Performing knowledge base changes / refactoring

Ensuring consistency of related knowledge

Managing changes, e.g. undo operations

Update materialized inferred data upon changes

Update materialised links to other data upon changes

Tools:

Protégé - PROMPT and change management plugins

EvoPat - easily re-usable and sharable evolution patterns dened viaSPARQL

PatOMat - ontology transformation framework

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 86 / 252

Page 92: The Linked Data Lifecycle

A

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 87 / 252

Page 93: The Linked Data Lifecycle

Exploration

RDF data can be complex (as discussed by Pascal Hitzler)

Exploration phase aims to make data accessible to non-experts

Options:

Faceted BrowsingQuestion AnsweringQuery BuildersVisualisation of statistical or geospatial data. . .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 88 / 252

Page 94: The Linked Data Lifecycle

Catalogus Professorum Lipsiensis

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 89 / 252

Page 95: The Linked Data Lifecycle

Visual Query Builder

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 90 / 252

Page 96: The Linked Data Lifecycle

Relationship Finder in CPL

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 91 / 252

Page 97: The Linked Data Lifecycle

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 92 / 252

Page 98: The Linked Data Lifecycle

Make the Web a Linked Data Washing Machine

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 93 / 252

Page 99: The Linked Data Lifecycle

Tool Support for Life-Cycle?

Many SW tools support one or more life-cycle stages

Linked Data Stack (http://stack.linkeddata.org) provides aconsolidated repository of such tools

Each tool is a Debian package

Lightweight integration between tools via common vocabularies andSPARQL

Demonstrator interfaces for showing tools in combination

Developed by LOD2 and GeoKnow EU projects

GeoKnow

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 94 / 252

Page 100: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 95 / 252

Page 101: The Linked Data Lifecycle

Knowledge Extraction

Knowledge Extraction is the creation of knowledge from structured(relational databases, XML) and unstructured (text, documents, images)sources.

Resulting knowledge needs to be in a machine-readable and

machine-interpretable format and facilitate inferencing

Similar to Information Extraction (NLP) and ETL (Data Warehouse),but main dierence: extraction result goes beyond the creation ofstructured information or the transformation into a relational schema

Requires re-use of existing formal knowledge (reusing ontologies) orthe generation of a schema based on the source data

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 96 / 252

Page 102: The Linked Data Lifecycle

Categorisation of Approaches

Source - Examples: plain text, relational databases, XML, CSV

Exposition - How is the extracted knowledge made explicit? How canyou query and perform inference?

Synchronization - Is the knowledge extraction process executed onceto produce a dump or is the result synchronized with the source? Arechanges to the result written back (Bi-directional)?

Reuse of Vocabularies - Can popular ontologies (Good Relations,FOAF, . . . ) be re-used to simplify global data integration?

Automatisation - manual, semi-automatic, automatic

Domain Ontology Required - Does the approach require apre-dened ontology or can it create a schema from the source(e.g. ontology learning)?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 97 / 252

Page 103: The Linked Data Lifecycle

Extraction from Structured Sources to RDF

Simple mappings from RDB tables/views to RDF

Direct mapping of the model of relational databases to RDFTable 7→ OWL classRow 7→ Instance s of this classCell with value o in column p 7→ Triple (s,p,o)Details: http://www.w3.org/TR/rdb-direct-mapping/

Complex mappings of relational databases to RDFAdditional renements can be employed to 1:1 mapping to improve theusefulness of RDF output

Extract or learn an OWL schema from the given database schemaMap the schema and its contents to a pre-existing domain ontology

Powerful mapping languages: R2RML, SML

XML

XML tree structure can be directly converted to RDF graph structureComplex mappings possible, e.g. via XSLT processors

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 98 / 252

Page 104: The Linked Data Lifecycle

Extraction from Natural Language Sources

80% of the information in business documents is in unstructurednatural language1

(-) Increased complexity and decreased quality of extraction

(+) Potential for a massive acquisition of extracted knowledge

Traditional Information Extraction (IE)

Recognize and categorise elements in textTechniques: Named Entity Recognition (NER), Coreference Resolution(CO), . . .

Ontology Learning (OL) from Text

Learn whole ontologies from natural language textUsually (semi-)automatic extracted

1Wimalasuriya, Dou. "Ontology-based information extraction: [. . . ]" Journal of Information Science

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 99 / 252

Page 105: The Linked Data Lifecycle

LinkedGeoData + Sparqlify

Example: LinkedGeoData Knowledge Extraction Project using Sparqlify

Structure

Motivation

OpenStreetMap

LGD Architecture

Mapping

Access (How LinkedGeoData is published)

Use Cases

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 100 / 252

Page 106: The Linked Data Lifecycle

Motivation

Ease information integration tasks that require spatial knowledge,such as

Oerings of bakeries next doorMap of distributed branches of a companyHistorical sights along a bicycle track

LOD cloud contains data sets with spatial features

e.g. Geonames, DBpedia, US census, EuroStatBut: they are restricted to popular or large entities like countries,famous places etc. or specic regions

Therefore they lack buildings, roads, mailboxes, etc.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 101 / 252

Page 107: The Linked Data Lifecycle

OpenStreetMap - Datamodel

Basic entities are:

Nodes Latitude, Longitude.Ways Sequence of nodes.Relations Associations between any number of nodes, ways andrelations. Every member in a relation plays a certain role.

Each entity may be described with tags (= key-value pairs)

A way is closed if the ID of the last referenced node equals that of therst one.

Whether a closed way denotes a linear ring or a polygon (i.e. whetherthe enclosed area is part of the respective OSM entity) depends on thetags.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 102 / 252

Page 108: The Linked Data Lifecycle

Example: Leipzig's Zoo

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 103 / 252

Page 109: The Linked Data Lifecycle

Comparison: Leipzig's Zoo (OpenStreetMap)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 104 / 252

Page 110: The Linked Data Lifecycle

Comparison: Leipzig's Zoo (GoogleMaps)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 105 / 252

Page 111: The Linked Data Lifecycle

LGD Architecture

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 106 / 252

Page 112: The Linked Data Lifecycle

Tag Mappings

Key-value pairs will be assigned toRDF ressources

Each pair (k , v) can be annotated withdatatypes, language tags, classes

Mappings are themselves tables

Example table:lgd_map_literal

k property lang

name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 107 / 252

Page 113: The Linked Data Lifecycle

View Denition

RDF mapping of the data from aPostgreSQL database

Create View lgd_nodes As

Construct

?n a lgdm:Node .

?n geom:geometry ?g .

?g ogc:asWKT ?o .

With

?n = uri(lgd:node, ?id)

?g = uri(lgd-geom:node, ?id)

?o = typedLiteral(?geom, ogc:wktLiteral)

From

nodes

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 108 / 252

Page 114: The Linked Data Lifecycle

Sparqlify

SPARQL-SQL Rewriter

Rewrites SPARQL Queries accordingto the view denitionPlatform module oers SPARQLEndpoint and Linked Data interface

https:

//github.com/AKSW/Sparqlify

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 109 / 252

Page 115: The Linked Data Lifecycle

Rest-API

Oers REST methods for frequentqueries

Based on SPARQL (Virtuoso) endpoint

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 110 / 252

Page 116: The Linked Data Lifecycle

Downloads

RDF dataset for download

Generated usingConstruct ?s ?p ?o

http:

//downloads.linkedgeodata.org

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 111 / 252

Page 117: The Linked Data Lifecycle

Ontology

Enriched classes and properties with multilingual labels fromTranslateWiki

http://translatewiki.net

Imported icons for 90 classes from the freely available iconcollection from the SJJB Management

http://www.sjjb.co.uk/mapicons/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 112 / 252

Page 118: The Linked Data Lifecycle

SML Mapping Examples

The following slides demonstrate how to map relational data to RDFwith the Sparqlication Mapping Language (SML).

Thereby, these prexes are used:Prexes

prex IRI

rdfs http://www.w3.org/2000/01/rdf-schema#

ogc http://www.opengis.net/ont/geosparql#

geom http://geovocab.org/geometry#

lgd http://linkedgeodata.org/triplify/

lgd-geom http://linkedgeodata.org/geometry/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 113 / 252

Page 119: The Linked Data Lifecycle

SML - Mapping Example I: The Goal (1/4)

Input Table

nodesid geom

1 POINT(0 0)2 POINT(1 1)

How to map tables to RDF?

How to introduce thecommonly useddistinction in GIS betweenfeature and geometry?

Aimed for RDF Output

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

...

lgd:node1 geom:geometry lgd-geom:node1 .

lgd:node2 geom:geometry lgd-geom:node2 .

lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .

lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 114 / 252

Page 120: The Linked Data Lifecycle

SML - Mapping Example I: SML Syntax Outline (2/4)

Input Table

nodesid geom

1 POINT(0 0)2 POINT(1 1)

Create View myNodesView As

Construct

...

With

...

From

...

Aimed for RDF Output

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

...

lgd:node1 geom:geometry lgd-geom:node1 .

lgd:node2 geom:geometry lgd-geom:node2 .

lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .

lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 115 / 252

Page 121: The Linked Data Lifecycle

SML - Mapping Example I: Construct and From (3/4)

Input Table

nodesid geom

1 POINT(0 0)2 POINT(1 1)

Create View myNodesView As

Construct

?n geom:geometry ?g .

?g ogc:asWKT ?o

With

...

From

nodes

Aimed for RDF Output

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

...

lgd:node1 geom:geometry lgd-geom:node1 .

lgd:node2 geom:geometry lgd-geom:node2 .

lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .

lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 116 / 252

Page 122: The Linked Data Lifecycle

SML - Mapping Example I: Complete! (4/4)

Input Table

nodesid geom

1 POINT(0 0)2 POINT(1 1)

Create View myNodesView As

Construct

?n geom:geometry ?g .

?g ogc:asWKT ?o

With

?n = uri(lgd:node, ?id)

?g = uri(lgd-geom:node, ?id)

?o = typedLiteral(?geom,

ogc:wktLiteral)

From

nodes

Aimed for RDF Output

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

...

lgd:node1 geom:geometry lgd-geom:node1 .

lgd:node2 geom:geometry lgd-geom:node2 .

lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .

lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 117 / 252

Page 123: The Linked Data Lifecycle

SML Mapping Examples

A more complex example, which demonstrates the use of an SQLmapping table and an SQL helper view.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 118 / 252

Page 124: The Linked Data Lifecycle

SML - Mapping Example II: The Goal (1/8)

Input Table

node_tagsid k v

1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig

Aimed for RDF Output

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

@prefix lgd: <http://linkedgeodata.org/triplify/> .

lgd:node1 rdfs:label "Universitaet Leipzig" .

lgd:node1 rdfs:label "University of Leipzig"@en .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 119 / 252

Page 125: The Linked Data Lifecycle

SML - Mapping Example II: Source Data (2/8)

OSM Table

node_tagsid k v

1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 120 / 252

Page 126: The Linked Data Lifecycle

SML - Mapping Example II: Mapping Table (3/8)

OSM Table RDF Mapping Table

node_tagsid k v

1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig

lgd_map_literalk property lang

name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 121 / 252

Page 127: The Linked Data Lifecycle

SML - Mapping Example II: Helper View (4/8)

OSM Table RDF Mapping Table

node_tagsid k v

1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig

lgd_map_literalk property lang

name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .

Helper View

lgd_node_tags_literalid property v lang

1 rdfs:label Universitaet Leipzig1 rdfs:label University of Leipzig en. . . . . . . . . . . .

SELECT id, property, v, lang FROM node_tags, lgd_map_literal

WHERE node_tags.k = lgd_map_literal.k

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 122 / 252

Page 128: The Linked Data Lifecycle

SML - Mapping Example II: SML View (5/8)

Logical Table SML View

lgd_node_tags_literalid property v lang

1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .

Create View lgd_node_tags_text As

Construct

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 123 / 252

Page 129: The Linked Data Lifecycle

SML - Mapping Example II: SML View (6/8)

Logical Table SML View

lgd_node_tags_literalid property v lang

1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .

Create View lgd_node_tags_text As

Construct

?s ?p ?o .

With

...

From

lgd_node_tags_literal

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 124 / 252

Page 130: The Linked Data Lifecycle

SML - Mapping Example II: SML View (7/8)

Logical Table SML View

lgd_node_tags_literalid property v lang

1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .

Create View lgd_node_tags_text As

Construct

?s ?p ?o .

With

?s = uri(lgd:node, ?id)

?p = uri(?property)

?o = plainLiteral(?v, ?lang)

From

lgd_node_tags_literal

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 125 / 252

Page 131: The Linked Data Lifecycle

SML - Mapping Example II: SML View (8/8)

Logical Table SML View

+lgd_node_tags_literal

id property v lang

1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .

Create View lgd_node_tags_text As

Construct

?s ?p ?o .

With

?s = uri(lgd:node, ?id)

?p = uri(?property)

?o = plainLiteral(?v, ?lang)

From

lgd_node_tags_literal

Resulting RDF

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

@prefix lgd: <http://linkedgeodata.org/triplify/> .

lgd:node1 rdfs:label "Universitaet Leipzig" .

lgd:node1 rdfs:label "University of Leipzig"@en .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 126 / 252

Page 132: The Linked Data Lifecycle

Further Tag Mappings

lgd_map_dataypek datatype

seats integerunisex boolean

lgd_map_propertyk property

website foaf:homepage

lgd_map_resource_kk property object

highway rdf:type lgdo:HighwayThing

lgd_map_resource_kvk v property object

waterway river rdf:type lgdo:River

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 127 / 252

Page 133: The Linked Data Lifecycle

LGD Edit Tool

Multi User Tag Mapping WebApp

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 128 / 252

Page 134: The Linked Data Lifecycle

Resources

Sparqlifyhttp://sparqlify.org

LinkedGeoDatahttp://linkedgeodata.org

Tag Mappingshttps://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/resources/org/aksw/linkedgeodata/sql/Mappings.sql

SML View Denitionshttps://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/resources/org/aksw/linkedgeodata/sml/LinkedGeoData-Triplify-IndividualViews.sml

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 129 / 252

Page 135: The Linked Data Lifecycle

Statistics (15 August 2013)

Complete OSM planet le corresponds to ∼ 20.000.000.000 triples

Virtual access via Sparqlify

Downloads limited to selected classes.292.780.188 Triples

153.613.243 triples of Nodes139.166.945 triples of WaysRelations not yet available for download

Among them

532.812 PlaceOfWorship82.788 RailwayStation72.091 Toilets71.613 Town19.937 City

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 130 / 252

Page 136: The Linked Data Lifecycle

Access

Materialized Sparql Endpoint (based on Virtuoso DB, downloaddatasets loaded)

http://linkedgeodata.org/sparql

http://linkedgeodata.org/snorql

Virtual Sparql Endpoint (based on Sparqlify, access to 20B triples,limited SPARQL 1.0 support)

http://linkedgeodata.org/vsparql

http://linkedgeodata.org/vsnorql

Rest Interface (based on the Virtual Sparql Endpoint)

Supports limited queries (e.g. circular/rectangular area, ltering bylabels)

Downloads

http://downloads.linkedgeodata.org

Monthly updates on the above datasets envisioned

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 131 / 252

Page 137: The Linked Data Lifecycle

Use Cases Augmented Reality

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 132 / 252

Page 138: The Linked Data Lifecycle

Use Cases Generic Browsing

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 133 / 252

Page 139: The Linked Data Lifecycle

Use Cases Generic Browsing

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 134 / 252

Page 140: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 135 / 252

Page 141: The Linked Data Lifecycle

Why Link Discovery?

1 Fourth Linked Dataprinciple

2 Links are central for

Cross-ontology QAData IntegrationReasoningFederated Queries...

3 2011 topology of theLOD Cloud:

31+ billion triples≈ 0.5 billion linksowl:sameAs in mostcases

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 136 / 252

Page 142: The Linked Data Lifecycle

Why is it dicult?

1 Time complexity

Large number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to Geonames (1ms percomparison)decades for linking DBpedia and LGD. . .

Denition (Link Discovery)

Given sets S and T of resources and relation RTask: Find M = (s, t) ∈ S × T : R(s, t)Common approaches:

Find M ′ = (s, t) ∈ S × T : σ(s, t) ≥ θFind M ′ = (s, t) ∈ S × T : δ(s, t) ≤ θ

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 137 / 252

Page 143: The Linked Data Lifecycle

Why is it dicult?

2 Complexity of specications

Combination of several attributes required for high precisionTedious discovery of most adequate mappingDataset-dependent similarity functions

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 138 / 252

Page 144: The Linked Data Lifecycle

LIMES Framework

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 139 / 252

Page 145: The Linked Data Lifecycle

Runtime Optimization

Reduce the number of comparisons C (A) ≥ |M ′| (assuming we needall σ/θ values for links)

Maximize reduction ratio:

RR(A) = 1− C (A)

|S ||T |

Question

Can we devise lossless approaches with guaranteed RR?

Advantages

Space managementRuntime predictionResource scheduling

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 140 / 252

Page 146: The Linked Data Lifecycle

Runtime Optimization

Reduce the number of comparisons C (A) ≥ |M ′| (assuming we needall σ/θ values for links)

Maximize reduction ratio:

RR(A) = 1− C (A)

|S ||T |

Question

Can we devise lossless approaches with guaranteed RR?

Advantages

Space managementRuntime predictionResource scheduling

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 140 / 252

Page 147: The Linked Data Lifecycle

RR Guarantee

Best achievable reduction ratio: RRmax = 1− |M′||S||T |

Approach H(α) fullls RR guarantee criterion, i:

∀r < RRmax,∃α : RR(H(α)) ≥ r

Here, we use relative reduction ratio (RRR):

RRR(A) =RRmax

RR(A)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252

Page 148: The Linked Data Lifecycle

RR Guarantee

Best achievable reduction ratio: RRmax = 1− |M′||S||T |

Approach H(α) fullls RR guarantee criterion, i:

∀r < RRmax, ∃α : RR(H(α)) ≥ r

Here, we use relative reduction ratio (RRR):

RRR(A) =RRmax

RR(A)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252

Page 149: The Linked Data Lifecycle

RR Guarantee

Best achievable reduction ratio: RRmax = 1− |M′||S||T |

Approach H(α) fullls RR guarantee criterion, i:

∀r < RRmax, ∃α : RR(H(α)) ≥ r

Here, we use relative reduction ratio (RRR):

RRR(A) =RRmax

RR(A)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252

Page 150: The Linked Data Lifecycle

Goal

Formal Goal

Devise H(α) : ∀r > 1, ∃α : RRR(H(α)) ≤ r

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 142 / 252

Page 151: The Linked Data Lifecycle

Restrictions

Minkowski Distance

δ(s, t) = p

√√√√ n∑i=1

|si − ti |p, p ≥ 2

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 143 / 252

Page 152: The Linked Data Lifecycle

Space Tiling

HYPPO

δ(s, t) ≤ θ describes a hypersphere

Approximate hypersphere by using a hypercube

Easy to computeNo loss of recall (blocking)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 144 / 252

Page 153: The Linked Data Lifecycle

Space Tiling

Set width of single hypercube to ∆ = θ/α

Tile Ω = S ∪ T into the adjacent cubes CCoordinates: (c1, . . . , cn) ∈ Nn

Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252

Page 154: The Linked Data Lifecycle

Space Tiling

Set width of single hypercube to ∆ = θ/αTile Ω = S ∪ T into the adjacent cubes C

Coordinates: (c1, . . . , cn) ∈ Nn

Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252

Page 155: The Linked Data Lifecycle

Space Tiling

Set width of single hypercube to ∆ = θ/αTile Ω = S ∪ T into the adjacent cubes C

Coordinates: (c1, . . . , cn) ∈ Nn

Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252

Page 156: The Linked Data Lifecycle

HYPPO

Combine (2α + 1)n hypercubes around C (ω) to approximatehypersphere

RRR(HYPPO(α)) = (2α+1)n

αnS(n)

limα→∞

RRR(HYPPO(α)) = 2n

S(n)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 146 / 252

Page 157: The Linked Data Lifecycle

HYPPO

RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50

limα→∞

RRR(HYPPO(α)) = 4π ≈ 1.27 (n = 2)

limα→∞

RRR(HYPPO(α)) = 6π ≈ 1.91 (n = 3)

limα→∞

RRR(HYPPO(α)) = 32π2≈ 3.24 (n = 4)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 147 / 252

Page 158: The Linked Data Lifecycle

HYPPO

RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50

limα→∞

RRR(HYPPO(α)) = 4π ≈ 1.27 (n = 2)

limα→∞

RRR(HYPPO(α)) = 6π ≈ 1.91 (n = 3)

limα→∞

RRR(HYPPO(α)) = 32π2≈ 3.24 (n = 4)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 147 / 252

Page 159: The Linked Data Lifecycle

HR3: Idea

index(C , ω) =

0 if ∃i : |ci − c(ω)i | ≤ 1, 1 ≤ i ≤ n,n∑

i=1(|ci − c(ω)i | − 1)p else,

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 148 / 252

Page 160: The Linked Data Lifecycle

HR3: IdeaCompare C (ω) with C i index(C , ω) ≤ αpα = 4, p = 2

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 149 / 252

Page 161: The Linked Data Lifecycle

HR3: Idea

Lemma

∀s ∈ S : index(C , s) > αp implies that all t ∈ C are non-matches

Claims

No loss of recallXlimα→∞

RRR(HR3(α)) = 1

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 150 / 252

Page 162: The Linked Data Lifecycle

HR3: Lemma 3

Lemma

∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))

p = 2, α = 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 151 / 252

Page 163: The Linked Data Lifecycle

HR3: Proof

Lemma

∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))

p = 2, α = 8

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 152 / 252

Page 164: The Linked Data Lifecycle

HR3: Proof

Lemma

∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))

p = 2, α = 25

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 153 / 252

Page 165: The Linked Data Lifecycle

HR3: Proof

Lemma

∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))

p = 2, α = 50

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 154 / 252

Page 166: The Linked Data Lifecycle

HR3: Idea

Theorem

limα→∞

RRR(HR3(α)) = 1

Claims

No loss of recallXlimα→∞

RRR(HR3(α)) = 1X

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 155 / 252

Page 167: The Linked Data Lifecycle

HR3: Experiments

Compare HR3 with LIMES 0.5's HYPPO and SILK 2.5.1

Experimental Setup:

Deduplicating DBpedia places by minimum elevation, elevation andmaximum elevation (θ = 49m, 99m).Geonames and LinkedGeoData by longitude and latitude (θ = 1, 9)

64-bit computer with a 2.8GHz i7 processor with 8GB RAM.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 156 / 252

Page 168: The Linked Data Lifecycle

HR3: Experiments (Comparisons)

Experiment 2: Deduplicating DBpedia places, θ = 99m0.64× 106 less comparisons

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 157 / 252

Page 169: The Linked Data Lifecycle

HR3: Experiments (Comparisons)

Experiment 4: Linking Geonames and LinkedGeoData, θ = 9

4.3× 106 less comparisons

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 158 / 252

Page 170: The Linked Data Lifecycle

HR3: Experiments (Runtime)

Experiment 1, 2: DBpedia, θ = 49, 99mExperiment 3, 4: Geonames and LGD, θ = 1, 9

Exp. 1 Exp. 2 Exp. 3 Exp. 4100

101

102

103

104

Run

time

(s)

HR3HYPPOSILK

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 159 / 252

Page 171: The Linked Data Lifecycle

HR3: Summary

Mission

New category of algorithms for link discovery

Presented HR3

Link discovery in ane spaces with Minkowski measuresOutperforms the state of the art (runtime, comparisons)Optimal reduction ratioIntegrated in LIMES

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 160 / 252

Page 172: The Linked Data Lifecycle

HR3: Summary

Mission

New category of algorithms for link discovery

Presented HR3

Link discovery in ane spaces with Minkowski measuresOutperforms the state of the art (runtime, comparisons)Optimal reduction ratioIntegrated in LIMES

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 160 / 252

Page 173: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252

Page 174: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252

Page 175: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252

Page 176: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252

Page 177: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252

Page 178: The Linked Data Lifecycle

Learning Complex Specications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)

Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252

Page 179: The Linked Data Lifecycle

Learning Complex Specications

Insight

Choice of right example is key for learning

So far, only use of informativeness

Question

Can we do better by using more information?

Higher F-measure

Often slower

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252

Page 180: The Linked Data Lifecycle

Learning Complex Specications

Insight

Choice of right example is key for learning

So far, only use of informativeness

Question

Can we do better by using more information?

Higher F-measure

Often slower

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252

Page 181: The Linked Data Lifecycle

Learning Complex Specications

Insight

Choice of right example is key for learning

So far, only use of informativeness

Question

Can we do better by using more information?

Higher F-measure

Often slower

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252

Page 182: The Linked Data Lifecycle

Basic Idea

Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252

Page 183: The Linked Data Lifecycle

Basic Idea

Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252

Page 184: The Linked Data Lifecycle

Basic Idea

Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252

Page 185: The Linked Data Lifecycle

Similarity of Candidates

Link candidate x = (s, t) can be regarded as vector(σ1(x), . . . , σn(x)) ∈ [0, 1]n.

Similarity of link candidates x and y :

sim(x , y) =1

1 +

√n∑

i=1(σi (x)− σi (y))2

. (1)

Allows exploiting both intra- and inter-class similarity

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 165 / 252

Page 186: The Linked Data Lifecycle

Graph Clustering

Rationale: Use intra-class similarity

Approach

Cluster elements of S+ and S− independentlyChoose one element per cluster as representativePresent oracle with most informative representatives

0.8

0.9

0.8

S+

S-

0.8

0.9

0.8

0.25

0.25

0.9

0.80.8

0.8

0.25a

b

c

d

e

d

f g

hi

k

l

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 166 / 252

Page 187: The Linked Data Lifecycle

BorderFlow

G = (V ,E , ω) with V = S+ or V = S−

ω(x , y) = sim(x , y)

Keep best ec edges for each x ∈ V

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 167 / 252

Page 188: The Linked Data Lifecycle

BorderFlow

Seed-based algorithm

Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))

http://sourceforge.net/projects/cugar-framework/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252

Page 189: The Linked Data Lifecycle

BorderFlow

Seed-based algorithm

Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))

http://sourceforge.net/projects/cugar-framework/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252

Page 190: The Linked Data Lifecycle

BorderFlow

Seed-based algorithm

Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))

http://sourceforge.net/projects/cugar-framework/Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252

Page 191: The Linked Data Lifecycle

BorderFlow

Seed-based algorithm

Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))

http://sourceforge.net/projects/cugar-framework/

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 169 / 252

Page 192: The Linked Data Lifecycle

BorderFlow

Seed-based algorithm

Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))

http://sourceforge.net/projects/cugar-framework/Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 169 / 252

Page 193: The Linked Data Lifecycle

Conclusion

Can be combined with arbitrary active learning ML algorithms

Was experimentally combined with EAGLE (genetic programming) andRAVEN (linear classier) and shown to outperform the plaininformativeness function in terms of F-measure

Choice of example important to minimise user eort

Contact me for detailed experimental results

Longer runtimes (up to 2×)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 170 / 252

Page 194: The Linked Data Lifecycle

Summary

Linking crucial task in the web of dataTow key problems

1 Ecient execution of link specications2 Creation of link specication

Presented HR3 to handle the rst problemPresented COALA as building block for the second problem

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 171 / 252

Page 195: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 172 / 252

Page 196: The Linked Data Lifecycle

Motivation

rise in the availability and usage of knowledge bases

still a lack of knowledge bases that consist of sophisticated schemainformation and instance data adhering to this schema

e.g. in the life sciences several knowledge bases

only consist of schema informationto a large extent, a collection of facts without a clear structure(e.g. information extracted from databases)

combination of sophisticated schema and instance data would allowpowerful reasoning, consistency checking, and improved querying

→ create schemata based on existing data

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 173 / 252

Page 197: The Linked Data Lifecycle

Example

dbr : Brad_Pitt : b i r t h P l a c e dbr : Shawnee , _Oklahoma ;a : Person .

dbr : Angela_Merkel : b i r t h P l a c e dbr : Hamburg ;a : Person .

dbr : A l b e r t_E i n s t e i n : b i r t h P l a c e dbr : Ulm ;a : Person .

dbr : Shawnee , _Oklahoma a : P lace .dbr : Ulm a : P lace .dbr : Hamburg a : P lace .

Suggestions: birthPlace

Ob j e c tP rope r t y : b i r t h P l a c eC h a r a c t e r i s t i c s : F un c t i o n a lDomain : PersonRange : P laceSubPropertyOf : hasBeenAt

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 174 / 252

Page 198: The Linked Data Lifecycle

Benets of an expressive schema

Axioms serve as documentation for the purpose and correct usage ofschema elements

Additional implicit information can be inferred

Improve querying optimisations

Improve/allow the application of schema debugging techniques

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 175 / 252

Page 199: The Linked Data Lifecycle

Each person was only born at one place?!

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 176 / 252

Page 200: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 201: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 202: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 203: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 204: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 205: The Linked Data Lifecycle

birthPlace birthPlace

6=

birthPlace is functional

SELECT ? s WHERE ? s dbo : b i r t h P l a c e ?o1 .? s dbo : b i r t h P l a c e ?o2 .FILTER (? o1 != ?o2 )

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252

Page 206: The Linked Data Lifecycle

Where was Julia Nannie Wallace born?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 178 / 252

Page 207: The Linked Data Lifecycle

Julia Nannie Wallace was born in Lacrosse?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 179 / 252

Page 208: The Linked Data Lifecycle

No, Julia Nannie Wallace was born in La Crosse!

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 180 / 252

Page 209: The Linked Data Lifecycle

birthPlace

Sport

rdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 210: The Linked Data Lifecycle

birthPlace

Sport

rdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 211: The Linked Data Lifecycle

birthPlace

Sportrdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 212: The Linked Data Lifecycle

birthPlace

Sportrdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 213: The Linked Data Lifecycle

birthPlace

Sportrdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 214: The Linked Data Lifecycle

birthPlace

Sportrdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 215: The Linked Data Lifecycle

birthPlace

Sportrdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 216: The Linked Data Lifecycle

birthPlace

Sport

rdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 217: The Linked Data Lifecycle

birthPlace

Sport

rdf:type

birthPlace range Place

Placerdf:type

Place disjointWith Sport

6=

City

v

SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252

Page 218: The Linked Data Lifecycle

3 Steps to get a schema

SPARQLEndpoint

Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)

3-Phase EnrichmentLearning Approach:

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 182 / 252

Page 219: The Linked Data Lifecycle

3 Steps to get a schema

1. obtain schema information

SPARQLEndpoint

Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)

Background Knowledge

3-Phase EnrichmentLearning Approach:

(onl

y ex

ecu

ted

once

per

know

ledg

e ba

se)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 183 / 252

Page 220: The Linked Data Lifecycle

3 Steps to get a schema

1. obtain schema information

Reasoner

SPARQLEndpoint

Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)

Background Knowledge

BackgroundKnowledge+ Relevant Instance Data

(opt

ion

alin

voca

tion

)

2. obtain axiom type and entity specific data

3-Phase EnrichmentLearning Approach:

(onl

y ex

ecu

ted

once

per

know

ledg

e ba

se)

(sam

ple

dat

aif

nece

ssar

y)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 184 / 252

Page 221: The Linked Data Lifecycle

3 Steps to get a schema

1. obtain schema information

Reasoner

SPARQLEndpoint

EnrichmentOntology

Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)

Background Knowledge

BackgroundKnowledge+ Relevant Instance Data

List of Axiom Suggestions+ Metadata

(opt

ion

alin

voca

tion

)

2. obtain axiom type and entity specific data

3. run machine learning algorithm

3-Phase EnrichmentLearning Approach:

(onl

y ex

ecu

ted

once

per

know

ledg

e ba

se)

(sam

ple

dat

aif

nece

ssar

y)

Learner

DL-Learner

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 185 / 252

Page 222: The Linked Data Lifecycle

3 Steps to get a schema

1. obtain schema information

Reasoner

SPARQLEndpoint

EnrichmentOntology

Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)

Background Knowledge

BackgroundKnowledge+ Relevant Instance Data

List of Axiom Suggestions+ Metadata

(opt

ion

alin

voca

tion

)

2. obtain axiom type and entity specific data

3. run machine learning algorithm

3-Phase EnrichmentLearning Approach:

(onl

y ex

ecu

ted

once

per

know

ledg

e ba

se)

iterate over all axiom typesand schema entities for fullenrichment

(sam

ple

dat

aif

nece

ssar

y)

Learner

DL-Learner

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 186 / 252

Page 223: The Linked Data Lifecycle

Starting Point

SPARQL endpoint: http://dbpedia.org/sparql

Entity URI: http://dbpedia.org/ontology/author

Axiom Type: Object Property Domain

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 187 / 252

Page 224: The Linked Data Lifecycle

Step 1 - Obtaining Schema Information

CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .

ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000

dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252

Page 225: The Linked Data Lifecycle

Step 1 - Obtaining Schema Information

CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .

ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000

dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252

Page 226: The Linked Data Lifecycle

Step 1 - Obtaining Schema Information

CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .

ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000

dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252

Page 227: The Linked Data Lifecycle

Step 2 - Obtain axiom type and entity specic data

SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .

GROUP BY ? type ORDER BY DESC(? cnt )

type cnt

owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...

...

CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .

ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000

...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252

Page 228: The Linked Data Lifecycle

Step 2 - Obtain axiom type and entity specic data

SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .

GROUP BY ? type ORDER BY DESC(? cnt )

type cnt

owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...

...

CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .

ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000

...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252

Page 229: The Linked Data Lifecycle

Step 2 - Obtain axiom type and entity specic data

SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .

GROUP BY ? type ORDER BY DESC(? cnt )

type cnt

owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...

...

CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .

ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000

...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252

Page 230: The Linked Data Lifecycle

Step 2 - Obtain axiom type and entity specic data

SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .

GROUP BY ? type ORDER BY DESC(? cnt )

type cnt

owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...

...

CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .

ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000

...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252

Page 231: The Linked Data Lifecycle

Step 2 - Obtain axiom type and entity specic data

SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .

GROUP BY ? type ORDER BY DESC(? cnt )

type cnt

owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...

...

CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .

ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000

...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252

Page 232: The Linked Data Lifecycle

Step 3 - Scoring

dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .

dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .

dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .

Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%

Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%

dbo : Book r d f s : subC la s sOf dbo : WrittenWork .

Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252

Page 233: The Linked Data Lifecycle

Step 3 - Scoring

dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .

dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .

dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .

Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%

Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%

dbo : Book r d f s : subC la s sOf dbo : WrittenWork .

Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252

Page 234: The Linked Data Lifecycle

Step 3 - Scoring

dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .

dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .

dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .

Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%

Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%

dbo : Book r d f s : subC la s sOf dbo : WrittenWork .

Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252

Page 235: The Linked Data Lifecycle

Step 3 - Scoring

dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .

dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .

dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .

Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%

Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%

dbo : Book r d f s : subC la s sOf dbo : WrittenWork .

Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252

Page 236: The Linked Data Lifecycle

Step 3 - Scoring(2)

Problem:

support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100

Solution:

Average of 95% condence interval (Wald method)

p′ = s+2

m+4

s −#successm −#total

min(1, p′ + 1.96 ·√

p′·(1−p′)m+4

) max(0, p′ − 1.96 ·√

p′·(1−p′)m+4

)

In 95% of the intervals the true value is between ... and ...

Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252

Page 237: The Linked Data Lifecycle

Step 3 - Scoring(2)

Problem:

support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100

Solution:

Average of 95% condence interval (Wald method)

p′ = s+2

m+4

s −#successm −#total

min(1, p′ + 1.96 ·√

p′·(1−p′)m+4

) max(0, p′ − 1.96 ·√

p′·(1−p′)m+4

)

In 95% of the intervals the true value is between ... and ...

Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252

Page 238: The Linked Data Lifecycle

Step 3 - Scoring(2)

Problem:

support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100

Solution:

Average of 95% condence interval (Wald method)

p′ = s+2

m+4

s −#successm −#total

min(1, p′ + 1.96 ·√

p′·(1−p′)m+4

) max(0, p′ − 1.96 ·√

p′·(1−p′)m+4

)

In 95% of the intervals the true value is between ... and ...

Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252

Page 239: The Linked Data Lifecycle

More Complex Axioms

"Pattern Based Knowledge Base Enrichment", ISWC 2013

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 192 / 252

Page 240: The Linked Data Lifecycle

Outlook and Summary

Schema in the Linked Data Web often shallow → tools needed tosupport knowledge engineers

Showed some techniques for learning OWL axioms on large knowledgebases available as SPARQL endpoints

More complex aioms require:

OWL-SPARQL rewriting orFragment extraction

Small- and medium sized knowledge bases can be handled viatechniques from Inductive Logic Programming

All algorithms implemented in DL-Learner framework(http://dl-learner.org)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 193 / 252

Page 241: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 194 / 252

Page 242: The Linked Data Lifecycle

Motivation

increasing number of knowledge bases in theSemantic Web (see e.g. LOD cloud)

maintenance of knowledge bases withexpressive semantics is challenging

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 195 / 252

Page 243: The Linked Data Lifecycle

(Automatically) Detectable Ontology Problems

Common problems:

Syntactic Problems

Structural Problems

Semantic Problems (focus of talk)

Task Based Problems:

Reasoning Related Problems

Linked Data Related Problems

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 196 / 252

Page 244: The Linked Data Lifecycle

Syntactic Problems

Syntactic errors are mainly violations of conventions of the language inwhich the ontology is modelled.

Example (Validity of XML)

<?xml v e r s i o n=" 1 .0 "?><rdf:RDF xm l n s : r d f=" h t t p : //www.w3 . org /1999/02/22− rd f−

syntax−ns#" xmlns :dc=" h t t p : // p u r l . o rg /dc/ e l ement s/1 .1/ "><r d f : D e s c r i p t i o n r d f : a b o u t=" h t t p : //www.w3 . org /">

<d c : t i t l e>World Wide Web Consort ium</ d c : t i t l e></ rdf :RDF>

FatalError: The element type rdf:Description must be terminated by thematching end-tag </rdf:Description>.[Line = 7, Column = 3]

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 197 / 252

Page 245: The Linked Data Lifecycle

Structural Problems

Problems in the taxonomy

Example (Circularities)

A v B,B v C ,C v A

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 198 / 252

Page 246: The Linked Data Lifecycle

Reasoning Related Problems

Problems which negatively aect the performance of reasoning overexpressive knowledge bases

Example (A named concept is equivalent to an AllValues restriction)

A ≡ ∀r .CReasoning complexity:

Universal restriction does not require to have a property value but onlyrestricts the values for existing property values

Any concept B for which instances cannot have r -llers satises therestriction, i.e. B v ∀r .C , and becomes a subclass of A

Typically leads to unintended inferences and additional inferences mayeventually slow down reasoning performance

Can be checked via Pellint (part of Pellet)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 199 / 252

Page 247: The Linked Data Lifecycle

Linked Data Related Problems

Problems which are the specic to publishing RDF using the Linked Dataprinciples

Incorrect implementation of content negotiation

Mixing up information and non-information resources

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 200 / 252

Page 248: The Linked Data Lifecycle

Semantic Problems

Logical contradictions in the underlying knowledge base

Example (Unsatisable classes)

O = A v B u C ,C v ¬B |= A v ⊥

Example (Inconsistent ontology)

O = A v B u C ,C v ¬B,A(x) |= > v ⊥

Usually handled by Ontology Debugging

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 201 / 252

Page 249: The Linked Data Lifecycle

Ontology Debugging

Problem: We have undesirable entailments

Solution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252

Page 250: The Linked Data Lifecycle

Ontology Debugging

Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axioms

Question: Which axioms?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252

Page 251: The Linked Data Lifecycle

Ontology Debugging

Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252

Page 252: The Linked Data Lifecycle

Ontology Debugging

Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252

Page 253: The Linked Data Lifecycle

Justication

Justication

For an ontology O and an entailment η where O |= η, a set of axioms J isa justication for η in O if J ⊆ O,J |= η and if J ′ ⊂ J then J ′ 6|= η.

Minimal subsets of an ontology that are sucient for a givenentailment to hold

Synonyms: MUPS (Minimal Unsatisability Preserving Sub-TBoxes),MinAs (Minimal Axiom sets), Kernels

Observations:

there can be multiple justications for a single entailment

an axiom can be part of multiple justications

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 203 / 252

Page 254: The Linked Data Lifecycle

Justication - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3

J2 = 5, 6

J3 = 3, 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252

Page 255: The Linked Data Lifecycle

Justication - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3

J2 = 5, 6

J3 = 3, 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252

Page 256: The Linked Data Lifecycle

Justication - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3

J2 = 5, 6

J3 = 3, 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252

Page 257: The Linked Data Lifecycle

Justication - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3

J2 = 5, 6

J3 = 3, 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252

Page 258: The Linked Data Lifecycle

Justication - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3

J2 = 5, 6

J3 = 3, 4

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252

Page 259: The Linked Data Lifecycle

Justication Based Repair

For a repair, at least one axiom from every justication needs to beremoved.

For a repair plan, all justications are needed.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 205 / 252

Page 260: The Linked Data Lifecycle

Justication Algorithms

Single justication:

Glass Box: Modifying underlying reasoning algorithm (tableau tracing)

Black-Box: Using reasoner as oracle

All justications:

Reiter's Hitting Set Tree Algorithm (HST)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 206 / 252

Page 261: The Linked Data Lifecycle

Black-Box

Expansion-Contraction Strategy

Expansion: Add axioms to empty set until entailment holds

Contraction: Remove axioms from set such that set becomes minimaland entailment still can be derived.

CHAPTER 3. COMPUTING JUSTIFICATIONS 54

Expansion Contraction

AxiomAxiom in justificationSelected axiom

Key:

Figure 3.1: A Depiction of a Black-Box Expand-Contract Strategy

3.2 Black-Box Algorithms for Computing Sin-

gle Justifications

The basic idea behind a black-box justification finding algorithm is to systemat-

ically test different subsets of an ontology in order to find one that corresponds

to a justification. As depicted in Figure 3.1, subsets of an ontology are typically

explored using an “expand-contract” strategy. In order to compute a justification

for O |= η, an initial, small, subset S of O (represented by circles with thick black

borders in Figure 3.1) is selected. The axioms in S are typically the axioms whose

signature has a non-empty intersection with the signature of η, or axioms that

“define”2 terms in the signature of η. A reasoner is then used to check if S |= η,

and if not, S is expanded by adding a few more axioms from O. This incremental

expansion phase continues until S is large enough so that it entails η. When this

happens, either S, or some subset of S, is guaranteed to be a justification for η.

At this point S is gradually contracted until it is a minimal set of axioms that

entails η i.e. a justification for η in O.

In some black-box algorithms the expand phase may be trivial, or “empty”,

where S is immediately expanded to all input axioms. In this situation it is the

contraction phase which “does all the work”. An example of such a strategy is

presented in Algorithm 3.1. In this algorithm a set of axioms S is initialised

(expanded) with all of the axioms in O so that S |= η. S is then pruned one

axiom at a time, so that for each α ∈ S, if S \ α |= η, then S = S \ α. This

process terminates when all axioms α ∈ S have been examined, at which point

2For example, the axiom A v B defines the class name A

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 207 / 252

Source: M. Horridge:JusticationBased Explanation in Ontologies(PhD

Thesis)

Page 262: The Linked Data Lifecycle

Hitting Set Tree Algorithm

from eld of Model Based Diagnosis

given a faulty system (ontology), it constructs nite tree whose

nodes are labelled with conict sets (justications), and whoseedges are labelled with components (axioms)

nds all minimal hitting sets, which represent diagnoses for theconict sets in the system

diagnosis = repair

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 208 / 252

Page 263: The Linked Data Lifecycle

Hitting Set Tree Algorithm - ExampleCHAPTER 3. COMPUTING JUSTIFICATIONS 63

Figure 3.2: An Example of a Hitting Set Tree

J1 = A ! B, B ! D

A ! B

A ! "R.C

B ! D

A ! "R.C

J2 = A ! "R.C,"R.# ! D

!R." # D!R." # D

J !2 = A ! "R.C,"R.# ! D

bottom right hand successor to the node labelled with J ′2 and whose successor

edge is labelled with ∃R.> v D was generated by considering O \ S where

S = B v D, ∃R.> v D (∃R.> plus the label of the path to the root) and

noting that O \ S does not contain a justification for A v D. A fresh successor

node was therefore generated and labelled with the empty set, with the successor

edge label being set as ∃R.> v D.

When no more successor nodes can be generated the hitting set tree is com-

plete. At this point, all justifications for O |= η occur as labels of nodes in the

tree. Additionally, all minimal repairs (diagnoses) for O |= η are contained as

leaf-root paths in the tree.

3.3.3 Model Based Diagnosis Optimisations

The above description of a hitting set tree and illustrative example do not take

into consideration any optimisations. In order to achieve acceptable performance

it is necessary to consider two important optimisations: (1) Early Path Termina-

tion, and (2) Justification Reuse, which are detailed below:

Early path termination

In the unoptimised version of the algorithm a node can be extended with successor

edges provided there is an axiom in its label which does not already label one of the

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 209 / 252

O = A v B

B v D

A v ∃R.C∃R.> v D |= A v D

Source: M. Horridge:JusticationBased Explanation in Ontologies(PhD

Thesis)

Page 264: The Linked Data Lifecycle

Justication Scenarios

A user can be faced with the following situations:

Small number of small justications

, Easy and pleasant to inspect

Small number of large justications

, Better than alternatives

Large number of justications

/ Pretty hopeless with current mechanismsIdea: Find source of unsatisability

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 210 / 252

Page 265: The Linked Data Lifecycle

Root Unsatisability - Denitions

A root UC is a class whose unsatisability does not depend on anotherclass, otherwise it is a derived UC.

A derived UC for which there is some justication that is not a strictsuperset of a justication for another UC is a partial derived UC.

Root Unsatisable Class

A class A is a root unsatisable class if there is no justication J |= A v ⊥such that J is a strict superset of a justication for some otherunsatisable class.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 211 / 252

Page 266: The Linked Data Lifecycle

Root Unsatisability - Approaches

Approaches:

1: compute all justications for each unsatisable class and apply thedenition → computationally often too expensive

2: heuristics for structural analysis of axioms

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 212 / 252

Debugging Unsatisable Classes in OWL Ontologies, Kalyanpur, Parsia, Sirin, Hendler,

J. Web Sem, 2005.

Page 267: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 268: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥

J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 269: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 270: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥

J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 271: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2

root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 272: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 273: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 274: The Linked Data Lifecycle

Root Unsatisability - Example

O =

B v ∃r .D (1)

B v ∀r .¬D (2)

A v B u C (3)

B v ¬C (4)

A v E (5)

A v ¬E u F (6)

|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4

|= B v ⊥ J4 = 1, 2 root

partial

(J4 ⊂ J1)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252

Page 275: The Linked Data Lifecycle

Axiom Relevance

resolving justication requires to delete or edit axioms

ranking methods highlight the most probable causes for problems

methods:

frequency

syntactic relevance

semantic relevance

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 214 / 252

Page 276: The Linked Data Lifecycle

Repair Consequences

after repairing process, axioms have been deleted or modied

→ desired entailments may be lost or new entailments obtained(including inconsistencies!)

→ user can decide to preserve them

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 215 / 252

Page 277: The Linked Data Lifecycle

SPARQL Endpoint Support

Previously mentioned approaches are implemented in the ORE tool(http://ore-tool.net)

ORE supports using SPARQL endpoints

implements an incremental load procedure

knowledge base is loaded in small chunks:

count number of axioms by typepriority based loading proceduree.g. disjointness axioms have higher priority than class assertion axioms

uses Pellet incremental reasoning

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 216 / 252

Learning of OWL Class Descriptions on Very Large Knowledge Bases,

Hellmann, Lehmann, Auer, Int. Journal Semantic Web Inf. Syst, 2009

Page 278: The Linked Data Lifecycle

SPARQL Endpoint Support II

algorithm performs sanity checks, e.g. SPARQL queries which probefor typical inconsistent axiom sets

can fetch additional Linked Data

dierent termination criteria

overall:

ORE allows to apply state-of-the-art ontology debugging methods on a

larger scale than was possible previously

aims at stronger support for the web aspect of the Semantic Weband the high popularity of Web of Data initiative

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 217 / 252

Page 279: The Linked Data Lifecycle

SPARQL Endpoint Support II

algorithm performs sanity checks, e.g. SPARQL queries which probefor typical inconsistent axiom sets

can fetch additional Linked Data

dierent termination criteria

overall:

ORE allows to apply state-of-the-art ontology debugging methods on a

larger scale than was possible previously

aims at stronger support for the web aspect of the Semantic Weband the high popularity of Web of Data initiative

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 217 / 252

Page 280: The Linked Data Lifecycle

DBpedia Live Demo

Inconsistency in DBpedia Live:

Individual: dbr:Purify_(album)

Facts: dbo:artist dbr:Axis_of_Advance

Individual: dbr:Axis_of_Advance

Types: dbo:Organisation

Class: dbo:Organisation

DisjointWith dbo:Person

ObjectProperty: dbo:artist

Range: dbo:Person

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 218 / 252

Page 281: The Linked Data Lifecycle

DBpedia Live Demo 2

Inconsistency in DBpedia in combination with WGS84 (Linked Data):

Individual: dbr:WKWS Facts: geo:long -81.76833343505859

Types: dbo:Organisation

DataProperty: geo:long Domain: geo:SpatialThing

Class: dbo:Organisation DisjointWith: geo:SpatialThing

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 219 / 252

Page 282: The Linked Data Lifecycle

OpenCyc Demo

Inconsistency in OpenCyc:

Individual: 'PopulatedPlace'

Types: 'ArtifactualFeatureType', 'ExistingStuffType'

Class: 'ArtifactualFeatureType'

SubClassOf: 'ExistingObjectType'

Class: 'ExistingObjectType'

DisjointWith: 'ExistingStuffType'

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 220 / 252

Page 283: The Linked Data Lifecycle

ORE - Screenshot

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 221 / 252

Page 284: The Linked Data Lifecycle

ORE - Screenshot

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 222 / 252

Page 285: The Linked Data Lifecycle

Related Tools

Swoopcan compute justications for unsatisability of classes and oers repairmodene-grained justication computation algorithm is incompletecan also compute justications for an inconsistent ontology, but doesnot oer repair mode in this casedoes not extract locality-based modules, which leads to lowerperformance for large ontologies

RaDONplugin for the NeOn toolkitoers a number of techniques for working with inconsistent orincoherent ontologiesallows to reason with inconsistent ontologies and can handle sets ofontologies (ontology networks)no ne-grained justications, no repair impact analysis

Pellintsearches for common patterns which lead to potential reasoningperformance problemsintegration in ORE planned

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 223 / 252

Page 286: The Linked Data Lifecycle

Related Tools II

PION and DION

developed in the SEKT project to deal with inconsistenciesPION is an inconsistency tolerant reasoner (four-valued paraconsistentlogic)DION oers the possibility to compute justications, but no repair

Explanation Workbench

Protégé plugin for reasoner requests like class unsatisability or inferredsubsumption relationscan compute regular and laconic justicationsmotivated the ORE debugging interfacecurrent version of Explanation Workbench does not allow to removeaxioms in laconic justications

RepairTab

supports the user in nding and detecting errors in ontologiesRepairTab uses a modied tableau algorithmshows inferences which can no longer be drawn after removing anaxiom (inspired ORE)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 224 / 252

Page 287: The Linked Data Lifecycle

Outline

1 Introduction to Linked Data

2 Linked Dataset Example: DBpedia

3 Linked Data Life-Cycle Overview

4 Knowledge Extraction

5 Data Integration / Linking

6 Enrichment

7 Repair

8 Knowledge Base Exploration / Querying

Interlinking/ Fusing

Classifi-cation/

Enrichment

Quality Analysis

Evolution / Repair

Search/ Browsing/

Exploration

Extraction

Storage/ Querying

Manual revision/

Authoring

Linked DataLifecycle

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 225 / 252

Page 288: The Linked Data Lifecycle

Motivation

User Query Interfaces:

Knowledge BaseSpecic Interfaces

Facet-BasedBrowsing

Visual SPARQLQuery Builders

Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252

Page 289: The Linked Data Lifecycle

Motivation

User Query Interfaces:

Knowledge BaseSpecic Interfaces

Facet-BasedBrowsing

Visual SPARQLQuery Builders

Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252

Page 290: The Linked Data Lifecycle

Motivation

User Query Interfaces:

Knowledge BaseSpecic Interfaces

Facet-BasedBrowsing

Visual SPARQLQuery Builders

Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252

Page 291: The Linked Data Lifecycle

Motivation

User Query Interfaces:

Knowledge BaseSpecic Interfaces

Facet-BasedBrowsing

Visual SPARQLQuery Builders

Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252

Page 292: The Linked Data Lifecycle

Motivation

User Query Interfaces:

Knowledge BaseSpecic Interfaces

Facet-BasedBrowsing

Visual SPARQLQuery Builders

Question AnsweringWhich tools for creating (SPARQL) queries

are end user friendly?

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252

Page 293: The Linked Data Lifecycle

Strengths of Weaknesses of Query Interfaces

Easy to Use

Robust

Flexible Queries

Expressive

Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252

Page 294: The Linked Data Lifecycle

Strengths of Weaknesses of Query Interfaces

Easy to Use

Robust

Flexible Queries

Expressive

Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252

Page 295: The Linked Data Lifecycle

Strengths of Weaknesses of Query Interfaces

Easy to Use

Robust

Flexible Queries

Expressive

Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252

Page 296: The Linked Data Lifecycle

Strengths of Weaknesses of Query Interfaces

Easy to Use

Robust

Flexible Queries

Expressive

Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252

Page 297: The Linked Data Lifecycle

Faceted Browsing

Simple way to browse structured information

User starts with all resources and then drills down via facets

Multiple dimensions can be supported for facets, e.g. taxonomy,existence of properties, values of properties

Can be combined with text search: previously information was browsedeither via a xed classication scheme or text search (with the latterbeing dominant) facet based browsing allows a combination of both

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 228 / 252

Page 298: The Linked Data Lifecycle

Facete

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 229 / 252

Page 299: The Linked Data Lifecycle

Facete

Generic Facet-Based Browser

RDF properties are facets (sub-facets are supported)

Each facete serves as source for columns (table rendering), points ofinterest (map rendering)

JavaScript implementation - SPARQL queries are done by the client

Each SPARQL endpoint can serve as backend, no API needs to beimplemented

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 230 / 252

Page 300: The Linked Data Lifecycle

Question Answering - State of the art

1 Map natural language question to triple-based representation.

2 Match triple-based representation against RDF data.

Example:

Where did Abraham Lincoln die?

SELECT ?x WHERE

res:Abraham_Lincoln dbo:deathPlace ?x .

PowerAqua:

Triple representation:

〈state/place, die, Abraham Lincoln〉Ontology mappings:

〈Place, deathPlace, Abraham_Lincoln〉

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 231 / 252

Page 301: The Linked Data Lifecycle

Problem

Triples do not always provide a faithful representation of the semanticstructure of the question.

Thus more expressive queries cannot be answered.

Example 1:

Which cities have more than three universities?

SELECT ?y WHERE

?x rdf:type dbo:University .

?x dbo:city ?y .

HAVING (COUNT(?x) > 3)

PowerAqua (triple representation):〈cities, more than, universities three〉

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 232 / 252

Page 302: The Linked Data Lifecycle

Problem

Triples do not always provide a faithful representation of the semanticstructure of the question.

Thus more expressive queries cannot be answered.

Example 2:

Who produced the most lms?

SELECT ?y WHERE

?x rdf:type dbo:Film .

?x dbo:producer ?y .

ORDER BY DESC(COUNT(?x)) LIMIT 1

PowerAqua (triple representation):〈person/organization, produced, most lms〉

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 232 / 252

Page 303: The Linked Data Lifecycle

Goal

In order to understand a user question, we need to understand:

The wordsFind a mapping from natural language expressions to ontologyconcepts.

Abraham Lincoln → res:Abraham_Lincoln

died in → dbo:deathPlace

The semantic structureDetermine the triple structure as well as lters and aggregationfunctions (order by, count, etc.).

the most N → ODER BY DESC(COUNT(?n)) LIMIT 1

more than three N → HAVING (COUNT(?n) > 3)

Goal: an approach that combines both an analysis of the semanticstructure and a mapping of words to URIs

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 233 / 252

Page 304: The Linked Data Lifecycle

Goal

In order to understand a user question, we need to understand:

The wordsFind a mapping from natural language expressions to ontologyconcepts.

Abraham Lincoln → res:Abraham_Lincoln

died in → dbo:deathPlace

The semantic structureDetermine the triple structure as well as lters and aggregationfunctions (order by, count, etc.).

the most N → ODER BY DESC(COUNT(?n)) LIMIT 1

more than three N → HAVING (COUNT(?n) > 3)

Goal: an approach that combines both an analysis of the semanticstructure and a mapping of words to URIs

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 233 / 252

Page 305: The Linked Data Lifecycle

Templated-based question answering

1 Template generation (Understanding the semantic structure)Parse question to produce a SPARQL template that directly mirrorsthe structure of the question, including lters and aggregationoperations.

2 Template instantiation (Understanding the words)Instantiate SPARQL template by matching natural languageexpressions with ontology concepts using statistical entityidentication and predicate detection.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 234 / 252

Page 306: The Linked Data Lifecycle

Example: Who produced the most lms?

1 SPARQL template:SELECT ?x WHERE

?y rdf:type ?c .

?y ?p ?x .

ORDER BY DESC(COUNT(?y)) LIMIT 1

?c CLASS [lms]?p PROPERTY [produced]

2 Instantiations:

?c = <http://dbpedia.org/ontology/Film>

?p = <http://dbpedia.org/ontology/producer>

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 235 / 252

Page 307: The Linked Data Lifecycle

Architecture

Natural Language Question

Semantic Representaion

SPARQL Query

Templates

Templates with URI slots

Ranked SPARQL Queries

Answer

LOD

Entity identification

Entity and Query Ranking

Query Selection

Resourcesand Classes

SPARQL Endpoint

Type Checkingand Prominence

BOA PatternLibrary

Properties

Tagged Question

Domain Independent Lexicon

Domain Dependent Lexicon

Parsing

Corpora?

!Loading

State

Process

Uses

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 236 / 252

Page 308: The Linked Data Lifecycle

Step 1: Template generation - Linguistic processing

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252

Page 309: The Linked Data Lifecycle

Step 1: Template generation - Linguistic processing

1 Natural language question is taggedwith part-of-speech information.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252

Page 310: The Linked Data Lifecycle

Step 1: Template generation - Linguistic processing

2 Based on POS tags, lexical entriesare built on the y.

Lexical entries are pairs of:

tree structures(Lexicalized Tree Adjoining Grammar)

semantic representations(ext. Discourse Representation Structures)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252

Page 311: The Linked Data Lifecycle

Step 1: Template generation - Linguistic processing

3 These lexical entries, together withdomain-independent lexical entries,are used for parsing the question.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252

Page 312: The Linked Data Lifecycle

Step 1: Template generation - Linguistic processing

4 The resulting semanticrepresentation is translated into aSPARQL template.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252

Page 313: The Linked Data Lifecycle

Example: Who produced the most lms?

domain-independent: who, the most

domain-dependent: produced/VBD, lms/NNS

SPARQL template 1:SELECT ?x WHERE

?x ?p ?y .

?y rdf:type ?c .

ORDER BY DESC(COUNT(?y)) LIMIT 1

?c CLASS [lms]?p PROPERTY [produced]

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252

Page 314: The Linked Data Lifecycle

Example: Who produced the most lms?

domain-independent: who , the most

domain-dependent: produced/VBD, lms/NNS

SPARQL template 1:SELECT ?x WHERE

?x ?p ?y .

?y rdf:type ?c .

ORDER BY DESC(COUNT(?y)) LIMIT 1

?c CLASS [lms]?p PROPERTY [produced]

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252

Page 315: The Linked Data Lifecycle

Example: Who produced the most lms?

domain-independent: who, the most

domain-dependent: produced/VBD , lms/NNS

SPARQL template 1:SELECT ?x WHERE

?x ?p ?y .

?y rdf:type ?c .

ORDER BY DESC(COUNT(?y)) LIMIT 1

?c CLASS [lms]

?p PROPERTY [produced]

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252

Page 316: The Linked Data Lifecycle

Example: Who produced the most lms?

domain-independent: who, the most

domain-dependent: produced/VBD, lms/NNS

SPARQL template 2:SELECT ?x WHERE

?x ?p ?y .

ORDER BY DESC(COUNT(?y)) LIMIT 1

?p PROPERTY [lms]

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252

Page 317: The Linked Data Lifecycle

Step 2: Template instantiation - Entity identication

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252

Page 318: The Linked Data Lifecycle

Step 2: Template instantiation - Entity identication

1 For resources and classes:

Identify synonyms of the label using WordNet.Retrieve entities with a label similar to the slot labelbased on string similarities (trigram, Levenshtein,substring).

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252

Page 319: The Linked Data Lifecycle

Step 2: Template instantiation - Entity identication

2 For property labels, the label isadditionally compared to naturallanguage expressions stored in theBOA pattern library.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252

Page 320: The Linked Data Lifecycle

Step 2: Template instantiation - Entity identication

3 The highest ranking entities arereturned as candidates for lling thequery slots.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252

Page 321: The Linked Data Lifecycle

BOA

The BOA pattern library is a repository of natural language representationsof Semantic Web predicates.Idea:

For each predicate P in a data repository (e.g. DBpedia), collect theset of entities S and O connected through P .

Search a text corpus (e.g. Wikipedia) for all sentences containing thelabels of S and O.

For all retrieved sentences, the natural language predicate is apotential pattern for P . The potential patterns are then scored by aneural network (e.g. according to frequency) and ltered.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 240 / 252

Page 322: The Linked Data Lifecycle

BOA: Example

Predicate:http://dbpedia.org/ontology/subsidiary

RDF snippet:

<http://dbpedia.org/resource/Google>

<http://dbpedia.org/ontology/subsidiary>

<http://dbpedia.org/resource/YouTube> .

<http://dbpedia.org/resource/Google> rdfs:label `Google'@en .

<http://dbpedia.org/resource/YouTube> rdfs:label `Youtube'@en .

Sentences:

Google's acquisition of Youtube comes as online video is really startingto hit its stride.Youtube, a division of Google, is exploring a new way to get morehigh-quality clips on its site: nancing amateur video creators.

Patterns:

subsidiary: S's acquisition of O

subsidiary: O, a division of S

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 241 / 252

Page 323: The Linked Data Lifecycle

BOA

The use of BOA patterns allows us to match natural language expressionsand ontology concepts even if they are not string similar and not coveredby WordNet.Examples:

married to → http://dbpedia.org/ontology/spouse

was born in → http://dbpedia.org/ontology/birthPlace

graduated from → http://dbpedia.org/ontology/almaMater

write → http://dbpedia.org/ontology/author

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 242 / 252

Page 324: The Linked Data Lifecycle

Example: Who produced the most lms?

Candidates for lling query slots:

?c CLASS [lms]

<http://dbpedia.org/ontology/Film>

<http://dbpedia.org/ontology/FilmFestival>

. . .

?p PROPERTY [produced]

<http://dbpedia.org/ontology/producer>

<http://dbpedia.org/property/producer>

<http://dbpedia.org/ontology/wineProduced>

. . .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 243 / 252

Page 325: The Linked Data Lifecycle

Step 3: Query ranking and selection

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252

Page 326: The Linked Data Lifecycle

Step 3: Query ranking and selection

1 Every entity receives a scoreconsidering string similarity andprominence (roughly how often itoccurs in the knowledge base).

2 The score of a query is thencomputed as the average of thescores of the entities used to ll itsslots.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252

Page 327: The Linked Data Lifecycle

Step 3: Query ranking and selection

3 In addition, type checks areperformed:For all triples ?x rdf:type

<class>, all query tripels ?x p e

and e p ?x are checked w.r.t.whether domain/range of p isconsistent with <class>.(If not, the query is rejected.)

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252

Page 328: The Linked Data Lifecycle

Step 3: Query ranking and selection

4 Of the remaining queries, the onewith highest score that returns aresult is chosen to retrieve ananswer.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252

Page 329: The Linked Data Lifecycle

Example: Who produced the most lms?

SELECT ?x WHERE

?x <http://dbpedia.org/ontology/producer> ?y .

?y rdf:type <http://dbpedia.org/ontology/Film> .

ORDER BY DESC(COUNT(?y)) LIMIT 1

Score: 0.7592425075864263

SELECT ?x WHERE

?x <http://dbpedia.org/ontology/film> ?y .

ORDER BY DESC(COUNT(?y)) LIMIT 1

Score: 0.6264001353183296

SELECT ?x WHERE

?x <http://dbpedia.org/ontology/producer> ?y .

?y rdf:type <http://dbpedia.org/ontology/FilmFestival>.

ORDER BY DESC(COUNT(?y)) LIMIT 1

Score: 0.6012584940627768

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 245 / 252

Page 330: The Linked Data Lifecycle

Evaluation set-up

Question set: 39 DBpedia training questions from QALD-1

The other 11 questions rely on namespaces which were notincorporated in predicate detection (FOAF and YAGO).

POS tags were idealized, in order to exclude tagging errors.

Evaluation measures:

Precision =number of correct resources returned by system

number of resources returned by system

Recall =number of correct resources returned by systemnumber of resources in gold standard answer

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 246 / 252

Page 331: The Linked Data Lifecycle

Results

Of the 39 questions. . .

5 could not be parsed due to unknown syntactic constructions oruncovered domain-independent expressions

19 were answered exactly as required by the benchmark (with precisionand recall 1.0)

another 2 are answered almost correctly (with precision and recallgreater than 0.8)

Mean precision: 0.61Mean recall: 0.63F-measure: 0.62

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 247 / 252

Page 332: The Linked Data Lifecycle

Discussion: Main sources of error

Incorrect templatesTemplate structure does not coincide with structure of the data:

When did Germany join the EU?res:Germany dbp:accessioneudate ?x .

Predicate detection fails

inhabitants 9 dbp:population, dbp:populationTotalowns 9 dbo:keyPerson

higher 9 dbp:elevationM

Wrong query is selected

Who wrote The pillars of the Earth?res:The_Pillars_of_the_Earth_(TV_Miniseries) dbo:writer ?x .

res:The_Pillars_of_the_Earth dbo:author ?x .

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 248 / 252

Page 333: The Linked Data Lifecycle

Conclusion

Two-step approach:1 Build templates that capture the semantic structure of a question.

map complex expressions (quantiers, comparatives, superlatives, etc.)to aggregation functions

2 Instantiate templates mapping natural language expressions to URIs.

BOA patterns for cases where string similarity and WordNetare not sucient

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 249 / 252

Page 334: The Linked Data Lifecycle

Outlook

Current work: Entity identication should take into account whethercandidate entities actually have any connection in the dataset.

Future work: Make templates less rigid and determine the exact triplestructure on the basis of data exploration.

The created template structure does not always coincide with how thedata is actually modelled.

Considering all possibilities of how the data could be modelled leads toa huge amount of templates (and even more queries) for one question.

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 250 / 252

Page 335: The Linked Data Lifecycle

Links

Web: http://aksw.org/Projects/AutoSPARQL

Demo: http://autosparql-tbsl.dl-learner.org

BOA: http://boa.aksw.orgDaniel Gerber and Axel-Cyrille Ngonga Ngomo: Bootstrapping theLinked Data Web. In: Proceedings of the Web Scale Knowledge

Extraction Workshop (WekEx), ISWC 2011.

QALD: http://www.sc.cit-ec.uni-bielefeld.de/ild/

Page 336: The Linked Data Lifecycle

The End

Jens [email protected]/Uni Leipzig

GeoKnow

http://geoknow.eu

Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 252 / 252