DBpedia - A crystallization point for the Web of Data · 2019-12-27 · DBpedia - A crystallization...
Transcript of DBpedia - A crystallization point for the Web of Data · 2019-12-27 · DBpedia - A crystallization...
- 1 -
DBpedia - A crystallization point for the Web of Data
Lehmann J, Lehmann J, Kobilarov G, et al. DBpedia - A crystallization point for the Web of Data[J]. Web Semantics Science Services & Agents on the World Wide Web, 2009, 7(3):154-165.
Christian Bizer, Jens Lehmann b, Georgi Kobilarov, Sören Auer,Christian Becker, Richard Cyganiak, Sebastian Hellmann
报告人:胡志强
2018年3月14日
- 2 -
➢ 1. Introduction
➢ 2. The DBpedia knowledge extraction framework
➢ 3. The DBpedia knowledge base
➢ 4. Accessing the DBpedia knowledge base over the web
➢ 5. Interlinked web content
➢ 6. Applications facilitated by DBpedia
➢ 7. Related work
➢ 8. Conclusions and future work
Contents
- 3 -
It is a community effort to extract structured information from Wikipedia and to
make this information accessible on the Web.
Abstract
Keywords: Web of Data; Linked Data; Knowledge extraction; Wikipedia; RDF
the extraction of the DBpedia knowledge base;
the current status of interlinking DBpedia with other data sources on the Web;
gives an overview of applications that facilitate the Web of Data around DBpedia.
➢ This article describes
➢ The DBpedia project
- 4 -
Additional content
Use RDF as data format
Use URIs as names for things
Use HTTP URIs so that people can look up those names
➢ Linked Data Principles
Represent: Resource Description Framework
A triple model: (subject, predicate, object)
Resources and properties are identified by URIs
E.g. (SeminarOne, speaker, Zhiqiang)
(SeminarOne, theme, DBpedia)
➢ RDF http://xxx/ Zhiqiang
http://xxx/SeminarOne
http://xxx/DBpedia
http: //xxx/schema#speaker
http: //xxx/schema#theme
- 5 -
1. Introduction
➢ The resulting Dbpedia knowledge base
Today, most knowledge bases cover only specific domains, are created by
relatively small groups of knowledge engineers, and are very cost intensive to
keep up-todate as domains change;
Wikipedia has grown into one of the central knowledge sources of mankind,
maintained by thousands of contributors.
Describes More than 2.6 million entities, Including 198,000 persons, 328,000
places, 101,000 musical works, 34,000 films, and 20,000 companies;
Including 3.1 million links to external web pages and 4.9 million RDF links into
other Web data sources.
➢ The Motivation
- 6 -
1. Introduction
➢ The Contributions to the development of the Web of Data
It covers many domains;
It represents real community agreement;
It automatically evolves as Wikipedia changes;
It is truly multilingual;
It is accessible on the Web.
Develop an information extraction framework that converts Wikipedia content
into a rich multi-domain knowledge base;
Define a Web-dereferenceable identifier for each DBpedia entity;
Publish RDF links pointing from DBpedia into other Web data sources and
support data publishers in setting links from their data sources to DBpedia.
➢ The advantages over existing knowledge bases
- 7 -
2. The DBpedia knowledge extraction framework
2.1. Architecture of the extraction framework
The main components:
➢ PageCollections
➢ Extraction Job
➢ Extractors
➢ Parsers
➢ Destinations
Extraction Jobs group a page
collection, extractors and a
destination into a workflow.
- 8 -
2. The DBpedia knowledge extraction framework
Extractors process the following types of Wikipedia content
Labels; Abstracts; Interlanguage links; Images; Redirects; Disambiguation;
External links; Pages links; Homepages; Categories; Geo-coordinate.
Labels: (rdfs: label)
Pagelinks: (dbpedia: wikilink)
Abstract, first paragraph: (rdfs: comment), long abstract: (dbpedia: abstract)
- 9 -
2. The DBpedia knowledge extraction framework
Extractors process the following types of Wikipedia content
Categories: (skos: concepts)
External links: (dbpedia: reference)
- 10 -
2. The DBpedia knowledge extraction framework
Extractors process the following types of Wikipedia content
Categories: (skos: broader)
- 11 -
2. The DBpedia knowledge extraction framework
Extractors process the following types of Wikipedia content
Disambiguation: (dbpedia: disambiguates)
- 12 -
2. The DBpedia knowledge extraction framework
Two workflows:
➢ Dump-based extraction
The Wikimedia Foundation publishes SQL dumps of all Wikipedia editions on
a monthly basis;
Use the DatabaseWikipedia page collection as the source of article texts;
Use the N-Triples serializer as the output destination
➢ Live extraction
The Wikimedia Foundation has given the Access to the Wikipedia OAI-PMH
live feed that instantly reports all Wikipedia changes; (OAI-PMH: a protocol)
Uses this Update Stream to extract new RDF whenever a Wikipedia article is
changed;
Access The text of these articles via the LiveWikipedia page collection;
The SPARQL-Update Destination deletes existing and inserts new triples into
a separate triple store.
- 13 -
2. The DBpedia knowledge extraction framework
2.1. Architecture of the extraction framework
The main components:
➢ PageCollections
➢ Extraction Job
➢ Extractors
➢ Parsers
➢ Destinations
Extraction Jobs group a page
collection, extractors and a
destination into a workflow
- 14 -
2. The DBpedia knowledge extraction framework
2.2. Generic versus mapping-based infobox extraction
(2018 from web)(2009 from paper)
- 15 -
2. The DBpedia knowledge extraction framework
2.2. Generic versus mapping-based infobox extraction
➢ The problems
Different communities use different templates to describe the same type of
things. (e.g. info box_city_japan, info box_swiss_town)
Different templates use different names for the same attribute. (e.g.
birthplace and placeofbirth)
➢ Two different extraction approachs
Generic infobox extraction
(aims at wide coverage)
Mapping-based infobox extraction
(aims at high data quality)
- 16 -
2. The DBpedia knowledge extraction framework
2.2. Generic versus mapping-based infobox extraction
➢ Generic infobox extraction
The corresponding DBpedia URI of the Wikipedia article is used as subject;
(e.g. http://dbpedia.org/page/Tom_Hanks)
The predicate URI is created by concatenating the namespace fragment
http://dbpedia.org/property/ and the name of the infobox attribute;
(e.g. http://dbpedia.org/property/birthDate)
Objects are created from the attribute value.
(e.g. 1956-07-09 (xsd:date))
(Object can be URI reference or literal values)
➢ The advantage and disadvantage
Advantage: Complete coverage of all infoboxes and infobox attributes
Disadvantage: Synonymous attribute names are not resolved
- 17 -
2. The DBpedia knowledge extraction framework
2.2. Generic versus mapping-based infobox extraction
➢ Mapping-based infobox extraction
Map Wikipedia templates to an ontology;
The ontology was created by manually arranging the 350 most commonly
used infobox templates into a subsumption hierarchy consisting of 170
classes;
Map 2350 attributes from within these templates to 720 ontology properties.
➢ The advantage and disadvantage
Advantage: Overcomes the problems of synonymous attribute names and
multiple templates
Disadvantage: Covers only 350 Wikipedia templates;
- 18 -
3. The DBpedia knowledge base
The overall descriptionCommon DBpedia classes with the number of their instances and example properties.
➢ Data Scale
more than 2.6 million entities;
labels and short abstracts in
30 different languages;
609,000 links to images;
3,150,000 links to external
web pages;
415,000 Wikipedia categories;
286,000 YAGO categories.
- 19 -
3. The DBpedia knowledge base
3.1. Identifying entities
➢ DBpedia uses English article names for creating identifiers;
➢ Resources are assigned a URI according to the pattern
http://dbpedia.org/resource/Name
(e.g. http://dbpedia.org/page/Tom_Hanks)
- 20 -
3. The DBpedia knowledge base
3.2. Classifying entities
Four classification schemata, compare these schemata:
SchemataNumber of categories
Description
Wikipedia 415,000It is collaboratively extended and kept up-to-date by thousands of Wikipedia editors;Its categories do not form a proper topical hierarchy.
YAGO 286,000It form a deep subsumption hierarchy; There are a few errors and omissions due to its automatic generation.e.g. the class “MultinationalCompaniesHeadquarteredInTheNetherlands”
UMBEL 20,000 It is a lightweight ontology.
DBpediaontology
170It includes 720 properties with domain and range definitions;The ontology was manually created.
- 21 -
3. The DBpedia knowledge base
3.3. Describing entities
➢ Comparison of the generic infobox, mapping-based infobox and pagelinks datasets
➢ Comparison of the graph structure of the generic infobox, mapping-based
infobox and pagelinks datasets.
(all numbers are for DBpedia release 3.2, English version)
(Measure characteristics of the RDF graph that connects DBpedia entities; Remove all
triples from the datasets that did not point at a DBpedia entity, including all literal
triples, all external links and all dead links.)
The average node indegree as the sum of all inbound edges divided by the number of objects.
The clustering coefficient is calculated as the number of existing connections between
neighbors of a node, divided by possible connections (k*(k − 1), k = number of node neighbors).
Point to other entities
(G)26% 53%(M)
- 22 -
3. The DBpedia knowledge base
3.3. Describing entities
➢ Comparison of the generic
infobox, mapping-based
infobox and pagelinks
datasets in terms of node
indegree versus rank.
(all numbers are for DBpedia release 3.2, English version)
The node indegrees follow
a power-law distribution
in all datasets;
- 23 -
4. Accessing the DBpedia knowledge base over the web
Four access mechanism of Dbpedia knowledge base
Publish RDF data on the Web that relies on HTTP URIs as resource identifiers
and the HTTP protocol to retrieve resource descriptions ;
E.g. http://dbpedia.org/page/Tom_Hanks
➢ Linked Data
➢ SPARQL endpoint
➢ RDF dumps
➢ Look index
Provide a SPARQL endpoint for querying the DBpedia knowledge base
URL: http://dbpedia.org/sparql
Slice the DBpedia knowledge base by triple predicate into several parts and
offer N-Triple serialisations of these parts for download.
URL: http://wiki.dbpedia.org/develop/datasets
Provide a lookup service that proposes DBpedia URIs for a given label.
URL: http://lookup.dbpedia.org/api/search.asmx
- 24 -
5. Interlinked web content
Various data sources that are interlinked with DBpedia
These RDF links lay the foundation for:
➢ Web of Data browing and crawing
➢ Web Data Fusion and Mashups
➢ Web Content Annotation (references to places)
➢ Agents can follow these
links to retrieve additional
information about Spain.
➢ Use DBpedia identifier Data
Integration to annotate the
topical subject of a paper.
- 25 -
5. Interlinked web content
Various data sources that are interlinked with DBpedia
Data sources that are interlinked with DBpedia
- 26 -
5. Interlinked web content
Various data sources that are interlinked with DBpediaDistribution of 4.9 million outgoing RDF links
pointing from DBpedia to other datasets.
Data sources publishing RDF links pointing at
DBpedia entities
DBpedia
a book a computer scitentist
- 27 -
6. Applications facilitated by DBpedia
The applications
➢ Browing and exploration
(DBpedia URIs make good starting points to explore or crawl the Web of Data)
DBpedia Mobile
➢ Querying and search
DBedia Query Builder
Relation Finder
➢ Content annotation
- 28 -
6. Applications facilitated by DBpedia
6.1. Browing and exploration
➢ DBpedia Mobile
A location-aware client for the Semantic Web that uses DBpedia locations as navigation starting points.
It allows users to discover, search and publish Linked Data pertaining to their current physical environment.
a location
Interesting navigation paths:
a personin DBpedia
a author
local bands
from navigate
author’s booksin Book Mashup
explore
albums of the bands in MusicBrainz
is
interestedfind
- 29 -
6. Applications facilitated by DBpedia
6.2. Querying and search
➢ DBpedia Query Builder Form-based DBpedia query builder
➢ Answer a query about
soccer players that
play for specific clubs
and are born in
countries with more
than 10 million
inhabitants.
➢ Offer a look-ahead
search proposes
suitable options
- 30 -
6. Applications facilitated by DBpedia
6.2. Querying and search
➢ Relationship Finder
The DBpedia Relationship Finder, displaying a connection between two objects.
➢ The Relationship Finder allows users to find connections between two
different entities in Dbpedia.
➢ Answer: exists?--->a connection--->shortest connection
- 31 -
7. Related work
➢ Extraction of structured Wikipedia content
➢ NLP-based knowledge extraction
➢ Advancing Wikipedia itself
➢ Stability of Wikipedia identifiers
Freebase Wikipedia Extraction
YAGO, it extracts 14 relationship type, such as subClassOf, locatedIn, bornInyear
KOG system, it uses both SVMs and Markov Logic Networks
Confirm the approach of using DBpedia URIs for interlinking data sources
across the Web of Data
Semantically annotated snapshot of Wikipedia published by Yahoo
Powerset search engine
The Semantic MediaWiki project also aims at enabling the reuse of information
within wikis as well as at enhancing search and browse facilities.
- 32 -
8. Conclusions and future work
➢ Cross-language infobox knowledge fusion
➢ Wikipedia article augmentation
➢ Wikipedia consistency checking
Infoboxes within different Wikipedia editions cover different aspects of an
entity at varying degrees of completeness.
The extraction of different Wikipedia editions and interlinking DBpedia with
external Web knowledge sources lays the foundation for checking the
consistency of Wikipedia content.
Interlinking DBpedia with other data sources makes it possible to develop a
MediaWiki extension that augments Wikipedia articles with additional
information.
E.g. geographic location like a city or monument add additional facts from CIA
Factbook
- 33 -
Thanks