ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing Linked Data
-
Upload
eswcsummerschool -
Category
Education
-
view
736 -
download
0
Transcript of ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing Linked Data
Providing Linked Data
Presented by: Barry Norton
Maribel Acosta
Motivation: Music!
2
Visualiza3on Module
Metadata Streaming providers
Physical Wrapper
Downloads
Data acquisi3
on R2R Transf. LD Wrapper
Musical Content
Applica3
on
Analysis & Mining Module
LD Data set
Access
LD Wrapper
RDF/ XML
Integrated Dataset
Interlinking Cleansing Vocabulary Mapping
SPARQL Endpoint
Publishing
RDFa
Other content
LINKED DATA LIFECYCLE
EUCLID -‐ Querying Linked Data 3
Linked Data Principles
1. Use URIs as names for things. 2. Use HTTP URIs so that users can look up
those names. 3. When someone looks up a URI, provide
useful informa9on, using the standards (RDF*, SPARQL).
4. Include links to other URIs, so that users can discover more things.
EUCLID -‐ Providing Linked Data 4
CH 1
Linked Data Lifecycle
Linked Data Lifecycle
EUCLID -‐ Providing Linked Data 5
Source: Sören Auer. “The Seman3c Data Web” (slides) Source: José M. Alvarez. “My Linked Data Lifecycle”
Source: Michael Hausenblas. “Linked Data lifeyclcle”
Core Tasks for Providing Linked Data
EUCLID -‐ Providing Linked Data 6
Based on the proposed LD lifecycles and the LD principles, we can iden3fy 3 main tasks for providing LD:
① Crea9ng: includes data extrac3on, crea3on of HTTP URIs, and vocabulary selec3on. (LD principles 1 & 2)
② Interlinking: involves the crea3on of (RDF) links to external data sets. (LD principle 4)
③ Publishing: consists of crea3ng the metadata and making the data set accessible. (LD principle 3)
Agenda 1. Crea9ng Linked Data
2. Interlinking Linked Data
3. Publishing Linked Data
4. Linked Data publishing checklist
7 EUCLID -‐ Providing Linked Data
CREATING LINKED DATA
EUCLID -‐ Querying Linked Data 8
• The data of interest may be stored in a wide range or formats:
• Several tools support the process of mining data from different repositories, for example:
Extracting the Data
9 EUCLID -‐ Providing Linked Data
Spreadsheets or tabular data Databases Text
R2RML
Using the RDF Data Model
EUCLID -‐ Providing Linked Data 10
• The RDF data model is used to represent the extracted informa3on
• The nodes represent the concepts/en33es within the data. A node corresponds to a URI, a blank node or a literal (only in predicates)
• The rela3onships between the concepts/en33es are modeled as arcs
Subject Object Predicate
Naming Things: URIs • All the things or dis3nct en33es within the data must be named
• According to the Linked Data principles, the standard mechanism to name en33es is the URI
• Designing Cool URIs: – Leave out informa3on about the data regarding to: author, technologies, status, access mechanisms, …
– Simplicity: short, mnemonic URIs – Stability: maintain the URIs as long as possible – Manageability: issue the URIs in a way that you can manage
11 EUCLID -‐ Providing Linked Data
Source:hjp://www.w3.org/TR/cooluris/
Selecting Vocabularies • Vocabularies model the concepts and the rela9onship between them in a knowledge domain
• Terms from well-‐known vocabularies should be reused wherever possible
• New terms should be define only if you can not find required terms in exis3ng vocabularies
• A large number of vocabularies in RDF are openly available, e.g., Linked Open Vocabularies (LOV)
12 EUCLID -‐ Providing Linked Data
Selecting Vocabularies (2)
EUCLID -‐ Providing Linked Data 13
Linked Open Vocabularies
322 vocabularies classified by domain
Source:hjp://lov.okfn.org/dataset/lov/
Selecting Vocabularies (3)
EUCLID -‐ Providing Linked Data 14
Linked Open Vocabularies: Analyzing MusicOntology
Source:hjp://lov.okfn.org/dataset/lov/details/vocabulary_mo.html
Selecting Vocabularies (4)
EUCLID -‐ Providing Linked Data 15
Other lists of well-‐known vocabularies are maintained by:
• W3C SWEO Linking Open Data community project hjp://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/CommonVocabularies
• Library Linked Data Incubator Group: Vocabularies in the library domain hjp://www.w3.org/2005/Incubator/lld/XGR-‐lld-‐vocabdataset-‐20111025
INTERLINKING LINKED DATA
EUCLID -‐ Providing Linked Data 16
Interlinking Data Sets • It’s one of the Linked Data principles!
• Involves the crea3on of RDF links between two different RDF data sets: – Links at instance level (rdfs:seeAlso, owl:sameAs) – Links at schema level (RDFS subclass/subproperty, OWL equivalent class/property, SKOS mapping proper9es)
• Appropriate links are detected via link discovery
EUCLID -‐ Providing Linked Data 17
4. Include links to other URIs, so that users can discover more things.
Interlinking Data Sets (2)
Challenges for link discovery • Linked Data sets are heterogeneous in terms of vocabularies, formats and data representa3on
• Large range of knowledge domains
• Scalability: LD is composed of a large number of data sets and RDF triples, hence it is not possible to compare every possible en3ty pair
EUCLID -‐ Providing Linked Data 18
Source: Robert Isele. “LOD2 Webinar Series:Silk”
Interlinking Data Sets (3)
Challenges for link discovery • It corresponds to the en9ty resolu9on problem:
deciding whether two en..es correspond to same object in the real world
• Name ambigui9es: typos, misspellings, different languages, homonyms
• Structural ambigui9es: same concepts/en33es with different structures. Requires the applica3on of ontology and schema matching techniques
EUCLID -‐ Providing Linked Data 19
Interlinking Data Sets (4)
EUCLID -‐ Providing Linked Data 20
RDF data sets can be interlinked:
Manually • Involves the manual explora3on of
LD data sets and their RDF resources to iden3fy linking targets
• May not be feasible when the number of en33es within the data set is very large
Automatically • Using tools that perform link
discovery based on linkage rules, for example: Silk, Limes and xCurator
owl:sameAs & rdfs:seeAlso • owl:sameAs
• Creates links between individuals • States that two URIs refer to the same individuals
• rdfs:seeAlso • States that a resource may provide addi3onal informa3on about the subject resource
• Links in MusicBrainz: – owl:seeAlso is used for music ar3sts – rdfs:seeAlso is used for albums
EUCLID -‐ Providing Linked Data 21
SKOS • Simple Knowledge Organiza3on System
– hjp://www.w3.org/TR/skos-‐reference/
• Data model for knowledge organiza3on systems (thesauri, classifica3on scheme, taxonomies)
• SKOS data is expressed as RDF triples
• Allows the crea3on of RDF links between different data sets with the usage of mapping proper9es
EUCLID -‐ Providing Linked Data 22
SKOS: Mapping Properties These proper3es are used to link SKOS concepts (par3cularly instances) in different schemes:
• skos:closeMatch: links two concepts that are sufficiently similar (some3mes can be used interchangeably)
• skos:exactMatch: indicates that the two concepts can be used interchangeably. • Axiom: It is a transi9ve property
• skos:relatedMatch: states an associa3ve mapping link between two concepts
EUCLID -‐ Providing Linked Data 23
Example of SKOS exact match
SKOS: Mapping Properties (2)
EUCLID -‐ Providing Linked Data 24
mo:MusicArtist skos:exactMatch dbpedia-‐ont:MusicalArtist.
@prefix skos: <http://www.w3.org/2004/02/skos/core#> @prefix mo: <http://purl.org/ontology/mo/> @prefix dbpedia-‐ont: <http://dbpedia.org/ontology/> @prefix schema: <http://schema.org/>
mo:MusicGroup skos:exactMatch schema:MusicGroup.
mo:MusicGroup skos:exactMatch dbpedia-‐ont:Band.
Example of SKOS close match
SKOS: Mapping Properties (3)
EUCLID -‐ Providing Linked Data 25
mo:SignalGroup skos:closeMatch schema:MusicAlbum.
@prefix skos: <http://www.w3.org/2004/02/skos/core#> @prefix mo: <http://purl.org/ontology/mo/> @prefix dbpedia-‐ont: <http://dbpedia.org/ontology/> @prefix schema: <http://schema.org/>
mo:SignalGroup skos:closeMatch dbpedia-‐ont:Album.
Integrity conditions • Guarantee consistency and avoid contradic3ons in the rela3onships between SKOS concepts
SKOS: Mapping Properties (4)
EUCLID -‐ Providing Linked Data 26
skos:Mapping Relation
skos:close Match
skos:exact Match
skos:related Match
Symmetric & Transi9ve
Disjoint with
Par3al Mapping Rela3on diagram with integrity condi3ons
Symmetric
PUBLISHING LINKED DATA
EUCLID -‐ Providing Linked Data 27
Publishing Linked Data Once the RDF data set has been created and interlinked, the publishing process involves the following tasks:
1. Metadata crea3on for describing the data set
2. Making the data set accessible
3. Exposing the data set in Linked Data repositories
4. Valida9ng the data set
EUCLID -‐ Providing Linked Data 28
• Consists of providing (machine-‐readable) metadata of RDF data sets which can be processed by engines
• This informa3on allows for:
– Efficient and effec3ve search of data sets
– Selec3on of appropriate data sets (for consump3on or interlinking)
– Get general sta3s3cs of the data sets
EUCLID -‐ Providing Linked Data 29
Describing RDF Data Sets
Describing RDF Data Sets (2) • The common language for describing RDF data sets is VoID (Vocabulary of Interlinked Data sets)
• Defines an RDF data set with the predicate void:Dataset
• Covers 4 types of metadata:
EUCLID -‐ Providing Linked Data 30
• General metadata
• Structural metadata
• Descrip3ons of linksets • Access metadata
VoID: General Metadata • General metadata is used by users to iden3fy appropriate data sets.
• Specifies informa3on about descrip3on of the data set, contact person/organiza3on, the license of the data set, data subject and some technical features.
• VoID (re)uses predicates from the Dublin Core Metadata1 and FOAF2 vocabularies.
EUCLID -‐ Providing Linked Data 31
1 hjp://dublincore.org/documents/2010/10/11/dcmi-‐terms/ 2 hjp://xmlns.com/foaf/spec/
VoID: General Metadata (2)
Predicate Range Descrip9on dcterms:title Literal Name of the data set.
dcterms:description Literal Descrip3on of the data set.
dcterms:source RDF resource Source from which the data set was derived.
dcterms:creator RDF resource Primarily responsible of crea3ng the data set.
dcterms:date xsd:date Time associated with an event in the life-‐cycle of the resource.
dcterms:created xsd:date Date of crea3on of the data set.
dcterms:issued xsd:date Date of publica3on of the data set.
dcterms:modified xsd:date Date on which the data set was changed.
foaf:homepage Literal Name of the data set.
dcterms:publisher RDF resource En3ty responsible for making the data set available.
dcterms:contributor RDF resource En3ty responsible for making contribu3ons to the data set.
EUCLID -‐ Providing Linked Data 32
Source: hjp://www.w3.org/TR/void/#metadata General Information Contains informa3on about the crea3on of the data set
VoID: General Metadata (3)
Other Information • License of the data set: specifies the usage condi3ons of
the data. The license can be pointed with the property dcterms:license
• Category of the data set: to specify the topics or domains covered by the data set, the property dcterms:subject can be used
• Technical features: the property void:feature can be used to express technical proper3es of the data (e.g. RDF serializa3on formats)
EUCLID -‐ Providing Linked Data 33
VoID: Structural Metadata
EUCLID -‐ Providing Linked Data 34
• Provides high-‐level informa3on about the internal structure of the data set
• This metadata is useful when exploring or querying the data set
• Includes informa3on about resources, vocabularies used in the data set, sta3s3cs and examples of resources in the data set
VoID: Structural Metadata (2)
EUCLID -‐ Providing Linked Data 35
Information about resources • Example resources: allow users to get an impression of the
kind of resources included in the data set. Examples can be shown with the property void:exampleResource
• Pajern for resource URIs: the void:uriSpace property can be used to state that all the en3ty URIs in a data set start with a given string
:MusicBrainz a void:Dataset; void:exampleResource <http://musicbrainz.org/artist/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d> .
:MusicBrainz a void:Dataset; void:uriSpace "http://musicbrainz.org/" .
VoID: Structural Metadata (3)
EUCLID -‐ Providing Linked Data 36
Vocabularies used in the data set • The void:vocabulary property iden3fies the vocabulary or
ontology that is used in a data set
• Typically, only the most relevant vocabularies are listed
• This property can only be used for en3re vocabularies. It cannot be used to express that a subset of the vocabulary occurs in the data set.
:MusicBrainz a void:Dataset; void:vocabulary <http://purl.org/ontology/mo/> .
VoID: Structural Metadata (4)
EUCLID -‐ Providing Linked Data 37
Source: hjp://www.w3.org/TR/void/#metadata
Statistics about a data set Express numeric sta3s3cs about a data set:
Predicate Range Descrip9on
void:triples Number Total number of triples contained in the data set.
void:entities Number Total number of en33es that are described in the data set. An en3ty must have a URI, and match the void:uriRegexPajern
void:classes Number Total number of dis3nct classes in the data set.
void:properties Number Total number of dis3nct proper3es in the data set.
void:distinctSubjects Number Total number of dis3nct subjects in the data set.
void:distinctObjects Number Total number of dis3nct objects in the data set.
void:documents Number Total number of documents, in case that the data set is published as a set of individual documents.
VoID: Structural Metadata (5)
EUCLID -‐ Providing Linked Data 38
Partitioned data sets • The void:subset property provides descrip3on of parts of a
data set
• Data sets can be par33oned based on classes or proper9es:
• void:classPartition contains only instances of a par3cular class • void:propertyPartition contains only triples with a par3cular predicate
:MusicBrainz a void:Dataset; void:subset :MusicBrainzArtists .
:MusicBrainz a void:Dataset; void:classPartition [ void:class mo:Release .] ; void:propertyParition [ void:property mo:member .] .
VoID: Describing Linksets
EUCLID -‐ Providing Linked Data 39
• Linkset: collec3on of RDF links between two RDF data sets
:DS1 :DS2
:LS1 :LS2
Image based on hjp://seman3cweb.org/wiki/File:Void-‐linkset-‐conceptual.png
owl:sameAs
@PREFIX void:<http://rdfs.org/ns/void#> @PREFIX owl:<http://www.w3.org/2002/07/owl#> :DS1 a void:Dataset . :DS2 a void:Dataset . :DS1 void:subset :LS1 . :LS1 a void:Linkset; void:linkPredicate owl:sameAs; void:target :DS1, :DS2 .
VoID: Describing Linksets (2)
EUCLID -‐ Providing Linked Data 40
Example
@PREFIX void:<http://rdfs.org/ns/void#> @PREFIX skos:<http://www.w3.org/2002/07/owl#> :MusicBrainz a void:Dataset . :DBpedia a void:Dataset . :MusicBrainz void:classPartition :MBArtists . :MBArtists void:class mo:MusicArtist . :MBArtists a void:Linkset; void:linkPredicate skos:exactMatch; void:target :MusicBrainz, :DBpedia .
The access metadata describes the methods of accessing the actual RDF data set
* This assumes that the default graph of the SPARQL endpoint contains the data set. VoID cannot express that a data set is contained a specific named graph. This can be specified with SPARQL 1.1. Service Descrip3on
VoID: Access Metadata
EUCLID -‐ Providing Linked Data 41
Method Predicate
Descrip9on
URI look up endpoint void:uriLookupEndpoint Specifies the URI of a service for accessing the data set (different from the SPARQL protocol)
Root resource void:rootResource URI of the top concepts (only for data sets structured as trees)
SPARQL endpoint void:sparqlEndpoint Provides access to the data set via the SPARQL protocol.*
RDF data dumps void:dataDump Specifies the loca3on of the dump file. If the data set is split into mul3ple files, then several values of this property are provided.
CH 5
Providing Access to the Data Set The data set can be accessed via different mechanisms:
EUCLID -‐ Providing Linked Data 42
RDFa RDF dump
SPARQL endpoint
Dereferencing HTTP URIs
Dereferencing HTTP URIs • Allows for easily exploring certain resources contained in the data set
• What to return for a URI? • Immediate descrip9on: triples where the URI is the subject.
• Backlinks: triples where the URI is the object. • Related descrip9ons: informa3on of interest in typical usage scenarios.
• Metadata: informa3on as author and licensing informa3on.
• Syntax: RDF descrip3ons as RDF/XML and human-‐readable formats.
• Applica3ons (e.g. LD browsers) render the retrieved informa3on so it can be perceived by a user.
EUCLID -‐ Providing Linked Data 43
Source: How to Publish Linked Data on The Web -‐ Chris Bizer, Richard Cyganiak, Tom Heath.
CH 1
Dereferencing HTTP URIs (2) Example: Dereferencing
EUCLID -‐ Providing Linked Data 44
RDFa • RDFa = “RDF in ajributes”
• Extension to HTML5 for embedding RDF within HTML pages: – The HTML is processed by the browser, the (human) consumer don’t see the RDF data
– The RDF triples within the page are consumed by APIs to extract the (semi-‐)structured data
• It is considered as the bridge between the Web of Data and the Web of Documents
• It is a complete serializa9on of RDF
EUCLID -‐ Providing Linked Data 45
RDFa: Attributes A]ribute role A]ribute Descrip9on
Syntax prefix List of prefix-‐name IRIs pairs
vocab IRI that specifies the vocabulary where the concept is defined
Subject about Specifies the subject of the rela3onship
Predicate
property Express the rela3onship between the subject and the value
rel Defines a rela3on between the subject and a URL
rev Express reverse rela3onships between two resources
Resource
href Specifies an object URI for the rel and rev ajributes
resource Same as href (used when href is not present)
src Specifies the subject of a rela3onship
Literal
datatype Express the datatype of the object of the property ajribute
content Supply machine-‐readable content for a literal
xml:lang, lang Specifies the language of the literal
Macro typeof Indicate the RDF type(s) to associate with a subject
inlist An object is added to the list of a predicate. EUCLID -‐ Providing Linked Data
46
RDFa: Example Extracting RDF from HTML
EUCLID -‐ Providing Linked Data 47
<div class="ar3stheader" about="hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_" typeof="hjp://purl.org./ontology/mo/MusicGroup"> … </div>
<hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_>
HTML (+RDFa):
RDF:
RDFa: Example Extracting RDF from HTML
EUCLID -‐ Providing Linked Data 48
<div class="ar3stheader" about="hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_" typeof="hjp://purl.org./ontology/mo/MusicGroup"> … </div>
<hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_> <hjp://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#type>
HTML (+RDFa):
RDF:
RDFa: Example Extracting RDF from HTML
EUCLID -‐ Providing Linked Data 49
<div class="ar3stheader" about="hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_" typeof="hjp://purl.org./ontology/mo/MusicGroup"> … </div>
<hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d#_> <hjp://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#type> <hjp://purl.org./ontology/mo/MusicGroup>.
HTML (+RDFa):
RDF:
RDFa: Example (2)
Extracting RDF from MusicBrainz.org
EUCLID -‐ Providing Linked Data 50
hjp://musicbrainz.org/ar3st/b10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d
RDFa: Example (2)
Extracting RDF from MusicBrainz.org
EUCLID -‐ Providing Linked Data 51
Source: hjp://www.w3.org/2007/08/pyRdfa/
RDFa: Example (2)
Extracting RDF from MusicBrainz.org
EUCLID -‐ Providing Linked Data 52
hjp://www.w3.org/2007/08/pyRdfa/extract?uri=hjp%3A%2F%2Fmusicbrainz.org%2Far3st%2Fb10bbbfc-‐cf9e-‐42e0-‐be17-‐e2c3e1d2600d&format=nt
Watch the EUCLID screencast: http://vimeo.com/euclidproject
RDF Dump • An RDF dump refers to a file which contains (part of) a data set specified in an RDF format (RDF/XML, N-‐Triples, N-‐Quads)
• The data set can be split into several RDF dumps
• A list of available data sets available as RDF dumps can be found at: – hjp://www.w3.org/wiki/DataSetRDFDumps
EUCLID -‐ Providing Linked Data 53
SPARQL Endpoint • The SPARQL endpoint refers to the URI of the listener of the SPARQL protocol service, which handles requests for SPARQL protocol opera3ons
• The user submits SPARQL queries to the SPARQL endpoint in order to retrieve only a desired subset of the RDF data set
• List of available SPARQL endpoints: • hjp://www.w3.org/wiki/SparqlEndpoints • hjp://labs.mondeca.com/sparqlEndpointsStatus/
EUCLID -‐ Providing Linked Data 54
CH 2
Using Linked Data Catalogs • Data catalogs, markets or repositories are pla{orms dedicated to provide access to a wide range of data sets from different domains
• Allow data consumers to easily find and use the data
• Usually the catalogs offer relevant metadata about the crea3on of the data set
EUCLID -‐ Providing Linked Data 55
Using Linked Data Catalogs (2) How to publish an RDF data set into a catalog?
EUCLID -‐ Providing Linked Data 56
Create your own data catalog
Recommended for big organiza3ons/ins3tu3ons aiming at providing a large number of data sets
Use a data management system, for example:
Upload your data set into an exis3ng catalog
Allows data consumers to easily find new data sets
Common LD catalogs are: -‐ -‐ The Linking Open Data Cloud
Validating Data Sets There are different ways to validate the published RDF data set:
EUCLID -‐ Providing Linked Data 57
General validators
Parsing & Syntax
• Vapour -‐ Performs two types of tests: without content nego3a3on and reques3ng RDF/XML content
hjp://validator.linkeddata.org/vapour
• URI Debugger -‐ Retreieves the HTTP responses of accessing a URI hjp://linkeddata.informa3k.hu-‐berlin.de/uridbg/
• RDF Triple-‐Checker – Dereferences namespaces associated with the resources used in the document
hjp://graphite.ecs.soton.ac.uk/checker/
• W3C RDF/XML Valida9on Service – Evaluates the syntax of RDF/XML documents and displays the RDF triples in it hjp://validator.linkeddata.org/vapour
• W3C Markup Valida9on Service – Checks syntac3c correctness for web documents with RDFa markup hjp://validator.w3.org/
• RDF:ALERTS – Validates syntax, undefined resources, datatype and other types of errors hjp://swse.deri.org/RDFAlerts/
Accessibility
Validating Data Sets (2) Example: Validating URIs with Vapour
EUCLID -‐ Providing Linked Data 58
Source: hjp://idi.fundacionc3c.org/vapour
Validating Data Sets (3) Example: Validating URIs with Vapour
EUCLID -‐ Providing Linked Data 59
Source: hjp://idi.fundacionc3c.org/vapour
Validating Data Sets (4)
Example: Validating URIs with Vapour
EUCLID -‐ Providing Linked Data 60
Source: hjp://idi.fundacionc3c.org/vapour
Example: Validating URIs with Vapour hjp://dbpedia.org/page/The_Beatles
hjp://dbpedia.org/data/The_Beatles.xml
HTML conten
t
RDF do
cumen
t
PROVIDING LINKED DATA: CHECKLIST
EUCLID -‐ Providing Linked Data 61
Providing Linked Data: Checklist (1) Creating Linked Data o All the relevant en33es/concepts were effec3vely extracted from the raw data ?
o Are all the created URIs dereferenceable? o Are you reusing terms from widely accepted vocabularies?
EUCLID -‐ Providing Linked Data 62
Providing Linked Data: Checklist (2) Interlinking Linked Data o Is the data set linked to other RDF data sets? o Are the created vocabulary terms linked to other vocabularies?
EUCLID -‐ Providing Linked Data 63
Providing Linked Data: Checklist (3) Publishing Linked Data o Do you provide data set metadata? o Do you provide informa3on about licensing? o Do you provide addi3onal access methods? o Is the data set available in LD catalogs? o Did the data set pass the valida3on tests?
EUCLID -‐ Providing Linked Data 64
Summary
EUCLID -‐ Providing Linked Data 65
• The Linked Data lifecycle: • 3 core tasks: crea3ng, interlinking and publishing
• Crea3on of Linked Data: • Extrac3ng relevant data, using URIs to name en33es and selec3ng vocabularies and expressing the data using the RDF data model
• Interlinking Linked Data: • Challenges of link discovery, using Silk to create links between two data sets and using SKOS links
• Publishing Linked Data: • Crea3on of data set metadata; publishing the data set via RDF dumps, SPARQL endpoints or RDFa; using RDFa and schema.org to enrich search results, and uploading the data set to a LD catalog
In this chapter we studied:
The Web & Linked Data
• Linked Data catalogs • Applica9ons
CKAN • CKAN is an open source pla{orm for developing data set catalogs
• Implement useful tools for data publishers to support: • Data harves3ng • Crea3on of metadata • Access mechanisms to the data set • Upda3ng the data set • Monitoring the access to the data set
EUCLID -‐ Providing Linked Data 67
CKAN (2)
EUCLID -‐ Providing Linked Data 68 Source: hjp://ckan.org
CKAN (3)
EUCLID -‐ Providing Linked Data 69 Source: hjp://ckan.org
• The Data Hub is a community-‐run data catalog which contains more than 5,000 data sets1
• “(…) is an openly editable open data catalogue, in the style of Wikipedia”.2
• It is implemented on top of the CKAN pla{orm
• Allows the crea3on of groups: – The Linking Open Data Cloud group exclusively contains Linked Data sets
EUCLID -‐ Providing Linked Data 70
1 According to the informa3on presented in the portal on March 2013 2 Source: hjp://datahub.io/about
The Data Hub
EUCLID -‐ Providing Linked Data 71
Source: hjp://datahub.io/
The Data Hub (2)
The Data Hub (3)
EUCLID -‐ Providing Linked Data 72
Source: hjp://datahub.io/
The Linking Open Data Cloud
EUCLID -‐ Providing Linked Data 73
September 2011
Source: Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch
The Linking Open Data Cloud
How to publish an RDF data set in this cloud? 1. The data set must follow the Linked Data principles 2. The data set must contain at least 1,000 RDF triples 3. The data set must contain at least 50 RDF links to a
data set that is already in the diagram 4. Access to the data set must be provided
Once these criteria are met, the data publisher must add the data set to the Data Hub catalog, and contact the administrators of the Linking Open Data Cloud group
EUCLID -‐ Providing Linked Data 74
Source: hjp://lod-‐cloud.net/
Linked Data & Search Engines
EUCLID -‐ Providing Linked Data 75
• Search engines collect informa3on about web resources in order to produce richer search results by improving the display of the results
• This is only possible if the search engines are able to understand the content within the web pages
• The HTML pages must be annotated with machine-‐readable content to describe their content:
Mark up format Vocabulary
RDFa for marking up data
EUCLID -‐ Providing Linked Data 76
• RDFa is used to provide (semi-‐)structured Linked Data embedded in web content
• Examples: – Some search engines use RDFa, e.g., Google, Yahoo! and Bing
– Facebook’s Open Graph is based on RDFa
Google Rich Snippets
EUCLID -‐ Providing Linked Data 77
• Embedding seman3cs via RDFa (or microformats/microdata) enhances search results:
Google Rich Snippets (2)
EUCLID -‐ Providing Linked Data 78
Schema.org
EUCLID -‐ Providing Linked Data 79
• Collec3on of schemas/vocabularies to markup the HTML pages
• It is recognized by Bing, Google, Yahoo! and Yandex
• Covers a wide range of knowledge domains
• It also offers an extension mechanism in case the publisher is interested in adding new concepts to the vocabularies
Schema.org (2)
EUCLID -‐ Providing Linked Data 80
The vocabularies cover the following topics:
Source: hjp://schema.org/docs/schemas.html
“The world is too rich, complex and interes.ng for a single schema to describe fully on its own. With schema.org we aim to find a balance, by providing a core schema that covers lots of situa.ons, alongside extension mechanisms for extra detail.” (Dan Brickley, schema.org)
EUCLID -‐ Providing Linked Data 81
Integrates(/aligns) exis3ng vocabularies where appropriate, e.g. rNews
Source: hjp://schema.org/Ar3cle
Schema.org (3)
Google Knowledge Graph
EUCLID -‐ Providing Linked Data 82
• The user is able to find answer to their queries without browsing pages
• Provides detailed informa3on
Google Knowledge Graph (2)
EUCLID -‐ Providing Linked Data 83
• Google Search results include structured data from Freebase
• Might disambiguate
search terms
Freebase
EUCLID -‐ Providing Linked Data 84
• Knowledge base of
structured data
• Data is stored as a graph
• Describes data from
different domains
Bing Snapshot
EUCLID -‐ Providing Linked Data 85
• Provides structured data related to the search term
• Includes a significant number of en33es from more domains
• Connects data from LinkedIn
• Is is powered by the graph engine Trinity.RDF
Bing Snapshot (2)
EUCLID -‐ Providing Linked Data 86
Open Graph Protocol
EUCLID -‐ Providing Linked Data 87
• It was originally created by Facebook
• Allows describing web content as graph objects, establishing connec3ons between people and objects
• The descrip3ons are embedded in the web page as RDFa data
• Supports descrip3on of several domains: basic metadata, music, video, ar3cles, books, websites and user profiles
Source: hjp://ogp.me/
Open Graph Protocol (2)
EUCLID -‐ Providing Linked Data 88
Source: hjp://ogp.me/
Who is using Open Graph protocol?
Source: hjp://ogp.me/
Google Mixi
Consumers Publishers
IMDb
Microso� NHL
Posterous
Rojen Tomatoes TIME
Open Graph Protocol (3)
EUCLID -‐ Providing Linked Data 89
• Facebook expands vocabulary of rela3onships beyond “friendship” and “like” more ac9ons!
Source: hjps://developers.facebook.com/docs/opengraph/
Open Graph Protocol & Facebook
EUCLID -‐ Providing Linked Data 90
List of domains and ac9ons
Source: hjps://developers.facebook.com/docs/opengraph/
• Listen • Create a playlist
• Watch • Rate • Wants to watch
• Rate • Read • Quote • Wants to read
• Achieve • High score
• Bike • Run • Walk
• Like • Recommend • Follow
General
Music
Movies & TV
Games
Fitness
Book
How can we exploit these links and rela3onships?
Facebook Graph Search
EUCLID -‐ Providing Linked Data 91
• focuse on people and their interests, exploi3ng how everything is related to each other
• Queries are specified using natural language
• Takes advantage of context and suggest possible queries
• Allows for building more complex (expressive) queries that are not possible with normal search: – For example, “music liked by me and friends who live in my city”
Facebook Graph Search (2)
EUCLID -‐ Providing Linked Data 92
Context (informa3on from profile):
Graph search sugges9ons:
Facebook Graph Search (3)
EUCLID -‐ Providing Linked Data 93
Results
Facebook Graph Search (4)
EUCLID -‐ Providing Linked Data 94
Observations • Allows for conjunc3ve queries (applying filter over intermediate
results = “apply operator”)
• Disjunc9ve queries are not supported: – For example: “My friends who like Seman3cWeb.com OR ReadWrite”
• Post search is not supported – It is not possible to search in post content submijed to the 3meline
• User privacy segngs affect the results
Tools for providing Linked Data
• Extrac9ng data from spreadsheets: OpenRefine • Extrac9ng data from RDBMS: R2RML • Extrac9ng data from text: Zemanta, OpenCalais, GATE • Interlinking data sets: Silk
EXTRACTING DATA FROM SPREADSHEETS WITH OPENREFINE
EUCLID -‐ Providing Linked Data 96
Integrate Chart Data • Task: Integrate latest chart
informa3on into your RDF database.
• Data may be available in non-‐RDF formats: – Plain text – CSV, TSV, separator-‐based
files – HTML tables – Spreadsheets
(OpenDocument, Excel, …) – XML – JSON – …
97
LD Data set
Access
Integrated Data Set
Interlinking Cleansing Vocabulary Mapping
SPARQL Endpoint
Publishing
CSV/ TSV HTML Spreadsheets JSON
Data acquisi3
on
EUCLID -‐ Providing Linked Data
Example Data
The Beatles, 250 million Elvis Presley, 203.3 million Michael Jackson, 157.4 million Madonna, 160.1 million Led Zeppelin, 135.5 million
Queen, 90.5 million
98
hjp://en.wikipedia.org/wiki/ List_of_best-‐selling_music_ar3sts
Ar3st Country of origin
Period ac3ve
Release-‐year of first charted record
Total cer3fied units (from available markets)[Notes]
The Beatles United Kingdom
1960–1970[4] 1962[4] Total available cer9fied units:
250 million[show]
Elvis Presley United States
1954–1977[28] 1954[28] Total available cer9fied units:
203.3 million[show]
Michael Jackson[Note 2]
United States
1964–2009[32] 1971[32] Total available cer9fied units:
157.4 million[show]
Madonna United States
1979–present[44] 1982[44] Total available cer9fied units:
160.1 million[show]
Led Zeppelin United Kingdom
1968–1980[50] 1969[50] Total available cer9fied units:
135.5 million[show]
Queen United Kingdom
1971–present[53] 1973[53] Total available cer9fied units:
90.5 million[show]
{
"artist": { "class": "artist", "name": "The Beatles"
}, "rank": 1,
"value": 250 million }, …
CSV
JSON
HTML tables
EUCLID -‐ Providing Linked Data
OpenRefine • transforms and cleans messy
input data sets.
• is an open-‐source successor of Google Refine.
• allows for en3ty reconcilia3on against SPARQL endpoints or RDF data.
• is extended with plugins that enhance its func3onality, e.g. for RDF support.
99 EUCLID -‐ Providing Linked Data
Quick Facts
Use of OpenRefine
100
1. Messy input data is imported, transformed into a table represen-‐ta3on and cleaned.
3. Define the structure of the RDF output.
4. The data is exported into some RDF syntax.
2. En3ty reconcilia3on is applied to allow for interlinking with exis3ng data sets.
The Beatles, 250 million Elvis Presley, 203.3 million Michael Jackson, 157.4 million Madonna, 160.1 million Led Zeppelin, 135.5 million
Queen, 90.5 million
CSV
musicbrainz:b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d :totalSales "25000000000"^^xsd:int . musicbrainz:01809552-4f87-45b0-afff-2c6f0730a3be :totalSales "2.033E10"^^xsd:int . musicbrainz:f27ec8db-af05-4f36-916e-3d57f91ecf5e :totalSales "1.574E10"^^xsd:int . musicbrainz:79239441-bfd5-4981-a70c-55c3f15c1287 :totalSales "1.601E10"^^xsd:int . musicbrainz:678d88b2-87b0-403b-b63d-5da7465aecc3 :totalSales "1.355E10"^^xsd:int . musicbrainz:0383dadf-2a4e-4d10-a46a-e9e041da8eb3 :totalSales "9.05E9"^^xsd:int .
RDF
EUCLID -‐ Providing Linked Data
Typical steps: • Group and explore data
items • Dele3ng columns or rows
based on filter condi3on • Split columns into several
columns based on condi3on
• Modify messy data items with GREL, a powerful expression language
• Replay steps from a previous Refine project
101 EUCLID -‐ Providing Linked Data
Data Transformation
How to Generate RDF?
• Addi3onal problem: data needs to be interlinked with exis3ng MusicBrainz data
• This is the point where plugins come into play: – RDF Refine: developed by DERI – An extension of OpenRefine to support RDF
102
? RDF
EUCLID -‐ Providing Linked Data
Core Capabilities
• Interlinking of data by en3ty reconcilia3on – Against SPARQL endpoints, RDF dumps – Discovery of relevant RDF data sets
• RDF export with the help of RDF skeletons – Define the vocabulary and graph structure of the RDF serializa3on
– In Turtle, RDF/XML
103 EUCLID -‐ Providing Linked Data
Typical steps: • Define a reconcilia3on service • Select specific types to reconcile against • Start reconciling a column against the
service
104 EUCLID -‐ Providing Linked Data
Entity Reconciliation
Define RDF Skeletons
• An RDF skeleton defines the structure of the RDF triples that are exported
EUCLID -‐ Providing Linked Data 105
RDF Skeletons
03.09.13 106 106 EUCLID -‐ Providing Linked Data
EXTRACTING DATA FROM RDBMS WITH R2RML
EUCLID -‐ Providing Linked Data 107
W3C RDB2RDF
• Task: Integrate data from rela3onal DBMS with Linked Data
• Approach: map from rela3onal schema to seman3c vocabulary with R2RML
• Publishing: two alterna3ves – – Translate SPARQL into
SQL on the fly – Batch transform data into
RDF, index and provide SPARQL access in a triplestore
108
LD Data set
Access
Integrated Data in
Triplestore
Interlinking Cleansing Vocabulary Mapping
SPARQL Endpoint
Publishing
Data acquisi3
on
EUCLID -‐ Providing Linked Data
R2RML Engine
Rela3onal DBMS
W3C RDB2RDF • The W3C made, last year, two recommenda3ons for mapping between rela3onal databases and RDF: – Direct mapping directly exposes data as RDF
• Not allowance for vocabulary mapping • No allowance for interlinking (unless URIs used in rela3onal data) • Not appropriate for this topic
– R2RML, the RDB to RDF mapping language • Allows vocabulary mapping (subject, predicate and object maps with class op3ons)
• Allows interlinking – URIs can be constructed • Means to provide MusicBrainz RDF/SPARQL itself
EUCLID -‐ Providing Linked Data 109
hjp://www.w3.org/2001/sw/rdb2rdf/
MusicBrainz Next Gen Schema
EUCLID -‐ Providing Linked Data 110
• Ar9st As pre-‐NGS, but
further ajributes
• Ar9st Credit Allows joint credit
• Release Group Cf. ‘album’
versus:
• Release • Medium
• Track • Track List
• Work • Recording
Source: hjps://wiki.musicbrainz.org/Next_Genera3on_Schema
Music Ontology
• OWL ontology with following core concepts (classes) and rela3onships (proper3es):
EUCLID -‐ Providing Linked Data 111
Source: hjp://musicontology.com
R2RML Class Mapping • Mapping tables to classes is ‘easy’:
lb:Artist a rr:TriplesMap ; rr:logicalTable [rr:tableName "artist"] ; rr:subjectMap [rr:class mo:MusicArtist ; rr:template "http://musicbrainz.org/artist/{gid}#_"] ; rr:predicateObjectMap [rr:predicate mo:musicbrainz_guid ; rr:objectMap [rr:column "gid" ; rr:datatype xsd:string]] .
EUCLID -‐ Providing Linked Data 112
R2RML Property Mapping • Mapping columns to proper3es can be easy:
lb:artist_name a rr:TriplesMap ; rr:logicalTable [rr:sqlQuery """SELECT artist.gid, artist_name.name FROM artist INNER JOIN artist_name ON artist.name =
artist_name.id"""] ; rr:subjectMap lb:sm_artist ; rr:predicateObjectMap [rr:predicate foaf:name ; rr:objectMap [rr:column "name"]] .
EUCLID -‐ Providing Linked Data 113
NGS Advanced Relations
EUCLID -‐ Providing Linked Data 114
• Major en33es (Ar3st, Release Group, Track, etc.) plus URL are paired (l_ar3st_ar3st)
• Each pairing of instances refers to a Link
• Links have types (cf. RDF proper3es) and ajributes
Source: hjp://wiki.musicbrainz.org/Advanced_Rela3onship
R2RML Advanced Mapping • Mapping advanced rela3onships (SQL joins): lb:artist_member a rr:TriplesMap ; rr:logicalTable [rr:sqlQuery """SELECT a1.gid, a2.gid AS band FROM artist a1 INNER JOIN l_artist_artist ON a1.id =
l_artist_artist.entity0 INNER JOIN link ON l_artist_artist.link = link.id INNER JOIN link_type ON link_type = link_type.id INNER JOIN artist a2 on l_artist_artist.entity1 = a2.id WHERE link_type.gid='5be4c609-‐9afa-‐4ea0-‐910b-‐12ffb71e3821' AND link.ended=FALSE"""] ; rr:subjectMap lb:sm_artist ; rr:predicateObjectMap [rr:predicate mo:member_of ; rr:objectMap [rr:template "http://musicbrainz.org/artist/
{band}#_" ; rr:termType rr:IRI]] .
EUCLID -‐ Providing Linked Data 115
EXTRACTING DATA FROM TEXT
EUCLID -‐ Providing Linked Data 116
OpenCalais
• Not easily customised/extended • Domain-‐specific coverage varies
EUCLID -‐ Providing Linked Data 117
Source: hjp://viewer.opencalais.com/
DBpedia Spotlight
• Not easily customised/extended • Is currently only available for English
EUCLID -‐ Providing Linked Data 118
Source: hjp://dbpedia-‐spotlight.github.com/demo/
hjp://dbpedia.org/page/Slowcore
hjp://dbpedia.org/page/Dorothy_Parker
Zemanta
EUCLID -‐ Providing Linked Data 119
Source: hjp://www.zemanta.com/demo/
• Common problem with general purpose, open-‐domain seman3c annota3on tools
• Best results require bespoke customisa3on
• General Architecture for Text Engineering • Free open-‐source (LGPL) framework and development environment
• Started 1996, large developer community • Used worldwide by many organisa3ons to build bespoke solu3ons; e.g. Press Associa3on and the Na3onal Archive
• Informa3on Extrac3on in many languages
GATE
EUCLID -‐ Providing Linked Data 120
hjp://www.gate.ac.uk/
• Increases recall over DBpedia by deriving new lexicalisa3ons for URIs from link anchor texts, disambigua3on pages, and redirect pages
GATE Example -‐ LODIE
EUCLID -‐ Providing Linked Data 121
Precision and Recall • Generic services typically very low recall • Combina3on is one solu3on
• Other solu3on is custom extrac3on
122
PER LOC ORG TOTAL DB Spotlight 0.97 / 0.40 0.82 / 0.46 0.86 / 0.31 0.85 / 0.39 Zemanta 0.96 / 0.84 0.89 / 0.62 0.82 / 0.57 0.90 / 0.68 LODIE 0.81 / 0.82 0.73 / 0.76 0.56 / 0.59 0.71 / 0.74
Zemanta ∩ LODIE 1.00 / 0.74 0.95 / 0.45 0.97 / 0.42 0.97 / 0.54
Zemanta U LODIE 0.94 / 0.93 0.77 / 0.76 0.72 / 0.71 0.82 / 0.81
EUCLID -‐ Providing Linked Data
Custom GATE Gazetteer • Retrieve MusicBrainz en3ty/label/class with SPARQL query
123 EUCLID -‐ Providing Linked Data
GATECloud • Custom (e.g. based around custom gazejeer) GATE pipelines can be executed on the cloud:
124 EUCLID -‐ Providing Linked Data
INTERLINKING DATA SETS WITH SILK
EUCLID -‐ Providing Linked Data 125
Interlinking with Silk • Task: Create links between
the data set and external Linked Data sources.
• Approach: Crea3on of specified links by querying the target data sets
• Alterna9ves: – Manual crea3on of
linkage rules by the user – Automa3c learning
linkage rules by submi�ng predefined SPARQL queries
126
LD Data set
Access
Integrated Data Set
Interlinking Cleansing Vocabulary Mapping
SPARQL Endpoint
Publishing
CSV/ TSV HTML Spreadsheets JSON
Data acquisi3
on
EUCLID -‐ Providing Linked Data
Link Discovery with Silk • Open source tool for discovering RDF links between data items within different Linked Data sources
• It is based on the Silk Link Specifica3on Language (Silk-‐LSL) for expressing linkage rules
• It accesses the target RDF data sets via SPARQL endpoints to generate RDF links
EUCLID -‐ Providing Linked Data 127
Source: Robert Isele. “LOD2 Webinar Series:Silk”
Silk Variants • Silk Single Machine
• Generates RDF links on a single machine • Data sets can reside either locally or in remote machines • Provides mul3threading and caching
• Silk MapReduce • Uses a cluster composed of mul3ple machines • Based on Hadoop and designed to scale to big data sets
• Silk Server • Used within applica3ons that consume Linked Data from the Web
while keeping track of known en33es • Provides an HTTP API for matching en33es from an incoming
stream
EUCLID -‐ Providing Linked Data 128 Source: hjp://wifo5-‐03.informa3k.uni-‐mannheim.de/bizer/silk/
Source: Silk workflow is par3ally based on “LOD2 Webinar Series: Silk -‐(Simplified) Linking Workflow” by Rober Isele.
Silk Workflow
EUCLID -‐ Providing Linked Data 129
Select LD data sets
• Iden3fy suitable data sets in LD catalogs*
• Select the two data sets to link
Specify LD data sets
• Specify the access method to the data set (RDF dump, SPARQL endpoint)*
• Specify the en3ty types to be linked
Write linkage rule
• Specifies how to compare the resources
• Use Silk-‐LSL • The rules can also be learnt
Generate RDF links
• Output links can be stored in a file or a triple store
• Can discover SKOS links
Silk framework
* See sec3on “Publishing Linked Data”
Linkage Rule Components • Linkage rules define the condi3ons to create the links between the data sets. These rules are composed of:
EUCLID -‐ Providing Linked Data 130 Source: hjp://wifo5-‐03.informa3k.uni-‐mannheim.de/bizer/silk/
RDF Paths • Describe the elements to be
compared • Example: ?a/rdfs:label
Transforma9ons • Apply transforma3ons to the
result set of an RDF path • Examples: LowerCase,
Concatenate, Replace, …
Comparators • Compute the similarity of two
inputs • Examples: String similarity
metrics, Date similarity, …
Aggrega9ons • Compute an aggregated value
from mul3ple comparators • Examples: Min, Max, Avg, various
means, Euclidian distance …
1 2
3 4
Silk Workbench • Web applica3on built on top of Silk, which allows the crea3on of projects to manage the crea3on of links between RDF data sets
• The data sets can be stored locally or accessed remotely by specifying the SPARQL endpoint
• The user is able to create customized linking tasks: – The tool offers a graphical editor to create linkage rules by combining the linkage rules components via drag & drop elements
– Includes support for (automa3c) learning linkage rules EUCLID -‐ Providing Linked Data 131
Project configuration
Silk Workbench (2)
EUCLID -‐ Providing Linked Data 132
1 2
3
4
1. Project: name and components (data sources, linking tasks and output tasks)
2. Data sources: specifica3on of the data sets to be interlinked
3. Linking task: specifica3on of the linkage rules and type of links to be created
4. Output task: mechanism to store the results from the lnking process
2
Editing a linking task
Silk Workbench (3)
EUCLID -‐ Providing Linked Data 133
1
4
2
3
1. Linkage rule components 2. Graphical editor: the items from (1) are dragged &
dropped in this area, and connected to compose the linkage rules
3. Generate links: based on the defined linkage rules in (2), the data sets are accessed to discover possible links
4. Learn: automa3c learning of linkage rules
Adding a linkage rule
Silk Workbench (4)
EUCLID -‐ Providing Linked Data 134
The previous linkage rule states: 1. Retrieve the foaf:name values from MusicBrainz and
the rdfs:label from DBpedia 2. Apply lower case transforma3on to the output of (1) 3. Compare the output from (2) using the metric
“Levenshtein distance”. If this distance is greater than 0.90, then create a link.
1 2
3
Generate Links
Silk Workbench (5)
EUCLID -‐ Providing Linked Data 135
Learn Rules
Silk Workbench (6)
EUCLID -‐ Providing Linked Data 136
For exercises, quiz and further material visit our website:
EUCLID -‐ Providing Linked Data 137
@euclid_project euclidproject euclidproject
http://www.euclid-‐project.eu
Other channels:
eBook Course