Comprehensive Software Application Designed to Integrate Lean Six Sigma Methodolgy
How to integrate Linked Data into your application
-
Upload
a-matteini -
Category
Technology
-
view
12.334 -
download
1
description
Transcript of How to integrate Linked Data into your application
|
HOW TOINTEGRATE LINKED DATAINTO YOUR APPLICATION
LDIF Team: Andreas Schultz, Freie Universität Berlin
Andrea Matteini, mes|semanticsRobert Isele, Freie Universität Berlin
Pablo N. Mendes, Freie Universität BerlinChristian Becker, mes|semantics
Christian Bizer, Freie Universität Berlin
With contributions by:Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
SEMANTIC TECHNOLOGY & BUSINESS CONFERENCESAN FRANCISCO, JUNE 5, 2012
|
• Raw data (RDF)
• Accessible on the web
• Data can link to other data sources
• Benefits: Ease of access and re-use; enables discovery
• One API for all data sources?
WHAT IS LINKED DATA?
Thing
Thing
Thing
Thing
Thing
Thing
A B C
Thing
Thing
Thing
Thing
D E
data link data link data link data link
|
LINKING OPEN DATA CLOUD
As of September 2011
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
http://lod-cloud.net
|
TYPES OF LINKED DATA
Linked Enterprise
Data
Open,Public Data
(LOD Cloud)
Commercial Linked Data
VERY SOON?
• Provide interfaces on top of them
• Augment your website
• Integrate them into your application logic
• Create specialized data marts
... AND WHAT YOU CAN DO WITH THEM
|
AUGMENT YOUR WEBSITE: BBC
BBC online properties make intensive use of data from Wikipedia and MusicBrainz
|
DATA MARTS: NEUROWIKI
• NeuroWiki creates views for genes, drugs and diseases data from four RDF data sources
• Provides navigation and composition tools for accessing and mining the data
|
APPLICATION LOGIC: IBM WATSON
• IBM Watson makes use of Linked Data sources such as DBpedia
http://www.flickr.com/photos/ibm_media/
|
4 STEPS TOLINKED DATA INTEGRATION
|
STEP #1:ACCESS LINKED DATA
• Linked Data is published via HTTP, SPARQL endpoints, RDF dumps
• Live access allows quick prototyping and limited production use
• As data sets grow in size and more data sources are added, a crawling/caching architecture often becomes necessary
ArchitectureAccess MethodsAccess MethodsAccess Methods Decision FactorsDecision FactorsDecision FactorsDecision Factors
Architecture HTTP Dereferencing SPARQL Dump
import Recency Speed / Scalability Reliability Complexity
On-The-Fly Dereferencing
X High Low Low High
Query Federation X High
Decreases exponentially as new sources are added
LowModerate with SPARQL 1.1 SERVICE clause
Crawling and Caching X X X Depends High High High
Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011)
|
STEP #1:ACCESS LINKED DATA
Implementations:
• On-the-fly dereferencing
• LDspider, SQUIN, Semantic Web Client library
• Query federation
• SPARQL 1.1 SERVICE clause
• Crawling and Caching
• Triplestore import script
• Public caches (e.g. Sindice, OpenLink LOD endpoint)
• LDIF
|
STEP #2:NORMALIZE VOCABULARIES
Data sources that overlap in content use a wide range of vocabularies.
po bib swrc dcam tl mpeg7 rdfg compass wot txn metalex doap wdrs admingeo vann api org sawsdl sdmx geospecies qb xml vu-wordnet rev
umbel uniprot http scovo void tag dbp
bio ore dbo gr
dbpedia event time xsd frbr
geonames cc sioc
vcard mo bibo
akt xhtml
geo skos
foaf
dc • Over 60 % of all LOD sources use
proprietary vocabularies
• It’s up to the data consumer to normalize the vocabularies
• Enterprise: Need to translate between internal and external vocabularies
Most widely used vocabularies in the LOD cloud (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
|
Approaches to Schema Mapping:
• Hand-crafting queries against individual sources – no different than an API
• Ontology Representation Languages: OWL, RDFS
• Rules: SWRL, RIF
• Query Languages
• SPARQL CONSTRUCT clause
• TopQuadrant SPARQLMotion
• Mosto
• R2R (part of LDIF)
OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc } OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] }
Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
STEP #2:NORMALIZE VOCABULARIES
|
Using SPARQL:• Rename a class
• Value transformation
• Create URI from literal
CONSTRUCT { ?s a mo:MusicArtist} WHERE { ?s a dbpedia-owl:MusicalArtist}
STEP #2:NORMALIZE VOCABULARIES
CONSTRUCT { ?s movie:runtime ?runtimeInMinutes . } WHERE { ?s dbpedia-owl:runtime ?runtime . BIND(?runtime * 60 As ?runtimeInMinutes)}
CONSTRUCT { ?s diseasome:omim ?omimuri . ?omimuri dc:identifier ?identifier .} WHERE { ?s dbpedia-owl:omim ?omim . BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri) BIND(concat(“omim:”, ?omim) As ?identifier)}
Slide credits: Andreas Schultz
|
STEP #3:RESOLVE IDENTIFIERS
Data sources that overlap in content use different identifiers for the same real-world entity.
1 linked data sets
2 linked data sets
3 linked data sets
4 linked data sets
5 linked data sets
6 - 10 linked data sets
> 10 linked data sets
0 25 50 75 100
27
17
5
19
38
62
98
Number of linked data sets per source (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
• Most LOD sources only provide owl:sameAs links to one other data source
• It’s up to the data consumer to generate additional links
• Enterprise: Need to link both internal and external resources
|
Approaches to Identity Resolution:
• Improvised or manual merging
• Rule-based approaches:
• SILK (part of LDIF)
• LIMES
STEP #3:RESOLVE IDENTIFIERS
Union Sq., New YorkUnion Sq., SeattleUnion Sq., San Francisco
Union Square
37°4
7′ N
122°
24′ W
Union Sq.=
Union Sq.,San Francisco
37°4
7′ N
122°
24′ W
|
Data sources that overlap in content provide data that is conflicting and of varying quality.
• Data sources have...
• ... different knowledge levels, views or intents
• ... wrong, biased, inconsistent or outdated information
• Approaches:
• Import data into distinct Named Graphs; query them separately using the SPARQL GRAPH clause
• Sieve (part of LDIF)
STEP #4:FILTER DATA
|
LDIF – LINKED DATA INTEGRATION FRAMEWORK
Integrates Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance
Collect data: Managed download and update
Translate data into a single target vocabulary
Resolve identifier aliases into local target URIs
Output
1
2
3
5
Cleanse data; resolving the conflicting values4
• Follows the Crawling and Caching Architecture Pattern
• Open source (Apache License, Version 2.0)
• Collaboration between Freie Universität Berlin and mes|semantics
NEW
|
Supported data sources:
• RDF dumps (all common formats)
• SPARQL Endpoints
• Crawling Linked Data via HTTP
LDIF PIPELINE
Collect data
Translate data
Resolve identities
1
2
3
Cleanse data
Output5
4
|
dbpedia-owl: City
LDIF PIPELINE
Collect data
Translate data
Resolve identities
1
2
3 R2R
• Simple mappings using OWL / RDFS statements(x rdfs:subClassOf y)
• Complex mappings with SPARQL expressivity
• Built-in transformation function library (XPath)
Sources use a wide range of different RDF vocabularies
schema:Place
fb:location.citytown
local:City
Cleanse data
Output5
4
|
LDIF PIPELINE
Collect data
Resolve identities
1
2
3
rdf:type wiki:Gene
wiki:IsInvolvedIn
Silk
Union Sq., New YorkUnion Sq., SeattleUnion Sq., San Francisco
Union Square
37°4
7′ N
122°
24′ W
• Automated link creation based on Link Specifications
• Supports various comparators and transformations (string similarity, basic arithmetics, time, geographical distance)
Sources use different identifiers for the same entity
Union Sq.=
Union Sq.,San Francisco
37°4
7′ N
122°
24′ W
Cleanse data
Output5
4
Translate data
|
LDIF PIPELINE
Collect data
Translate data
Resolve identities
Cleanse data
1
2
Sieve
San Francisco population is
0.7M
★ ★
★
San Franciscopopulation
is 0.8M
★ ★
San Francisco population is
0.8M
Output5
4
3
Sources provide different values for the same property
1. Quality Assessment – assign quality scores to Named Graphs (by time, by source preference, thresholds)
2. Data Fusion – resolve conflicting property values (according to quality scores, frequency, averages)
|
Output options:
• N-Quads
• N-Triples
• SPARQL Update Stream
• Provenance tracking using Named Graphs
LDIF PIPELINE
Collect data
Translate data
Resolve identities
1
2
3
Cleanse data
Output5
4
|
LDIF ARCHITECTURE
!
!
!
!
Application!Code!!
!!
Application!Layer!
Data!Access,!!Integration!and!!Storage!Layer!
Web!of!Data!
Publication!Layer!
Integrated!Web!Data!
Data!Translation!Module!
!
Identity!Resolution!Module!
!
SPARQL!or!RDF!API!
LD!Wrapper!
HTTP!
HTTP! HTTP!
Data!Quality!and!Fusion!Module!
Database!A!
RDF/XML!
HTTP!
Web!Data!Access!Module!
!
LD!Wrapper!
Database!B! CMS!
RDFa!
!!!!!!LDIF!!
|
VERSIONS
• In-memory
• fast, but scalability limited by local RAM
• RDF Store (TDB)
• stores intermediate results in a Jena TDB RDF store
• can process more data than In-memory but doesn't scale
• Cluster (Hadoop)
• scales by parallelizing work across multiple machines using Hadoop
• can process a virtually unlimited amount of data
• ready for Amazon Elastic MapReduce
|
BENCHMARKSKEGG GENES VS. UNIPROT (CLUSTER)
300M TRIPLES
3.6B TRIPLES
|
Q & A
|
• Early adopters wanted!
• Website: http://bit.ly/ldifweb
• Google Group: http://bit.ly/ldifgroup
• http://mes-semantics.com
• Supported in part by
• Vulcan Inc. as part of its Project Halo
• EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943)
• Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz
THANKS!