Linked Data, Cultural Heritage & the Karma Mapping Software
-
Upload
pedro-szekely -
Category
Technology
-
view
173 -
download
3
Transcript of Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data & Cultural Heritage
Pedro Szekely and Craig Knoblock USC/Information Sciences Institute [email protected], [email protected]
http://isi.edu/integration/karma
February 2015
Outline
• Problem
• Linked Data
• Karma
• Reconciliation
• Next steps
CC-By 2.0 2 USC Information Sciences Institute
CURRENT STATE OF CULTURAL HERITAGE DATA
CC-By 2.0 3 USC Information Sciences Institute
Humans Browsing the Web Crystal Bridges
Museum ofAmerican Art
Dallas Museum of Art
IndianapolisMuseum of Art
The Metropolitan Museum of Art
National Portrait Gallery
Smithsonian American Art Museum
USC Information Sciences Institute CC-By 2.0 4
WHAT WE SEE
CC-By 2.0 5 USC Information Sciences Institute
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah
blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
WHAT THE COMPUTER SEES
USC Information Sciences Institute CC-By 2.0 6
WEB PAGES ARE UNUSABLE FOR CREATING INNOVATIVE APPLICATIONS
USING THE DATA
CC-By 2.0 7 USC Information Sciences Institute
SOLUTION: Linked Open Data
“web pages for computers”
using W3C standards for publishing data
CC-By 2.0 8 USC Information Sciences Institute
CC-By 2.0 9
Tim Berners Lee on Linked Open Data
USC Information Sciences Institute
http://youtu.be/OM6XIICm_qo
Humans Browsing the Web Crystal Bridges
Museum ofAmerican Art
Dallas Museum of Art
IndianapolisMuseum of Art
The Metropolitan Museum of Art
National Portrait Gallery
Smithsonian American Art Museum
USC Information Sciences Institute CC-By 2.0 10
CC-By 2.0 11
RAW DATA NOW
USC Information Sciences Institute
Publish Your Raw Data Crystal Bridges
Museum ofAmerican Art
Dallas Museum of Art
IndianapolisMuseum of Art
The Metropolitan Museum of Art
National Portrait Gallery
Smithsonian American Art Museum
USC Information Sciences Institute CC-By 2.0 12
CC-By 2.0 13
Examples of Raw Data Now
USC Information Sciences Institute
https://github.com/cooperhewitt/collection
https://github.com/IMAmuseum/ima-collection
Convert Data to CRM (2 star) Crystal Bridges
Museum ofAmerican Art
Dallas Museum of Art
IndianapolisMuseum of Art
The Metropolitan Museum of Art
National Portrait Gallery
Smithsonian American Art Museum
USC Information Sciences Institute CC-By 2.0 14
Linked Museum Data (3 star) Crystal Bridges
Museum ofAmerican Art
Dallas Museum of Art
IndianapolisMuseum of Art
The Metropolitan Museum of Art
National Portrait Gallery
Smithsonian American Art Museum
USC Information Sciences Institute CC-By 2.0 15
Linked Cultural Heritage Data (4 star)
USC Information Sciences Institute CC-By 2.0 16
Represent Resources Using URIs
h&p://szekelys.com/family#pedro
“Pedro”
h&p://xmlns.com/foaf/0.1/firstName
USC Information Sciences Institute CC-By 2.0 17
Represent Information as Triples
h&p://szekelys.com/family#pedro h&p://xmlns.com/foaf/0.1/firstName
Subject Predicate
Object
The resource being described
A property of the resource
The value of the property
“Pedro”
USC Information Sciences Institute CC-By 2.0 18
RDF Graphs
h&p://szekelys.com/family#pedro
“Pedro”
foaf:firstName
foaf:Person rdf:type
h&p://isi.edu/~szekely
foaf:homepage
USC Information Sciences Institute CC-By 2.0 19
Linked Open Data
CC-By 2.0 20 USC Information Sciences Institute
Steps to Create Linked Open Data
CC-By 2.0 21 USC Information Sciences Institute
Steps to Create Linked Open Data • Publish the raw data
… get the data out of the proprietary database
• Select ontologies … that define classes and properties for our data
• Define URI scheme … identifiers of your resources
• Convert data to RDF … from data sources to the ontologies
• Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 22
CC-By 2.0 23
CIDOC CRM
• Select ontologies … that define classes and properties for our data
http://www.cidoc-crm.org/
USC Information Sciences Institute
CC-By 2.0 24
• Define URI scheme … identifiers of your resources
USC Information Sciences Institute
CC-By 2.0 25
http://edan.si.edu/saam/person-institution/8 http://edan.si.edu/saam/person-institution/8/id http://edan.si.edu/saam/person-institution/8/appellation/displayname http://edan.si.edu/saam/object/12 http://edan.si.edu/saam/object/12/title http://edan.si.edu/saam/object/12/id http://edan.si.edu/saam/object/12/acquisition http://edan.si.edu/saam/object/12/production http://edan.si.edu/saam/object/12/production/date http://edan.si.edu/saam/thesauri/nationality/American http://edan.si.edu/saam/thesauri/classification/Photography
• Define URI scheme … identifiers of your resources
USC Information Sciences Institute
CC-By 2.0 26
• Convert data to RDF … from data sources to the ontologies
USC Information Sciences Institute
RDF Mapping Tools
CC-By 2.0 27 USC Information Sciences Institute
TOOL SHORTCOMINGS BENEFITS custom code
labor intensive w error prone
flexible
R2RML difficult to learn w only SQL databases
W3C standard w good documentation w multiple vendors
Open Refine
no guidance w only tabular data
graphical user interface w support for reconciliation w open source
Karma university product easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
XML/JSON
Services
Karma
SQL/CSV
BigData
RDF
JSON
…
Interactive tool for rapidly extracting, cleaning, transforming, integrating & publishing
linked data in multiple formats 28 USC Information Sciences Institute
Ontology
KARMA DEMO
CC-By 2.0 29 USC Information Sciences Institute
http://youtu.be/h3_yiBhAJIc
Easy To Use
CC-By 2.0 30
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
CLEAR DEPICTION OF MAPPING
USC Information Sciences Institute
CC-By 2.0 31
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
LEARNS TO MAP YOUR DATA
USC Information Sciences Institute
CC-By 2.0 32
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
SUGGEST CORRECT ADJUSTMENTS
USC Information Sciences Institute
CC-By 2.0 33
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
EMBEDDED PYTHON SCRIPTING
USC Information Sciences Institute
CC-By 2.0 34
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
IMPORT POPULAR DATA FORMATS
USC Information Sciences Institute
CC-By 2.0 35
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
OUTPUT RDF IN MULTIPLE FORMATS
ntriples
JSON
AVRO
SPARQL
ElasticSearch, GitHub, …
Hadoop, BigData USC Information Sciences Institute
CC-By 2.0 36
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
40 million documents 1 billion triples
larger than all AAC museums combined
USC Information Sciences Institute
CC-By 2.0 37
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
periodic update every hour, every day
continuous update as new records come in
USC Information Sciences Institute
CC-By 2.0 38
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
Karma compatible with R2RML tools
USC Information Sciences Institute
CC-By 2.0 39
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
Karma Is Open Souce USC Information Sciences Institute
CC-By 2.0 40
URI RECONCILIATION
USC Information Sciences Institute
Multiple “John Singer Sargent” ima:Singer_Sargent_John a aac:Person ; dct:date "1856-1925" ; foaf:name "John Singer Sargent" .
saam:person_4253 a aac:Person ; saam:associatedPlace saam:SaamPlace_1357324439768t1r13950_0, saam:SaamPlace_1357324439768t1r13951_0 ; saam:constituentId "4253" ; rdaGr2:biographicalInformation “Painter. Sargent traveled …" ; rdaGr2:dateAssociatedWithThePerson "1990-10-1”, "1995-5-8" ; rdaGr2:dateOfBirth "1856-1-12" ; rdaGr2:dateOfDeath "1925-4-15" ; rdaGr2:placeOfBirth saam:SaamPlace_1357324439768t1r13952_0 ; rdaGr2:placeOfDeath saam:SaamPlace_1357324439768t1r13953_0 ; skos:altLabel "John S. Sargent" ; skos:prefLabel "John Singer Sargent" .
cb:12_4567 a aac:Person ; ont0:dateOfBirth "1879", "1885" ; ont0:dateOfDeath "1925" ; skos:prefLabel "John Singer Sargent" .
met:person_1893_3819 a aac:Person ; ont0:placeOfResidence "North and Central America", "United States" ; foaf:name "John Singer Sargent" .
dma:person_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1856" ; ont0:dateOfDeath "1925" ; foaf:name "John Singer Sargent" .
Pedro Szekely USC Information Sciences Institute CC-By 2.0 41
John Singer Sargent ima:SaamPerson_John_Singer_Sargent a aac:Person ; dct:date "1856-1925" ; foaf:name "John Singer Sargent" .
aac:Person_4253 a aac:Person ; saam:associatedPlace saam:SaamPlace_1357324439768t1r13950_0, saam:SaamPlace_1357324439768t1r13951_0 ; saam:constituentId "4253" ; rdaGr2:biographicalInformation “Painter. Sargent traveled …" ; rdaGr2:dateAssociatedWithThePerson "1990-10-1”, "1995-5-8" ; rdaGr2:dateOfBirth "1856-1-12" ; rdaGr2:dateOfDeath "1925-4-15" ; rdaGr2:placeOfBirth saam:SaamPlace_1357324439768t1r13952_0 ; rdaGr2:placeOfDeath saam:SaamPlace_1357324439768t1r13953_0 ; skos:altLabel "John S. Sargent" ; skos:prefLabel "John Singer Sargent" .
cb:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1879", "1885" ; ont0:dateOfDeath "1925" ; skos:prefLabel "John Singer Sargent" .
met:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:placeOfResidence "North and Central America", "United States" ; foaf:name "John Singer Sargent" .
dallas:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1856" ; ont0:dateOfDeath "1925" ; foaf:name "John Singer Sargent" .
Pedro Szekely USC Information Sciences Institute CC-By 2.0 42
Reconciled “John Singer Sargent” URIs
saam:person_4253 owl:sameAs cb:12_4567 ; owl:sameAs dma:person_John_Singer_Sargent ; owl:sameAs ima:Singer_Sargent_John ; owl:sameAs met:SaamPerson_John_Singer_Sargent ; owl:sameAs dbpedia:John_Singer_Sargent ; owl:sameAs nytimes/N49129220686803623753 ; owl:sameAs w-flick/John_Singer_Sargent ; ....
Pedro Szekely USC Information Sciences Institute CC-By 2.0 43
URI Reconciliation In Karma
Pedro Szekely USC Information Sciences Institute CC-By 2.0 44
Results of Automatic Linking
Pedro Szekely
99% are correct 6% are missing
USC Information Sciences Institute CC-By 2.0 45
Steps to Create Linked Open Data • Publish the raw data
… get the data out of the proprietary database
• Select ontologies … that define classes and properties for our data
• Define URI scheme … identifiers of your resources
• Convert data to RDF … from data sources to the ontologies
• Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 46
CC-By 2.0 47
TMS to CRM easy?
USC Information Sciences Institute
CC-By 2.0 48
TMS to CRM easy?
USC Information Sciences Institute
NO
COMMUNITY EFFORT • Publish the raw data
… get the data out of the proprietary database
• Select ontologies … that define classes and properties for our data
• Define URI scheme … identifiers of your resources
• Convert data to RDF … from data sources to the ontologies
• Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 49
Radical Ideas
• ULAN in Wikipedia or Wikidata • ULAN in GitHub • Collection data in GitHub • Community created CRM mappings in GitHub • CRM in JSON-LD in GitHub • Tools to export from TMS to GitHub
USC Information Sciences Institute CC-By 2.0 50
STORING AND MAINTAINING
THE DATA CC-By 2.0 51 USC Information Sciences Institute
Deployment Options
CC-By 2.0 52 USC Information Sciences Institute
Technology Shortcomings Benefits SPARQL endpoint
low reliability, esoteric, slow
sophisticated query language
RDF dump no query capability, esoteric
flexibility: clients can download and use in applications, easy to publish
JSON-LD + ElasticSearch
restricted query language
very high performance, mainstream technology, easy to publish
Karma supports the three options
CC-By 2.0 53
federation every publishes their data with
their own URIs
aggregation aggregator repulishes everyone’s
data with new URIs
USC Information Sciences Institute
thanks for your attention!
https://github.com/usc-isi-i2/Web-Karma!Open Source, Apache 2 License!
CC-By 2.0 54 USC Information Sciences Institute