Linked Data, Cultural Heritage & the Karma Mapping Software

Post on 18-Jul-2015

173 views 3 download

Tags:

Transcript of Linked Data, Cultural Heritage & the Karma Mapping Software

Linked Data & Cultural Heritage

Pedro Szekely and Craig Knoblock USC/Information Sciences Institute pszekely@isi.edu, knoblock@isi.edu

http://isi.edu/integration/karma

February 2015

Outline

•  Problem

•  Linked Data

•  Karma

•  Reconciliation

•  Next steps

CC-By 2.0 2 USC Information Sciences Institute

CURRENT STATE OF CULTURAL HERITAGE DATA

CC-By 2.0 3 USC Information Sciences Institute

Humans Browsing the Web Crystal Bridges

Museum ofAmerican Art

Dallas Museum of Art

IndianapolisMuseum of Art

The Metropolitan Museum of Art

National Portrait Gallery

Smithsonian American Art Museum

USC Information Sciences Institute CC-By 2.0 4

WHAT WE SEE

CC-By 2.0 5 USC Information Sciences Institute

blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah      blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah      blah  blah  blah  blah  blah  blah  blah  blah    blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah    

blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah    blah  blah  blah  blah  blah  blah  blah  blah    blah  blah  blah    blah  blah  blah  blah      blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah      blah  blah  blah    blah  blah  blah  blah    blah  blah  blah    blah  blah  blah    blah  blah  blah      

blah  blah  blah  blah  

blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  

blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  

blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah    blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah  blah    

WHAT THE COMPUTER SEES

USC Information Sciences Institute CC-By 2.0 6

WEB PAGES ARE UNUSABLE FOR CREATING INNOVATIVE APPLICATIONS

USING THE DATA

CC-By 2.0 7 USC Information Sciences Institute

SOLUTION: Linked Open Data

“web pages for computers”

using W3C standards for publishing data

CC-By 2.0 8 USC Information Sciences Institute

CC-By 2.0 9

Tim Berners Lee on Linked Open Data

USC Information Sciences Institute

http://youtu.be/OM6XIICm_qo

Humans Browsing the Web Crystal Bridges

Museum ofAmerican Art

Dallas Museum of Art

IndianapolisMuseum of Art

The Metropolitan Museum of Art

National Portrait Gallery

Smithsonian American Art Museum

USC Information Sciences Institute CC-By 2.0 10

CC-By 2.0 11

RAW DATA NOW

USC Information Sciences Institute

Publish Your Raw Data Crystal Bridges

Museum ofAmerican Art

Dallas Museum of Art

IndianapolisMuseum of Art

The Metropolitan Museum of Art

National Portrait Gallery

Smithsonian American Art Museum

USC Information Sciences Institute CC-By 2.0 12

CC-By 2.0 13

Examples of Raw Data Now

USC Information Sciences Institute

https://github.com/cooperhewitt/collection

https://github.com/IMAmuseum/ima-collection

Convert Data to CRM (2 star) Crystal Bridges

Museum ofAmerican Art

Dallas Museum of Art

IndianapolisMuseum of Art

The Metropolitan Museum of Art

National Portrait Gallery

Smithsonian American Art Museum

USC Information Sciences Institute CC-By 2.0 14

Linked Museum Data (3 star) Crystal Bridges

Museum ofAmerican Art

Dallas Museum of Art

IndianapolisMuseum of Art

The Metropolitan Museum of Art

National Portrait Gallery

Smithsonian American Art Museum

USC Information Sciences Institute CC-By 2.0 15

Linked Cultural Heritage Data (4 star)

USC Information Sciences Institute CC-By 2.0 16

Represent Resources Using URIs

h&p://szekelys.com/family#pedro  

“Pedro”  

h&p://xmlns.com/foaf/0.1/firstName  

USC Information Sciences Institute CC-By 2.0 17

Represent Information as Triples

h&p://szekelys.com/family#pedro  h&p://xmlns.com/foaf/0.1/firstName  

Subject Predicate

Object

The resource being described

A property of the resource

The value of the property

“Pedro”  

USC Information Sciences Institute CC-By 2.0 18

RDF Graphs

h&p://szekelys.com/family#pedro  

“Pedro”  

foaf:firstName  

foaf:Person  rdf:type  

h&p://isi.edu/~szekely  

foaf:homepage  

USC Information Sciences Institute CC-By 2.0 19

Linked Open Data

CC-By 2.0 20 USC Information Sciences Institute

Steps to Create Linked Open Data

CC-By 2.0 21 USC Information Sciences Institute

Steps to Create Linked Open Data •  Publish the raw data

… get the data out of the proprietary database

•  Select ontologies … that define classes and properties for our data

•  Define URI scheme … identifiers of your resources

•  Convert data to RDF … from data sources to the ontologies

•  Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …

USC Information Sciences Institute CC-By 2.0 22

CC-By 2.0 23

CIDOC CRM

•  Select ontologies … that define classes and properties for our data

http://www.cidoc-crm.org/

USC Information Sciences Institute

CC-By 2.0 24

•  Define URI scheme … identifiers of your resources

USC Information Sciences Institute

CC-By 2.0 25

http://edan.si.edu/saam/person-institution/8 http://edan.si.edu/saam/person-institution/8/id http://edan.si.edu/saam/person-institution/8/appellation/displayname http://edan.si.edu/saam/object/12 http://edan.si.edu/saam/object/12/title http://edan.si.edu/saam/object/12/id http://edan.si.edu/saam/object/12/acquisition http://edan.si.edu/saam/object/12/production http://edan.si.edu/saam/object/12/production/date http://edan.si.edu/saam/thesauri/nationality/American http://edan.si.edu/saam/thesauri/classification/Photography

•  Define URI scheme … identifiers of your resources

USC Information Sciences Institute

CC-By 2.0 26

•  Convert data to RDF … from data sources to the ontologies

USC Information Sciences Institute

RDF Mapping Tools

CC-By 2.0 27 USC Information Sciences Institute

TOOL SHORTCOMINGS BENEFITS custom code

labor intensive w error prone

flexible

R2RML difficult to learn w only SQL databases

W3C standard w good documentation w multiple vendors

Open Refine

no guidance w only tabular data

graphical user interface w support for reconciliation w open source

Karma university product easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

XML/JSON

Services

Karma

SQL/CSV

BigData

RDF

JSON

Interactive tool for rapidly extracting, cleaning, transforming, integrating & publishing

linked data in multiple formats 28 USC Information Sciences Institute

Ontology

KARMA DEMO

CC-By 2.0 29 USC Information Sciences Institute

http://youtu.be/h3_yiBhAJIc

Easy To Use

CC-By 2.0 30

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

CLEAR DEPICTION OF MAPPING

USC Information Sciences Institute

CC-By 2.0 31

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

LEARNS TO MAP YOUR DATA

USC Information Sciences Institute

CC-By 2.0 32

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

SUGGEST CORRECT ADJUSTMENTS

USC Information Sciences Institute

CC-By 2.0 33

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

EMBEDDED PYTHON SCRIPTING

USC Information Sciences Institute

CC-By 2.0 34

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

IMPORT POPULAR DATA FORMATS

USC Information Sciences Institute

CC-By 2.0 35

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

OUTPUT RDF IN MULTIPLE FORMATS

ntriples

JSON

AVRO

SPARQL

ElasticSearch, GitHub, …

Hadoop, BigData USC Information Sciences Institute

CC-By 2.0 36

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

40 million documents 1 billion triples

larger than all AAC museums combined

USC Information Sciences Institute

CC-By 2.0 37

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

periodic update every hour, every day

continuous update as new records come in

USC Information Sciences Institute

CC-By 2.0 38

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

Karma compatible with R2RML tools

USC Information Sciences Institute

CC-By 2.0 39

easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source

Karma Is Open Souce USC Information Sciences Institute

CC-By 2.0 40

URI RECONCILIATION

USC Information Sciences Institute

Multiple “John Singer Sargent” ima:Singer_Sargent_John a aac:Person ; dct:date "1856-1925" ; foaf:name "John Singer Sargent" .

saam:person_4253 a aac:Person ; saam:associatedPlace saam:SaamPlace_1357324439768t1r13950_0, saam:SaamPlace_1357324439768t1r13951_0 ; saam:constituentId "4253" ; rdaGr2:biographicalInformation “Painter. Sargent traveled …" ; rdaGr2:dateAssociatedWithThePerson "1990-10-1”, "1995-5-8" ; rdaGr2:dateOfBirth "1856-1-12" ; rdaGr2:dateOfDeath "1925-4-15" ; rdaGr2:placeOfBirth saam:SaamPlace_1357324439768t1r13952_0 ; rdaGr2:placeOfDeath saam:SaamPlace_1357324439768t1r13953_0 ; skos:altLabel "John S. Sargent" ; skos:prefLabel "John Singer Sargent" .

cb:12_4567 a aac:Person ; ont0:dateOfBirth "1879", "1885" ; ont0:dateOfDeath "1925" ; skos:prefLabel "John Singer Sargent" .

met:person_1893_3819 a aac:Person ; ont0:placeOfResidence "North and Central America", "United States" ; foaf:name "John Singer Sargent" .

dma:person_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1856" ; ont0:dateOfDeath "1925" ; foaf:name "John Singer Sargent" .

Pedro  Szekely  USC Information Sciences Institute CC-By 2.0 41

John Singer Sargent ima:SaamPerson_John_Singer_Sargent a aac:Person ; dct:date "1856-1925" ; foaf:name "John Singer Sargent" .

aac:Person_4253 a aac:Person ; saam:associatedPlace saam:SaamPlace_1357324439768t1r13950_0, saam:SaamPlace_1357324439768t1r13951_0 ; saam:constituentId "4253" ; rdaGr2:biographicalInformation “Painter. Sargent traveled …" ; rdaGr2:dateAssociatedWithThePerson "1990-10-1”, "1995-5-8" ; rdaGr2:dateOfBirth "1856-1-12" ; rdaGr2:dateOfDeath "1925-4-15" ; rdaGr2:placeOfBirth saam:SaamPlace_1357324439768t1r13952_0 ; rdaGr2:placeOfDeath saam:SaamPlace_1357324439768t1r13953_0 ; skos:altLabel "John S. Sargent" ; skos:prefLabel "John Singer Sargent" .

cb:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1879", "1885" ; ont0:dateOfDeath "1925" ; skos:prefLabel "John Singer Sargent" .

met:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:placeOfResidence "North and Central America", "United States" ; foaf:name "John Singer Sargent" .

dallas:SaamPerson_John_Singer_Sargent a aac:Person ; ont0:dateOfBirth "1856" ; ont0:dateOfDeath "1925" ; foaf:name "John Singer Sargent" .

Pedro  Szekely  USC Information Sciences Institute CC-By 2.0 42

Reconciled “John Singer Sargent” URIs

saam:person_4253 owl:sameAs cb:12_4567 ; owl:sameAs dma:person_John_Singer_Sargent ; owl:sameAs ima:Singer_Sargent_John ; owl:sameAs met:SaamPerson_John_Singer_Sargent ; owl:sameAs dbpedia:John_Singer_Sargent ; owl:sameAs nytimes/N49129220686803623753 ; owl:sameAs w-flick/John_Singer_Sargent ; ....

Pedro  Szekely  USC Information Sciences Institute CC-By 2.0 43

URI Reconciliation In Karma

Pedro  Szekely  USC Information Sciences Institute CC-By 2.0 44

Results of Automatic Linking

Pedro  Szekely  

99% are correct 6% are missing

USC Information Sciences Institute CC-By 2.0 45

Steps to Create Linked Open Data •  Publish the raw data

… get the data out of the proprietary database

•  Select ontologies … that define classes and properties for our data

•  Define URI scheme … identifiers of your resources

•  Convert data to RDF … from data sources to the ontologies

•  Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …

USC Information Sciences Institute CC-By 2.0 46

CC-By 2.0 47

TMS to CRM easy?

USC Information Sciences Institute

CC-By 2.0 48

TMS to CRM easy?

USC Information Sciences Institute

NO  

COMMUNITY EFFORT •  Publish the raw data

… get the data out of the proprietary database

•  Select ontologies … that define classes and properties for our data

•  Define URI scheme … identifiers of your resources

•  Convert data to RDF … from data sources to the ontologies

•  Identify links to other Linked Data datasets … aka reconciliation, entity resolution, …

USC Information Sciences Institute CC-By 2.0 49

Radical Ideas

•  ULAN in Wikipedia or Wikidata •  ULAN in GitHub •  Collection data in GitHub •  Community created CRM mappings in GitHub •  CRM in JSON-LD in GitHub •  Tools to export from TMS to GitHub

USC Information Sciences Institute CC-By 2.0 50

STORING AND MAINTAINING

THE DATA CC-By 2.0 51 USC Information Sciences Institute

Deployment Options

CC-By 2.0 52 USC Information Sciences Institute

Technology Shortcomings Benefits SPARQL endpoint

low reliability, esoteric, slow

sophisticated query language

RDF dump no query capability, esoteric

flexibility: clients can download and use in applications, easy to publish

JSON-LD + ElasticSearch

restricted query language

very high performance, mainstream technology, easy to publish

Karma supports the three options

CC-By 2.0 53

federation every publishes their data with

their own URIs

aggregation aggregator repulishes everyone’s

data with new URIs

USC Information Sciences Institute

thanks for your attention!

https://github.com/usc-isi-i2/Web-Karma!Open Source, Apache 2 License!

CC-By 2.0 54 USC Information Sciences Institute