Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority...

68
Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators: Dave Eichmann (University of Iowa) Simeon Warner and Dean Krafft (Cornell) December 6, 2017 Linked Data for Libraries - Labs

Transcript of Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority...

Page 1: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Linking the Data

Building Effective Authority and

Identity Lookup

Huda Khan and E Lynette Rayle

Cornell University

Collaborators

Dave Eichmann (University of Iowa)

Simeon Warner and Dean Krafft (Cornell)

December 6 2017 Linked Data for Libraries - Labs

Overview

bull Background and Motivation

bull Examples

bull VitroLib

bull Hyrax

bull Architecture overview

bull Future work

bull Questions

Background

bull Mellon Foundation-funded LD4 Projects

bull Transition library systems to linked data

bull Link better explore better

bull Flat record -gt Discrete entities with well-defined relationships

bull String identifiers -gt URIs

bull Relationships with other linked data

Background

4

Made in

America

1980

Made in

America

Blues

Brothers

Made in

America 1980 Blues

Brothers

Blues

Brothers

MARC

RECORD

NAME

AUTH

FILE

WORK

INSTANCE

AGENT

RWO

BIBFRAME

BIBLIOTEK-O

ENTITIES

WITH URIS

Background

ldquoA cataloger is an individual responsible for the processes of description subject analysis classification and authority control of library materials Catalogers serve as the lsquofoundation of all library service as they are the ones who organize information in such a way as to make it easily accessiblersquordquo (Emphasis mine)

From httpsenwikipediaorgwikiCataloging

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 2: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Overview

bull Background and Motivation

bull Examples

bull VitroLib

bull Hyrax

bull Architecture overview

bull Future work

bull Questions

Background

bull Mellon Foundation-funded LD4 Projects

bull Transition library systems to linked data

bull Link better explore better

bull Flat record -gt Discrete entities with well-defined relationships

bull String identifiers -gt URIs

bull Relationships with other linked data

Background

4

Made in

America

1980

Made in

America

Blues

Brothers

Made in

America 1980 Blues

Brothers

Blues

Brothers

MARC

RECORD

NAME

AUTH

FILE

WORK

INSTANCE

AGENT

RWO

BIBFRAME

BIBLIOTEK-O

ENTITIES

WITH URIS

Background

ldquoA cataloger is an individual responsible for the processes of description subject analysis classification and authority control of library materials Catalogers serve as the lsquofoundation of all library service as they are the ones who organize information in such a way as to make it easily accessiblersquordquo (Emphasis mine)

From httpsenwikipediaorgwikiCataloging

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 3: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Background

bull Mellon Foundation-funded LD4 Projects

bull Transition library systems to linked data

bull Link better explore better

bull Flat record -gt Discrete entities with well-defined relationships

bull String identifiers -gt URIs

bull Relationships with other linked data

Background

4

Made in

America

1980

Made in

America

Blues

Brothers

Made in

America 1980 Blues

Brothers

Blues

Brothers

MARC

RECORD

NAME

AUTH

FILE

WORK

INSTANCE

AGENT

RWO

BIBFRAME

BIBLIOTEK-O

ENTITIES

WITH URIS

Background

ldquoA cataloger is an individual responsible for the processes of description subject analysis classification and authority control of library materials Catalogers serve as the lsquofoundation of all library service as they are the ones who organize information in such a way as to make it easily accessiblersquordquo (Emphasis mine)

From httpsenwikipediaorgwikiCataloging

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 4: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Background

4

Made in

America

1980

Made in

America

Blues

Brothers

Made in

America 1980 Blues

Brothers

Blues

Brothers

MARC

RECORD

NAME

AUTH

FILE

WORK

INSTANCE

AGENT

RWO

BIBFRAME

BIBLIOTEK-O

ENTITIES

WITH URIS

Background

ldquoA cataloger is an individual responsible for the processes of description subject analysis classification and authority control of library materials Catalogers serve as the lsquofoundation of all library service as they are the ones who organize information in such a way as to make it easily accessiblersquordquo (Emphasis mine)

From httpsenwikipediaorgwikiCataloging

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 5: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Background

ldquoA cataloger is an individual responsible for the processes of description subject analysis classification and authority control of library materials Catalogers serve as the lsquofoundation of all library service as they are the ones who organize information in such a way as to make it easily accessiblersquordquo (Emphasis mine)

From httpsenwikipediaorgwikiCataloging

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 6: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Background

bull Traditional practices Authority File

bull Eg Name Authority Files Subject Headings Genre Forms from LOC

bull String as unique identifier eg ldquoMark Twainrdquo

bull Tasks and workflows

bull Identification ldquoAboutnessrdquo

bull Disambiguation

bull Context and original authority record

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 7: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Background

bull Goals Design and architecture around accessing authorities

bull VitroLib

bull Prototype cataloging editor

bull Createsuses linked data

bull Enables lookup and use of authorities

bull Hyrax

bull Samvera technology stack

bull Incorporate authorities into institutional repository records

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 8: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

8

VitroLib Demo

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 9: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

9

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 10: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

What just happened

Questioning Authority

MAGIC (To Be Explained)

VitroLib Search Service

LOC Genre Forms

Search LOC Genre Form

data

Query = animation

Translate to QA Service

Request

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

context

ldquoAlternate Labelrdquo [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

urihttpidlocgovgf2011026141

label ldquoClay animation television programsrdquo

altLabelList [

ldquoClaymation television programsrdquo

ldquoSculptmation television programsrdquo

] hellip

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 11: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

23

Hyrax Demo

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 12: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Autocomplete Saving String and URI

Authority OCLC FAST Subauthority PersonName

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 13: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Selected String and URI

Saves both string and URI

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 14: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Selecting a Term using

Lookup with Context

26

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 15: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Selecting a Term using

Lookup with Context

27

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 16: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Getting more from the same authority

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 17: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Getting more from other authorities

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 18: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

30

Architecture

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 19: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Technical Motivation

bull Linked data provideshellip

bull URIs that identify specific terms (as opposed to ambiguity of using

strings)

bull Reconciliation to relate terms that are defined in separate authorities

bull Goals of implementationhellip

bull Provide a single process to access many authorities

bull Provide efficient and reliable access to authorities

bull Provide a means for disambiguation that empowers library staff to

make the most accurate selections

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 20: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

First Set of Challenges

1 Finding Documentation

2 Linked Data Access API eg no support partial support requires login credentials

sparql query endpoint only

3 Varying Results Formats eg rdf-xml json-ld turtle n-triples etc

4 Varying Ontologies eg SKOS schemaorg madsrdf dbpedia geonames

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 21: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 22: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 23: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 24: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 25: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 26: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Multi-Server Architecture

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainamp

maximumRecords=2

[urihttpidworldcatorgfast31622

id31622 labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563

id365563labelTwain Shania ]

httpexperimentalworldcatorgfast

searchquery=oclcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

lthttpidworldcatorgfast31622gt

a schemaPerson

dctermsidentifier 31622

skosprefLabel Twain Mark 1835-1910

skosaltLabel Make Teviin 1835-1910

Make Tuwen 1835-1910

lthttpidworldcatorgfast365563gt

a schemaPerson

dctermsidentifier 365563

skosprefLabel Twain Shania

skosaltLabel Twain Eilleen

Edwards Eilleen

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 27: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Direct Access Query API

Direct against authorityhellip

httpexperimentalworldcatorgfastsearch

query=oclcpersonalName+22twain22

ampmaximumRecords=2

httpapigeonamesorgsearch

q=ithaca

ampmaxRows=2

ampusername=demo

amptype=rdf

httpartemideartuniroma2it8081agrovocrestv1search

query=milk

amplang=en

ampmaxhits=2

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 28: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Normalized Query API

Through QA normalization layerhellip

httplocalhost3000qasearchlinked_dataoclc_fast

q=twain

ampmaxRecords=2

httplocalhost3000qasearchlinked_datageonames

q=ithaca

ampmaxRecords=2

httplocalhost3000qasearchlinked_dataagrovoc

q=milk

ampmaxRecords=2

amplang=en

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 29: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Normalized Results

[urihttpidworldcatorgfast31622 id31622 labelTwain Mark 1835-1910 urihttpidworldcatorgfast365563 id365563 labelTwain Shania]

[uri httpswsgeonamesorg2162552 id httpswsgeonamesorg2162552 label Ithaca (AU) uri httpswsgeonamesorg4515289 id httpswsgeonamesorg4515289 label Ithaca (US)]

[uri httpaimsfaoorgaosagrovocc_8602 id httpaimsfaoorgaosagrovocc_8602 label acidophilus milk uri httpaimsfaoorgaosagrovocc_16076 id httpaimsfaoorgaosagrovocc_16076 label buffalo milk]

OCLC FAST GeoNames AgroVoc

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 30: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Second Set of Challenges

5 Reliability amp Efficiency eg server uptime server load

6 Accuracy eg select results based on usage data lexical match

custom weighting other

7 Order Ranking eg How to order a graph

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 31: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

Jena-Fuseki

Triplestore

One full setup per authority

LuceneSOLR

Index

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 32: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

Jena-Fuseki

Triplestore

LuceneSOLR

Index

One full setup per authority

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 33: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

Jena-Fuseki

Triplestore

LuceneSOLR

Index

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 34: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

extract search rank

extract URI

Jena-Fuseki

Triplestore

for each result

LuceneSOLR

Index

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 35: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

sparql query for URI

Jena-Fuseki

Triplestore

LuceneSOLR

Index

extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 36: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Cache Server Query Process

JSP Query API

combine all results

Jena-Fuseki

Triplestore

insert search rank in predicate

lthttpvivoweborgontology

corerankgt

LuceneSOLR

Index

sparql query for URI extract search rank

extract URI

for each result

lucene search for ezra cornell

index built with predicate values

ltskosprefLabelgt

ltskosaltLabelgt

httpservicesld4lorgld4l_servicesloc_name_batchjspquery=ezra20cornellampmaxRecords=10

One full setup per authority

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 37: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

UI-QA-Authority

QA ndash normalize RDF returned from an authority

httplocalhost3000qasearchlinked_data

oclc_fastpersonal_nameq=twainampmaximumRecords=2

[urihttpidworldcatorgfast31622id31622

labelTwain Mark 1835-1910

urihttpidworldcatorgfast365563id365563

labelTwain Shania

httpexperimentalworldcatorgfastsearchquery=o

clcpersonalName+22twain22

ampsortKeys=usageampmaximumRecords=2

RDF of

search

results

Active-Triples

LDF Cache

(Marmotta or

Blazegraph) LDF Cache Jena-Fuseki-

Lucene

Cache

Direct Access

of External

Authority

HyraxVitrolib ndash UI for selecting an entry from an authority

search of cache performed via Lucene index

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 38: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Third Set of Challenges

8 Disambiguation through better context eg expand from just prefLabel tohellip

preLabel altLabel birthdeath dates occupation etc

9 Reconciliation across multiple sources eg match LoC URI to OCLC FAST URI

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 39: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

53

Whatrsquos next

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 40: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Addressing Architectural Challenges

bull Generalize process for accessing context on the

cache server and in the normalization layer

bull Multi-authority search and reconciliation

bull Address the need for cache refresh

bull Mirrored cache servers

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 41: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

User Experience and Design

bull User-centered Design

bull Observe listen learn design evaluate iterate

bull Iteratively design and evaluate UI for lookupauthorities

with catalogers

bull Search result rankingorderingfiltering for catalogers

bull Additional UI platforms eg FOLIO

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 42: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

56

Questions

httptinyurlcomld4l-auth-access

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 43: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Appendix for Challenges 1-4

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 44: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Challenge 1 Documentation

58

LoC httpidlocgovtechcenter

C Harlow notes on reconciling LoC - httpsgithubcomcmh2166lc-reconcile

OCLC FAST

httpswwwoclcorgdeveloperdevelopweb-servicesfast-apilinked-dataenhtml

GeoNames

httpwwwgeonamesorgexportgeonames-searchhtml

AGROVOC httpaimsfaoorgvest-registryvocabulariesagrovoc-multilingual-agricultural-thesaurus

swagger config httpsgithubcomNatLibFiSkosmosblobmasterswaggerjson

NALT

httpsagclassnalusdagov

DBpedia httpwikidbpediaorgOnlineAccess1220Public20Faceted20Web20Service20Inter

face

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 45: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Challenge 2 Linked Data Access API

59

for Search Query for Term Fetch

LoC not supported URI

OCLC FAST httpexperimentalworldcatorgfastsearchq

uery=subauth+all+22query22ampsortK

eys=usageampmaximumRecords=maximumR

ecords

URI

GeoNames httpapigeonamesorgsearchq=queryamp

maxRows=maxRowsampusername=userna

meamptype=rdf

URI

AGROVOC httpartemideartuniroma2it8081agrovocr

estv1searchquery=queryamplang=lang

httpartemideartuniroma2it8081agrovo

crestv1datauri=httpaimsfaoorgaosa

grovocterm_id

NALT httpskosmoslibrarycornelledurestv1nalt

searchquery=queryamplang=lang

httpskosmoslibrarycornelledurestv1na

ltdatauri=term_uri

DBpedia

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 46: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Challenge 3 Varying Results Formats

60

for Search Query for Term Fetch

LoC not supported rdf-xml

OCLC FAST rdf-xml rdf-xml

GeoNames rdf-xml rdf-xml

AGROVOC json-ld rdf-xml json-ld turtle

NALT json-ld rdf-xml json-ld turtle

DBpedia

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 47: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Challenge 4 Varying Ontologies

61

Primary Ontology Flat vs Navigation

required

LoC madsrdf

SKOS

navigation required

OCLC FAST schemaorg

SKOS

flat

GeoNames geonames flat

hierarchical

AGROVOC SKOS flat

hierarchical

NALT SKOS flat

hierarchical

DBpedia dbpedia flat

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 48: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Configurations for Questioning Authority

62

LoC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_locconfigauthoritieslinked_dat

a

OCLC FAST httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_oclcfastconfigauthoritieslinked

_data

GeoNames httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_geonamesconfigauthoritieslink

ed_data

AGROVOC httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_agrovocconfigauthoritieslinked

_data

NALT httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_naltconfigauthoritieslinked_dat

a

DBpedia httpsgithubcomld4l-

labslinked_data_authoritiestreemasterqa_dbpediaconfigauthoritieslinked

_data

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 49: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Appendix for Challenges 5-7

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 50: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Creating a Cache Server

Hardware

bull 8-core 64gb 3Ghz Mac Pro (late 2013) macOS Sierra

(10126)

bull 32tb Pegasus-2 Thunderbolt RAID configured as RAID-5

Triplestore

bull Apache Jena Fuseki 240 provides SPARQL endpoint

bull Apache Tomcat 90 runs custom web application(s)

bull Apache Lucene 36 provides search interface

64

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 51: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Customizations

bull custom per-data-source JSP web application provides

searchbrowsedownload functionality

bull custom (generic) SPARQL Tag Library provides API for web

apps (available at httpsgithubcomeichmannlod-utilities)

bull custom (generic) Lucene Tag Library provides API for web apps

65

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 52: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Loading a New Vocabulary

bull download RDF

bull if necessary convert to n-triples (required for GeoNames data for instance)

bull use tdbloader2 to populated triplestore

bull configure Fuseki server(s) with triplestore details

bull create new JSP project in Eclipse

bull write one or more indexer programs that populate Lucene indices and run indexer(s)

bull write searchbrowsedownload application logic using the SPARQL and Lucene tags

bull package project as war

bull deploy to Apache Tomcat server(s)

bull add new service to Apache HTTPD virtual host specification

66

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 53: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

UI Access to Cache Server

httpservicesld4lorgld4l_servicesloc_namejsp

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 54: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Downloads

68

LoC httpidlocgovdownload (n-triples OR rdf-xml)

OCLC FAST httpwwwoclcorgresearchthemesdata-sciencefastdownloadhtml (n-triples)

GeoNames httpwwwgeonamesorgontologydocumentationhtml (custom format ndash see notes for processing)

AGROVOC httpsaims-faoatlassiannetwikispacesAGVpages2949126Releases (n-triples OR rdf-xml)

NALT httpsagclassnalusdagovdownloadshtml (rdf-xml)

DBpedia httpwikidbpediaorgdownloads-2016-04

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 55: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Potential Options for Reconciliation

bull VIAF for name reconciliation ndash we are doing some

work with this

bull Wikidata ndash Ive heard that they are working on

Reconciliation issues but havent yet explored in

depth bull Intro Video (3hrs)

bull API Access

bull SPARQL ndash user manual

bull federated queries with other authorities

Doing a google search for linked data reconciliation

returns a large number of articles and presentations

on this concept

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool

Page 56: Linking the Data: Building effective Authority and …Linking the Data: Building Effective Authority and Identity Lookup Huda Khan and E. Lynette Rayle Cornell University Collaborators:

Links to Code amp More

bull QA Server - Code for a small app that provides the

Questioning Authority normalization layer

bull Linked Data Authorities - Configurations that can

be used with QA Server

bull LD4L Services - UI access to Cache Server

bull VitroLib - Code for the VitroLib cataloging tool