Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow...

36
Extracting XML from Unicorn Extracting XML from Unicorn with OAI and SRU with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels

Transcript of Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow...

Page 1: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Extracting XML from Unicorn Extracting XML from Unicorn with OAI and SRUwith OAI and SRU

European Unicorn User Group ConferenceGlasgow Caledonian University

September 7th & 8th, 2006

Benoit PAUWELSUniversité Libre de Bruxelles (ULB)

Brussels

Page 2: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

AgendaAgenda

• Introduction – Unicorn interfaces

• Part 1: An OAI frontend for Unicorn• Part 2: An SRU frontend for Unicorn

– Short description of OAI and SRU protocols– Overview of technical implementation– Use cases and demos

Page 3: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

IntroductionIntroduction

• OAI and SRU are ‘open’ protocols that permit exchange of metadata between information systems

• Well-known Unicorn interfaces:– Unicorn API server– Unicorn Webcat/iBistro/iLink server– Unicorn Z39.50 server

• All comply to the philosophy of request/response sequences

Page 4: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Client system Unicorn server

Catalogue database

[ Records and indexes ]

TCPIP/SocketAPI request

TCPIP/Socket API responseAPI datacodes/values

API server

Unicorn interfaces: API Unicorn interfaces: API serverserver

SirsiDynix

• Character client

• C Workflows client

• Java Themes client

Communication protocol TCPIP/SocketInformation exchange protocol proprietary SirsiDynix API requests/responsesReturned record structure proprietary SirsiDynix format (data-codes and -values)

Page 5: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Client system Unicorn server

Catalogue database

[ Records and indexes ]

HTTPiLink request (URL)

HTTP HTML pageHTML

iLink

Unicorn interfaces: iLinkUnicorn interfaces: iLink

• Any Web browser

Communication protocol HTTPInformation exchange protocol URL requests / HTML responsesReturned record structure HTML

Web Server

Page 6: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Client system Unicorn server

Catalogue database

[ Records and indexes ]

Z39.50Z39.50 request

Z3950 Z3950 responseMARC21

Z39.50

Unicorn interfaces: Z39.50Unicorn interfaces: Z39.50

• Any Z3950 client

Communication protocol Z39.50 specificInformation exchange protocol Z39.50 specificReturned record structure typically MARC21

Page 7: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Unicorn interfacesUnicorn interfaces

• API: Proprietary– low interoperability level

• HTML: Record data not well structured– low reusability level

• Z39.50: Protocol specific– more difficult to implement (high learning curve)– Z39.50 is statefull

Difficult to integrate into today’s web services environments

communication: use HTTPinformation exchange: use open protocols (like OAI and

SRU)record data structure: use XML (according to well-

defined XML Schema)

Page 8: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

2 new Unicorn interfaces2 new Unicorn interfaces

• HTTP / Open / XML

• OAI-PMH: Open Archives Initiative – Protocol for Metadata Harvesting

• SRU: Search and Retrieve via URL

Page 9: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Service Provider Data Provider

Document Archive

HTTP embeddedOAI requests

HTTP embeddedOAI responses

OAI Frontend

OAI-PMH : the protocolOAI-PMH : the protocol

Web Server

Page 10: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI-PMH: the protocolOAI-PMH: the protocol

• ‘Harvester collects metadata from archives’

• Stateless protocol: sequence of OAI requests/responses over HTTP

• Just harvesting -- NOT searching

Page 11: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI-PMH: the protocolOAI-PMH: the protocol

OAI requests

• HTTP GET|POST requests• Syntax

– BASE URL• host + port + path of OAI request handler

– key=value pairs• Examples:

– http://www.cible.ulb.ac.be:80/cgi-bin/OAI20/catalog?verb=Identify _

– http://www.biomedcentral.com/oai/1.1/bmcoai.asp?verb=GetRecord&identifier=oai:bmc:1471-2105-1-1&metadataPrefix=oai_dc

Page 12: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI-PMH: the protocolOAI-PMH: the protocol

OAI responses

• XML encoded bytestreams, containing the records• Record = triplet

– header (unique OAI identifier)– metadata– about

• Metadata schemes– XML Schema– Minimum: unqualified Dublin Core– Community specific

• Example of a record (catkey 450000 from ULB catalogue):– oai_dc marc21 umods

Page 13: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI-PMH: the protocolOAI-PMH: the protocol

Simple : 6 OAI requests/responses

• Identify– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _

• ListMetadataFormats [identifier]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=ListMetadataFormats _

• ListSets– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _

• GetRecord identifier, metadataPrefix– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _

Page 14: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI-PMH: the protocolOAI-PMH: the protocol

Simple : 6 OAI requests/responses

• ListRecords metadataPrefix, [from,until,set]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=ListRecords&metadataPrefix=oai_dc _– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=ListRecords&metadataPrefix=mhld21&set=elper _– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _

• ListIdentifiers metadataPrefix, [from,until,set]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?

verb=ListIdentifiers&metadataPrefix=oai_dc _

Page 15: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

• Implementation of the data provider functionality (2001)

• http://www.openarchives.org/tools/tools.htmlpick a template and interface with Unicorn through Unicorn database tools

• Our choice: Object Oriented Perl frontend (H. Suleman – Virginia Tech) _

Page 16: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

HTTP embeddedOAI request

Unicorn Server

HTTP server

Unicorn database

CGIOAI

C wrapper

fork in ‘sirsi’environment

OAI.pl

• call the appropriate OAI request handler

• retrieve metadata fromUnicorn database

• format in XMLHTTP embeddedOAI response

Page 17: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

Example: implementation of the GetRecord request

http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc

1. Get metadata from Unicorn for catkey 245000$record = `echo $catkey | catalogdump -of | filtermarc

-iALL -od -Ds`; _@dates = split(‘\|’,`echo $catkey | selcatalog -iK -opr`);

2. Convert ANSEL character set into ISO-LATIN-13. Map from MARC to oai_dc _4. Format into XML

Page 18: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

Example: implementation of the ‘set’ parameter of the ListRecords request

http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&set=elper

• Precompile set as a file of catkeys– name of file: « name of set_catkeys »

• einstein_albert_catkeys• elper_catkeys• sd_catkeys• all_catkeys

– through periodic execution of « mkoaisets » custom report

Page 19: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

Example: implementation of the ‘from/until’ parameters of the ListRecords request

http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31

• BRS index on creation/modification date?• Every Unicorn record that gets created or modified is

‘touched’ in the ‘textedit’ and ‘browsedit’ directories• Custom report ‘cadutext’

– saves catkeys to <ud>/Savedkeys/adutext/rptid– adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext

• Example: « from=2006-08-01&until=2006-08-31 »– obtain report ids for all runs of cadutext after 2006-08-01 and

before 2006-08-31 from the file <ud>/Lastruns/cadutext– for each of these report ids: obtain catkeys from

<ud>/Savedkeys/adutext/rptid and save them to randomnumber_catkeys file

– sort and uniq the randomnumber_catkeys file

Page 20: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend for UnicornOAI frontend for Unicorn

• Limitations of implementation:– ListRecords/ListIdentifiers:

• The from and until parameters are not permitted if the set parameter is given on the request

• The from and until parameters are permitted if the set parameter is not given on the request, but their values should fall within a certain date range (at this moment arbitrarily set to ‘today - 2 months’ and ‘today’)

– Deleted records

• Complete source code and documentation available on the API Repository (http://sirsiapi.org)

Page 21: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB

Use case 1: Vlink - OpenURL resolver systemjoint project with Vrije Universiteit Brussel (VUB)

ULBiLink

JSTOR

ISIWeb of Science

ElsevierScienceDirect

OVIDWebSpirs

HTMLextended services

OpenURL

Vlink

Vlinkknowledge base

Page 22: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.
Page 23: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB

Use case 1: Vlink - OpenURL resolver system

• OpenURL sent from iLinkhttp://bibdev.vub.ac.be/cgi-bin/openurlulb? sid=ULB:Webcat&id=oai:ulbcat:617924

• This OpenURL does not contain enough metadata for the specific item ==> Vlink does a fetch back to Unicorn through an OAI GetRecord request to obtain a full MARC21 bibliographic description

http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefix=marc21

Page 24: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB

Use case 1: Vlink - OpenURL resolver system

• Feed Vlink Knowledge Base through OAI harvesting

VLink

Vlink Knowledge Base Unicorn

OAI-PMH

http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper

Page 25: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB

Use case 2: Unicat - Virtual Union Catalog of Belgium

University library Catalog

UnicornAleph

VIRTUAVUBIS

End User

Unicat WWW

Gateway

Unicat Indexer

Unicat Harvester

Search/Browse indexes

UnionOAI

Archive

OAI SRU

PublicMuseum

Other

OAI

Central Repository Data providers

HTML

Page 26: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

Client System Unicorn Server

SRU Frontend

SRU : the protocolSRU : the protocol

Web Server Catalogue database

[ Records and indexes ]HTTP

SRU request

HTTP SRU responseXML

Communication protocol HTTPInformation exchange protocol SRUReturned record structure XML

Page 27: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU: the protocolSRU: the protocol

• ‘Client searches and retrieves metadata records from an archive’

• Stateless protocol: sequence of SRU requests/responses over HTTP

• Search and Retrieve (<-> OAI: harvesting)

Page 28: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU: the protocolSRU: the protocol

SRU requests

• HTTP GET requests

• Syntax– BASE URL

• host + port + path of SRU request handler– key=value pairs

• 3 possible requests (operations)– explain

• serves to record facilities available at an SRU server• used by clients to self-configure• returned explain record is in XML and follows the ZeeRex Schema • Example: http://z3950.loc.gov:7090/voyager?

version=1.1&operation=explain _– scan

• allows the client to request a range of the available terms at a given point within a list of indexed terms

• enables clients to present an ordered list of values and, if supported, how many hits there would be for a search on that term

– searchRetrieve

Page 29: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU: the protocolSRU: the protocol

searchRetrieve operation

• searchRetrieve (principal) parameters– Version: (of the request); current protocol version: 1.1– query: query expressed in CQL– startRecord: position within the sequence of matched records of the

first record to be returned– maximumRecords: number of records requested to be returned – recordSchema: schema requested for the records to be returned– stylesheet: URL for an xml stylesheet. The client requests that the

server simply return this URL in the response.

• CQL

« Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary. »

(http://www.loc.gov/standards/sru/cql)

Page 30: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU: the protocolSRU: the protocol

searchRetrieve operation

Examples of CQL queries:

• dinosaurtitle = "complete dinosaur"title exact "the complete dinosaur"dinosaur not reptile dinosaur and bird or dinobird publicationYear < 1980

• title all "complete dinosaur"title contains all of the words: ‘complete’, and ‘dinosaur’

• title any "dinosaur bird reptile"title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’

• ribs prox/distance<=5 chevronsa more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’

Page 31: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU: the protocolSRU: the protocol

searchRetrieve operation -- examples

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&query=author=einstein _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein&recordSchema=dc _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author all "einstein albert“ _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“ _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _

• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _

Page 32: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU frontend for UnicornSRU frontend for Unicorn

Unicorn Server

SRU FrontendWeb Server Catalogue database

[ Records and indexes ]HTTP

SRU request

HTTP SRU responseXML

Client system

Page 33: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU frontend for UnicornSRU frontend for Unicorn

Unicorn Server

Z39.50 FrontendWeb Server

Catalogue database

[ Records and indexes ]

HTTPSRU request

HTTP SRU responseXML

SRU/Z39.50 Gateway

SRU/Z39.50

Z3950Z3950 request

Z3950Z3950 response

Client system

Page 34: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU frontend for UnicornSRU frontend for Unicorn

• SRU/Z39.50 Gateway: YAZ Proxy (Index Data)– Implemented at ULB: 7/2006 (2 days)– config.xml

<target name="cible" default="1"> <url>bib7.ulb.ac.be:2200</url> <xi:include href="explain.xml"/> <cql2rpn>pqf.properties</cql2rpn> </target> <target name=“slavko" default="1"> <url>velma.library.mun.ca:2200</url>

<xi:include href="explain.slavko.xml"/> <cql2rpn>pqf.slavko.properties</cql2rpn> </target>

– explain.xml• ZeeRex XML record as response to ‘explain’ operation

– pqf.properties• specifies the mapping of various CQL indexes,

relations, etc. into Type-1 query attributes

Page 35: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU frontend for UnicornSRU frontend for Unicorn

• YAZ Proxy

– http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _

– http://bib49.ulb.ac.be:9000/Slavko?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _

Page 36: Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS.

SRU frontend : use case @ SRU frontend : use case @ ULBULB

• Seamless integration of catalog searches in CMS• Typo3• Example

– HTML page containing biography of famous belgian historian Henri Pirenne

– frame pointing to the following URL:http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=pirenne%20and%20epub-dnu-*&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl

• Project– Unicorn contains descriptions of databases, websites,

etc with local thematic classification codes in 653– create thematic websites within our CMS, containing

frames that list available databases per theme