OAI Overview

35
OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA [email protected] http://www.cs.odu.edu/~mln/ Bioinformatics Seminar ODU CS 791/891 Feb 3 2003

description

OAI Overview. Bioinformatics Seminar ODU CS 791/891 Feb 3 2003. Michael L. Nelson Old Dominion University Norfolk Virginia, USA [email protected] http://www.cs.odu.edu/~mln/. The Rise and Fall of Distributed Searching. - PowerPoint PPT Presentation

Transcript of OAI Overview

Page 1: OAI Overview

OAI Overview

Michael L. NelsonOld Dominion University

Norfolk Virginia, [email protected]

http://www.cs.odu.edu/~mln/

Bioinformatics SeminarODU CS 791/891

Feb 3 2003

Page 2: OAI Overview

The Rise and Fall of Distributed Searching

• wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice– Davis & Lagoze, JASIS 51(3), pp. 273-80– Powell & French, Proc 5th ACM DL, pp. 264-265

• distributed searching of N nodes still viable, but only for small values of N

• NCSTRL: N > 100; bad• NTRS/NIX: N<=20; ok (but could be better)

Page 3: OAI Overview

The Rise and Fall of Distributed Searching

• Other problems of distributed searching (from STARTS)

– source-metadata problem• how do you know which nodes to search?

– query-language problem• syntax varies and drifts over time between the various nodes

– rank-merging problem• how do you meaningfully merge multiple result sets?

• Temptations:– centralize all functions

• “everything will be done at X”

– standardize on a single product• “everyone will use system Y”

Page 4: OAI Overview

Universal Preprint Service

• A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives– based on NCSTRL+; a modified version of Dienst

• support for “clustering”• support for “buckets”

• Demonstrated at Santa Fe NM, October 21-22, 1999– http://ups.cs.odu.edu/– D-Lib Magazine, 6(2) 2000 (2 articles)

• http://www.dlib.org/dlib/february00/02contents.html

– UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

Page 5: OAI Overview

• Data Providers– publishing into an archive

– providing methods for metadata “harvesting”• provide non-technical context for sharing information

also

• Service Providers– harvest metadata from providers

– implement user interface to data

• Self-describing archives– Much of the learning about the constituent UPS

archives occurred out of band…

Data and Service Providers

Even if theseare done bythe same DL,these are distinct roles

Page 6: OAI Overview

Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata

– data remains at remote repositories

user

. . .

search for “cfd applications”

local copy ofmetadata

metadataharvested offline

metadataharvested offline

metadataharvested offline

metadataharvested offline

each node independently maintained

all searching, browsing, etc. performed on the metadata hereindividual nodes can

still support direct userinteraction

Page 7: OAI Overview

Result… OAI

• http://www.openarchives.org/

• The OAI was the result of the demonstration and discussion during the Santa Fe meeting

• Initial focus was on federating collections of scholarly e-print materials…

• …however, interest grew and the scope and application of OAI expanded to become a generic bulk metadata transport protocol

• Note:– OAI is only about metadata -- not full text!

– OAI is neutral with respect to the nature of the metadata or the resources the metadata describes

• read: commercial publishers have an interest in OAI too...

Page 8: OAI Overview

about eprintsdocument

like objectsresources

metadata OAMSunqualifiedDublin Core

unqualifiedDublin Core

transport HTTP HTTP HTTP

responses XML XML XML

requests HTTP GET/POST HTTP GET/POST HTTP GET/POST

verbs Dienst OAI-PMH OAI-PMH

nature experimental experimental stable

modelmetadataharvesting

metadataharvesting

metadataharvesting

Santa Feconvention

OAI-PMHv.1.0/1.1

OAI-PMHv.2.0

Page 9: OAI Overview

Dublin Core

• Dublin Core Metadata Initiative– http://www.dublincore.org/

– from 1994-1995, recognizing the need for simple, interoperable metadata for resource discovery

– good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html

– 15 elements (qualifiers possible)

Title Creator Subject Description Publisher

Contributor Date Type Format Identifier

Source Language Relation Coverage Rights

Page 10: OAI Overview

Overview of OAI Verbs

Verb Function

Identify description of archive

ListMetadataFormats metadata formats supported by archive

ListSets sets defined by archive

ListIdentifiers OAI unique ids contained in archive

ListRecords listing of N records

GetRecord listing of a single record

archivalmetadata

harvestingverbs

most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)

Page 11: OAI Overview

Argument SummarymetadataPrefix from until set resumptionToke

nidentifier

Identify

ListMetadataFormats

optional

ListSets exclusive

ListIdentifiers optional optional optional exclusive

ListRecords optional optional optional exclusive

GetRecord

Page 12: OAI Overview

Error SummaryIdentify BA

ListMetadataFormats

BA NMF IDDNE

ListSets BA BRT NSH

ListIdentifiers BA BRT CDF NRM NSH

ListRecords BA BRT CDF NRM NSH

GetRecord BA CDF IDDNE

Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification

Page 13: OAI Overview

Flow Control

• ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of:– resumptionToken – an opaque, archive-defined data

string that when passed back to the archive allows the response to begin where it left off

• each archive defines their own resumptionToken syntax; it may have visible semantics or not

– 503 http status code – “retry after”• up to the harvester to understand this code and respect it, and

up to the archive to enforce it

Page 14: OAI Overview

resumptionToken

harvester RDBMS

ListRecords

Records 1-100, resumptionToken=AXad31

ListRecords, resumptionToken=AXad31

Records 101-200, resumptionToken=pQ22-x

ListRecords, resumptionToken=pQ22-x

Records 201-277

scenario: harvesting277 records in 3 separate100 record “chunks”

Page 15: OAI Overview

OAI Links & Demos• Data providers

– not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool

• http://purl.org/net/oai_explorer• ~100 registered data providers

– http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl– many being used for internal purposes; not registered

• Service providers– Arc, the first known SP harvesting from OAI data providers

• http://arc.cs.odu.edu/ • ~20 registered service providers

– http://www.openarchives.org/service_provider/oai_sp.htm– several more known to be in testing or creation

Page 16: OAI Overview

Field of Dreams• It should be easy to be a data provider, even if it makes more work for

the service provider.– if enough data providers exist, the service providers will come (DPs >>

SPs)

• Open-source / freely available tools– “drop-in” data providers:

• industrial strength: http://www.eprints.org/• personal size: http://kepler.cs.odu.edu/

– tools to make your existing DL a data provider:• http://www.openarchives.org/tools/tools.htm• also: OAI-implementers mailing list / mail archive!

– service providers:• only bits and pieces currently publicly available...

Page 17: OAI Overview

OAI Observation: Front-End Only

• No input/registry mechanism– OAI harvesting protocol is always a front-end for something else

• filesystem, Dienst, RDBMS, LDAP, etc.

– convenient for pre-existing DLs, but does not address “new” DLs• e.g., “we want to do OAI”

• Bounds the scope of OAI– responsibilities and domain of OAI are still be discussed– tension between functionality and simplicity

Page 18: OAI Overview

OAI Observation: No T&C

• No terms & conditions provisions in protocol– assumes all metadata has uniform access rights

• how to restrict metadata to certain hosts?

– introducing T&C would increase the scope of application, but at the expense of simplicity

• how expensive do we want to make a “just-a-front-end protocol” ?

• maybe T&C is a good application for sets?

Page 19: OAI Overview

OAI Observation: No T&C

• Possible to use multiple OAI servers in a DMZ-like configuration…

Public OAI Server

Private OAI Server

Source database

OAI requestsfrom trusted hosts

OAI requestsfrom arbitrary hosts

could even use a separate copy of the database…

Page 20: OAI Overview

OAI Observation: No T&C

• Possible to use OAI harvesting protocol in closed, restricted systems

OAI 1 OAI 2

OAI 3OAI 4

all OAI requests originate from these 4 DLs

Page 21: OAI Overview

OAI Observation: Monolithic

• An OAI server has no protocol-defined concept of “other” OAI servers– backups, mirrors, etc. have to be resolved

outside of the scope of OAI• scope vs. complexity again

– fully connected graph of DLs harvesting from each other is unnecessary

• cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System

– 3rd party harvesting interfaces raise more T&C and data coherency issues

Page 22: OAI Overview

302 Load Balancing• Interactive users on main DL machine should not be

impacted by metadata harvesting– don’t take deliveries through the front door– not part of the protocol; defined outside the protocol

OAIServer

naca.larc.nasa.gov/oai/

if load > 0.05redirect request

OAIServer

buckets.dsi.internet2.edu/naca/oai/

harvesterhttp://blah/oai/?verb=ListIdentifiers

HTTP Status Code 302

http://blah/oai/?verb=ListIdentifiers

<?xml version=“1.0” encoding=“UTF-8”?>…<ListIdentifiers>…</ListIdentifiers>

Page 23: OAI Overview

OAI Observation: Data Coherency

• In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret – what is an update vs. addition?

• in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out

– deletions?• it is currently optional for OAI systems to mark records

as deleted or not…– still left to the harvester to interpret

Page 24: OAI Overview

OAI Observation: Harvest Model• Frequency of harvests

– all-at-once harvests?• initial harvest• resolving data coherency

– frequent incremental harvests?• far more efficient for both service and data providers

• Webcrawling vs. digital library models– webcrawlers: little to no a priori information about target– DLs: frequent harvesting of a small number of known targets

• Realization: we know very little about how harvesting behavior…– are we optimizing for all-at-once, when incremental will be more

common?

Page 25: OAI Overview

Other Uses For the OAI-PMH• Assumptions:

– Traditional DLs / SPs will continue on their present path of increasing sophistication

• citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc.

– growth rates remain the same (5x DPs as SPs)

• Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state– Future opportunities are possible by creatively

interpreting the OAI-PMH data model

Page 26: OAI Overview

resource

all available metadata about David

item

Dublin Coremetadata

MARCmetadata

SPECTRUMmetadata records

item = identifier

record = identifier + metadata format + datestamp

set-membership is item-level property

OAI-PMH Data Model

Page 27: OAI Overview

Typical Values• repository

– collection of publications• resource

– scholarly publication• item

– all metadata (DC + MARC)• record

– a single metadata format• datestamp

– last update / addition of a record• metadata format

– bibliographic metadata format• set

– originating institution or subject categories

Page 28: OAI Overview

Repositories…

• Stretching the idea of a repository a bit:– contextually sensitive repositories

• “personalization for harvesters”• communication between strangers, or communication

between friends?

– OAI-PMH for individual complex objects?• OAI-PMH without MySQL?!

– Fedora, Multi-valent documents, buckets– tar, jar, zip, etc. files

Page 29: OAI Overview

Resource

• What if resource were:– computer system status

• uptime, who, w, df, ps, etc.

– or generalized “system” status• e.g., sports league standings

– people• personnel databases• authority files for authors

Page 30: OAI Overview

Item

• What if item were:– software

• union of versions + formats – all forms of metadata

• administrative + structural• citations, annotations, reviews, etc.

– data • e.g., newsfeeds and other XML expressible content

– metadataPrefixes or sets could be defined to be different versions

Page 31: OAI Overview

Record

• What if record were:– specific software instantiations / updates– access / retrieval logs for DLs (or computer systems)– push / pull model inversion

• put a harvester on the client behind a firewall, the client contacts a DP and receives “instructions” on how to submit the desired document (e.g., send email to a specified address)

Page 32: OAI Overview

Datestamp

• semantics of datestamp are strongly influenced by the choice of resource / item / record / metadataPrefix, but it could be used to:– signify change of set membership (e.g., workflow:

item moves from “submitted” to “approved”)– change datestamp to reflect access to the DP

• e.g., in conjunction with metadataPrefixes of “accessed” or “mirrored”

Page 33: OAI Overview

metadataPrefix

• what if metadataPrefix were:– instructions for extracting / archiving / scraping the

resource• verb=ListRecords&metadataPrefix=extract_TIFFs

– code fragments to run locally• (harvested from a trusted source!)

– XSLT for other metadataPrefixes• branding container is at the repository-level, this could

be record- or item-level

Page 34: OAI Overview

Set• sets are already used for tunneling OAI-PMH

extensions (see Suleman & Fox, D-Lib 7(12))• other uses:

– in aggregators, automatically create 1 set per baseURL– have “hidden” sets (or metadataPrefix) that have

administrative or community-specific values (or triggers)

• set=accessed>1000&from=2001-01-01• set=harvestMeWithTheseARGS&until=2002-05-

05&metadataPrefix=oai_marc

Page 35: OAI Overview

Interesting Services

• DP9– gateway to expose repository contents in HTML

suitable for web crawlers

• Celestial– OAI “cache”, also 1.1 -> 2.0 converter

• Static (mini-) repositories– XML files, based on OLAC work

• OpenURL metadata format registries– record = metadata format