CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...
Transcript of CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...
Advanced O
verview of Version 2.0 of theO
pen Archives Initiative
Protocol for Metadata H
arvesting
Michael L. N
elsonO
ld Dom
inion University
Norfolk VA
mln@
cs.odu.edu
Herbert Van de Som
pelLos A
lamos N
ational LaboratoryLos A
lamos N
Mherbertv@
lanl.gov
Simeon W
arnerCornell U
niversityIthaca N
Ysim
ACM
/IEEE Joint Conference on Digital Libraries
Tutorial 5Portland, O
regon14:00 - 17:30July 14 2002
Scope and Focus
•This Tutorial is not…–
an introduction to OA
I-PMH
–a listing of all the wonderful projects that useO
AI-PM
H–
a discussion of the merits of m
etadataharvesting vs. distributed searching
•A
passing familiarity is assum
ed for:–
OA
I-PMH
1.x–
Dublin Core
Outline
•H
ow 2.0 evolved from SFC and 1.x
–people, processes, events
•W
hat’s new in 2.0–
comparison with 1.x
•Guidelines, recom
mendations, best
practices for 2.0 implem
entations–
harvesters, repositories, aggregators, optionalcontainers
•N
ovel applications of OA
I-PMH
Spoiler: What’s N
ew in 2.0?!•
Good news: OA
I-PMH
is still
Six Verbs + DC
•Increm
ental improvem
ents–
single XM
L schema
–am
biguities removed
–m
ore expressive options–
cleaner separation of roles & responsibilities•
Bad news: not backwards compatible with 1.1
fromSanta Fe Convention
toO
AI-PM
H v.2.0
abouteprints
document
like objectsresources
metadata
OA
MS
unqualifiedD
ublin CoreunqualifiedD
ublin Core
transportH
TTPH
TTPH
TTP
responsesX
ML
XM
LX
ML
requestsH
TTP GET/POST
HTTP GET/PO
STH
TTP GET/POST
verbsD
ienstO
AI-PM
HO
AI-PM
H
natureexperim
entalexperim
entalstable
model
metadata
harvestingm
etadataharvesting
metadata
harvesting
Santa Feconvention
OA
I-PMH
v.1.0/1.1O
AI-PM
Hv.2.0
Santa Fe Convention [02/2000]
• goal: optimize discovery of e-prints
• input:
• the UPS prototype
• RePEc /SOD
A “data provider / service
provider model”
• Dienst protocol
• deliberations at Santa Fe meeting
[10/99]
OA
I-PMH
v.1.0 [01/2001]
• goal: optimize discovery of docum
ent-like
objects
• input:
• SFC• D
LF meetings on m
etadata harvesting• deliberations at Cornell m
eeting [09/00]• alpha test group of O
AI-PM
H v.1.0
• low-barrier interoperability specification
• metadata harvesting m
odel: data provider / serviceprovider
• focus on document-like objects
• autonomous protocol
• HTTP based
• XM
L responses
• unqualified Dublin Core
• experimental: 12-18 m
onths
OA
I-PMH
v.1.0 [01/2001]
Selected Pre- 2.0 OA
I Highlights
•O
ctober 21-22, 1999 - initial UPS m
eeting•
February 15, 2000 - Santa Fe Convention published in D-Lib M
agazine–
precursor to the OA
I metadata harvesting protocol
•June 3, 2000 - workshop at A
CM D
L 2000 (Texas)•
August 25, 2000 - O
AI steering com
mittee form
ed, DLF/CN
I support•
September 7-8, 2000 - technical m
eeting at Cornell University
–defined the core of the current O
AI m
etadata harvesting protocol•
September 21, 2000 - workshop at ECD
L 2000 (Portugal)•
Novem
ber 1, 2000 - Alpha test group announced (~15 organizations)
•January 23, 2001 - O
AI protocol 1.0 announced, O
AI O
pen Day in the
U.S. (W
ashington DC)
–purpose: freeze protocol for 12-16 m
onths, generate critical mass
•February 26, 2001 - O
AI O
pen Day in Europe (Berlin)
•July 3, 2001 - O
AI protocol 1.1 announced
–to reflect changes in the W
3C’s XM
L latest schema
recomm
endation•
September 8, 2001 - workshop at ECD
L 2001 (Darm
stadt)
OA
I-PMH
v.2.0 [06/2002]
• goal: recurrent exchange of metadata about resources
between systems
• input:
• OA
I-PMH
v.1.0• feedback on O
AI-im
plementers
• deliberations by OA
I-tech [09/01 - 06/02]
• alpha test group of OA
I-PMH
v.2.0 [03/02 - 06/02]
•officially released June 14, 2002
• low-barrier interoperability specification
• metadata harvesting m
odel: data provider / serviceprovider
• metadata about resources
• autonomous protocol
• HTTP based
• XM
L responses
• unqualified Dublin Core
• stable
OA
I-PMH
v.2.0 [06/2002]
releasing OA
I-PMH
v.2.0
fi pre-alpha phase
fi alpha-phase
fi creation of O
AI-tech
fi beta-phase
• created for 1 year period
• charge:
• review functionality and nature of OA
I-PMH
v.1.0
• investigate extensions
• release stable version of OA
I-PMH
by 05/02
• determine need for infrastructure to support broad
adoption of the protocol
• comm
unication: listserv, SourceForge, conference calls
creation of OA
I-tech [06/01]
US representatives
Thomas Krichel (Long Island U
) - Jeff Young (OCLC) - Tim
Cole - (U of Illinois at U
rbana Champaign) - H
ussein Suleman
(Virginia Tech) - Simeon W
arner (Cornell U) - M
ichael Nelson
(NA
SA) - Caroline A
rms (LoC) - M
ohamm
ad Zubair (Old
Dom
inion U) - Steven Bird (U
Penn.)
European representatives
Andy Powell (Bath U
. & UKO
LN) - M
ogens Sandfaer (DTV) -
Thomas Baron (CERN
) - Les Carr (U of Southam
pton)
OA
I-tech
• review process by OA
I-tech:
• identification of issues
• conference call to filter/combine issues
• white paper per issue
• on-line discussion per white paper
• proposal for resolution of issue by OA
I-exec
• discussion of proposal & closure of issue
• conference call to resolve open issues
pre-alpha phase [09/01 – 02/02]
• creation of revised protocol document
• in-person meeting Lagoze - Van de Som
pel -N
elson – Warner
• autonomous decisions
• internal vetting of protocol document
pre-alpha phase [02/02]
• alpha-1 release to OA
I-tech March 1st
2002
• OA
I-tech extended with alpha testers
• discussions/implem
entations by OA
I-tech
• ongoing revision of protocol document
alpha phase [02/02 – 05/02]
• The British Library• Cornell U
. -- NSD
L project & e-print arXiv
• Ex Libris• FS Consulting Inc -- harvester for m
y.OA
I• H
umboldt-U
niversität zu Berlin• InQ
uirion Pty Ltd, RMIT U
niversity• Library of Congress• N
ASA
• OCLC O
AI-PM
H 2.0 alpha testers (1/2)
OA
I-PMH
2.0 alpha testers (2/2)
• Old D
ominion U
. -- ARC , D
P9• U
. of Illinois at Urbana-Cham
paign• U
. Of Southam
pton -- OA
IA (now Celestial),
CiteBase, eprints.org
• UCLA
, John Hopkins U
., Indiana U., N
YU -- sheet
music collection
• UKO
LN, U
. of Bath -- RDN
• Virginia Tech -- repository explorer
beta phase [05/02-06/02]
• beta release on May 1st 2002 to:
• registered data providers and serviceproviders
• interested parties
• fine tuning of protocol document
• preparation for the release of 2.0conform
ant tools by alpha testers
what’s new in OA
I-PMH
v.2.0
fi corrections
fi new functionality
fi general changes to im
prove solidity ofprotocol
fi quick recap
Overview of O
AI-PM
H Verbs
FunctionVerb
listing of a single recordGetRecord
listing of N records
ListRecords
OA
I unique ids contained in archiveListIdentifiers
sets defined by archiveListSets
metadata form
ats supported byarchive
ListMetadataForm
ats
description of archiveIdentify
metadata
about therepository
harvestingverbs
most verbs take argum
ents: dates, sets, ids, metadata form
atsand resum
ption token (for flow control)
general changes
protocol vs periphery
• clear distinction between protocol and
periphery
• fixed protocol document
• extensible implem
entation guidelines:
• e.g. sample m
etadata formats, description
containers, about containers
• allows for OA
I guidelines and comm
unityguidelines
OA
I-PMH
vs HTTP
• clear separation of OA
I-PMH
and HTTP
• OA
I-PMH
error handling
• all OK at H
TTP level? => 200 OK
• something wrong at O
AI-PM
H level? =>
OA
I-PMH
error (e.g. badVerb)
• http codes 302, 503, etc. still available toim
plementers, but no longer represent O
AI-PM
Hevents
resource
all available metadata
about David
item
Dublin Core
metadata
MA
RCm
etadata S
PECTRUM
metadata
records
item = identifier
record = identifier + metadata form
at + datestamp
set-mem
bership isitem
-level property
resource – item - record
other general changes
• better definitions of harvester,
repository, item, unique identifier, record,
set, selective harvesting
• oai_dc schema builds on D
CMI X
ML
Schema for unqualified D
ublin Core
• usage of must, m
ust not etc. as in RFC2119
• wording on response compression
other general changes
• all protocol responses can be validated with
a single XM
L Schema
• easier for data providers
• no redundancy in type definitions
• SOA
P-ready
• clean for error handling
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH><responseDate>
2002-0208T08:55:46Z</responseDate>
<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>
<GetRecord> <record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
…..
</metadata>
</record>
</GetRecord></OAI-PMH>
response no errors
note no http encodingof the O
AI-PM
H request
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH><responseDate>
2002-0208T08:55:46Z</responseDate>
<request>http://arXiv.org/oai2</request>
<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>
</OAI-PMH>
response with error
with errors, only the correctattributes are echoed in <request>
corrections
dates/times
• all dates/times are U
TC, encoded in
ISO8601, Z-notation
1957-03-20T20:30:00Z
• idempotency of r
esumptionToken: return sam
e incomplete
list when rT is reissued
• while no changes occur in the repo: strict
• while changes occur in the repo: all items with unchanged
datestamp
•new, optional attributes for the resumptionToken:
•expirationDate
•completeListSize
•cursor
resumptionToken
•1.x - if no records m
atch, an empty list was returned
noRecordsMatch
•2.0 - if no records m
atch, the error conditionnoRecordsM
atch is returned -- not an empty list
noRecordsMatch
new functionality
• harvesting granularity
• mandatory support of Y
YYY-MM-DD
• optional support of YYYY-MM-DDThh:mm:ssZ
• other granularities considered, but ultimately rejected
• granularity of from and u
ntil m
ust be the
same
harvesting granularity
• Identify m
ore expressive
Identify
<Identify>
<repositoryName>Library of Congress 1</repositoryName>
<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>
<protocolVersion>2.0
</protocolVersion>
<adminEmail>
[email protected]</adminEmail>
<adminEmail>
[email protected]</adminEmail>
<deletedRecord>transient</deletedRecord>
<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>deflate</compression>
• header contains set m
embership of item
header
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec> <setSpec>math</setSpec> </header>
<metadata>
…..
</metadata>
</record>
eliminates the need for the “double
harvest” 1.x required to get all recordsand all set inform
ation
• ListIdentifiers returns h
eaders
ListIdentifiers
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“…” …>http://arXiv.org/oai2</request>
<ListIdentifiers>
<header>
<identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> ……
• ListIdentifiers m
andates
metadataPrefix as argum
ent
ListIdentifiers
http://www.perseus.tufts.edu/cgi-bin/pdataprov?
verb=ListIdentifiers
&metadataPrefix=olac
&from=2001-01-01
&until=2001-01-01
&set=Perseus:collection:PersInfo
•the changes to ListIdentifiers are subtle, andreflect a change in the O
AI-PM
H data m
odel•
Could have been named “ListH
eaders” or reduced toan option for ListRecords–
“ListIdentifiers” kept for lexigraphical consistency
ListIdentifiers
• character set for metadataPrefix and
setSpec extended to U
RL-safe characters
metadataPrefix
A-Z a-z 0-9 _ ! ‘ $ ( ) + - . *
in the periphery
• introduction of provenance container to
facilitate tracing of harvesting history
provenance
<about>
<provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate>
<originDescription>
… … …
</originDescription>
</originDescription> </provenance></about>
• introduction of friends container to
facilitate discovery of repositories
friends
<description>
<friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends></description>
• introduction of branding container for
DPs to suggest rendering & association hints
<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/
http://www.openarchives.org/OAI/2.0/branding.xsd">
<collectionIcon>
<url>http://my.site/icon.png</url>
<link>http://my.site/homepage.html</link>
<title>MySite(tm)</title>
<width>88</width>
<height>31</height>
</collectionIcon>
<metadataRendering
metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"
mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>
<metadataRendering
metadataNamespace="http://another.place/MARC"
mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>
</branding>
branding
• revision of oai-identifier
<description>
<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-
identifier"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-
identifier
http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>oai-stuff.foo.org
</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier>
</oai-identifier>
</description>
oai-identifier
domain based
repository names
•O
AI 1.x: oai_dc Schem
a defined by OA
I•
OA
I 2.0: oai_dc Schema im
ports from D
CMI
Schema for unqualified D
C elements
oai_dc
•O
AI 1.x: oai_m
arc•
OA
I 2.0: LoC marxm
l, oai_marc
–http://www.loc.gov/standards/m
arcxml/
MA
RC21
did not make it into O
AI-PM
H v.2.0
•SO
AP im
plementation
•Result set filtering
•M
ultiple / “best” metadata
•GetRecord -> GetRecords
•M
achine readable rights managem
ent•
XM
L format for “m
ini-archives”
Detailed Review of theO
AI-PM
H 2.0 Verbs
Identify
•A
rguments
–none
•Errors–
none
•A
rguments
–none
•Errors–
badArgum
ent
1.12.0
ListMetadataForm
ats
•A
rguments
–identifier(O
PTION
AL)
•Errors–
id does not exist
•A
rguments
–identifier(O
PTION
AL)
•Errors–
badArgum
ent–
noMetadataForm
ats–
idDoesN
otExist
1.12.0
ListSets
•A
rguments
–resum
ptionToken(EX
CLUSIVE)
•Errors–
no set hierarchy
•A
rguments
–resum
ptionToken(EX
CLUSIVE)
•Errors–
badArgum
ent–
badResumptionToken
–noSetH
ierarchy
1.12.0
ListIdentifiers
•A
rguments
–from
(OPTIO
NA
L)–
until (OPTIO
NA
L)–
set (OPTIO
NA
L)–
resumptionToken
(EXCLU
SIVE)•
Errors–
no records match
•A
rguments
–from
(OPTIO
NA
L)–
until (OPTIO
NA
L)–
set (OPTIO
NA
L)–
resumptionToken
(EXCLU
SIVE)–
metadataPrefix (REQ
UIRED
)•
Errors–
badArgum
ent–
cannotDissem
inateFormat
–badResum
ptionToken–
noSetHierarchy
–noRecordsM
atch
1.12.0
ListRecords
•A
rguments
–from
(OPTIO
NA
L)–
until (OPTIO
NA
L)–
set (OPTIO
NA
L)–
resumptionToken
(EXCLU
SIVE)–
metadataPrefix (REQ
UIRED
)•
Errors–
no records match
–m
etadata format cannot be
disseminated
•A
rguments
–from
(OPTIO
NA
L)–
until (OPTIO
NA
L)–
set (OPTIO
NA
L)–
resumptionToken
(EXCLU
SIVE)–
metadataPrefix (REQ
UIRED
)•
Errors–
noRecordsMatch
–cannotD
isseminateForm
at–
badResumptionToken
–noSetH
ierarchy–
badArgum
ent
1.12.0
GetRecord
•A
rguments
–identifier (REQ
UIRED
)–
metadataPrefix
(REQU
IRED)
•Errors–
id does not exist–
metadata form
at cannotbe dissem
inated
•A
rguments
–identifier (REQ
UIRED
)–
metadataPrefix (REQ
UIRED
)
•Errors–
badArgum
ent–
cannotDissem
inateFormat
–idD
oesNotExist
1.12.0
Argum
ent Summ
aryidentifier
resumptionToken
setuntil
fromm
etadataPrefix
48
88
84
GetRecord
8exclusive
optionaloptional
optional4
ListRecords
8exclusive
optionaloptional
optional4
ListIdentifiers
8exclusive
88
88
ListSets
optional8
88
88
ListMetadata
Formats
88
88
88
Identify
Error Summ
ary
IDD
NE
IDD
NE
CDF
BAGetRecord
NS
HN
RMCD
FBRT
BAListRecords
NS
HN
RMCD
FBRT
BAListIdentifiers
NS
HBRT
BAListSets
NM
FBA
ListMetadata
Formats
BAIdentify
Generate badVerb on any input not matching the 6 defined verbs
this is an inversion of the table in section 3.6 of the OA
I-PMH
specification
RepositoryIm
plemenation Guidelines
Minim
al Repository•
2.0 provides many expressive, but optional
features–
but still low barrier!
•if you are writing your own repository software,the quickest path to im
plementation can involve
initially:–
only supporting DC
–skipping: <about>, sets, com
pression–
skip flow control (resumptionTokens) if < 1000 item
s
•add optional features as requirem
ents andfam
iliarity allows
Be Honest with datestam
p!•
a change in the process of dynamic generation of a
metada form
at really does mean all records have
been updated!
if (internalItemDatestamp > disseminationInterfaceDatestamp) {
datestamp = internalItemDatestamp
} else {
datestamp = disseminationInterfaceDatestamp
}
Not H
iding Updates
•O
AI-PM
H is designed to allow increm
entalharvesting
•U
pdates must be available by the end of the
period of the datestamp assigned, i.e.
–D
ay granularity => during same day
–Seconds granularity => during sam
e second
•Reason: harvesters need to overlap requests byjust one datestam
p interval (one day or onesecond)–
in 1.x, 2 intervals were required (in many circum
stances)
State in resumptionTokens
•H
TTP is stateless•
resumptionTokens allow state inform
ation to bepassed back to the repository to create acom
plete list from sequence of incom
plete lists•
EITHER – all state in resum
ptionToken•
OR – cache result set in repository
Caching the Result Set•
Repository caches results of initial request,returns only incom
plete list•
resumptionToken does not contain all state
information, it includes:
–a session id
–offset inform
ation, necessary for idempotency
•resum
ptionToken allows repository to returnnext incom
plete list•
increased complexity due to cache m
anagement
–but a potential perform
ance win
All State in the
resumptionToken
•A
rrange that remaining item
s/headers in complete list
response can be specified with a new query and encode thatin resum
ptionToken•
One sim
ple approach is to return items/headers in id order
and make the new query specify the sam
e parameters and
the last id return (or by date)–
simple to im
plement, but possibly longer execution tim
es
•Can encode param
eters very simply:
<resumptionToken>metadataPrefix=oai_dc
from=1999-02-03&until=2002-04-01&
lastid=fghy:45:123</resumptionToken>
resumptionToken attributes (1)
•expirationDate – likely to be useful when cache
clean-up schedule is known–
Do not specify e
xpirationDate if all state in
resumptionToken
•badResumptionToken error to be used if
resumptionToken expired
–M
ay also be used if request cannot be completed for
some other reason
•e.g.: if repository changes cause the incom
plete list to haveno records
resumptionToken attributes (2)
•completeListSize and c
ursor optionally
provide information about size of com
pletelist and num
ber of records so fardissem
inated–
use consistently if used–
designed for status monitoring
–caveat harvester:
completeListSize m
ay beapproxim
ate and may be revised
resumptionToken
The only defined use of resumptionToken is as follows:
•a repository must include a resum
ptionToken element as
part of each response that includes an incomplete list;
•in order to retrieve the next portion of the complete list,!
the next request must use the value of that
resumptionToken elem
ent as the value of theresum
ptionToken argument of the request;
•the response containing the incomplete list that
completes the list m
ust include an empty resum
ptionTokenelem
ent;
Flow Control & Load Balancing
How to respond to a “bad” harvester:
1.H
TTP status code 200; response to OA
I-PMH
requestwith a resum
ptionToken.2.
HTTP status code 503; with the Retry-A
fter headerset to an appropriate value if subsequent requestfollows too quickly or if the server is heavily loaded.
3.H
TTP status code 403; with an appropriate reasonspecified if subsequent requests do not adhere toRetry-A
fter delays.
302 Load Balancing•
Interactive users on main D
L machine should not be
impacted by m
etadata harvesting–
don’t take deliveries through the front door–
not part of the protocol; defined outside the protocol
OA
IServer
naca.larc.nasa.gov/oai/
if load > 0.50
redirect request
OA
IServer
buckets.dsi.internet2.edu/naca/oai/
harvesterhttp://blah/oai/?verb=L
istIdentifiers&m
etadataPrefix=oai_dc
HT
TP Status C
ode 302
http://blah/oai/?verb=ListIdentifiers&
metadataPrefix=oai_dc
<?xml version=“1.0” encoding=“U
TF-8”?>
…<ListIdentifiers>
…</ListIdentifiers>
DN
S Load Balancing
•using a D
NS rotor, establish
–a.foo.org, b.foo.org, c.foo.org
–each with a synchronized copy of therepository
–let D
NS & chance distribute the load
Load Balancing Caveats•
Copies of the repository must be synchronized
–(cf. Pande, et al. JCD
L 02)
•Com
plex hierarchies are possible–
programm
er must insure no cycles in redirection graphs!
•The baseU
RL in the reply must always point to the
original repository, not the repository thateventually answered the request
Error Handling: Verbosity
More is better…
<error code="badArgument">Illegal argument ‘foo’</error>
<error code="badArgument">Illegal argument ‘bar’</error>
is preferred over:
<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>
which is preferred over:
<error code="badArgument">Illegal arguments</error>
Error Handling: Levels
•the O
AI-PM
H error / exception conditions are for
OA
I-PMH
semantic events
•they are not for situations when:–
the database is down–
a record is malform
ed•
remem
ber: record = id + datestamp + m
etadataPrefix•
if you’re missing one of those, you don’t have an O
AI record!
–and other conditions that occur outside the O
AI scope
•use http codes 500, 503 or other appropriate values to indicatenon-O
AI problem
s
Error Handling: Extensions
•A
rguments that are not 'required', 'optional' or
'exclusive’ are 'illegal' and should generatebadArgument errors.
•If you want to extend the O
AI-PM
H:
–stop and consider: do you really need to?
•m
aybe you should have different OA
I-PMH
interfaces, or creativem
etadata formats
–if you really, really want to, tunnel your extensions through the“set” feature
•see http://www.dlib.org/dlib/decem
ber01/suleman/12sulem
an.html for
examples
Idempotency of “List”Requests (1)
•Purpose is to allow harvesters to recover from
lostresponses or crashes without starting a largeharvest from
scratch•
Recover by re-issuing request usingresumptionToken from
previous request•
IMPLICA
TION
: harvester must accept both the
most recent r
esumptionToken issued and the
previous one
Idempotency of “List”Requests (2)
•response to a re-issued request m
ust contain all unchangedrecords
•any changed records will get new datestam
ps after time of
initial request•
changes will be picked up by subsequent harvest if notincluded
[no experience yet with incomplete responses to ListSets or
ListMetadataForm
ats requests]
Case Study: “bucket” based repositories
•Buckets: see N
elson & Maly, CA
CM 44(5)
•2.0–
LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)
–N
ACA
- naca.larc.nasa.gov/oai2.0/ (file system, refer)
•1.1–
LTRS - techreports.larc.nasa.gov/ltrs/oai/–
NA
CA - naca.larc.nasa.gov/ltrs/oai/
–O
pen Video - www.open-video.org/oai/ (MySQ
L, local)–
JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS A
ccess dump, local)
–GLTRS (filesystem
, HTM
L scraping)•
Characteristics:–
resumptionToken support initially skipped; added later (all)
•highly encoded rT’s: [2001-01-01!!!!301!600]
–sets initially skipped, added later (LTRS)
–initially had load balancing with 2 N
ACA
repositories…
Case Study: “bucket” based repositories
•in bucket term
inology:–
6 OA
I verbs (methods) added to the existing list
of methods
•http://naca.larc.nasa.gov/oai2.0/?m
ethod=list_methods
•http://naca.larc.nasa.gov/oai2.0/?m
ethod=list_source&target=ListId
entifiers
–a data elem
ent is added to the bucket that containsthe specifics of the particular repository and itsm
etadata format
•http://naca.larc.nasa.gov/oai2.0/?m
ethod=display&pkg_nam
e=oai&elem
ent_name=oai.pl
Harvester
Implem
entation Guidelines
Be a Polite OA
I Neighbor
•Re-use existing free harvesters rather thanwriting your own–
http://www.openarchives.org/tools/index.html
•Read h
ttp://www.robotstxt.org/wc/robots.html if
you insist on writing your own
•Provide m
eaningful User-Agent & F
rom http
headers–
these values can be configured even if using someone
else’s harvester
Listen to the Repository!•
Check Identify’s <granularity> elem
ent if you wish touse finer than YYYY-M
M-D
D•
If you harvest with sets, remem
ber that “:” indicateshierarchies–
harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”
•Check for and handle non-200 http status codes (503,302, etc.)
•em
pty resumptionToken => end of com
plete list•
consider asking for compressed responses if the
repository supports them…
How to Grab Everything
•Issue an Identify request to find the finest datestam
p granularitysupported.
•Issue a ListM
etadataFormats request to obtain a list of all
metadataPrefixes supported.
•H
arvest using ListRecords requests for each metadataPrefix supported.
Knowledge of the datestamp granularity allows for less overlap if
granularities finer than a day are supported.•
Set structure can be inferred from the setSpec elem
ents in the headerblocks of each record returned (consistency checks are possible).
•Item
s may be reconstructed from
the constituent records. Localdatestam
ps must be assigned to harvested item
s.•
Provenance and other information in <
about> blocks m
ay be re-assem
bled at the item level if it is the sam
e for all metadata form
atsharvested. H
owever, this information m
ay be supplied differently fordifferent m
etadata formats and m
ay thus need to be store separatelyfor each m
etadata format.
Harvesting v1.1 and v2.0
•N
ot difficult tohandle both cases,test with I
dentify:
–v1.1: <
Identify>
<protocolVersion>
–v2.0 <
OAI-PMH>
<Identify>
<protocolVersion>
•N
ote also differenterror andexception handling
•M
any similarities,
harvesters canshare lots of code
Harvesting D
emo
•H
arvester written in Perl•
Handles v1.0, v1.1 and v2.0
•U
ses LWP,
Expat and X
ML::Parser (no schem
avalidation)
•U
TF-8 conditioning to deal with some “im
perfect”repositories
•Sequence of requests: I
dentify,
ListMetadataFormats, ListSets then
ListRecords/ListIdentifiers
•Support for increm
ental harvesting, usesresponseDate from
last harvest to get new startdatestamp
Harvesting logs
•A
lan Kent’s v2.0 harvester logs:http://www.inquirion.com:8123/public/collList;collListCmd=
list
•A
lan Kent’s summ
ary of v1.1 harvesting resultshttp://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm
•Celestial v1.1 harvesting logshttp://celestial.eprints.org/cgi-bin/status
•D
P9 gateway using arc harvested information
http://arc.cs.odu.edu:8080/dp9/index.jsp
NA
SA <friends> exam
ple (1)•
A light weight, D
P-centric method
to comm
unicate the existence of“others”
http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify
..<description>
<friends ..nam
espace stuff..>
<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>
<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>
<baseURL>http://horus.riacs.edu/perl/oai/</baseURL>
<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>
</friends>
</description>
..
<friends>…</friends/
http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/
http://ntrs.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://horus.riacs.edu/perl/oai/
harvester
Identify
NA
SA <friends> exam
ple (2)
Case Study: arXiv (1)
•Existing system
, running >10years. Written
mostly in Perl
•Flat file system
for ‘database’•
200k papers with metadata in hom
ebrewform
at•
~200 updates/day. OA
I repository just oneview of system
, must integrate with daily
update schedule
Case Study: arXiv (2)
•W
rite in Perl–
Easy integration with rest of system, reuse
code from v1.0/v1.1 interface
–U
se libwww; XML::DOM
•D
aily rebuild of datestamp database
–N
o existing date in system appropriate
–Base on U
nix cdate of m
etadata files
•O
n-the-fly metadata translation
–Straightforward, avoids data duplication
Case Study: arXiv (3)
•Flow control to avoid loading server and to avoidharvesters tripping robot alarm
s–resumptionTokens to lim
it response size (500-1000records or 5k-10k identifiers / response)
–503 R
etry-After replies based on client ip
•Im
plement r
esumptionTokens that include all
state–
Avoid need to cache result sets / clean cache
Aggregator / Cache / Proxy
Implem
entation Guidelines
<provenance> & datestam
ps
•rem
inder: datestamps are local to the
repository, and an re-exportingservice m
ust use new localdatestam
ps•
such services should use the<provenance> container to preserve
the original datestamps
Identifiers are Local
•identifiers are local to the repository
•unless you absolutely did not change them
etadata and the identifier correspondsto a recognized U
RI scheme, use a new
identifier upon re-exporting–
use the <provenance> container to preserve
the harvesting history
Derived from
the same item
?3 ways to determ
ine if records share provenancefrom
the same item
:
1.both records have the sam
e identifier and thebaseU
RL in the request elements of the O
AI-PM
Hreponses which include the record are the sam
e;2.
both records have the same identifier and that
identifier belongs to some recognized U
RI scheme;
3.the provenance containers of both records have thesam
e entries for both the identifier and baseURL;
<provenance> exam
ple (1)
<?xml version="1.0" encoding="UTF-8"?>
<GetRecord ..nam
espace stuff..> <responseDate>2002-02-08T08:55:46.1</responseDate>
<request verb="GetRecord" metadataPrefix="odd_fmt"
identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>
<record>
<header>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp> </header>
<metadata> ...metadata record in odd_fmt... </metadata>
</record>
</GetRecord>
Consider a request from crosswalker.oa.org:
http://odd.oa.org?verb=GetRecord
&identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt
and the following response from odd.oa.org:
Imagine that c
rosswalker.oa.org cross-walks
harvested metadata from
odd_fmt into o
ai_marc and
then re-exposes the metadata with new identifiers.
A request from
getmarc.oa.org:
http://crosswalker.oa.org?verb=GetRecord
&identifier=oai:cw.oa.org:z1x2y3
&metadataPrefix=oai_marc
might then yield the following response from
crosswalker.oa.org:
<provenance> exam
ple (2)
<provenance> exam
ple (3)<record>
<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>
<datestamp>2002-02-09T01:15:43Z</datestamp>
</header>
<metadata> ...metadata record in oai_marc... </metadata>
<about>
<provenance ..nam
espace stuff..>
<originDescription harvestDate="2002-02-08T08:55:46Z“
altered="true">
<baseURL>http://odd.oa.org</baseURL>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
<metadataNamespace>http://odd.oa.org/odd_fmt</..>
</originDescription>
</provenance>
</about>
</record>
<provenance> exam
ple (4)
This oai_marc record is then re-
exposed by getmarc.oa.org with the
same identifier o
ai:cw.oa.og:z1x2y3
(because the record has not beenaltered).
The associated <provenance> container
might be:
<provenance> exam
ple (5)<record>
<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>
<datestamp>2002-03-01T01:46:11Z</datestamp> </header>
<metadata> ...metadata record in oai_marc... </metadata>
<about>
<provenance ..nam
espace stuff..> <originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”> <baseURL>http://crosswalker.oa.org/<baseURL> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> <metadataNamespace>http://../oai_marc</..>
<originDescription harvestDate="2002-02-08T08:55:46Z” altered="true">
<baseURL>http://odd.oa.org</baseURL>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
<metadataNamespace>http://odd.oa.org/odd_fmt</..>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
oai-identifier
•not com
patible with v1.1 oai-identifier–
repositoryNam
e now domain nam
e based–
not reliant upon OA
I centralizedregistration
•one-to-one m
apping for escapingcharacters: %
3F allowed, %3f not
–allows sim
ple comparison
looking ahead:novel uses of O
AI-PM
H
•Registry of m
etadata formats for
OpenU
RL–
http://www.sfxit.com/openurl/
–http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf
•other uses?
linkingservers
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
registrationpollingO
AI-PM
H harvesting
centralrepositoryOAI-PMH
Poll
regis.
userservice
registry model
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
Goal:• inform
linking servers re Schema
• ease of admin for all parties involved
• limit hum
an overhead
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
registration centralrepositoryregis.
Registry:• schem
aLocation• registration date• m
irror of Schema
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
registration centralrepositoryregis.
polling
PollPoll:• fetch schem
a at schemaLocation
• log failure/success• com
pare fetched Schema with m
irror• changed => replace m
irror• rem
oved => deregistered
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
registrationpolling
centralrepositoryOAI-PMH
Poll
regis.
OA
I repo:• record-ids = schem
aLocation• oai_dc record :
• registration info• (de)registration datestam
p• xsi record :
• mirror schem
a• schem
a update datestamp
• poll record : • process info• recent poll datestam
p
linkingservers
registrars
XM
L Schema
URL1
XM
L Schema
URL2
XM
L Schema
URLn
registrationpollingO
AI-PM
H harvesting
centralrepositoryOAI-PMH
Poll
regis.
userservice