mod_oai: Metadata Harvesting
for EveryoneMichael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind
Elango
{mln,aelango}@cs.odu.edu{herbertv,liu_x}@lanl.gov
DLF 2004 Fall ForumBaltimore MD
October 25-27, 2004
mod_oai is sponsored by the Andrew Mellon Foundation
Outline
• mod_oai– crawling vs. harvesting– complex objects & OAI-PMH– how mod_oai works– scenarios– demos
• More information– http://www.modoai.org/– http://www.openarchives.org/
www.getty.edu
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc3; last mod2003-11-29
doc4; last mod2002-10-03
doc100; last mod2003-09-113…
what documents have beenmodified since 2003-11-15?
Inefficient Web Crawlers
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
www.getty.edu with OAI-PMH
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc3; last mod2003-11-29
doc4; last mod2002-10-03
doc100; last mod2003-09-113…
what documents have beenmodified since 2003-11-15?
A More Efficient Way…
mod_oai• Goal: integrate OAI-PMH functionality into
the web server itself…• mod_oai: an Apache 2.0 module to
automatically answer OAI-PMH requests for an http server– written in C– respects values in .htaccess, httpd.conf
• Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
• www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg
OAI-PMH data model
resource
item
Dublin Coremetadata
MARCXMLmetadata
MPEG-21DIDL records
OAI-PMH identifier = entry point to all records pertaining to the resource
METS metadata pertaining
to the resource
modeled representation of the resource
simplemodel
more expressivemodel
complexmodel
complexmodel
OAI-PMH and complex models
• OAI-PMH record == modeled representation of the resource• Can be selectively harvested via OAI-PMH ~ datestamp, set• Resource can be:
– simple object (1 file) – compound object (multiple files)
• OAI-PMH records can contain:– Typical metadata– Actual resource(s)
• By-Value – base64 encoded• By-Reference – http address of resource• both
– Identifiers of metadata and resource(s), unambiguously mapped to the identified data
– A variety of secondary information
Complex Objects & OAI-PMH
• LANL Repository– OAI-PMH as a Repository Access Protocol to
access metadata and content represented as DIDLs
• APS/LANL/LoC Mirroring– OAI-PMH transfer of APS content represented
in application neutral format (DIDLs)
• LANL DSpace Plug-in– Exposes MPEG-21 DIDL documents through
built-in DSpace OAI-PMH infrastructure
How mod_oai works
• Install on an Apache 2.0 server– compile & edit httpd.conf
http://www.foo.edu/ now has an OAI-PMH baseURL of:
http://www.foo.edu/modoai
OAI-PMH characteristics: Typical Repository
OAI-PMH Entity value description
Resource URL PDF, PS, XML, HTML or other file
Item
identifier OAI Identifier
DNS-based name of metadata about resource
set membership LCSH Library of Congress Subject Heading
Record
metadataPrefix oai_dc bibliographic metadata in Dublin Core
datestamp 2004-10-18
modification date of DC record
Record
metadataPrefix oai_marc bibliographic metadata in MARC
datestamp 2004-07-31
modification date of MARC record
resource
DC, HTTP, DIDL Modeled Representations item
Dublin Coremetadata
HTTPheaders
DIDL: base64 orurls + HTTP headers records
OAI Identifier == URL of Resource
OAI-PMH Data Model in mod_oai
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
Set membership == MIME type
OAI-PMH characteristics: mod_oaiOAI-PMH Entity value description
Resource URL HTML, GIF, PDF or other web file
Item
identifier URL same URL as the resource
set membership MIME type MIME type of the resource
Record
metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_dc a subset of http_header in DC
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata
datestamp 2004-07-31 modification date of resource
OAI-PMH Concepts
concept mod_oai interpretation
OAI Identifier URL of resource
set MIME type of resource
datestamp change time of resource
deleted records “no” deleted records
http_header
Use Cases
• Regular Web Crawling– use ListIdentifiers to discover URLs– add new URLs to the list of URLs to be
crawled
• Harvesting Resources w/ OAI-PMH– use ListRecords to extract the entire
resource as an MPEG-21 DIDL AIP
Regular Crawling: ListIdentifiers
harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates
Resource Harvesting: ListRecords
harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref
resources)
Demo
• Repository Explorer– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?
archive=http://whiskey.cs.odu.edu/modoai
• Direct URLs– http://whiskey.cs.odu.edu/modoai?verb=Identify– http://whiskey.cs.odu.edu/modoai?verb=ListMetadataForm
ats– http://whiskey.cs.odu.edu/modoai?
verb=ListIdentifiers&metadataPrefix=oai_dc– http://whiskey.cs.odu.edu/modoai?
verb=ListRecords&metadataPrefix=http_header– http://whiskey.cs.odu.edu/modoai?
verb=ListRecords&metadataPrefix=oai_didl
Datestamps and Etags
• Procedure– 16 harvests over 1
month of 465,374 .dk domains
– 5,543,470 possible downloads
– 5,182,034 successful downloads
– 599,143 changes
Datestamp and Etag Example
L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
Errors in Datestamps and Etags
Indicating ChangeEtags Datestamps
missed change 0.087% 0.30%
redundant crawl
32% 10.7%
L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
40.1 % of pages without Etags0.07% of pages without Datestamps
mod_oai…
• is:– a simple way to more
efficiently harvest web pages
– a possible impact on robots.txt
– fully OAI-PMH compliant • works with existing
harvesters
• is not:– yet suitable for dynamic
files– a replacement for
• DSpace• Fedora• eprints.org• other digital libraries /
repositories / cms
Top Related