mod_oai: Metadata Harvesting for Everyone

21
mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004 mod_oai is sponsored by the Andrew Mellon Foundation

description

mod_oai: Metadata Harvesting for Everyone. Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004. mod_oai is sponsored by the Andrew Mellon Foundation. Outline. - PowerPoint PPT Presentation

Transcript of mod_oai: Metadata Harvesting for Everyone

Page 1: mod_oai:  Metadata Harvesting  for Everyone

mod_oai: Metadata Harvesting

for EveryoneMichael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind

Elango

{mln,aelango}@cs.odu.edu{herbertv,liu_x}@lanl.gov

DLF 2004 Fall ForumBaltimore MD

October 25-27, 2004

mod_oai is sponsored by the Andrew Mellon Foundation

Page 2: mod_oai:  Metadata Harvesting  for Everyone

Outline

• mod_oai– crawling vs. harvesting– complex objects & OAI-PMH– how mod_oai works– scenarios– demos

• More information– http://www.modoai.org/– http://www.openarchives.org/

Page 3: mod_oai:  Metadata Harvesting  for Everyone

www.getty.edu

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc3; last mod2003-11-29

doc4; last mod2002-10-03

doc100; last mod2003-09-113…

what documents have beenmodified since 2003-11-15?

Inefficient Web Crawlers

robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

Page 4: mod_oai:  Metadata Harvesting  for Everyone

www.getty.edu with OAI-PMH

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc3; last mod2003-11-29

doc4; last mod2002-10-03

doc100; last mod2003-09-113…

what documents have beenmodified since 2003-11-15?

A More Efficient Way…

Page 5: mod_oai:  Metadata Harvesting  for Everyone

mod_oai• Goal: integrate OAI-PMH functionality into

the web server itself…• mod_oai: an Apache 2.0 module to

automatically answer OAI-PMH requests for an http server– written in C– respects values in .htaccess, httpd.conf

• Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

• www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg

Page 6: mod_oai:  Metadata Harvesting  for Everyone

OAI-PMH data model

resource

item

Dublin Coremetadata

MARCXMLmetadata

MPEG-21DIDL records

OAI-PMH identifier = entry point to all records pertaining to the resource

METS metadata pertaining

to the resource

modeled representation of the resource

simplemodel

more expressivemodel

complexmodel

complexmodel

Page 7: mod_oai:  Metadata Harvesting  for Everyone

OAI-PMH and complex models

• OAI-PMH record == modeled representation of the resource• Can be selectively harvested via OAI-PMH ~ datestamp, set• Resource can be:

– simple object (1 file) – compound object (multiple files)

• OAI-PMH records can contain:– Typical metadata– Actual resource(s)

• By-Value – base64 encoded• By-Reference – http address of resource• both

– Identifiers of metadata and resource(s), unambiguously mapped to the identified data

– A variety of secondary information

Page 8: mod_oai:  Metadata Harvesting  for Everyone

Complex Objects & OAI-PMH

• LANL Repository– OAI-PMH as a Repository Access Protocol to

access metadata and content represented as DIDLs

• APS/LANL/LoC Mirroring– OAI-PMH transfer of APS content represented

in application neutral format (DIDLs)

• LANL DSpace Plug-in– Exposes MPEG-21 DIDL documents through

built-in DSpace OAI-PMH infrastructure

Page 9: mod_oai:  Metadata Harvesting  for Everyone

How mod_oai works

• Install on an Apache 2.0 server– compile & edit httpd.conf

http://www.foo.edu/ now has an OAI-PMH baseURL of:

http://www.foo.edu/modoai

Page 10: mod_oai:  Metadata Harvesting  for Everyone

OAI-PMH characteristics: Typical Repository

OAI-PMH Entity value description

Resource URL PDF, PS, XML, HTML or other file

Item

identifier OAI Identifier

DNS-based name of metadata about resource

set membership LCSH Library of Congress Subject Heading

Record

metadataPrefix oai_dc bibliographic metadata in Dublin Core

datestamp 2004-10-18

modification date of DC record

Record

metadataPrefix oai_marc bibliographic metadata in MARC

datestamp 2004-07-31

modification date of MARC record

Page 12: mod_oai:  Metadata Harvesting  for Everyone

OAI-PMH characteristics: mod_oaiOAI-PMH Entity value description

Resource URL HTML, GIF, PDF or other web file

Item

identifier URL same URL as the resource

set membership MIME type MIME type of the resource

Record

metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_dc a subset of http_header in DC

datestamp 2004-07-31 modification date of resource

Record

metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata

datestamp 2004-07-31 modification date of resource

Page 13: mod_oai:  Metadata Harvesting  for Everyone

OAI-PMH Concepts

concept mod_oai interpretation

OAI Identifier URL of resource

set MIME type of resource

datestamp change time of resource

deleted records “no” deleted records

Page 14: mod_oai:  Metadata Harvesting  for Everyone

http_header

Page 15: mod_oai:  Metadata Harvesting  for Everyone

Use Cases

• Regular Web Crawling– use ListIdentifiers to discover URLs– add new URLs to the list of URLs to be

crawled

• Harvesting Resources w/ OAI-PMH– use ListRecords to extract the entire

resource as an MPEG-21 DIDL AIP

Page 16: mod_oai:  Metadata Harvesting  for Everyone

Regular Crawling: ListIdentifiers

harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates

Page 17: mod_oai:  Metadata Harvesting  for Everyone

Resource Harvesting: ListRecords

harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref

resources)

Page 18: mod_oai:  Metadata Harvesting  for Everyone

Demo

• Repository Explorer– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?

archive=http://whiskey.cs.odu.edu/modoai

• Direct URLs– http://whiskey.cs.odu.edu/modoai?verb=Identify– http://whiskey.cs.odu.edu/modoai?verb=ListMetadataForm

ats– http://whiskey.cs.odu.edu/modoai?

verb=ListIdentifiers&metadataPrefix=oai_dc– http://whiskey.cs.odu.edu/modoai?

verb=ListRecords&metadataPrefix=http_header– http://whiskey.cs.odu.edu/modoai?

verb=ListRecords&metadataPrefix=oai_didl

Page 19: mod_oai:  Metadata Harvesting  for Everyone

Datestamps and Etags

• Procedure– 16 harvests over 1

month of 465,374 .dk domains

– 5,543,470 possible downloads

– 5,182,034 successful downloads

– 599,143 changes

Datestamp and Etag Example

L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004

http://www.netarchive.dk/website/publications/Etags-2004.pdf

Page 20: mod_oai:  Metadata Harvesting  for Everyone

Errors in Datestamps and Etags

Indicating ChangeEtags Datestamps

missed change 0.087% 0.30%

redundant crawl

32% 10.7%

L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004

http://www.netarchive.dk/website/publications/Etags-2004.pdf

40.1 % of pages without Etags0.07% of pages without Datestamps

Page 21: mod_oai:  Metadata Harvesting  for Everyone

mod_oai…

• is:– a simple way to more

efficiently harvest web pages

– a possible impact on robots.txt

– fully OAI-PMH compliant • works with existing

harvesters

• is not:– yet suitable for dynamic

files– a replacement for

• DSpace• Fedora• eprints.org• other digital libraries /

repositories / cms