Harvesting Metadata Using OAI-PMH

49
Harvesting Metadata Using OAI-PMH Roy Tennant California Digital Library

description

Harvesting Metadata Using OAI-PMH. Roy Tennant California Digital Library. Outline. The Open Archives Initiative OAI-PMH The Harvesting Process Harvesting Problems Steps to a Fruitful Harvest A Harvesting Service Model The OAI Future. Open Archives Initiative. - PowerPoint PPT Presentation

Transcript of Harvesting Metadata Using OAI-PMH

Page 1: Harvesting Metadata Using OAI-PMH

Harvesting Metadata Using OAI-PMH

Roy TennantCalifornia Digital Library

Page 2: Harvesting Metadata Using OAI-PMH

Outline

The Open Archives InitiativeOAI-PMHThe Harvesting ProcessHarvesting ProblemsSteps to a Fruitful HarvestA Harvesting Service ModelThe OAI Future

Page 3: Harvesting Metadata Using OAI-PMH

Open Archives InitiativeAimed at making the large and growing number of repositories of freely available digital content interoperableOnly five years old, but already essentialProtocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvestWell over 500 repositories world-wide support the protocolOAIster.org has indexed 5 million items from those repositories

Page 4: Harvesting Metadata Using OAI-PMH

www.oaforum.org/tutorial/

Page 5: Harvesting Metadata Using OAI-PMH

OAI-PMHData providers (DP) — those with the stuffService providers (SP) — those who harvest metadata and provide aggregation and search servicesSoftware for both DPs and SPs readily available OAI-PMH verbs:

IdentifyListIdentifiersListMetadataFormatsListSetsListRecordsGetRecord

Page 6: Harvesting Metadata Using OAI-PMH

OAI Architecture

Source: Open Archives Forum Tutorial

Page 7: Harvesting Metadata Using OAI-PMH
Page 8: Harvesting Metadata Using OAI-PMH

IdentifyProvides basic information about a repository

Page 9: Harvesting Metadata Using OAI-PMH

ListMetadataFormatsLists available metadata formats

Page 10: Harvesting Metadata Using OAI-PMH

ListIdentifiersLists all identifiers (or only those of the optionally specified set)Must include metadataPrefix attribute

Page 11: Harvesting Metadata Using OAI-PMH

ListSets

Lists available sets

Page 12: Harvesting Metadata Using OAI-PMH

Library of Congress ListSets response

Page 13: Harvesting Metadata Using OAI-PMH

ListRecordsLists all records (or only those of the optionally specified set)Must include metadataPrefix attribute

Page 14: Harvesting Metadata Using OAI-PMH

GetRecordRetrieves a specific recordMust include metadataPrefix and identifier attributes

Page 15: Harvesting Metadata Using OAI-PMH

The Harvesting Process

Identifying SourcesSelecting SetsHarvestingIndexingInterface

Page 16: Harvesting Metadata Using OAI-PMH

gita.grainger.uiuc.edu/registry/

Page 17: Harvesting Metadata Using OAI-PMH

errol.oclc.org

Page 18: Harvesting Metadata Using OAI-PMH

Selecting Sets

Review the response to the ListSets verbMay be instructive to search the collection in the native interface, if possibleLook for descriptive pages on the site being harvested

Page 19: Harvesting Metadata Using OAI-PMH
Page 20: Harvesting Metadata Using OAI-PMH
Page 21: Harvesting Metadata Using OAI-PMH

Harvesting

Many harvesting applications are available, I will focus on:

Public Knowledge Project (PKP) Harvester Virginia Tech Perl Harvester

Library software vendors increasingly offer harvesting products (e.g., ExLibris’ MetaIndex)

Page 22: Harvesting Metadata Using OAI-PMH
Page 23: Harvesting Metadata Using OAI-PMH

+-----------------------------------------+| Harvester Sample Configurator |+-----------------------------------------+| Version 1.1 :: July 2002 || Hussein Suleman <[email protected]> || Digital Library Research Laboratory || www.dlib.vt.edu :: Virginia Tech |------------------------------------------+

Defaults/previous values are in brackets - press <enter> to accept thoseenter "&delete" to erase a default valueenter "&continue" to skip further questions and use all defaultspress <ctrl>-c to escape at any time (new values will be lost)

Press <enter> to continue

[ARCHIVES]Add all the archives that should be harvested

Current list of archives:No archives currently defined !

Select from: [A]dd [D]oneEnter your choice [D] : a{return}

[ARCHIVE IDENTIFIER]You need a unique name by which to refer to the archive youwill harvest metadata fromExamples: nsdl-380602, VTETD

Archive identifier [] : nsdl-380602{return}

Virginia Tech Perl Harvester

Page 24: Harvesting Metadata Using OAI-PMH

Let’s Harvest!

Page 25: Harvesting Metadata Using OAI-PMH

Indexing

Pick your favorite database/indexing software:

MySQLSWISH-EWhatever is lying around…

May need to specifically set up a method to search across the entire recordMay need different fields for indexing than for displayWill need to deal with element collision

Page 26: Harvesting Metadata Using OAI-PMH

Interface

Software interface (API) for other applications:

SRU/SRW?Arbitrary Web Services schema?

User interface:What functions do you want your users to be able to perform?What kinds of displays do you want?

Page 27: Harvesting Metadata Using OAI-PMH

Harvesting Problems

SetsMetadata FormatsMetadata ArtifactsGranularityMetadata Variances

Page 28: Harvesting Metadata Using OAI-PMH

Sets

Records are harvested in clumps, called “sets” created by DPsNo guidelines exist for defining setsExamples:

CollectionOrganizational structureFormat (but is a page image an image? See example)

Page 29: Harvesting Metadata Using OAI-PMH

Metadata Formats

Only required format is simple Dublin Core, although any format can be made available in additionFew DPs surface richer metadataSimple DC is simply too simple!Example (artifact vs. surrogate dates)

Page 30: Harvesting Metadata Using OAI-PMH

Metadata Artifacts

“unintended, unwanted aberrations”Sample causes:

Idiosyncratic local practicesAnachronismsHTML code

Examples: Circa = string of dates for searching purposes[electronic resource]

Page 31: Harvesting Metadata Using OAI-PMH

Granularity

Record Granularity: what is an “object”?

A book, or each individual page?Examples: CDL, Univ. of Michigan

Metadata Granularity: Multiple values in one fieldExample: Univ. of Washington

Page 32: Harvesting Metadata Using OAI-PMH

Metadata Variances

Subject terminology differencesDisparities in recording the same metadata

Example: date variances

Mapping oddities or mistakesExamples: 1) format into description, 2) description into subject

Page 33: Harvesting Metadata Using OAI-PMH

Steps to a Fruitful Harvest

Needs Assessment (it’s the user, stupid)DP Identification and CommunicationMetadata CaptureMetadata AnalysisMetadata SubsettingMetadata NormalizationMetadata EnrichmentIndexing & DisplayInterface (it’s still the user, stupid)

Page 34: Harvesting Metadata Using OAI-PMH

Needs Assessment

What are you trying to accomplish?What will your users want to be able to do?What metadata will you need, and what procedures will you need to set up to enable these activities?Which repositories have what you want?Is what they have (e.g., sets, metadata) usable as is, or ?

Page 35: Harvesting Metadata Using OAI-PMH

DP Identification & Communication

Identification:Use UIUC directory of DPs to identify potential sources

Communication:Not required to tell them you are harvesting, but may help establish a good relationshipMay want to request that they surface a richer metadata format and/or provide a different set

Page 36: Harvesting Metadata Using OAI-PMH

Metadata Capture

Sample questions to answer:Individual sets, or all?Richer metadata formats available?How frequently to reharvest?Start from scratch each time or update?

Many software options

Page 37: Harvesting Metadata Using OAI-PMH

Metadata Analysis

Finding out what you have (and don’t have)

Encoding practicesGap analysis (e.g., missing fields, etc.)Mistakes (e.g., mapping errors)

Software can helpCommercial software like SpotfireIn-house or open source software tools

Page 38: Harvesting Metadata Using OAI-PMH

Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill

Five elements are used 71% of the time

Page 39: Harvesting Metadata Using OAI-PMH
Page 40: Harvesting Metadata Using OAI-PMH
Page 41: Harvesting Metadata Using OAI-PMH
Page 42: Harvesting Metadata Using OAI-PMH

Metadata Subsetting

DP sets are unlikely to serve all SP uses wellSPs will need the ability to subset harvested metadataExample: prototype subsetting tool

Page 43: Harvesting Metadata Using OAI-PMH
Page 44: Harvesting Metadata Using OAI-PMH

Metadata Normalization

Normalizing: to reduce to a standard or normal statePrototype date normalization service screen

Page 45: Harvesting Metadata Using OAI-PMH

Metadata EnrichmentAdding fields and/or qualifiers may be useful or required, for example:

Metadata provider informationGeographic coverageSubject terms mapped to a different thesaurusAuthority control record

The enrichment process may be the same tool as the subsetting tool (i.e., find a cluster of records and perform an action)

Page 46: Harvesting Metadata Using OAI-PMH

Indexing & Display

Selected fields may need to be mapped to specific indexing and display elementsParticularly required if harvesting different metadata formatsBut also needs to be done with multiple, conflicting fields:

<date>1863.</date><date>[2001 or 2002.]</date>

<identifier>SHS 1,679</identifier><identifier>http://content.lib.washington.edu/cgi-bin/htmlview.exe?CISOROOT=/loc&CISOPTR=58</identifier><identifier>http://content.lib.washington.edu/loc/image/1679.jpg</identifier>

Page 47: Harvesting Metadata Using OAI-PMH

A Harvesting Service Model

Page 48: Harvesting Metadata Using OAI-PMH

The OAI Future

Further protocol developmentServices layered on top of OAI-PMHShared software toolsBest practices for both DPs and SPs

Page 49: Harvesting Metadata Using OAI-PMH

oai-best.comm.nsdl.org