NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library...
-
Upload
eustace-taylor -
Category
Documents
-
view
222 -
download
1
Transcript of NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library...
NLM Digital CollectionsUpdate for DCFedoraUsersGroup
January 22, 2013
John DoyleNational Library of Medicine
The Story So Far
2
Texts– 7,866 books, incl. 225 multi-vol sets– Medical Heritage Library
1.7m pages In-house digitization
– 1 multi-part report
Audiovisuals– 70 films – 2 thematic collections
The Saga Continues
Serials– NIH Institute annual reports– 61 volume printed index of historical citations– Journals may be coming soon
Oral Histories Still Images Born-digital resources Citation dataset
Public Interface: “Digital Collections”
Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata,
datastreams Book Viewer (NWU)
Open source software from Northwestern University
Open source JPEG2000 server (Djatoka) Video Player with Search (NLM)
Features video transcript search and play-ahead jump
HHS Innovates finalist (top 6), Fall 2011 4
Replacing Muradora Muradora codebase is aging
– No community development or support Newer community projects reaching maturity
– Islandora– Hydra
Priority is to preserve/enhance resource search and browse
Probably retain the book and video viewing applications
5
Current Developments Workflows
– Increasingly concurrent content projects– Moving from project-specific to project-agnostic
Data Services– Programmatic access – search web service– Bulk data– Need to pin down use cases
Fedora framework upgrading– Journaling for propagating changes across
multiple Fedora instances
6
Current Developments
Periodic checksum checking– Make use of recent Fedora enhancements in
this area Third copy of content
– “Just in case” copy, not primary disaster recovery
– Amazon Glacier seems to be a good fit Descriptive Metadata
– More automated updating of ILS– Need to update Fedora/Solr post-ingest
7
Related Activities
Internet Archive– Over 6,500 books uploaded as part of MHL
project– Only selected datastreams going up– Expect to continue sending books to IA going
forward Hathi Trust
– Working group delivered recommendations last year
– Participation could involve an IA-to-HT path– Some bibliographic challenges to be met
NLM Digital CollectionsSupport for Multi-volume texts
January 22, 2013
Nancy Fallgren, Doron ShalviNational Library of Medicine
Outline Regular book processing Regular book data model and presentation What is a multi-volume? Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Software adjustments Questions
10
Regular book processing Voyager record
– One to one relationship between BIB record and digital object
Metadata processing– MARCXML to OAI-DC and DMDINDEX
Preingest process– Create derivatives– Generate FOXML– Locate files
Ingest into Fedora
11
Regular book data model
12
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
METS M text/xml METS file for entire book
OCR E text/plain Book OCR - full text of entire book
PDF E application/pdf PDF of entire book
THUMB E image/jpeg JPG Thumbnail image of selected page in book
Preview E image/jpeg JPG Preview image of selected page in book
Regular book presentation
13
What is a Multi-volume? Multiple volume monographic series
– All volumes share the same series title– Each volume may or may not have a unique
title– The series has a finite beginning and end
Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records
Not journals or serials
14
Multi-volume metadata issues
One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume)– NLM UID (MARC 035$9) is the basis for each
digital object’s PID– Disambiguating volume titles
Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows
Scanning Spreadsheets:UIDs and volume nos.
From spreadsheet to XML
Set/Parent MARCXML
New child/volume MARCXML
Set/Parent DC
Child/Volume DC
Disambiguating Multi-volume workflows
Transform pre-ingest manifests (UID lists)– Remove all UIDs with “X#” suffix
Transform post-ingest manifests– Remove all “X#” suffixes from UIDs– De-dupe the remaining list– Add only set/parent url to BIB records
DREPSERIES code
Asynchronous Volume processinga.k.a. Jail
Do not pass GO, do not collect $200 Volumes are scanned and processed
asynchronously Set object created for first child part Standard processing and review workflow Volumes held in Jail – no further processing – until all
volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to
Production
Multi-volume set data model
24
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
THUMB E image/jpeg JPG Thumbnail image of selected page in set
Preview E image/jpeg JPG Preview image of selected page in set
Same data model as book, but no METS, OCR or PDF
Multi-volume part data model
25
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
METS M text/xml METS file for entire book
OCR E text/plain Book OCR - full text of entire book
PDF E application/pdf PDF of entire book
THUMB E image/jpeg JPG Thumbnail image of selected page in book
Preview E image/jpeg JPG Preview image of selected page in book Same data model as book
Multi-volume relationships
26
Set Part
fedora:hasPart
fedora:isPartOf
Multi-volume presentation - set
27
Multi-volume presentation - part
28
Software adjustments
Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link
to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant
changes, would have helped