NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library...

NLM Digital CollectionsUpdate for DCFedoraUsersGroup

January 22, 2013

John DoyleNational Library of Medicine

The Story So Far

2

Texts– 7,866 books, incl. 225 multi-vol sets– Medical Heritage Library

1.7m pages In-house digitization

– 1 multi-part report

Audiovisuals– 70 films – 2 thematic collections

The Saga Continues

Serials– NIH Institute annual reports– 61 volume printed index of historical citations– Journals may be coming soon

Oral Histories Still Images Born-digital resources Citation dataset

Public Interface: “Digital Collections”

Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata,

datastreams Book Viewer (NWU)

Open source software from Northwestern University

Open source JPEG2000 server (Djatoka) Video Player with Search (NLM)

Features video transcript search and play-ahead jump

HHS Innovates finalist (top 6), Fall 2011 4

Replacing Muradora Muradora codebase is aging

– No community development or support Newer community projects reaching maturity

– Islandora– Hydra

Priority is to preserve/enhance resource search and browse

Probably retain the book and video viewing applications

5

Current Developments Workflows

– Increasingly concurrent content projects– Moving from project-specific to project-agnostic

Data Services– Programmatic access – search web service– Bulk data– Need to pin down use cases

Fedora framework upgrading– Journaling for propagating changes across

multiple Fedora instances

6

Current Developments

Periodic checksum checking– Make use of recent Fedora enhancements in

this area Third copy of content

– “Just in case” copy, not primary disaster recovery

– Amazon Glacier seems to be a good fit Descriptive Metadata

– More automated updating of ILS– Need to update Fedora/Solr post-ingest

7

Related Activities

Internet Archive– Over 6,500 books uploaded as part of MHL

project– Only selected datastreams going up– Expect to continue sending books to IA going

forward Hathi Trust

– Working group delivered recommendations last year

– Participation could involve an IA-to-HT path– Some bibliographic challenges to be met

NLM Digital CollectionsSupport for Multi-volume texts

January 22, 2013

Nancy Fallgren, Doron ShalviNational Library of Medicine

Outline Regular book processing Regular book data model and presentation What is a multi-volume? Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Software adjustments Questions

10

Regular book processing Voyager record

– One to one relationship between BIB record and digital object

Metadata processing– MARCXML to OAI-DC and DMDINDEX

Preingest process– Create derivatives– Generate FOXML– Locate files

Ingest into Fedora

11

Regular book data model

12

ID TYPE MIMETYPE LABEL

PID - - Fedora persistent identifier

DC X text/xml Dublin Core metadata for this object

RELS-EXT X application/rdf+xml

RDF statements about this object

MARCXML M text/xml MARCXML metadata

DMDINDEX X text/xml DMDINDEX descriptive metadata

METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book

Regular book presentation

13

What is a Multi-volume? Multiple volume monographic series

– All volumes share the same series title– Each volume may or may not have a unique

title– The series has a finite beginning and end

Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records

Not journals or serials

14

Multi-volume metadata issues

One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume)– NLM UID (MARC 035$9) is the basis for each

digital object’s PID– Disambiguating volume titles

Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows

Scanning Spreadsheets:UIDs and volume nos.

From spreadsheet to XML

Set/Parent MARCXML

New child/volume MARCXML

Set/Parent DC

Child/Volume DC

Disambiguating Multi-volume workflows

Transform pre-ingest manifests (UID lists)– Remove all UIDs with “X#” suffix

Transform post-ingest manifests– Remove all “X#” suffixes from UIDs– De-dupe the remaining list– Add only set/parent url to BIB records

DREPSERIES code

Asynchronous Volume processinga.k.a. Jail

Do not pass GO, do not collect $200 Volumes are scanned and processed

asynchronously Set object created for first child part Standard processing and review workflow Volumes held in Jail – no further processing – until all

volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to

Production

Multi-volume set data model

24








THUMB E image/jpeg JPG Thumbnail image of selected page in set

Preview E image/jpeg JPG Preview image of selected page in set

Same data model as book, but no METS, OCR or PDF

Multi-volume part data model

25








METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book Same data model as book

Multi-volume relationships

26

Set Part

fedora:hasPart

fedora:isPartOf

Multi-volume presentation - set

27

Multi-volume presentation - part

28

Software adjustments

Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link

to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant

changes, would have helped

Demonstration

http://collections.nlm.nih.gov

http://collections.nlm.nih.gov/

NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library...

Documents

Transcript of NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library...