H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams...

40
HARVARD UNIVERSITY LIBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007

Transcript of H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams...

Page 1: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

The Global Digital Format Registry(GDFR) Project

Stephen AbramsHarvard University

Andreas StanescuOCLC

CNI Fall Task Force MeetingWashington, DC, December 10-11, 2007

Page 2: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Digital preservation and format

• Preservation is concerned with ensuring access to managed digital assets over time

• Thus, preservation activities are focused on

– Viability– Fixity– Authenticity– Interpretability– Renderability

• The last two are primarily a function of format

Page 3: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Without format typing, all content is opaque

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

Page 4: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Without format typing, all content is opaque

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 5: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Without format typing, all content is opaque

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 6: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Global Digital Format Registry

“The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.”

– Centrally-organized collection and review

– Distributed storage, discovery, and delivery on a network of independent, but cooperating registries

Page 7: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

What is a format?

• “A serialized encoding of an abstract information model”

• Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level

– IEEE 754 floating point number– File system– In both case, there are well-defined syntactic and

semantic rules for mapping from information to bits, and back again

Page 8: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

What’s wrong with MIME types?

Page 9: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

What’s wrong with MIME types?

• Non-standardized documentation

• Intended for human, not machine consumption

• Coarse granularity– image/tiff vs. TIFF 4.0 – 6.0

Baseline Class B, G, P, RExtension Class YTIFF/EPTIFF/IT with file types CT, LW, HC, MP, BP, BP, BL,

FPExif 2.0 – 2.2GeoTIFFTIFF/FXDNG

Page 10: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR project

• Two DLF-sponsored invitational workshops

– University of Pennsylvania, January 2003– Washington, March 2003

• Two independent demonstration projects

– FRED [John Ockerbloom, University of Pennsylvania]tom.library.upenn.edu/fred/

– FOCUS [Joseph JaJa, University of Maryland]www.umiacs.umd.edu/~joseph/focus-archiving06.pdf

Page 11: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR project

• Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation

• Staffing and technical work subcontracted by HUL to OCLC (July 2006)

Page 12: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR project oversight

• Technical Working Group (TWG)– Bibliothèque nationale de France– British Library– California Digital Library– Digital Curation Centre, UK– Library of Congress– National Archives, UK– National Archives and Records Administration– National Library of Australia– National Library of New Zealand– Stanford University– University of Pennsylvania

Page 13: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

General development goals

• A generalized registry framework, specialized for the distributed GDFR application

• Based on well-known products and protocols

• Human and machine interfaces

• Full information content expressible in XML form, and can be re-instantiated from that expression

• Platform independence

• Globally fault tolerant

• Open source

Page 14: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR data model

• Consistent with PRONOM registry

Media

Agent

Software

Hardware

Document

dependencies

author, owner, maintainer

documentation

Format

Identifier Name Version Classification Description ReleaseDate WithdrawalDate Rights Signature Byte order Grammar Assessment

Relationship

Page 15: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Identifiers

• Canonical, GDFR-assigned identifier

– “info” URI info:rfa/gdfr1/Formats/1

• Other well-known identifiers

– Common name “TIFF”, “Tagged Image File Format”

– MIME type image/tiff

– PRONOM identifier info:pronom/fmt/7

– Library of Congress Format Description Document (FDD) identifier fdd000022

Page 16: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Classification scheme

• Eight facets

– Genre (required) text, still-image, sound, aggregate, …

– Role (required) family, file-format, encoding, serialization

– Composition unitary, container-bundle, container-wrapper

– Form binary, text

– Constraint structured, unstructured

– Basis sampled, symbolic

– Domain astronomy, cad-cam, gis, web-archive, …

– Transform compression, encryption, message-digest, …

Page 17: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Classification scheme

• Examples

– TIFF (Tagged Image File Format) genre:still-imagerole:familycomposition:container-

wrapperform:binarybasis:sampled

– LZW (Liv-Zempel-Welch) genre:still-imagerole:encodingtransform:compression

– SVG (Scalable Vector Graphics) genre:still-imagerole:file-formatform:textbasis:symbolic

Page 18: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Signatures

• External signatures

– File extension– Mac OS type– Mac OS X Uniform Type Identifiers (UTI)

• Internal signatures

– “Magic numbers”– Required vs. optional– Fixed vs. restricted vs. unrestricted

Page 19: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Grammar

• Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation

– BNF Backus-Naur Form

– BSDL MPEG-21 Bitstream Syntax Description Language

– DFDL Data Format Description Language

– EAST CCSDS 644.0-B-2

– XCEL Extensible Characterisation Extraction Language

Page 20: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Assessment

• Assessment of a format, expressed in some formal typed notation

– Cornell Virtual Remote Control (VRC)

– DTSC PANIC

– Library of Congress Sustainability, Quality, Function (SQF)

– National Library of Australia AONS

– OCLC INFORM

Page 21: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Documentation

• Specification documents (and software files) can be managed and distributed in the network

– Applicable only in cases of public domain resources or if explicit permission is granted by rights holders

– Other documents (and software) will be referenced by full citation, including actionable links where possible

– Mechanism for individuals or institutions to register locally-held copies, with terms of use

Page 22: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Software

• Format role Input, output

• Process type Characterize, create, edit, identify, …

• Enables discovery of transformative processing chains

PDF Postscript ASCII

Transformpdf2ps

RenderNotepad

Transformps2ascii

Page 23: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Relationships

• Modification BWF → WAVE

– Extension DNG → TIFF 6.0

– Restriction PDF/A → PDF 1.4

• Definition NITF → XML DTD

• Requisite XML → Relax NG

• Containment ZIP → *

• Equivalence DXF (ASCII) → DXF (binary)

• Version Word 97 → Word 6.0

• Affinity SPIFF → JPEG

Page 24: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR node

• Based on the OCLC IWSA / RFA framework

Canonical service layer

SRU/W OAISRU Update RSS Atom

XML RDBMS

Public service layer

Storage layer

Collection layer

Add DeleteUpdate Search

TCP/IP

AtomSRU/W

Display AdminDataContent History Export Import

Create

Page 25: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR node

• Java, Apache/Tomcat, Berkeley DB XML

• GNU LGPL license

– Including pre-existing OCLC technology and technology newly-developed for the project

• Release schedule

– v0.1 (alpha) March 23, 2007– v0.1 (beta) June 14, 2007– v1.0 June 30, 2007– v1.1 August 12, 2007– v1.3 September 17, 2007– v1.3.1 October 26, 2007

Page 26: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR node

Page 27: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR node

Page 28: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR node

Page 29: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR network

• Peer-to-peer network of independent, but cooperating registries communicating over a common protocol

RootGDFR node

GDFR node

GDFR node

GDFR node

Editorial process

Submissions for technical vetting

Vetted for propagation

GDFR protocol

Data propagation

Page 30: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

GDFR network

• Public notification of the availability of new data

– RSS feed available at well-known public address to which remote nodes can subscribe

• Remote harvesting of local data

– OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)

• Initially, a single source (root node) for all new data

Page 31: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Project status

• Extensive internal testing of GDFR software in a stand-alone mode

• Current project activities are focused on

– Implementing the distribution and synchronization functions

– Building the network– Data acquisition– Succession planning

Page 32: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Initial population

• Manual addition is possible, but time consuming

• Automated update using Atom

• What sources are available for bulk population?

– PRONOM registry www.nationalarchives.gov.uk/pronom

– Library of Congress Format Description Documents (FDD) www.digitalpreservation.gov/formats/fdd/descriptions.shtml

– Unix / Linux magic(4) database

Page 33: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Subsequent population

• RFC 2026, Internet Standards Processwww.ietf.org/rfc/rfc2026.txt

– “Iterations of review by the ... community and revision based upon experience”

• Draft distribution and public discussion

• Approval by “area” editors

• Release to the network for distribution

Page 34: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Sustainability

• The technological solution is the (relatively) easy part, but…

– The technology is expendable

– The important point is for the data to survive, evolve, and expand

Page 35: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Governance and succession

• Mellon funding was for technical work only

• At the end of the two year project…

– Harvard will continue maintenance for up to two years– Library of Congress has agreed to be a care-taker

agency until a permanent body is identified

Page 36: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Governance and succession

• NARA GDFR governance investigation

– Part of the Electronic Records Archives (ERA) initiative

– GDFR Governance Workshop, November 2007

• Bibliothèque et Archives, Canada • NARA• Corp. for National Research Initiatives • NASA• Digital Curation Centre, UK • NIST• Digital Library Federation • National Library of Australia• General Services Administration • National Library of New

Zealand• Georgia Institute of Technology • San Diego Supercomputer

Center• Government Printing Office • Stanford University• Harvard University • Statens Archiv, Sweden• IBM Watson Research Center • Tessalla Support Services• Koninklijke Bibliotheek, Netherlands • University of Pennsylvania• Library of Congress• MIT

Page 37: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Administrative considerations

• Policy

– Who (and how many) can join the network?– What are the eligibility requirements?– What are the rights and obligations of membership?

• Technical

– Who will maintain and enhance the data model?– Who will maintain, enhance, distribute, and support

the software?

Page 38: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Administrative considerations

• Data

– Who will contribute data?– Who will vouch for data authenticity?– Who will ensure data integrity?

• Financial

– What are the real human and system costs associated with GDFR operation?

– Who pays, and how?

Page 39: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

Summary

• The GDFR is an enabling technology that will support digital repository and preservation activities

– Supports the strong typing of digital assets at an appropriate level of granularity

– Enables the future recovery of the syntax and semantics associated with typed digital assets

– A means to pool and redistribute the expertise of the international digital preservation community

Page 40: H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

HARVARD UNIVERSITY LIBRARY

For more information…

www.formatregistry.org

[email protected]

[email protected]