Global Digital Format Registry
description
Transcript of Global Digital Format Registry
Global Digital Format Registry
Stephen L. AbramsHarvard University Library
MacKenzie SmithMassachusetts Institute of Technology
DLF Spring Forum New York, May 14-16, 2003
DLF Spring Forum New York, May 14-16, 2003 2
Why Do We Need a Registry?
• Repository functions are performed on a format-specific basis
• Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented
• Interchange requires mutual agreement of format syntax and semantics
DLF Spring Forum New York, May 14-16, 2003 3
Potential Use Cases
• Identification– “I have a digital object; what format is it?”
• Validation– “I have an object purportedly of format F; is it?”
• Transformation– “I have an object of format F, but need G; how can I produce it?”
• Characterization– “I have an object of format F; what are its significant properties?”
• Risk assessment– “I have an object of format F; is at risk of obsolescence?”
• Delivery– “I have an object of format F; how can I render it?”
DLF Spring Forum New York, May 14-16, 2003 4
Repository Format Dependencies
• Ingest– Validation– SIP-to-AIP
• Access– AIP-to-DIP– Rendering
• Preservation planning– Migration– Emulation– UVC
DLF Spring Forum New York, May 14-16, 2003 5
Repository Format Dependencies
SIP
AIP
Data Management
Administer
Archival storage
Manage
Access
DIP
Preservation
Strategies
Monitoring
Migration
Emulation
DescriptiveMetadata
Content andrepresentation
information
Ingest
QA
Generate AIP
Discovery
Generate DIP
Delivery
DLF Spring Forum New York, May 14-16, 2003 6
Repository Format Dependencies
SIP
AIP
Data Management
Administer
Archival storage
Manage
Access
DIP
Format registry
Preservation
Strategies
Monitoring
Migration
Emulation
Transform SIP-to-AIP
Validate SIP
Transform AIP-to-DIP
Metadata for encapsulation/archaeology
DescriptiveMetadata
Content andrepresentation
information
Ingest
QA
Generate AIP
Discovery
Generate DIP
Delivery
DLF Spring Forum New York, May 14-16, 2003 7
What’s Wrong with MIME Types?
• Insufficient depth of detail– Syntax and semantics– Public and proprietary
• Insufficient granularity– Both tiled RGB TIFF with LZW and striped
bi-tonal TIFF with Group 4 → image/tiff– All of PDF 1.0 – 1.4, PDF/X-1 – 3, and
PDF/A → application/pdf
DLF Spring Forum New York, May 14-16, 2003 8
A Bit of History
• DLF-sponsored invitational meetings
• Ad-hoc committee– Collected use cases– Working groups on data and governance
models
During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns.
DLF Spring Forum New York, May 14-16, 2003 9
Ad-Hoc Committee
• Bibliothèque nationale de France
• California Digital Library• Digital Library Federation• Harvard University• IETF• JISC• JSTOR• Library of Congress• MIT• NARA
• National Archives of Canada
• New York University• NIST• OCLC• Public Records Office,
UK• RLG• Stanford University• University of
Pennsylvania
DLF Spring Forum New York, May 14-16, 2003 10
Global Digital Format Registry
The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.
DLF Spring Forum New York, May 14-16, 2003 11
What is a Format?
• No assumption regarding byte size
• An information model is a formal expression of exchangeable knowledge
A format is a fixed, byte-serialized encoding of an information model.
DLF Spring Forum New York, May 14-16, 2003 12
What is Representation Information?
• Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value
Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats.
DLF Spring Forum New York, May 14-16, 2003 13
Data Model
• Registry• Format
– Descriptive• General descriptive properties
– Characterization• Technical syntactic/semantic properties
– Processing• Services and systems using format as input or output
– Administrative• Provenance
DLF Spring Forum New York, May 14-16, 2003 14
Informative, not Evaluative
• Legal liability
• May discourage deposit of proprietary information
• Investigate ways to include (by reference?) third party evaluations/recommendations– Insofar as this doesn’t hamper primary goal
The format properties stored in the registry should be factual, not judgmental.
DLF Spring Forum New York, May 14-16, 2003 15
Data Model Sources
• ISO 14721, Open archival information system -- Reference model– CCSDS OAIS reference model– Representation information
• Interpret, or provide “additional meaning” to Data Object• Structure and semantic information
• PRONOM– Public Records Office, UK– “information about file formats and the application
software needed to open them”– Format, vendor, product
DLF Spring Forum New York, May 14-16, 2003 16
Data Model Sources
• Diffuse– EC’s Information Society Technologies programme
– “reference and guidance information on available and emerging standards and specifications”
– Business Guides• “application of standards and specifications in specific areas”
• OCLC/RLG Preservation Metadata Framework– “information necessary to render/display, understand,
and interpret the Content Data Object”– Based on CEDARS, NEDLIB NLA, OAIS, and OCLC
metadata
DLF Spring Forum New York, May 14-16, 2003 17
Data Model Sources
• NIST National Software Reference Library– File profiles for the NSRL Reference Data Set
• Vendor, product, operating system
– Used for forensic identification• Media features
– Protocol-independent content negotiation• Selection of an “appropriate representation” of a
resource
– RFCs 2506, 2533, 2534
DLF Spring Forum New York, May 14-16, 2003 18
Data Model Sources
• Typed Object Model (TOM)– “model for identifying and describing data formats …
distributed system of ‘type brokers’ that maintain and interpret these descriptions”
– Format is aggregate of type (attributes, operations, semantics) and encoding
• JISC File Format Representation and Rendering Project– Assessment of formats and rendering software– Representation system to track formats and their
rendering software
DLF Spring Forum New York, May 14-16, 2003 19
Data Model Sources
• Bitstream Syntax Description Language– MPEG-21content adaptation– XML schema to model multimedia bitstreams
Useful for administrative properties and data types:
• ISO/IEC 11179, Specification and standardization of data elements
• OASIS/ebXML Registry Information Model
DLF Spring Forum New York, May 14-16, 2003 20
Data Model
Relation
Target : CognomenRegistry : CognomenType : <<enum>>Note * : UTF-8
Cognomen
Value : UTF-8Type : <<enum>>Note * : UTF-8
Person
Title ? : UTF-8Affiliation + : Agent
Agent
Name : UTF-8Address ? : UTF-8Telephone ? : ITU-TFax ? : ITU-T E.164Email ? : RFC 2821Web ? : URIType : <<enum>>Note * : UTF-8
Class
Identifier : CognomenOntology : CognomenNote * : UTF-8
Document
Title : UTF-8Version ? : UTF-8Author * : AgentPublisher * : AgentDate ? : ISO 8601Type : <<enum>>Identifier * : CognomenAccessibility : AccessNote * : UTF-8
Signature
Value : Byte streamObligation : <<enum>>Note * : UTF-8
ExternalSignature
Type : <<enum>>
InternalSignature
Fixity : <<enum>>Offset ? : Non-negative
Registry
Name : UTF-8Version : UTF-8Date : ISO 8601Format * : FormatService * : ServiceRegistry * : RegistryNote * : UTF-8
Service
Name : UTF-8Protocol ? : UTF-8Note * : UTF-8
Event
Agent : AgentDate : ISO 8601Type : <<enum>>Review : <<enum>>Note * : UTF-8
Process
Type : <<enum>>Stream * : StreamNote * : UTF-8
System
Name : UTF-8Version : UTF-8Agent : AgentProcess * : ProcessRelationship * : RelationNote * : UTF-8
Format
Identifier : UTF-8Alias * : UTF-8Author * : AgentOwner + : AuthorityMaintainer * : AuthorityClassification + : ClassRelationship * : RelationSpecification * : DocumentSignature * : SignatureTool * : SystemStatus : <<enum>>Provenace* : EventNote * : UTF-8
Stream
Format : CognomenType : <<enum>>Note * : UTF-8
Authority
Agent : AgentStart : ISO 8601End ? : ISO 8601Note * : UTF-8
Access
Type : <<enum>>Start : ISO 8601End : ISO 8601Note * : UTF-8
DLF Spring Forum New York, May 14-16, 2003 21
High-Level Format PropertiesFormat
Identifier UTF-8 Canonical format identifier
Alias * UTF-8 Variant identifiers
Author * Agent Author
Owner + Authority Owner
Maintainer * Authority Maintenance agency
Classification + Class Ontological classification
Relationship * FormatRelation Typed relationship with another format, either registered internally or externally
Specification * Document Specification document
Signature * Signature Internal or external signature
Tool * System Process or service having format as input or output
Status Status: ‘Active’, ‘Withdrawn’, ‘Unknown’, ‘Other’
Provenance * Event Provenance event
Note * UTF-8 Informative note
DLF Spring Forum New York, May 14-16, 2003 22
Descriptive Properties
• Identifiers– Canonical and alias
• Arbitrary relationships– Equivalence– Encapsulation– Sub-typing, with strict substitutability
• PDF 1.0 ← … ← PDF 1.4 ← PDF/A• XML ← SVG
– Versioning
• Ontological classification
DLF Spring Forum New York, May 14-16, 2003 23
Format Ontology• Content stream
– Logical– Numeric
• Scalar– Integer
» Unsigned– Real
» Floating point– Complex
– Text• Structured text
– Mark-up language– Programming language
• Message– Mail– News
– Image• Still
– Font» Outline» Raster
– Graphic» Vector» Raster
– Page description• Motion
– Audio• Music
– Application• CAD• Communication
• Database• Executable• GIS• Presentation• Spreadsheet• Word processing
– Transformation• Compression
– Lossless– Lossy
• Container– File system
• Transfer– 7-bit safe
• Physical media– Magnetic
• Disk• Tape
– Reel– Cartridge
– Optical• Disk
– CD-ROM– DVD
• Film– Paper
• Card• Tape
DLF Spring Forum New York, May 14-16, 2003 24
Characterization Properties
• Specification documents– Actionable links– Public identifiers– Hard copy
• Public, on-site, license, and escrow access
• Signatures– External
• File extension, Mac OS data fork type
– Internal• Magic number
DLF Spring Forum New York, May 14-16, 2003 25
Centralized vs. Distributed
• Allowing arbitrary granularity may lead to an explosion of registered formats– Versions– Local profiles
• Typed relationships support internal and external references
• Enable distributed architecture without mandating it
DLF Spring Forum New York, May 14-16, 2003 26
Core Registry Services
• Management Services– Approval
• Level of review, level of public disclosure
– Maintenance• Add, update, delete format entries
– Notification• Notify registry clients of new/updated format or trigger events
(e.g. obsolescence, new transformation service, etc.)
– Introspection• Determine local policies (scope, coverage, implemented
services, etc.) of a given registry to identify appropriate registry to use
DLF Spring Forum New York, May 14-16, 2003 27
Core Registry Services
• Access Services– Description
• Representation information returned on request for single format
– Export• Entire registry or selected subset sent to external repository
DLF Spring Forum New York, May 14-16, 2003 28
Supported Services
• Representation Services– Identification services
• Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry
– Validation services• Verify format of a specific DO by comparing its
attributes to the attribute profile retrieved from the registry for that format.
DLF Spring Forum New York, May 14-16, 2003 29
Supported Services
• Brokerage Services– Rendering service
• Identify current rendering conditions for supplied DO
– Transformation service• Convert DO from current (source) format to
target format
– Metadata Extraction services• Registry returns information supporting
automated extraction of attribute metadata from a DO of a specific format
DLF Spring Forum New York, May 14-16, 2003 30
Service Model Sources
• ANSI X3.285, Metamodel for Management of Shareable Data– Service model for ISO/IEC 11179
• IANA MIME media type registry
• OASIS/ebXML Registry Services Specification
DLF Spring Forum New York, May 14-16, 2003 31
Registry Operation
• Trust is necessary to encourage deposit of proprietary information
• Sustainability is necessary to justify expense– As for all preservation activities, how do we
generate income today, for services not needed until tomorrow?
The registry is valuable insofar as it is trustworthy and sustainable.
DLF Spring Forum New York, May 14-16, 2003 32
Registry Operation
• Will registry staff collect and manage representation information, or
• Will knowledgeable community members submit information?
• What is the level of technical review, and by whom?– IETF model
Is the registry self-populating, or a public bulletin board?
DLF Spring Forum New York, May 14-16, 2003 33
Governance Model
• Can this initiative reasonably be placed under the umbrella of an existing organization?
• Is global scope in conflict with national prerogatives?
• How to build sufficient trust models• Governance model becomes more important as
the operational model becomes more pro-active (distributed and contributory)
DLF Spring Forum New York, May 14-16, 2003 34
Business Model
• Costs depend on level of quality and authority required (e.g. wiki vs oclc)
• Assuming the registry needs to be cost-recovered, options for supporting “common good” services include:– Subsidy– Subscription– Pay to submit
• Format registration accompanied by fee
– Pay to view• Queries on a for-fee basis
– Added-value services
DLF Spring Forum New York, May 14-16, 2003 35
Next Steps
• Tell people what we’re doing– National, academic, private libraries/archives– Standards bodies– Commercial
• Regulated industries• Software vendors (developers and consumers of formats)• Publishers
– Anyone with long-term digital preservation needs
• Refine documentation for a general audience– Vision statement and high-level project plan
DLF Spring Forum New York, May 14-16, 2003 36
Next Steps
• Look for project funding– Potentially two phases:
• Design and implementation– Can be funded through grants, in-kind participation
• Operational– Need reliable, sustainable income stream
– Planning grant to sustain initial activity• Data and service models• Governance and business model• Development and operations plan
– Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre
DLF Spring Forum New York, May 14-16, 2003 37
Why Is This Important to You?
• If you care about the long-term usability of your digital assets:– The registry will allow typing of digital objects
at an appropriate level of granularity– The registry will allow the recovery in the
future of the syntax and semantics associated with typed digital objects
– The registry is an enabling technology underlying digital repository operations and preservation activities
DLF Spring Forum New York, May 14-16, 2003 38
… thanks!
hul.harvard.edu/formatregistry
[email protected]@mit.edu