Jay Gattuso Persistently Identifying Formats

23
‘Persistently’ Identifying Formats PRONOM, DROID and the NDHA Jay Gattuso Digital Preservation Analyst National Digital Heritage Archive National Library of New Zealand

description

'Persistently’ Identifying Formats PRONOM, DROID and the NDHA Jay Gattuso

Transcript of Jay Gattuso Persistently Identifying Formats

Page 1: Jay Gattuso Persistently Identifying Formats

‘Persistently’ Identifying Formats

PRONOM, DROID and the NDHA

Jay Gattuso Digital Preservation Analyst

National Digital Heritage ArchiveNational Library of New Zealand

Page 2: Jay Gattuso Persistently Identifying Formats

Summary

How Rosetta uses DROIDHow DROID has changed

Research NDHA completedResults

Recommendations

Page 3: Jay Gattuso Persistently Identifying Formats

DROID & PRONOM • PRONOM is the most

widely used file format registry in the sector

• DROID is a tool that ‘identifies’ file types (based on PRONOM records)

• Both are from TNA (UK)• DROID Signature v59

– 551 signature sets– 864 file type records

EP/1958/2520-F Registry, Hunter Building, Victoria University of Wellington

Photograph taken for the Evening Post newspaper, 31 Jul 1958 Alexander Turnbull Library

www.nationalarchives.gov.uk/PRONOM/Default.aspx

Page 4: Jay Gattuso Persistently Identifying Formats

Rosetta – A Brief History

• NLNZ Digital Preservation Repository

• 4 years since inception• 18 months out of project• 8 significant

upgrades/software revisions• ~6 Million digital objects to

date• Backbone of the ANZ GDAP

1/1-000008-G Smiley's stables and horse repository, Whanganui

Harding, William James, 1826-1899 :Negatives of Wanganui district .Alexander Turnbull Library

Page 5: Jay Gattuso Persistently Identifying Formats

Write Once, Read Many

Inside Rosetta, format identification is a ‘WORM’ process.

As a part of the ingest routine, format identification is automatically undertaken, written to the file records, and the system database, and used thereafter as a consistent ‘label’.

E-272-f-001Abbot, John 1751-1840 :

Original drawings of insects by J Abott. [1816?]Alexander Turnbull Library

.

We rely on the persistence of the label to accurately plan activities and ‘measure’ the content or shape of the repository.

Page 6: Jay Gattuso Persistently Identifying Formats

Behaviours and functions based on DROID format assertions

Rosetta uses DROID to automatically establish format type.

Page 7: Jay Gattuso Persistently Identifying Formats

Rosetta Overview

Validation StackAutomated Format

Identification via DROID

Page 8: Jay Gattuso Persistently Identifying Formats

Shape Sorting...

Where:

• The area inside the box is Rosetta

• Each block is a DO• Each shape is a format• The ‘Sorter’ is DROID

Page 9: Jay Gattuso Persistently Identifying Formats

Shape Sorting...

Process:

• A record is kept of the ‘shape’ the DO entered the box via

• The record is used by the system to trigger activities

• The DO can be removed from the box using the same shaped hole it used on entry

Page 10: Jay Gattuso Persistently Identifying Formats

Shape Sorting...

Expectations:

• The ‘Sorter’ never changes• The blocks never change• A DO placed in the box

yesterday will be the same shape tomorrow

• A DO placed in the box yesterday will be extractable via the shape tomorrow

Page 11: Jay Gattuso Persistently Identifying Formats

Shape Sorting...

The reality for NDHA:

• DROID has undergone 2 major revisions

• Container signatures have been included

• Since Rosetta v1 release: – 406 new formats, – 600 changes to signatures– (This is generally a good thing!)

Page 12: Jay Gattuso Persistently Identifying Formats

• Rosetta has used DROID versions 3 and 5, currently testing with 6

• Rosetta has used DROID signature versions v13, v37, v45 and v49, testing with v52

• Proposal to use a new DROID method in Rosetta

• How has/will this affect the way we characterise Digital Objects at the NDHA?

Identifying and Quantifying Change

EP/1958/0585-F Signature of Queen Elizabeth II in a visitors book

Negatives of the Evening Post newspaper. Feb 1958Alexander Turnbull Library

Page 13: Jay Gattuso Persistently Identifying Formats

• Source set: – 26,000 digital objects, – ~600 Gb of content, – spanning 61 format types – all from the live system

• DROID v3, DROID v5, DROID v6 and DROID v6 ‘FAST’ tested

• Signatures v13, v37, v45, v49 and v50 tested

• All files tested with and without file extensions

Identifying and Quantifying Change

EP/1990/0432/29-FNew school patrol system being tested , Wellington

Photograph taken by John Nicholson ca 2 Feb 1990

Alexander Turnbull Library

Page 14: Jay Gattuso Persistently Identifying Formats

• 1 million DROID ‘assertions’ captured• Python and MySQL used to sort,

clean, filter, draw graphics and otherwise interpret results

• Paper competed and will be available on the OPF website

www.openplanetsfoundation.org

Identifying and Quantifying Change

DCDL-0004533Eric Idle. 5 December, 2007.

Webb, Murray, 1947- : Digital caricatures published from 29 July 2005 onwards

Alexander Turnbull Library

Page 15: Jay Gattuso Persistently Identifying Formats

Summary of Results

Of the 61 tested file types :

75% performed identically for all tested versions of DROID and signature versions

fmt/49(RTF 1.4)

Page 16: Jay Gattuso Persistently Identifying Formats

Summary of Results

Of the 61 tested file types :

40% consistently offered a single PUID across the range of DROID tests

By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe

fmt/12(PNG 1.1)

Page 17: Jay Gattuso Persistently Identifying Formats

Summary of Results

Of the 61 tested file types :

In 26% of the file types multiple PUIDs are equally asserted by DROID at various times.

By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aiff, and arc

fmt/7(TIF format)

Page 18: Jay Gattuso Persistently Identifying Formats

Summary of Results

Of the 61 tested file types :

In 16% of the file types DROID version 6 in ‘FAST’ mode performs differently DROID version 6 in standard mode

By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif and exe fmt/6

(Waveform Audio)

Page 19: Jay Gattuso Persistently Identifying Formats

Recommendation 1

There is a clear need for a community owned dataset that spans the PRONOM catalogue to support testing

(This should be community created) ExL-fmt/62 - fmt/189

(MS Open Office XML 2007)

Page 20: Jay Gattuso Persistently Identifying Formats

Recommendation 2

It is strongly recommended that more research is undertaken looking at the persistence of PUID’s to give a more complete history of file type assertions by PRONOM/DROID

fmt/14(PDF 1.0)

Page 21: Jay Gattuso Persistently Identifying Formats

Recommendation 3

Given the variances observed, especially with DROID v6 ‘FAST’ mode, it is recommended that all signatures are robustly tested prior to release, and efforts are made to maintain consistency with legacy signatures, and limit impact on users x-fmt/263

(ZIP format)

Page 22: Jay Gattuso Persistently Identifying Formats

Recap

How Rosetta uses DROIDHow DROID has changed

Research NDHA completedResults

Recommendations

Page 23: Jay Gattuso Persistently Identifying Formats

Thank you

[email protected]

Rosetta demo – Wednesday 28th March 9am to 1pm @ NLNZ - 77 Thorndon Quay

Paper available through the Open Planets Website www.openplanetsfoundation.org