The Reality of Digital Transfer @ArchivesNZ

13
Department of Internal Affairs The Reality of Digital Transfer @ArchivesNZ Ross Spencer, Talei Masters Archives New Zealand Records Management Network Event, Tuesday November 25 2014

description

Presentation for Archives New Zealand Records Management Network Event describing the reality of digital transfer. Looking at the potential scale of digital transfers from the largest collections we investigated during the initial transfers project and comparing it to the accession work we're currently investigating at time of writing. A look at some of the challenges involved and how we're tackling those.

Transcript of The Reality of Digital Transfer @ArchivesNZ

Page 1: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

The Reality of Digital Transfer

@ArchivesNZ

Ross Spencer, Talei Masters

Archives New Zealand

Records Management Network Event,

Tuesday November 25 2014

Page 2: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

Background

Born Digital and Cultural Heritage Conference

Melbourne*: http://bit.ly/1utAqz0

Spencer, Braden, Hutar, Masters, Crouch, Mosely, Fly

Away Home: Pilot Transfer of Born-digital Records at

Archives New Zealand

Collected our experiences from late 2013 through to early

2014. Royal Commission work through to GDAP Closure

and beginning of eAccessions.

* http://playitagainproject.org/conference-report/

Page 3: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

A missing piece of the jigsaw…

• An appraisal of the technical challenges

• The first of a much bigger puzzle?

• We understood a minimal set of descriptive

metadata e.g. transfer metadata file; mapping

of EDRMS fields to that schema

• But the collection profile was missing –

technical implications of digital preservation…

Page 4: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

And the numbers were/are huge!

Royal Commission on the Pike River Coal Mine Tragedy

Lotus Notes DMSAccessData Summation

Two EDRMS:

374,264 Files (200GB)

66,580 Directories

3,892 Unidentified Objects

15 Unidentified Extensions

87 Known Formats

55,425 Duplicates (Content)

Analysis time: 108 minutes

24,190 Files (5GB)

641 Directories

1,254 Unidentified Objects

8 Unidentified Extensions

62 Known Formats

6,200 Duplicates (Content)

Analysis time: 44 minutes

Page 5: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

There’s more…

The Canterbury Earthquakes Royal Commission (partial stats)

11,505 Files (57GB)

246 Directories

123 Unidentified Objects

2 Unidentified Extensions

55 Known Formats

2,468 Duplicates (Content)

Analysis time: stats not collected

Lotus Notes DMS… (but a different flavour!)

One EDRMS:

Page 6: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

Performance of tools…

Just one (fairly profound?) example for you…Pike River

metadata extraction, and checksum generation… ‘triage’

2949m21.680s

49 Hours!

Page 7: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

Questions already forming…

• How do we speed things up?

• How do we make reporting consistent?

• Where do we begin with this information?

• Some answers already appearing: stats report is now

generated by a Python script in response to these

issues: https://github.com/exponential-decay/droid-

sqlite-analysis

• Relies only on The National Archives, DROID tool, file

listing, format ID, and checksumming utility

Page 8: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

eAccession One [e1]

Legacy accessions that we have opportunity to utilise lessons

learned from Initial Digital Transfers…

175 Files (166.5 mb)

10 Directories

0 Unidentified Objects

0 Unidentified Extensions

7 Known Formats

0 Duplicates (content)

Page 9: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

eAccession Four [e4]

eAccessions were seen to be the least complex and allowed

us to focus, primarily, on the challenge of ingest…

1295 Files (565.0 mb)

6 Directories

2 Unidentified Objects

1 Unidentified Extensions

12 Known Formats

2 Duplicates (content)

Note: Obscured issue in original statistics…

A number of false positives! System files

identified as something more generic.

Thumbnail preview files, and Serif PagePlus

might normally look like MS Office file-like

objects.

Page 10: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

Technical Challenges in e1 and e4

• [Tools] Ability to handle multi-byte character encodings. Maori macrons

‘Ā’.

• [Tools] Unidentified files and false positives.

• [Tools] Recording of pre-conditioning actions on ingest into digital

preservation system.

• [Tools] Implementing CSV ingest mechanism; configuration, code, and

workflow.

• [Pre-conditioning / Tools] Digital preservation system’s ability (Rosetta)

to handle contiguous spaces in filenames.

• [Pre-conditioning] One invalid JPEG. Required rearrangement of

application marker segments.

Page 11: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

What next..?

• One step at a time. Accessions e1 and e4; develop capability

further with e2 and e3.

• Incorporate metadata extraction tool JHOVE into process

following experience with e1 and e4, possibly via FITS

• Refine current metrics and the presentation of statistics e.g.

make more useful for Archivists working on the born-digital

we’re already in possession of…

• Ideal: Archivists knowledge (processes, analysis, diagnosis)

becomes actuated.

Page 12: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs

What next..?

• SCALE!

Thank you!

Page 13: The Reality of Digital Transfer @ArchivesNZ

Department of Internal Affairs