American Archives Horror Story

Post on 18-Jul-2015

848 views 0 download

Transcript of American Archives Horror Story

ARCHIVE

LTO FAILURE AND DATA LOSS

Who we are: WGBH MLA

Who We Are: AAPB

...and more than 120 public radio and television stations and archives nationwide

Digitization recently completed

WGBH’s 7,010 tapes that were sent for digitization

Returned on 17 LTO-6 tapes

• 5,000 hours of digitized and born digital media

• Up to 59,000 files

• Not to exceed 5.24 terabytes after transcoding has occurred

The Born Digital Deliverable

• Lack of staff resources at stations• Absence of existing metadata• Unique identifiers ≠ actual names of

files• Limitations of our metadata

management system • Bicycling hard drives• Access quality vs. preservation

quality• 5.24 terabytes became 300+

terabytes

We had some challenges

• Send multiple batches totaling 13,500 video and audio files

• Pull 300TB of files over our network and place on 76 3TB hard drives

– Stored on LTO-4 robotic machine in IT

– Checksums for most files did not exist

– Many files up to 100GB each

The Plan at WGBH

THE PROBLEM

Out of a set of 2069 files pulled for Batch 3 part 1, 1195 proved to have failed on reaching Crawford

693 failed initial analysis

394 failed QC

108 failed transcode

= 57% failure rate

The next batch had 1310 failures out of 2826 files

THE PROCESS

start with csv file containing final name of file at receiving end, full path to file on source end, ID value of offline storage tape

shell script:

- sorts files by # of storage tape

- logs into DAM using ssh

- transfers file using scp through Artesia from LTO 4 tape (stored as tarball) onto 3 TB hard drive

later versions used tar rather than scp

THE PROCESS (REVISED)post-transfer, compare the megabyte block counts of source and destination products

(no checksum – took too much time to perform on such large files while under time pressure)

failed items automatically removed from drive

transfer script re-run until all files download successfully

if files fail repeatedly, assume they have failed on LTO; backup tape called from Iron Mountain and attempted to be staged from there

THE PROGRESS

Many files that initially failed eventually transferred successfully, either from the initial tape or from a backup, after multiple attempts

Others were never successfully transferred

Out of a planned 10,648 files in the batch, 2173 were never successfully downloaded – a 20% failure rate

BREAKING DOWN THE FAILURES

ffmpeg –i ${filename}mediainfo –f ${filename}

“moov atom not found”

QC FAILURE

Playable files with evidence of corruption defined by Crawford as “issues that would make the file unusable,” for example:

a green screen with no audio

a video that plays for two seconds before the screen going black or grey

pixels shift out of place in zigzag pattern

audio is digital noise only

THE PROGNOSIS

Sample data: 5000 files with checksums generated at creation

1012 of those files could not be transferred from LTO, after multiple attempts

However, MD5s on LTO show the files are unchanged

So the files are good – but can’t be reached?

THE POSSIBILITIES

Files were bad before they went onto LTO –production environment provides little opportunity for QC

Files are good, but inaccessible on LTO because of problems with the way the data is stored on the tape or the interaction of the different technologies used to get it out c

THE PROBLEMS NOW

Administrative distance between institutional IT and archival needs makes it difficult to get clear answers about the technology we’re using

Staff turnover means information about original systems/data transfer processes are lost

Local LTO systems incompatible with older tapes, making direct testing currently impossible

NEXT STEPS

Acquire Linux machine for direct testing of LTO 4 tapes

Test different transfer protocols

More investigation into the SL8500 SAMFS/QFS

Look for patterns in inaccessible files (file size, date uploaded, system architecture on storage tape)

Rebecca Fraimow & Casey Davis@rhfraim

@CaseyEDavis1rebecca_fraimow@wgbh.org

casey_davis@wgbh.org