The Archival Problem & Infrastructure for Solutions What needs to be archived and what needs to be...

21
The Archival Problem & Infrastructure for Solutions What needs to be archived and what needs to be done? Richard Boulderstone Director eStrategy February 2010

Transcript of The Archival Problem & Infrastructure for Solutions What needs to be archived and what needs to be...

The Archival Problem & Infrastructure for Solutions

What needs to be archived and what needs to be done?

Richard BoulderstoneDirector eStrategyFebruary 2010

22

What needs to be archived?

most things...at least a sample of most things...

33

What needs to be done?

1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?

2. Storage Long term authenticity of items Loss or corruption External References

3. Access Securely share content with other legal deposit

libraries Long term access – Beyond life of original hardware &

software platform (aka Digital Preservation) Controlled access

Public domain, legal deposit, licensed Content must be easy to find – Important!!!

44

Digital Library Architecture & Design Considerations

Started with Long-Term Storage problem Wanted cost effective, highly resilient store (highly unlikely

to lose items or have items corrupted), long term integrity Analysis showed that magnetic tape solutions had

limitations: As size of store grows (petabytes) total recovery time

can be long Cost of tape not much less than commodity spinning

disk Wanted continuous validation to ensure content

retained integrity Disk storage market tends to focus ‘value-added’ products

on high-transaction rates, large capacity and high reliablity – our requirement is low-cost, large capacity and reasonable performance

Needed to share some of the archived content with other legal deposit libraries

Architect resilience into system – is ‘backup’ part of the architecture?

55

The Digital Library System Store

66

What needs to be done?

1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?

2. Storage Long term authenticity of items Loss or corruption External References

3. Access Securely share content with other legal deposit

libraries Long term access – Beyond life of original hardware &

software platform (aka Digital Preservation) Controlled access

Public domain, legal deposit, licensed Content must be easy to find – Important!!!

77

CONTENT STREAMS with operational DIGITAL LIBRARY INGEST

ID Name Description

1 DIGITISED BOOKS & JOURNALS

Microsoft-funded digitised nineteenth century books accessible from DLS to the Reading Rooms via ILS. Ingest is complete.

2 VOLUNTARY ELECTRONIC LEGAL DEPOSIT (VELD)

Deposited hand held media and offline electronic media submitted ‘in lieu’ of electronic legal deposit legislation; includes journals, books etc (formerly known as VDEP material; excludes scholarly e-journals which are identified as a separate stream). A very limited amount of this content is accessible from DLS in the Reading Rooms via ILS.

3 FIELD SOUND RECORDINGS

Born digital field recordings created by Sound Archive – low volume, accessible from DLS to Reading Rooms via Sound Server

4 eJOURNALS Scholarly e-Journals sent to the BL as part of the Voluntary Deposit scheme. Ingest of simple eJournals (Stream 2) is live. Progress towards ingest of complex eJournals is halted while the technical options are being considered. Work has started on a project to ingest ESTAR eJournals

5a DIGITAL NEWSPAPERS Contemporary newspapers to be supplied digitally & directly to the BL by newspaper publishers. A pilot of a small number of titles from a single publisher is live. Progress is halted pending agreements from publishers.

5b LEGACY DIGITISED NEWSPAPERS

Scanned historical newspapers already in the BL’s current collection. Ingest of JISC-1 is currently underway

88

CONTENT STREAMS prioritised for DIGITAL LIBRARY INGEST

ID Name Description

6 WEB ARCHIVING An archive of web-sites gathered after gaining permission from rights holders. Following Legal Deposit Regulations the BL will be able to harvest sites from the uk domain without asking for permission. The project to ingest this collection is at the shape stage

7 BORN DIGITAL SOUND Born digital sound recordings, acquisitions & voluntary deposits. Expanding, with implications for Gateway project & links to Moving Image stream. The project to ingest this collection is at the shape stage

99

OTHER CONTENT STREAMS for DIGITAL LIBRARY INGEST

ID Name Description

8 LEGACY DIGITISED MASTERS

All existing image-based digitisation products largely on hand-held media with significant preservation risk

9 NEW DIGITISED MASTERS Digitised images from forthcoming projects including Single-Sheet Digitisation, Vulnerable Items Imaging, possibly Greek Manuscripts. Fast Track to Safety has been used to process the Vulnerable Collection Items (VCI) images – these are now DLS-ready. The same process can be used for the single-sheet digitised objects & other Aleph catalogued content

10 DIGITISED SOUND Digitised archival sound recordings funded by JISC (ASR1 & ASR2)

11 eMANUSCRIPTS Hybrid collections, comprising paper, computer & other media with significant issues (technical, privacy, rights management etc)

12 DATABASES Including large numerical datasets

13 DIGITAL MAPS Including OS/OSNI MasterMap data and other contacts / agreements (actually a dataset rather than a ‘map’)

14 E-THESES Digitised & born digital theses funded by JISC (eThOS Project)

15 ELECTRONIC GREY LITERATURE

Scientific technical & business documents, conference papers, newsletters, e-govt documents i.e. not readily available through commercial channels. Some of this content is already ingested to DLS via the VELD route (see 2 above)

16 eBOOKS Assumes born-digital material. Some of this content is already ingested to DLS via the VELD route (see 2 above)

17 MOVING IMAGES Digital recordings of television programmes, online podcasts etc

1010

1. Ingest – Some remaining issues

‘Dynamic’ Content – Update after initial deposit Currently use snapshot, version-based approach Other generic solutions?

Should we archive published outputs, underlying data or both?

Growing diversity of content Should we validate to ensure long-term access?

Container formats may hide significant complexity (3D pdf)

Scale

1111

What needs to be done?

1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?

2. Storage Long term authenticity of items Loss or corruption External References

3. Access Securely share content with other legal deposit

libraries Long term access – Beyond life of original hardware &

software platform (aka Digital Preservation) Controlled access

Public domain, legal deposit, licensed Content must be easy to find – Important!!!

1212

Edinburgh -2010

Aberystwyth

Boston Spa

St. Pancras

Cambridge Univ.

Oxford Univ.

Legal Deposit Libraries Shared Infrastructure

Large scale, highly resilient digital store

Complete copies of content at each node

Continuous validation & correction

Long term digital storage for BL content & eLegal deposit distribution

Distribution of eLegal deposit content (NLW, NLS and Oxford & Cambridge)

1313

Agreement between UK Legal Deposit Libraries

Use of single IT infrastructure, based on BL Digital Library System, to share legal deposit content

Use of single ingest point (Boston Spa) for legal deposit content

Deployment of ‘nodes’ at BL, NLW & NLS for resilience, operational efficiency, autonomy of operation. Oxford and Cambridge to access content from BL node.

Consistent approach to preservation, metadata standards, SLAs (service level agreements), infrastructure operations.

Access controls Trinity College Dublin will be included when

legislation allows

141414

Digital Library System Contents

Live Content Streams

Sound Archives (BL)

Voluntary Digital Donations (Vol. Scheme)

Nineteenth Century Digitised Books (BL)

Born Digital Newspapers (BL Pilot)

eJournals (Vol. Scheme)

Digitised Newspapers (BL)

Storage

>500,000 Digital Items

~50 Terabytes of Content

1515

Long-Term Access (aka Digital Preservation)

Dedicated digital preservation team at BL

Digital Library System currently supports Bit-level Preservation – long term integrity of ingested ‘bits’.

Also need to support Content-level Preservation, where the DLS is able to provide long-term access to the content, ensuring that users can render and use preserved content.

The Planets Project will deliver preservation modules for DLS in summer 2010.

Identification of at risk content Support for file format migrations Technology watch service

1616

2. Storage – Some remaining issues

Ongoing cost Storage Can we share common costs (Tools, Technology

watch, Test-beds) Can ‘dynamic’ items be frozen and more

importantly unfrozen? How many file formats/software will become

obsolete requiring heroic efforts to recreate original user experience?

How are external references maintained over time?

1717

What needs to be done?

1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?

2. Storage Long term authenticity of items Loss or corruption External References

3. Access Securely share content with other legal deposit

libraries Long term access – Beyond life of original hardware &

software platform (aka Digital Preservation) Controlled access

Public domain, legal deposit, licensed Content must be easy to find – Important!!!

1818

Digital Policy & Rights Management

To provide the widest possible access to our digital collections while respecting the terms and conditions of licenses, voluntary schemes and regulations.

Most content controlled by copyright/legal deposit restrictions – will this change?

Current access control supports: Embargoed (no access), Authorised staff only, Reading room only

To be developed: Internet Single consecutive use at legal deposit libraries Secure container so that readers can use own PCs to access legal

deposit content

Mobile (anywhere) access

1919

Content Navigation & Discovery

The most important issue Catalogue model designed for two levels of hierarchy (Title &

holdings) Using Ex Libris Primo product as initial solution (Lucene full-text

search engine embedded in product) Much more needed – need help!

Persistent links Full featured commercial search engines Semantic web/Linked data/RDF Triples Text mining, entity extraction Information visualisation techniques Hardware developments, mobile technologies, large displays

2020

3. Access – Some remaining issues

With huge quantity of content how can people find what they want?

How can we support the development of sophisticated content navigation tools?

Where should we invest in resource discovery?

2121

Conclusion

We have developed a highly-resilient, scalable store for digital items

We will need to archive a very broad range of content. The BL Digital Library System will be used by the legal

deposit libraries to share legal deposit content However, this feels like the beginning of a very long

journey! We will need considerable help along the way

Thank you.