The Archival Problem & Infrastructure for Solutions What needs to be archived and what needs to be...
-
Upload
hilda-leonard -
Category
Documents
-
view
218 -
download
3
Transcript of The Archival Problem & Infrastructure for Solutions What needs to be archived and what needs to be...
The Archival Problem & Infrastructure for Solutions
What needs to be archived and what needs to be done?
Richard BoulderstoneDirector eStrategyFebruary 2010
33
What needs to be done?
1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?
2. Storage Long term authenticity of items Loss or corruption External References
3. Access Securely share content with other legal deposit
libraries Long term access – Beyond life of original hardware &
software platform (aka Digital Preservation) Controlled access
Public domain, legal deposit, licensed Content must be easy to find – Important!!!
44
Digital Library Architecture & Design Considerations
Started with Long-Term Storage problem Wanted cost effective, highly resilient store (highly unlikely
to lose items or have items corrupted), long term integrity Analysis showed that magnetic tape solutions had
limitations: As size of store grows (petabytes) total recovery time
can be long Cost of tape not much less than commodity spinning
disk Wanted continuous validation to ensure content
retained integrity Disk storage market tends to focus ‘value-added’ products
on high-transaction rates, large capacity and high reliablity – our requirement is low-cost, large capacity and reasonable performance
Needed to share some of the archived content with other legal deposit libraries
Architect resilience into system – is ‘backup’ part of the architecture?
66
What needs to be done?
1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?
2. Storage Long term authenticity of items Loss or corruption External References
3. Access Securely share content with other legal deposit
libraries Long term access – Beyond life of original hardware &
software platform (aka Digital Preservation) Controlled access
Public domain, legal deposit, licensed Content must be easy to find – Important!!!
77
CONTENT STREAMS with operational DIGITAL LIBRARY INGEST
ID Name Description
1 DIGITISED BOOKS & JOURNALS
Microsoft-funded digitised nineteenth century books accessible from DLS to the Reading Rooms via ILS. Ingest is complete.
2 VOLUNTARY ELECTRONIC LEGAL DEPOSIT (VELD)
Deposited hand held media and offline electronic media submitted ‘in lieu’ of electronic legal deposit legislation; includes journals, books etc (formerly known as VDEP material; excludes scholarly e-journals which are identified as a separate stream). A very limited amount of this content is accessible from DLS in the Reading Rooms via ILS.
3 FIELD SOUND RECORDINGS
Born digital field recordings created by Sound Archive – low volume, accessible from DLS to Reading Rooms via Sound Server
4 eJOURNALS Scholarly e-Journals sent to the BL as part of the Voluntary Deposit scheme. Ingest of simple eJournals (Stream 2) is live. Progress towards ingest of complex eJournals is halted while the technical options are being considered. Work has started on a project to ingest ESTAR eJournals
5a DIGITAL NEWSPAPERS Contemporary newspapers to be supplied digitally & directly to the BL by newspaper publishers. A pilot of a small number of titles from a single publisher is live. Progress is halted pending agreements from publishers.
5b LEGACY DIGITISED NEWSPAPERS
Scanned historical newspapers already in the BL’s current collection. Ingest of JISC-1 is currently underway
88
CONTENT STREAMS prioritised for DIGITAL LIBRARY INGEST
ID Name Description
6 WEB ARCHIVING An archive of web-sites gathered after gaining permission from rights holders. Following Legal Deposit Regulations the BL will be able to harvest sites from the uk domain without asking for permission. The project to ingest this collection is at the shape stage
7 BORN DIGITAL SOUND Born digital sound recordings, acquisitions & voluntary deposits. Expanding, with implications for Gateway project & links to Moving Image stream. The project to ingest this collection is at the shape stage
99
OTHER CONTENT STREAMS for DIGITAL LIBRARY INGEST
ID Name Description
8 LEGACY DIGITISED MASTERS
All existing image-based digitisation products largely on hand-held media with significant preservation risk
9 NEW DIGITISED MASTERS Digitised images from forthcoming projects including Single-Sheet Digitisation, Vulnerable Items Imaging, possibly Greek Manuscripts. Fast Track to Safety has been used to process the Vulnerable Collection Items (VCI) images – these are now DLS-ready. The same process can be used for the single-sheet digitised objects & other Aleph catalogued content
10 DIGITISED SOUND Digitised archival sound recordings funded by JISC (ASR1 & ASR2)
11 eMANUSCRIPTS Hybrid collections, comprising paper, computer & other media with significant issues (technical, privacy, rights management etc)
12 DATABASES Including large numerical datasets
13 DIGITAL MAPS Including OS/OSNI MasterMap data and other contacts / agreements (actually a dataset rather than a ‘map’)
14 E-THESES Digitised & born digital theses funded by JISC (eThOS Project)
15 ELECTRONIC GREY LITERATURE
Scientific technical & business documents, conference papers, newsletters, e-govt documents i.e. not readily available through commercial channels. Some of this content is already ingested to DLS via the VELD route (see 2 above)
16 eBOOKS Assumes born-digital material. Some of this content is already ingested to DLS via the VELD route (see 2 above)
17 MOVING IMAGES Digital recordings of television programmes, online podcasts etc
1010
1. Ingest – Some remaining issues
‘Dynamic’ Content – Update after initial deposit Currently use snapshot, version-based approach Other generic solutions?
Should we archive published outputs, underlying data or both?
Growing diversity of content Should we validate to ensure long-term access?
Container formats may hide significant complexity (3D pdf)
Scale
1111
What needs to be done?
1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?
2. Storage Long term authenticity of items Loss or corruption External References
3. Access Securely share content with other legal deposit
libraries Long term access – Beyond life of original hardware &
software platform (aka Digital Preservation) Controlled access
Public domain, legal deposit, licensed Content must be easy to find – Important!!!
1212
Edinburgh -2010
Aberystwyth
Boston Spa
St. Pancras
Cambridge Univ.
Oxford Univ.
Legal Deposit Libraries Shared Infrastructure
Large scale, highly resilient digital store
Complete copies of content at each node
Continuous validation & correction
Long term digital storage for BL content & eLegal deposit distribution
Distribution of eLegal deposit content (NLW, NLS and Oxford & Cambridge)
1313
Agreement between UK Legal Deposit Libraries
Use of single IT infrastructure, based on BL Digital Library System, to share legal deposit content
Use of single ingest point (Boston Spa) for legal deposit content
Deployment of ‘nodes’ at BL, NLW & NLS for resilience, operational efficiency, autonomy of operation. Oxford and Cambridge to access content from BL node.
Consistent approach to preservation, metadata standards, SLAs (service level agreements), infrastructure operations.
Access controls Trinity College Dublin will be included when
legislation allows
141414
Digital Library System Contents
Live Content Streams
Sound Archives (BL)
Voluntary Digital Donations (Vol. Scheme)
Nineteenth Century Digitised Books (BL)
Born Digital Newspapers (BL Pilot)
eJournals (Vol. Scheme)
Digitised Newspapers (BL)
Storage
>500,000 Digital Items
~50 Terabytes of Content
1515
Long-Term Access (aka Digital Preservation)
Dedicated digital preservation team at BL
Digital Library System currently supports Bit-level Preservation – long term integrity of ingested ‘bits’.
Also need to support Content-level Preservation, where the DLS is able to provide long-term access to the content, ensuring that users can render and use preserved content.
The Planets Project will deliver preservation modules for DLS in summer 2010.
Identification of at risk content Support for file format migrations Technology watch service
1616
2. Storage – Some remaining issues
Ongoing cost Storage Can we share common costs (Tools, Technology
watch, Test-beds) Can ‘dynamic’ items be frozen and more
importantly unfrozen? How many file formats/software will become
obsolete requiring heroic efforts to recreate original user experience?
How are external references maintained over time?
1717
What needs to be done?
1. Ingest Transition from print to digital information resources Heterogeneity, complexity and scale of digital content Interactive items Should we validate?
2. Storage Long term authenticity of items Loss or corruption External References
3. Access Securely share content with other legal deposit
libraries Long term access – Beyond life of original hardware &
software platform (aka Digital Preservation) Controlled access
Public domain, legal deposit, licensed Content must be easy to find – Important!!!
1818
Digital Policy & Rights Management
To provide the widest possible access to our digital collections while respecting the terms and conditions of licenses, voluntary schemes and regulations.
Most content controlled by copyright/legal deposit restrictions – will this change?
Current access control supports: Embargoed (no access), Authorised staff only, Reading room only
To be developed: Internet Single consecutive use at legal deposit libraries Secure container so that readers can use own PCs to access legal
deposit content
Mobile (anywhere) access
1919
Content Navigation & Discovery
The most important issue Catalogue model designed for two levels of hierarchy (Title &
holdings) Using Ex Libris Primo product as initial solution (Lucene full-text
search engine embedded in product) Much more needed – need help!
Persistent links Full featured commercial search engines Semantic web/Linked data/RDF Triples Text mining, entity extraction Information visualisation techniques Hardware developments, mobile technologies, large displays
2020
3. Access – Some remaining issues
With huge quantity of content how can people find what they want?
How can we support the development of sophisticated content navigation tools?
Where should we invest in resource discovery?
2121
Conclusion
We have developed a highly-resilient, scalable store for digital items
We will need to archive a very broad range of content. The BL Digital Library System will be used by the legal
deposit libraries to share legal deposit content However, this feels like the beginning of a very long
journey! We will need considerable help along the way
Thank you.