Post on 25-May-2015
The Challenges of Preserving Every Digital Format on the Face of the Planet
Leslie JohnstonMarch 26, 2012
Well, not every format
But we often have little or no control over what comes into the Library of Congress Digital Collections, and we manage and preserve a wide variety of formats.
What are examples of some of the collecting and preservation challenges?
NATIONAL DIGITAL NEWSPAPER PROGRAM chroniclingamerica.loc.gov/
A partnership between the National Endowment for the Humanities and the Library of Congress:
Enhance access to America newspapers
Sustainable digital collection
Scalable, phased, cost-effective management
The program has:
Multiple producers (25 now, ultimately 54)
Digitization standards (http://loc.gov/ndnp/)
Free and open public access
APIs for machine access and automated processes
Files
TIFFs, JPEGs, JPEG 2000s, and XML.
Over 4 million newspaper pages ingested to date
Over 250 Tb of data
WEB ARCHIVING http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa- home.html
The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records.The collections include:
U.S. elections
Web sites created by members of the House and Senate
Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices.
Collections around an area of study, such as Legal “Blawgs”
The file formats include every format possible on the web. The collection comprises approximately 5 billion files in 300 TB.
NATIONAL DIGITAL INFORMATION INFRASTRUCTURE & PRESERVATION PROGRAM digitalpreservation.gov
CONTENT TYPESCONTENT TYPES
Images and Text Audio Visual Geospatial Web Sites
PACKARD CAMPUS NATIONAL AUDIO-VISUAL CENTERPreserving Film, Broadcast Television, and Audio
The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future.
The facility also handles born-digital video and audio received directly from producers.
The formats include MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial formats.
eDEPOSIT FOR eSERIALS
eDeposit for eSerials is a collaborative effort between the U.S. Copyright Office and the Library of Congress.
Copyright Mandatory Deposit represents the largest acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.
eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.
The files must come to the Library “as published” – in whatever their original formats are. This means a wide variety of XML content and metadata, HTML, and PDFs.
WORLD DIGITAL LIBRARY www.wdl.org
Deliver historically significant primary materials from cultures around the world to an international multilingual audience
Over 100 participating partner institutions, and contributions from over 40 institutions so far.
Representing all 193 UNESCO member countries.
Maps, prints, photographs, rare books, manuscripts, journals, sound recordings, and motion pictures.
Metadata in Arabic, Chinese, French, English, Portuguese, Russian, and Spanish.
JPEG 2000s, PDFs, XML.
THE TWITTER ARCHIVEEvery public tweet since Twitter’s launch in March
2006.We have a historic 2006-2010 archive and ongoing
access to new tweets. We do not receive personal account information,
linked images, or linked web page content.Tweets will not move into the archive until six
months after their initial posting.The Library’s researcher services will not recreate
twitter, and cannot be openly accessible.We are testing various technologies, and entering a
pilot phase with test researchers. We will announce it when the archive is open to all researchers.
The collection comprises only a few TB, but over 80 billion tweets.
An FAQ is available online at: http://blogs.loc.gov/loc/2010/04/the-library-and- twitter-an-faq/
So how are we making this easier for the Library to manage?
Preservation Infrastructure
•The Library developed the BagIt transfer specification for the movement of files between and within organizations.
• http://www.digitalpreservation.gov/documents/ bagitspec.pdf
•The Library inventories all incoming files, and is inventorying all digital content.
• We maintain multiple copies of files on servers and on tape, in geographically distributed locations.
Preservation Partnerships
The Library cannot collect everything on its own, so works as part of:
The National Digital Stewardship Alliance http://www.digitalpreservation.gov/ndsa/
The International Internet Preservation Consortium http://netpreserve.org/about/index.php
among others…
What are the Library’s strategies for formats?• The Library has documented sustainability factors for file formats.
• http://www.digitalpreservation.gov/format s/
• For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.
• http://www.copyright.gov/circs/circ07b.pdf
• The Library is developing Format Preservation Action Plans.
DISCUSSION?
Leslie JohnstonChief of Repository Development
Manager of Technical Architecture Initiatives, NDIIPPlesliej@loc.gov