PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation...
-
date post
18-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation...
PANDORA and Beyond: Managing Web Archiving at the National Library of Australia
Digital Preservation SeminarNational Library of Australia, 21 November 2006
Paul KoerbinManager Digital Archiving
National Library of [email protected]
PANDORA and Beyond
• Context and background• PANDORA – selective archiving• PANDAS – a web archiving system• Domain harvesting• Now and beyond
PANDORA and Beyond – Context - Legislation
• National Library Act, 1960
• Functions of the National Library – Maintain and develop a national collection of library
material, including a comprehensive collection of library material relating to Australia and the Australian people
– To make library material in the national collection available … in the national interest
– ‘Library material’ ~ books, periodicals, newspapers, manuscripts, films, sound recordings, musical scores, maps, plans , pictures, photographs, prints and other recorded material …
PANDORA and Beyond – Context - Legislation
• Copyright Act, 1968 – Sect 201• Delivery of library materials to the National
Library – ‘Library material’ ~ book, periodical, newspaper,
pamphlet sheet of letter-press, sheet of music, map, plan, chart or table, being a literary, dramatic, musical or artistic work or an edition of such a work …
• Enabling and supportive legislation does not address the collection of digital content
• Copyright Amendment (Digital Agenda) Act, 2000– some support for digital preservation actions
PANDORA and Beyond – Context – Web Publishing
• World Wide Web: a new publishing medium, 1995→
• Defining a publication for our purpose:
A publication is information, regardless of its format or method of delivery, that is made available to the general public, or to an identified public, either free of charge or for a fee.
Definition from: PANDORA Selection Guidelines
http://pandora.nla.gov.au/selectionguidelines.html#pubdefinition
• Content rendered through a web browser• Email – only as delivery mechanism (e.g. PDF)• Databases – yes, but more problematic
PANDORA and Beyond – Context – Web Publishing
• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)
PANDORA and Beyond – Context – Some Objectives
• Fulfil the functions of the National Library• Identify published content to collect• Manage content for long term preservation
– Integrity of the data streams – Maintain access to authentic content
• Provide persistent access to the content• Incorporate collection and preservation of web
content into routine Library processes• Efficient and sustainable
PANDORA and Beyond – The PANDORA Archive
• PANDORA Archive 1996→• Began as proof-of-concept project• Now a routine process within NLA• Currently 10 participants – NLA, state libraries (not Tas), NFSA, AWM,
AIATSIS• Selective, content focused (bibliocentric)
– simple documents to whole websites• PANDAS workflow management system, 2001→
PANDORA and Beyond – PANDORA – Web Archiving
What is web archiving?• Identifying and selecting• Seeking permission to collect and make accessible• Recording metadata • Crawling/harvesting (including scheduling)• Processing for quality assurance (best effort)• Storing and maintaining the data• Preparing and rendering for public display• Creating resource discovery metadata
PANDORA and Beyond – PANDAS
• PANDAS – PANDORA Digital Archiving System• Web based workflow management system• Developed specifically to manage the web
archiving processes at the National Library of Australia
• Used by PANDORA’s participants located throughout Australia (mainland state libraries, AWM, NFSA, AIATSIS)
• Also used by UKWAC
PANDORA and Beyond – PANDAS
• Developed in-house at the NLA• Replaced multiple non-integrated systems used
between 1996 and 2001 • Written in Java on Apple WebObjects application
development platform• Presentation, application, business and data layers• Version 1 released June 2001• Version 2 released August 2002• Version 3 due early 2007
PANDORA and Beyond – PANDAS
PANDORA and Beyond – PANDAS
• Developed in-house at the NLA• Replaced multiple non-integrated systems used
between 1996 and 2001 • Written in Java on Apple WebObjects application
development platform• Presentation, application, business and data layers• Version 1 released in June 2001• Version 2 released August 2002• Version 3 due early 2007
PANDORA and Beyond – PANDAS
• Record administrative metadata about titles selected (or considered) for archiving
• Schedule and initiate harvesting – but not a crawler; currently use HTTrack
• Manage quality assurance checking and problem fixing workflow
• Prepare and deliver archived copies for public display through the PANDORA home page– dynamically from PANDAS database
• Manage access restrictions• Facilitates management reporting
PANDORA and Beyond – Persistent URIs
• Running number generated by PANDAS• Persistent URL applied to title entry page
http://nla.gov.au/nla.arc-21220• Logically extended to any resource in the Archive
http://nla.gov.au/nla.arc-21220-20030822-www.ipjp.org/september2002/schweitzer-ed.html
• Citation generator on public interface
PANDORA and Beyond – PANDORA Statistics
Indicative statistics as at October 2006
• 13,000+ titles• 26,000+ archived instances• 33.5+ million files*• 1.2+ Terabytes data*
*These figures are for the display copy only. Three preservation copies are actually maintained: a preservation master, an access master and a metadata master.
PANDORA and Beyond – Domain Harvesting
• Crawl conducted by the Internet Archive for the NLA
• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs
• 2nd harvest Aug/Sept 2006– 5 weeks, 516m files, 19.04 TBs
• Harvest of the .au top level domain– plus, non .au hosts identified through geoPI
lookup as being hosted in Australia• Domain harvesting – obvious choice?
Comparative statistics
PANDORA (c. 6% of 2006 DH)
Files: 33 million
Size: 1.2 TB
HTML: 67%
Image files: 28.5%
PDF files: 1.6%
MS Word files:
0.2%
DH MIME types
Domain Harvest 2005 2006
Unique files 185,549,662 516,280,205
Hosts crawled 811,523 1,046,038
Size 6.69 TB 19.04 TB
PANDORA and Beyond – Domain Harvesting – Pros and Cons
• Convergence of resources, technology, collaborations, and purpose in 2005
• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate
• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance, scoping,
site complexity, deep web– Legal and access issues to resolve
PANDORA and Beyond – Now
• 10 years selective web archiving for PANDORA– publicly accessible web archive
• 2 years domain harvesting – large scale archival content
• PANDAS– production workflow system
• Tangible outcomes from pragmatic approach• Doing (what we can) with limited resources• Developing experience, knowledge and skill
through practical engagement in the tasks
PANDORA and Beyond – Future Strategies
• Renewed focus on strategic thinking• Collaborations, relationships, partnerships
– International Internet Preservation Consortium Internet Archive
– Open source tools, standards (IIPC)
– Institutional and trusted repositories (universities and e-presses)
– Government & academic sectors (APSR, ARROW)
– ‘research information infrastructure’• services that support the discovery and management of
research resources and research outputs by and for the current and future research community
PANDORA and Beyond – Future Strategies
• Preservation planning and infrastructure• Sustainable resourcing and workflows• Push for legislation for collecting in the digital age• Understanding the territory
– Personal web archiving (HanzoWeb); archive crawlers (Warrick); advanced bookmarking (spurl.net)
• Strategic use of selective and domain harvesting• Architecture, systems and workflows for efficient
management of and access to web archive collections