PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation...

PANDORA and Beyond: Managing Web Archiving at the National Library of Australia

Digital Preservation SeminarNational Library of Australia, 21 November 2006

Paul KoerbinManager Digital Archiving

National Library of [email protected]

PANDORA and Beyond

• Context and background• PANDORA – selective archiving• PANDAS – a web archiving system• Domain harvesting• Now and beyond

PANDORA and Beyond – Context - Legislation

• National Library Act, 1960

• Functions of the National Library – Maintain and develop a national collection of library

material, including a comprehensive collection of library material relating to Australia and the Australian people

– To make library material in the national collection available … in the national interest

– ‘Library material’ ~ books, periodicals, newspapers, manuscripts, films, sound recordings, musical scores, maps, plans , pictures, photographs, prints and other recorded material …

PANDORA and Beyond – Context - Legislation

• Copyright Act, 1968 – Sect 201• Delivery of library materials to the National

Library – ‘Library material’ ~ book, periodical, newspaper,

pamphlet sheet of letter-press, sheet of music, map, plan, chart or table, being a literary, dramatic, musical or artistic work or an edition of such a work …

• Enabling and supportive legislation does not address the collection of digital content

• Copyright Amendment (Digital Agenda) Act, 2000– some support for digital preservation actions

PANDORA and Beyond – Context – Web Publishing

• World Wide Web: a new publishing medium, 1995→

• Defining a publication for our purpose:

A publication is information, regardless of its format or method of delivery, that is made available to the general public, or to an identified public, either free of charge or for a fee.

Definition from: PANDORA Selection Guidelines

http://pandora.nla.gov.au/selectionguidelines.html#pubdefinition

• Content rendered through a web browser• Email – only as delivery mechanism (e.g. PDF)• Databases – yes, but more problematic

PANDORA and Beyond – Context – Web Publishing

• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)

PANDORA and Beyond – Context – Some Objectives

• Fulfil the functions of the National Library• Identify published content to collect• Manage content for long term preservation

– Integrity of the data streams – Maintain access to authentic content

• Provide persistent access to the content• Incorporate collection and preservation of web

content into routine Library processes• Efficient and sustainable

PANDORA and Beyond – The PANDORA Archive

• PANDORA Archive 1996→• Began as proof-of-concept project• Now a routine process within NLA• Currently 10 participants – NLA, state libraries (not Tas), NFSA, AWM,

AIATSIS• Selective, content focused (bibliocentric)

– simple documents to whole websites• PANDAS workflow management system, 2001→

http://pandora.nla.gov.au/

PANDORA and Beyond – PANDORA – Web Archiving

What is web archiving?• Identifying and selecting• Seeking permission to collect and make accessible• Recording metadata • Crawling/harvesting (including scheduling)• Processing for quality assurance (best effort)• Storing and maintaining the data• Preparing and rendering for public display• Creating resource discovery metadata

PANDORA and Beyond – PANDAS

• PANDAS – PANDORA Digital Archiving System• Web based workflow management system• Developed specifically to manage the web

archiving processes at the National Library of Australia

• Used by PANDORA’s participants located throughout Australia (mainland state libraries, AWM, NFSA, AIATSIS)

• Also used by UKWAC


• Developed in-house at the NLA• Replaced multiple non-integrated systems used

between 1996 and 2001 • Written in Java on Apple WebObjects application

development platform• Presentation, application, business and data layers• Version 1 released June 2001• Version 2 released August 2002• Version 3 due early 2007


• Developed in-house at the NLA• Replaced multiple non-integrated systems used

between 1996 and 2001 • Written in Java on Apple WebObjects application

development platform• Presentation, application, business and data layers• Version 1 released in June 2001• Version 2 released August 2002• Version 3 due early 2007


• Record administrative metadata about titles selected (or considered) for archiving

• Schedule and initiate harvesting – but not a crawler; currently use HTTrack

• Manage quality assurance checking and problem fixing workflow

• Prepare and deliver archived copies for public display through the PANDORA home page– dynamically from PANDAS database

• Manage access restrictions• Facilitates management reporting

PANDORA and Beyond – Persistent URIs

• Running number generated by PANDAS• Persistent URL applied to title entry page

http://nla.gov.au/nla.arc-21220• Logically extended to any resource in the Archive

http://nla.gov.au/nla.arc-21220-20030822-www.ipjp.org/september2002/schweitzer-ed.html

• Citation generator on public interface

http://nla.gov.au/nla.arc-21220



PANDORA and Beyond – PANDORA Statistics

Indicative statistics as at October 2006

• 13,000+ titles• 26,000+ archived instances• 33.5+ million files*• 1.2+ Terabytes data*

*These figures are for the display copy only. Three preservation copies are actually maintained: a preservation master, an access master and a metadata master.

PANDORA and Beyond – Domain Harvesting

• Crawl conducted by the Internet Archive for the NLA

• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs

• 2nd harvest Aug/Sept 2006– 5 weeks, 516m files, 19.04 TBs

• Harvest of the .au top level domain– plus, non .au hosts identified through geoPI

lookup as being hosted in Australia• Domain harvesting – obvious choice?

Comparative statistics

PANDORA (c. 6% of 2006 DH)

Files: 33 million

Size: 1.2 TB

HTML: 67%

Image files: 28.5%

PDF files: 1.6%

MS Word files:

0.2%

DH MIME types

Domain Harvest 2005 2006

Unique files 185,549,662 516,280,205

Hosts crawled 811,523 1,046,038

Size 6.69 TB 19.04 TB

PANDORA and Beyond – Domain Harvesting – Pros and Cons

• Convergence of resources, technology, collaborations, and purpose in 2005

• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate

• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance, scoping,

site complexity, deep web– Legal and access issues to resolve

PANDORA and Beyond – Now

• 10 years selective web archiving for PANDORA– publicly accessible web archive

• 2 years domain harvesting – large scale archival content

• PANDAS– production workflow system

• Tangible outcomes from pragmatic approach• Doing (what we can) with limited resources• Developing experience, knowledge and skill

through practical engagement in the tasks

PANDORA and Beyond – Future Strategies

• Renewed focus on strategic thinking• Collaborations, relationships, partnerships

– International Internet Preservation Consortium Internet Archive

– Open source tools, standards (IIPC)

– Institutional and trusted repositories (universities and e-presses)

– Government & academic sectors (APSR, ARROW)

– ‘research information infrastructure’• services that support the discovery and management of

research resources and research outputs by and for the current and future research community

PANDORA and Beyond – Future Strategies

• Preservation planning and infrastructure• Sustainable resourcing and workflows• Push for legislation for collecting in the digital age• Understanding the territory

– Personal web archiving (HanzoWeb); archive crawlers (Warrick); advanced bookmarking (spurl.net)

• Strategic use of selective and domain harvesting• Architecture, systems and workflows for efficient

management of and access to web archive collections

PANDORAAustralia’s Web Archive



PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation...

Documents

Transcript of PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation...