BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004.

16
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004

Transcript of BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004.

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS

Jani Stenvall

1.10.2004

Themes

• Web archiving in general: what it is + international perspectives

• Archiving Finnish web content: legal framework, plans and techniques

Web archiving?

• Storing and preserving content made available on the Internet/www

• Relates to the general idea of preserving any cultural heritage

• For the use of future scholars, researchers and common citizens

Who is archiving the Web?

• Usually national libraries: in the context of legal deposit law or some other mandate– E.g. Nordic countries, Australia, France, UK, USA, Italy...

• Internet Archive (globally)– www.archive.org

• Other specialized organisations

• Cooperation: International Internet Preservation Consortium (IIPC)– www.netpreserve.org

Basics of web archiving

• Two approaches– Selection-based

• identifying web sites that are to be archived

– Harvesting-based (crawler-based)• using specialised software to collect large amounts of data (e.g. country

domain-level)

• Various challenges in– Presenting and preservation: understanding web publishing

technologies– Harvesting, link extraction, version control– Deep web, web databases– Cooperation with web publishers

Finnish digital content in the National Library: digital legal deposit

• Reforming the legislation– A draft Government Bill for new legal deposit act introduced to

the Ministry of Education in June 2003– Still waiting for the parliamentary prosess– Copyright legislation also being reformed according to the EU

directive

• New responsibilities for the National Library– Collect and store web material: ”a representative and versatile

sample of publicly available materials on the Internet”– Collect and store off-line media

Collecting and Storing the Web: Case Finland

• Two methods: – crawling/harvesting the national web space by the National Library– If the material cannot be harvested and/or the Nat. Lib. consideres it

especially valuable: • library notifies the publisher => publisher deposits or provides the ”means

for storing”• e.g. commercial web-publications, ministry reports, some deep web

publications

• Different workflow and data storage for these two methods– a web archive (harvests)– a document archive (deposits)

The Two Methods

deposits,publisher provides

crawling / harvesting

The Finnish Web Space

Web Archive Document ArchiveFull text

index

DOMS: ENCompass (metadata)

The Two Methods

• Web Archive

– Web sites, web pages

– National domain (.fi) + other domains with national content

– html, gif, jpeg...

– Some sites with id + password

– Harvesting sets the limits

– Full text indexing, little metadata

• Document Archive– Digital documents that we want to

catalogue also to the national bibliography

– Documents that cannot be harvested (some restricted documents, deep web materials)

– Documents that are deemed quality publications (e.g. research series, e-books)

– Rich metadata used for the management and indexing, currently no full text indexing available

Web archive

• Harvesting: Heritrix• Indexing: FAST• Search User Interface: Nordic Web Archive Toolset

• Data storage– Currently only around 1,5 TB of data stored in a tape robot– Currently storing the web data in a ARC-format (Heritrix)– Negotiations for a large scale storage - hopefully in production

2005

Web Archive 2

• Archiving policies are being formed in Finland• Will be part of the overall collection develoment in

the National Library• Current thinking:

– 1-2 times a year: a wide sweep (all that we can find)– More frequent harvests for certain sites (e.g. news & media)– Theme harvests (e.g. elections)

Document archive and the DOMS

• Digital Object Management System (DOMS) – Purchased from Endeavor Inc. = ENCompass for Digital Collections - not

yet in production– Will be a centralised system for Universities– Allows the description of digital objects and building of collections– Metadata customisable– Access restrictions customisable– Search UI customisable for each collection

• Digital objects stored in a document archive or anywhere else– E.g. central digital repository or a web server– Linking from a Encompass metadata record to the object (URI, path)

Legal deposit web content in DOMS/ENCompass

• How to deal with incoming web deposits?– Workflows– Formats– Collection building

• Metadata schema to support the management and preservation of objects– Technical metadata– Administrative metadata

• Utilise existing metadata– E.g. MARC-records, data from publisher, OAI

• Convert ENCompass metadata for other use– E.g. to MARC-records in national bibliography

Off-line content and ENCompass

• Finnish Music publications– Digitised + digital

• CD-ROMs and Games?– For preservation purposes needs an emulation

environment

• Video => Finnish Film Archive• Other digitised collections regarded as legal

deposit content

Access to the legal deposit collections (web archive and document archive)

• Based on the draft bill– For researchers ”and other users”– On-site only: legal deposit libraries (currently 7) + The

Finnish Film Archive– Researcher workstations

• User interfaces– Web archive UI– DOMS UI

Anything else?

Questions?

Thank You!

[email protected]