BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004.
-
Upload
alvin-maggard -
Category
Documents
-
view
216 -
download
0
Transcript of BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004.
Themes
• Web archiving in general: what it is + international perspectives
• Archiving Finnish web content: legal framework, plans and techniques
Web archiving?
• Storing and preserving content made available on the Internet/www
• Relates to the general idea of preserving any cultural heritage
• For the use of future scholars, researchers and common citizens
Who is archiving the Web?
• Usually national libraries: in the context of legal deposit law or some other mandate– E.g. Nordic countries, Australia, France, UK, USA, Italy...
• Internet Archive (globally)– www.archive.org
• Other specialized organisations
• Cooperation: International Internet Preservation Consortium (IIPC)– www.netpreserve.org
Basics of web archiving
• Two approaches– Selection-based
• identifying web sites that are to be archived
– Harvesting-based (crawler-based)• using specialised software to collect large amounts of data (e.g. country
domain-level)
• Various challenges in– Presenting and preservation: understanding web publishing
technologies– Harvesting, link extraction, version control– Deep web, web databases– Cooperation with web publishers
Finnish digital content in the National Library: digital legal deposit
• Reforming the legislation– A draft Government Bill for new legal deposit act introduced to
the Ministry of Education in June 2003– Still waiting for the parliamentary prosess– Copyright legislation also being reformed according to the EU
directive
• New responsibilities for the National Library– Collect and store web material: ”a representative and versatile
sample of publicly available materials on the Internet”– Collect and store off-line media
Collecting and Storing the Web: Case Finland
• Two methods: – crawling/harvesting the national web space by the National Library– If the material cannot be harvested and/or the Nat. Lib. consideres it
especially valuable: • library notifies the publisher => publisher deposits or provides the ”means
for storing”• e.g. commercial web-publications, ministry reports, some deep web
publications
• Different workflow and data storage for these two methods– a web archive (harvests)– a document archive (deposits)
The Two Methods
deposits,publisher provides
crawling / harvesting
The Finnish Web Space
Web Archive Document ArchiveFull text
index
DOMS: ENCompass (metadata)
The Two Methods
• Web Archive
– Web sites, web pages
– National domain (.fi) + other domains with national content
– html, gif, jpeg...
– Some sites with id + password
– Harvesting sets the limits
– Full text indexing, little metadata
• Document Archive– Digital documents that we want to
catalogue also to the national bibliography
– Documents that cannot be harvested (some restricted documents, deep web materials)
– Documents that are deemed quality publications (e.g. research series, e-books)
– Rich metadata used for the management and indexing, currently no full text indexing available
Web archive
• Harvesting: Heritrix• Indexing: FAST• Search User Interface: Nordic Web Archive Toolset
• Data storage– Currently only around 1,5 TB of data stored in a tape robot– Currently storing the web data in a ARC-format (Heritrix)– Negotiations for a large scale storage - hopefully in production
2005
Web Archive 2
• Archiving policies are being formed in Finland• Will be part of the overall collection develoment in
the National Library• Current thinking:
– 1-2 times a year: a wide sweep (all that we can find)– More frequent harvests for certain sites (e.g. news & media)– Theme harvests (e.g. elections)
Document archive and the DOMS
• Digital Object Management System (DOMS) – Purchased from Endeavor Inc. = ENCompass for Digital Collections - not
yet in production– Will be a centralised system for Universities– Allows the description of digital objects and building of collections– Metadata customisable– Access restrictions customisable– Search UI customisable for each collection
• Digital objects stored in a document archive or anywhere else– E.g. central digital repository or a web server– Linking from a Encompass metadata record to the object (URI, path)
Legal deposit web content in DOMS/ENCompass
• How to deal with incoming web deposits?– Workflows– Formats– Collection building
• Metadata schema to support the management and preservation of objects– Technical metadata– Administrative metadata
• Utilise existing metadata– E.g. MARC-records, data from publisher, OAI
• Convert ENCompass metadata for other use– E.g. to MARC-records in national bibliography
Off-line content and ENCompass
• Finnish Music publications– Digitised + digital
• CD-ROMs and Games?– For preservation purposes needs an emulation
environment
• Video => Finnish Film Archive• Other digitised collections regarded as legal
deposit content
Access to the legal deposit collections (web archive and document archive)
• Based on the draft bill– For researchers ”and other users”– On-site only: legal deposit libraries (currently 7) + The
Finnish Film Archive– Researcher workstations
• User interfaces– Web archive UI– DOMS UI