Post on 30-Dec-2015
Web Archives: Interacting with Scholars
Helen Hockx-Yu
Head of Web Archiving
British Library
28 November 2013
OVERVIEW Access to Web Archives
2
Web Archiving initiatives worldwide
3http://en.wikipedia.org/wiki/File:Map_of_Web_archiving_initiatives_worldwide.png
(Scholarly) use of web archives?
4
Restricted access, e.g. large scale national web archives referred to as “dark archives”
Archiving institutions’ focus on data collection, not usage
“Document-centric” access methods
Cannot produce replicas of original websites
No agreed way of calculating / benchmarking access statistics
Little evidence of scholarly use of web archives, making it difficult to understand requirements
Access methods
5
URL search Keyword search
Full-text search
Thematic Collections
Subject Browsing
Alphabetical browsing
26 15 11 11 9 14
International Internet Preservation Consortium (IIPC) – 46 members worldwide
“IIPC members’ archives” has 29 entries 19 have full or partial online access, often permission-based
URL search as standard, universal access method - requires users to know the URL of the website they are looking for
For many archives, full-text search is the next challenge on the roadmap
6
Web archive as historical document
SCHOLARLY FEEDBACK UK Web Archive
7
Scholarly feedback
User Survey in 2012 to identify scholarly value of the UK Web Archive, as perceived by researchers To obtain feedback on the access
mechanisms currently offered by archive To identify gaps in terms of content
coverage To obtain insight into reason why
researchers may or may not use the web archive
8
Methodology
9
By IRN Research between May and June 2012 94 telephone interviews with previous and non-
users of the UK Web Archive – 74% are non-users
A small group was asked to undertake a second phase, running search and detailing each stage – documented as case studies
Interview sample by subject
10
Subject Non-users Users
Arts and Humanities 33 10
Social Sciences 27 11
Science Technology Medicine
4 3
Total 64 24
Unclassified 6 -
Scholarly value
11
Non users Users
Appreciate potential value but for many no relevant content
All understand the value as snapshot of selective sites at specific times
More special collections would increase value
Value would increase with more scientific and technical content
Access Mechanisms
12
Non users Users
Search tool easy to use but complicated for minority
Majority satisfied with presentation of results and ease of use of site
Most search / browse by special collections
More interest in visualisation tools
Search results unstructured and random
Need for improved data mining tools
More explanation about functions and features needed
Limited interest in visualisation tools
Additional functions and features
13
Non users Users
Improvements to search results pages
6-monthly updates
Interactive features Interactive features
Facility to suggest special collections
Too much text on home page
Content coverage
14
Non users Users
More relevant special collections More images, illustrations, rich media
More images, blogs Politics, contemporary British history
Too much missed from specific websites
Reason for using or not using UKWA
15
Non users Users
Current content not relevant Majority “very likely” to use again as there is content of interest
More information regarding selection policy
Another 39% “quite likely”
Less than a quarter “very likely” to use again
Why do researchers use / not use a web archive
16
Relevance of content determines whether researchers use it
Selective web archives please some but disappoint others
Use web archives for reference AND analytics
Still a significant portion of the research community yet to be reached
Access statistic of the UK Web Archive: 1 Jan – 28 Nov 2013
17
INTERACTING WITH SCHOLARS
Web Archives
18
Scholarly interactions: three types
Archive-driven Initiated by archival institutions Aimed at understanding scholarly requirements and improving archival
practice
Scholar-driven Initiated by scholars with research interest related to web archiving or
archived web material, including many “unknown” scholars A number of active research groups emerging
Netlab, WebArt and DMI, IHR, OII, ODU… Attention from the Web Science community
Project-based Various scale, scope and funding sources Developing web archiving or discipline specific solutions Researchers and archiving institutions as partners
19
Scholarly interactions: three phases
20
Phase 1: Building collections Scholars’ involvement in scoping collections, selecting and
describing websites relevant to research interest Creation of specific, (narrow) topical collections, e.g. “Religion,
politics and law since 2005” in the UK Web Archive
Phase 2: Formulating research questions Brain-storm sessions, workshops etc. Shift of focus to web archives in entirety The Analytical Access to the Domain Dark Archive (AADDA) project
9 research proposals by arts, humanities and social sciences scholars
A prototype UI for analytical access Lack of awareness & baseline knowledge, Time & resource consuming Challenging: you don’t know what you don’t know
21
Scholarly interaction: three Phases
Phase 3: independent use of web archives The desired “go-to” state, meet common scholarly
requirements Web archives do not become bottlenecks Base-line knowledge is self-explanatory, e.g. scope of the
archive, its coverage and lacunae, how it was collected, and how a particular website was crawled
Clear interfaces and jargon-free descriptions in alignment with scholarly requirements
Open access Including provision of downloadable derived or secondary
datasets, e.g. http://data.webarchive.org.uk/opendata/
How was the UK web linked in 1996?
22
• By Rainer Simon using UK Host-Level Link Graph (1996-2010) dataset
• Based on the 1996 portion: 58,842 hosts (nodes); 184,433 host-to-host links (edges)
• UK web as part of the global web
• Scalability issues with large dataset over time
SCHOLARLY REQUIREMENTS Web Archives
23
Scholarship is changing
Blurred boundaries between scholarly sources and popular sources, even more so in the context of the web
Any source used for scholarly purposes can be defined as scholarly source
Scholarship is evolving: computational engaged research gaining momentum e.g. digital humanities Redrawing disciplinary boundaries Less text-based, multi-media driven Web playing an important role – will archives of the web
do the same?
24
25
Scholarly use (of digital sources): key characteristics
Availability or accessibility
Text and paratext, defined by Gérard Genette as “accompaniments” that “surround or prolong the text”. Niels Brugger (2010) applied this concept to websites and argues it is different in form and function, and plays a crucial role in textual coherence of a website
Or context, in the usual sense of the word, e.g. out and in-links
Citation – backbone of research - requires persistence identification of sources, ideally retrievable
Sources relevant and specific to research question, without any arbitrarily imposed (national , geographical or format related) boundaries
Quality
Flexibility /ability to apply digital methods for analytics and discovery of new knowledge
26
Requirements for web archives
Characteristics of Scholarly use Requirements for web archives
Availability No access restriction, available online
Paratext or context
Access to collection policy and scope, crawl configuration, craw log and any contextual information
Persistence and citability
- Longevity of web archives - Persistent identifiers- Standards of citing archived websites- Integration with bibliographical management tools (eg Zotero)
Collect / organise research corpus
- Archiving of research corpora on demand- Means to mix and match and reassemble corpora based on research questions
Quality- Archival version represents as much as possible the live website in completeness, intellectual content, behaviour and look and feel- Curation
Applying Digital methods
- Multiple access methods including data analytics and visualisations- Access to web archives as “big data”
Boundary & format-independent
- Interlinked web archives - integration with other digital and printed holdings eg books, ejournals
Unique Selling Points (USPs)
27
The live web as a fast evolving, interactive, multi-dimensional, open and participatory and interlinked collective system
Web archives as static, flat, exclusive, individual systems with boundaries and limitations
Focus on USPs – things that differentiate web archives from the live web Some web resources have vanished and web archives hold the
only copies of these Periodic snapshots showing evolution and change of websites Web archives as comprehensive historical datasets - lends itself to
opportunities for analytical access
Linked web archives
Who has archived http://www.conservatives.com/?
Mementos service
Allow users to find archived web pages (mementos) in multiple web archives across the world (search based on aggregated metadata)
Exposes the memento protocol, which adds time dimension to HTTP - accessing the past web as it is to access the current web
uses the Memento aggregate TimeGate hosted by lanl.gov Source code
Also developed the Find memento bookmarklet, finding archived versions of 404 webpages while browsing
EXTRA SLIDES FOR ILLUSTRATION
UK Web Archive
30
UK Web Archive: search interface
31
UK Web Archive: browse interface
32
33
Using N-gram for scholarly research
Courtesy of Dr Peter Webster, Institute of Historical Research, University of London
UK Web Archive: visual browsing
34
RSS feed of latest instances
35
Replacing original search function on site
36
Showing the big picture
37
http://seadragon.com/view/wky