Web Archives: Interacting with Scholars Helen Hockx-Yu Head of Web Archiving British Library 28...

Web Archives: Interacting with Scholars

Helen Hockx-Yu

Head of Web Archiving

British Library

28 November 2013

OVERVIEW Access to Web Archives

Web Archiving initiatives worldwide

3http://en.wikipedia.org/wiki/File:Map_of_Web_archiving_initiatives_worldwide.png

(Scholarly) use of web archives?

Restricted access, e.g. large scale national web archives referred to as “dark archives”

Archiving institutions’ focus on data collection, not usage

“Document-centric” access methods

Cannot produce replicas of original websites

No agreed way of calculating / benchmarking access statistics

Little evidence of scholarly use of web archives, making it difficult to understand requirements

Access methods

URL search Keyword search

Full-text search

Thematic Collections

Subject Browsing

Alphabetical browsing

26 15 11 11 9 14

International Internet Preservation Consortium (IIPC) – 46 members worldwide

“IIPC members’ archives” has 29 entries 19 have full or partial online access, often permission-based

URL search as standard, universal access method - requires users to know the URL of the website they are looking for

For many archives, full-text search is the next challenge on the roadmap

Web archive as historical document

SCHOLARLY FEEDBACK UK Web Archive

Scholarly feedback

User Survey in 2012 to identify scholarly value of the UK Web Archive, as perceived by researchers To obtain feedback on the access

mechanisms currently offered by archive To identify gaps in terms of content

coverage To obtain insight into reason why

researchers may or may not use the web archive

Methodology

By IRN Research between May and June 2012 94 telephone interviews with previous and non-

users of the UK Web Archive – 74% are non-users

A small group was asked to undertake a second phase, running search and detailing each stage – documented as case studies

Interview sample by subject

Subject Non-users Users

Arts and Humanities 33 10

Social Sciences 27 11

Science Technology Medicine

Total 64 24

Unclassified 6 -

Scholarly value

Non users Users

Appreciate potential value but for many no relevant content

All understand the value as snapshot of selective sites at specific times

More special collections would increase value

Value would increase with more scientific and technical content

Access Mechanisms

Non users Users

Search tool easy to use but complicated for minority

Majority satisfied with presentation of results and ease of use of site

Most search / browse by special collections

More interest in visualisation tools

Search results unstructured and random

Need for improved data mining tools

More explanation about functions and features needed

Limited interest in visualisation tools

Additional functions and features

Non users Users

Improvements to search results pages

6-monthly updates

Interactive features Interactive features

Facility to suggest special collections

Too much text on home page

Content coverage

Non users Users

More relevant special collections More images, illustrations, rich media

More images, blogs Politics, contemporary British history

Too much missed from specific websites

Reason for using or not using UKWA

Non users Users

Current content not relevant Majority “very likely” to use again as there is content of interest

More information regarding selection policy

Another 39% “quite likely”

Less than a quarter “very likely” to use again

Why do researchers use / not use a web archive

Relevance of content determines whether researchers use it

Selective web archives please some but disappoint others

Use web archives for reference AND analytics

Still a significant portion of the research community yet to be reached

Access statistic of the UK Web Archive: 1 Jan – 28 Nov 2013

INTERACTING WITH SCHOLARS

Web Archives

Scholarly interactions: three types

Archive-driven Initiated by archival institutions Aimed at understanding scholarly requirements and improving archival

practice

Scholar-driven Initiated by scholars with research interest related to web archiving or

archived web material, including many “unknown” scholars A number of active research groups emerging

Netlab, WebArt and DMI, IHR, OII, ODU… Attention from the Web Science community

Project-based Various scale, scope and funding sources Developing web archiving or discipline specific solutions Researchers and archiving institutions as partners

Scholarly interactions: three phases

Phase 1: Building collections Scholars’ involvement in scoping collections, selecting and

describing websites relevant to research interest Creation of specific, (narrow) topical collections, e.g. “Religion,

politics and law since 2005” in the UK Web Archive

Phase 2: Formulating research questions Brain-storm sessions, workshops etc. Shift of focus to web archives in entirety The Analytical Access to the Domain Dark Archive (AADDA) project

9 research proposals by arts, humanities and social sciences scholars

A prototype UI for analytical access Lack of awareness & baseline knowledge, Time & resource consuming Challenging: you don’t know what you don’t know

Scholarly interaction: three Phases

Phase 3: independent use of web archives The desired “go-to” state, meet common scholarly

requirements Web archives do not become bottlenecks Base-line knowledge is self-explanatory, e.g. scope of the

archive, its coverage and lacunae, how it was collected, and how a particular website was crawled

Clear interfaces and jargon-free descriptions in alignment with scholarly requirements

Open access Including provision of downloadable derived or secondary

datasets, e.g. http://data.webarchive.org.uk/opendata/

How was the UK web linked in 1996?

• By Rainer Simon using UK Host-Level Link Graph (1996-2010) dataset

• Based on the 1996 portion: 58,842 hosts (nodes); 184,433 host-to-host links (edges)

• UK web as part of the global web

• Scalability issues with large dataset over time

SCHOLARLY REQUIREMENTS Web Archives

Scholarship is changing

Blurred boundaries between scholarly sources and popular sources, even more so in the context of the web

Any source used for scholarly purposes can be defined as scholarly source

Scholarship is evolving: computational engaged research gaining momentum e.g. digital humanities Redrawing disciplinary boundaries Less text-based, multi-media driven Web playing an important role – will archives of the web

do the same?

Scholarly use (of digital sources): key characteristics

Availability or accessibility

Text and paratext, defined by Gérard Genette as “accompaniments” that “surround or prolong the text”. Niels Brugger (2010) applied this concept to websites and argues it is different in form and function, and plays a crucial role in textual coherence of a website

Or context, in the usual sense of the word, e.g. out and in-links

Citation – backbone of research - requires persistence identification of sources, ideally retrievable

Sources relevant and specific to research question, without any arbitrarily imposed (national , geographical or format related) boundaries

Quality

Flexibility /ability to apply digital methods for analytics and discovery of new knowledge

Requirements for web archives

Characteristics of Scholarly use Requirements for web archives

Availability No access restriction, available online

Paratext or context

Access to collection policy and scope, crawl configuration, craw log and any contextual information

Persistence and citability

- Longevity of web archives - Persistent identifiers- Standards of citing archived websites- Integration with bibliographical management tools (eg Zotero)

Collect / organise research corpus

- Archiving of research corpora on demand- Means to mix and match and reassemble corpora based on research questions

Quality- Archival version represents as much as possible the live website in completeness, intellectual content, behaviour and look and feel- Curation

Applying Digital methods

- Multiple access methods including data analytics and visualisations- Access to web archives as “big data”

Boundary & format-independent

- Interlinked web archives - integration with other digital and printed holdings eg books, ejournals

Unique Selling Points (USPs)

The live web as a fast evolving, interactive, multi-dimensional, open and participatory and interlinked collective system

Web archives as static, flat, exclusive, individual systems with boundaries and limitations

Focus on USPs – things that differentiate web archives from the live web Some web resources have vanished and web archives hold the

only copies of these Periodic snapshots showing evolution and change of websites Web archives as comprehensive historical datasets - lends itself to

opportunities for analytical access

Linked web archives

Who has archived http://www.conservatives.com/?

Mementos service

Allow users to find archived web pages (mementos) in multiple web archives across the world (search based on aggregated metadata)

Exposes the memento protocol, which adds time dimension to HTTP - accessing the past web as it is to access the current web

uses the Memento aggregate TimeGate hosted by lanl.gov Source code

Also developed the Find memento bookmarklet, finding archived versions of 404 webpages while browsing

EXTRA SLIDES FOR ILLUSTRATION

UK Web Archive

UK Web Archive: search interface

UK Web Archive: browse interface

Using N-gram for scholarly research

Courtesy of Dr Peter Webster, Institute of Historical Research, University of London

UK Web Archive: visual browsing

RSS feed of latest instances

Replacing original search function on site

Showing the big picture

http://seadragon.com/view/wky

Web Archives: Interacting with Scholars Helen Hockx-Yu Head of Web Archiving British Library 28...

Documents

Transcript of Web Archives: Interacting with Scholars Helen Hockx-Yu Head of Web Archiving British Library 28...

Archiving Files

Supporting further and higher education JISC Circular 7/05: UK LOCKSS Pilot Programme Helen Hockx-Yu Programme Manager, JISC.

NovelRolesfortheE3UbiquitinLigaseAtrophin-interacting ... · 2012-05-22 · NovelRolesfortheE3UbiquitinLigaseAtrophin-interacting Protein4andSignalTransductionAdaptorMolecule1inG

Archiving MM모듈

OFFICE 365 ARCHIVING...INTELLIGENT OFFICE 365 ARCHIVING Your Cloud. Our SaaS. Redeﬁne Best Practices for Cloud-based Data Archiving. Granular Archiving – Archive stale content

Web Archiving

SAP Archiving

OCS Archiving

digital archiving

Email Archiving

IDoc Archiving

MagicWatch: Interacting & Segueing - Ubicompubicomp.org/ubicomp2014/proceedings/ubicomp_adjunct/... · 2014-11-13 · MagicWatch: Interacting & Segueing Abstract Interacting Seeking

Www.bl.uk 1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.

Oecd Archiving

BlogForever: From Web Archiving to Blog Archiving...The aim of this paper is to introduce blog archiving as a special type of web archiving. Web archiving is an important aspect in

Archiving Email

Interacting Galaxies

Archiving websites

(Web) Archiving Online Media - SIEPRWeb) Archiving Online Media.… · archiving •web archiving use case •web archiving mechanics •technical challenges •approaches for online

Creative archiving