Prospects and pitfalls in using web archives for research

31
A new class of primary source? Prospects and pitfalls in using web archives for research Dr Peter Webster Webster Research and Consulting @pj_webster

Transcript of Prospects and pitfalls in using web archives for research

A new class of primary source? Prospects and pitfalls in using web archives for research

Dr Peter WebsterWebster Research and Consulting@pj_webster

A lost archive?

A lost archive?

A lost archive?

The web its own archive?

Open UK Web Archive 2004-13 comparison.@anjacks0n http://britishlibrary.typepad.co.uk/webarchive/2014/10/what-is-still-on-the-web-after-10-years-of-archiving-.html

Disappearing predictably

Disappearing unpredictably

.. But safe and sound in the archive

Reasons to care about web archiving

• education and research

• enforcement of the law

• public accountability

Three archives for the UK

Temporal scope Content scope Access

Open UKWA 2004-present Selective (14.7k)

Online

Legal Deposit UKWA

2013-present Comprehensive (for UK)

Onsite

JISC UK Domain Dataset

1996-2013 Comprehensive (for .uk)

Index only

JISC UK Web Domain Dataset (1996-2013)

• copy of Internet Archive holdings for .uk

• bought by JISC, held by British Library

• 60TB of data

• no direct access to content

• prototype search at webarchive.org.uk/shine

• derived datasets in public domain

Web archives for NI and RoI

Temporal scope Content scope Access

NLI Web Archive

2011-present Selective (542) Online

PRONI Web Archive

2010-present Selective (115) Online

Legal Deposit UKWA

2013-present Comprehensive (for UK!)

Onsite (TCD)

Ways to use the archived web

• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page

Changing aesthetics

gov.ie, captured by archive.org, 15 August 2000

Vanished content

southtippcoco.ie, captured by archive.org, 4 Jan 2014

Visualising trends: Ngram

http://www.webarchive.org.uk/shine/graph

Ways to use the archived web

• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page

• Direct access to WARC• Derived datasets• API access

Derived datasets from the BL

From JISC UK Web Domain Dataset (1996-2010)

• File format profile• Geo-index• Crawled URL Index (CDX)• Host Link Graph

Public domain at data.webarchive.org.uk

Creationism ?• non-evolutionary account of human

origins

• modern

• a long history

• a feature of some parts of evangelicalism

• (anti-evolutionism, Intelligent Design)

The creationist web : three questionsA justified conspiracy theory about marginalisation of creationist voices?

A real danger or a moral panic (Truth in Science) ?

The web as friend of the marginalised opinion?

http://peterwebster.me/2014/11/18/reading-creationism-in-the-web-archive/

UK Host Link Graph (1996-2010)

2008 | newsimg.bbc.co.uk | youtube.com | 45

2008 | archbishopofyork.org.uk | flickr.com | 1

2002 | secularism.org.uk | geocities.com | 1

Public domain at: data.webarchive.org.uk

Approach • selection of key UK creationist sites

• extraction of all unique inbound referring hosts for 1996-2010

• inspection and classification

Caveats on method • partial nature of the dataset

• benchmarking of absolute numbers

• selective sample

• what does a link mean, anyway ?

• not looking at number of linking resources per host

Truth in Science: how significant? • only 46 unique inbound hosts

• … of which many were other creationists or secularist sites

• two churches, one school

• fewer in 2010 than 2007

Conclusions • a utopian dream unfulfilled

• a genuine moral panic

• a justified conspiracy theory

Next steps (1) 1. NI the 'creationism capital of Europe'? (Analysis of:

• links from GB organisations to NI creationists

• links from NI to RoW)

2. What about creationism in .ie ?

Next steps (2) Project: EU National Web Spheres

• part of resaw.eu

• investigating the nature of a national web domain

• .. including the interlinking between them

• case study I: Anglican & Presbyterian churches in Ireland, north and south

Web Archives for Historians

@HistWebArchives , http://webarchivehistorians.org/

Questions ? Peter [email protected]@pj_websterpeterwebster.mewebsterresearchconsulting.com