TimeMaps: Metadata for Memento

54
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14 th July 2010 Herbert Van de Sompel Robert Sanderson Michael L. Nelson Lyudmila Balakireva Scott Ainsworth Harihar Shankar http://www.mementoweb.org/ TimeMaps: Metadata for Memento Memento is partially funded by the Library of Congress

description

Abstract: Dr. Robert Sanderson, a scientist at Los Alamos National Laboratory and visiting scholar at the Graduate School of Library & Information Science for 2009-2010, will present his work on Memento, a technical framework for adding a temporal dimension to web browsing to allow users to access web resources not only as they exist now but also as they existed in the past. After describing the overall system, the talk will focus on the metadata requirements for the TimeMap API that compliant web archives can expose in order to publish their holdings. This talk is sponsored by the GSLIS Metadata Round Table.

Transcript of TimeMaps: Metadata for Memento

Page 1: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Herbert Van de Sompel Robert Sanderson Michael L. Nelson

Lyudmila Balakireva Scott Ainsworth Harihar Shankar

http://www.mementoweb.org/

TimeMaps: Metadata for Memento

Memento is partially funded by the Library of Congress

Page 2: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Memento wants to make Navigating the Web’s Past Easy

http://www.mementoweb.org/ http://groups.google.com/group/memento-dev

•  Problem Statement

•  Memento Solution •  Navigation not Search •  API for Web Archives

•  Memento Ontology for TimeMaps

Page 3: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Web Resources have Different Representations over Time

Page 4: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Thankfully Archived Representations Exist

Page 5: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

1.  Access is via a new URI, unknown to the user.

2.  People do not like to search for archived resources, and there is no automated method

3.  Navigation in the past is inconsistent: 1.  Stuck in single, necessarily incomplete archive 2.  Or if not rewritten, URIs lead back to the present

3 Issues with Current Access to Archives

Comment on Popular Science article: http://bit.ly/bWr5gP

Page 6: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

1. Representations Archived at a Different URI

http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks

Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC

Page 7: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

2. Searching is Cumbersome

http://web.archive.org/web/*/http://cnn.com/ http://en.wikipedia.org/w/index.php?title=September_11_attacks&action=history

Page 8: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

3. Inconsistent Navigation (Archives Incomplete)

http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com

http://web.archive.org/web/20010911213855/www.cnn.com/TECH/space/

Sep 11 2001, 20:36:10 UTC Sep 11 2001, 21:38:55 UTC

SPACE

Page 9: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

3. Inconsistent Navigation (Can't Stay in Past)

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks3

Dec 20 2001, 4:51:00 UTC

http://en.wikipedia.org/wiki/The_Pentagon

current

Pentagon

Page 10: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Past and Current Web are Not Integrated

Page 11: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

The Web without a Time Dimension

Need to use a different URI to access archived versions of a resource and its current version

Page 12: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

The Web with Time Dimension added by Memento

Memento uses URI of the current version to access archived versions, but qualify it with datetime, and magically arrive at the correct location.

Page 13: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

The Memento Solution

There are two components to the Memento Solution:

•  Component 1: Navigation to an archived resource via its original resource, by leveraging content negotiation.

•  Component 2: A discovery API for archives that enables retrieving a list of all archived versions of a resource for a given URI.

Page 14: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Content Negotiation in Time

•  Many systems support content negotiation for file format o  Your client by default asks for HTML and gets HTML o  But it could get PDF via the same URI

•  Memento proposes a new dimension for content negotiation: Time o  Your client by default asks for the current time, and gets it o  But it could get an older version via the same URI

•  Can be accomplished with only one new HTTP header in each direction:

o  Accept-Datetime Request for a particular timestamp o  Content-Datetime The returned content’s timestamp

o  These exactly mirror existing headers for Format, Language, etc.

Page 15: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

www.cnn.com web.archive.org

Page 16: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

Original Resource

Mementos

www.cnn.com web.archive.org

Page 17: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

Original Resource

Mementos ?

www.cnn.com web.archive.org

Page 18: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

Original Resource

Mementos TimeGate

www.cnn.com web.archive.org

Page 19: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

Original Resource

Mementos

Conneg with TimeGate to Mementos

TimeGate

www.cnn.com web.archive.org

Page 20: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

current

Apr 10 2001, 21:39:30 UTC

Aug 15 2004, 08:45:27 UTC

Aug 15 2007, 19:21:58 UTC

Original Resource

Mementos

Conneg with TimeGate to Mementos

TimeGate

Link Headers

www.cnn.com web.archive.org

Page 21: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

wikipedia.org

Original Resource

Mementos

Conneg with TimeGate to Mementos

TimeGate

Link Headers

Page 22: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

The Web with Time Dimension added by Memento

Page 23: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

The Memento Solution

•  Component 2: A discovery API for archives that allows requesting a list of all archived versions held for a resource with a given URI.

Page 24: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

•  Mementos for any given resource are distributed across archives. (What? Not just the Internet Archive?!)

•  In order to get a correct perspective of available Mementos, different archives need to be consulted.

•  Can do by distributed search (slow), or by consulting an aggregator.

•  Aggregator and other services need machine readable description of archives' holdings to select appropriate Memento for request

•  Closest in time •  Most reliable representation •  Fastest responding •  (etc)

Why an API?

Page 25: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

WebCitation 13 May 2009 12:28:39

Page 26: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11

Page 27: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11

BL Archive 14 May 2009 07:12:45

Page 28: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11

BL Archive 14 May 2009 07:12:45 Dracos 14 May 2009 13:00:00

Page 29: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11

BL Archive 14 May 2009 07:12:45 Dracos 14 May 2009 13:00:00

TNA 14 May 2009 18:21:32

And no Internet Archive…

Page 30: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

TimeMaps •  At most basic: List of URIs of Mementos and their times •  Expressed as Linked Data; a profile of OAI ORE Resource Maps •  Link header from TimeGate and Memento

Page 31: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Basic ORE Model

Aggregation (Aggr) is a set of web resources (R-1 to R-3), described in RDF or Atom by a Resource Map (ReM).

Page 32: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

TimeBundles

Resources of Interest in Memento: •  Original Resource •  TimeGate •  Mementos

Page 33: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

TimeGates

•  Period(s) that the TimeGate covers •  Which resource is it a TimeGate for •  mem:TimeSpan as can cover multiple distinct periods

Page 34: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Mementos

•  Time Period: valid for or observed over, number of observations •  Metadata: size, format, etc (will come back to the "etc") •  Which resource it is a Memento for

Page 35: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Serializations

•  RDF/XML •  Good for XML parsers

•  Turtle, N3 and related •  Good for graph parsers

•  RDFa •  Good for web browsers

•  Atom •  Good for alerting, feed readers etc (but still embeds RDF)

•  New: Link Header format •  Good for real-time applications •  Smaller file size (just the facts, ma'am) •  Easy to implement with existing link header parsers •  Servers need to produce format anyway, so non-rdf way out

Page 36: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Use Case: Aggregator using TimeMaps

Page 37: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Original Resource

Conneg with TimeGate to Mementos

TimeGate Mementos

Link Headers

Page 38: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Metadata Discussion Points

1.  What metadata is necessary to determine the most appropriate copy?

•  Distance to requested time most important •  Quality of representation? •  Usage statistics for Original Resource? For Memento? •  User tagging of Memento for quality? •  Archive response speed? •  Need to know more information from user preferences?

2.  What other metadata is useful and available?

•  Crawling archives have limited information •  CMS systems have much more •  User tags, comments, annotations •  Semantic information about content, eg title, author, subject •  Distribution of changes over time

Page 39: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Metadata Discussion Points

3.  What metadata is necessary for inter-archive synchronization?

•  Deduplication information: digests, request headers •  "Significant Change" factors •  Crawler settings: respect no-cache, robots.txt etc

4.  What metadata can be generated by other services?

•  Open World Model: Anyone can say anything about anything •  Technical metadata easy (MIX for images, etc) •  Time Series Analysis interesting (techtales.org) •  Machine Learning based approaches?

Page 40: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Thank You

Rob Sanderson: •  [email protected] •  [email protected]

This presentation: •  http://www.slideshare.net/azaroth42/xxx

Memento: •  http://www.mementoweb.org/ •  http:groups.google.com/group/memento-dev

MementoFox: •  https://addons.mozilla.com/en-US/firefox/addon/100298 aka: http://bit.ly/memfox

Memento Enables Navigating the Past Web

Page 41: TimeMaps: Metadata for Memento

TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010

Discussion Questions

1.  What metadata is necessary to determine the most appropriate copy?

2.  What other metadata is useful and available?

3.  What metadata is necessary for inter-archive synchronization?

4.  What metadata can be generated by other services?

Page 42: TimeMaps: Metadata for Memento

HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Appendix: Memento HTTP Flow

Page 43: TimeMaps: Metadata for Memento

Memento HTTP Flow HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Memento HTTP Flow

Page 44: TimeMaps: Metadata for Memento

Memento HTTP Flow: URI-R HEAD R, (Accept-Datetime)

HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close

Memento HTTP Flow

Page 45: TimeMaps: Metadata for Memento

Memento HTTP Flow HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Memento HTTP Flow

Page 46: TimeMaps: Metadata for Memento

Memento HTTP Flow: Success – URI-R

LinkG

HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate" Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1

Memento HTTP Flow

Page 47: TimeMaps: Metadata for Memento

Memento HTTP Flow HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Memento HTTP Flow

Page 48: TimeMaps: Metadata for Memento

GET G, Accept-Datetime

GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close

Memento HTTP Flow

Page 49: TimeMaps: Metadata for Memento

Memento HTTP Flow HEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Memento HTTP Flow

Page 50: TimeMaps: Metadata for Memento

302M, Vary, LinkR,B,M

HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8

Memento HTTP Flow

Page 51: TimeMaps: Metadata for Memento

HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Memento HTTP Flow

Page 52: TimeMaps: Metadata for Memento

GET M, Accept-Datetime

GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close

Memento HTTP Flow

Page 53: TimeMaps: Metadata for Memento

Flow HEAD R, (Accept-Datetime)

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, (Accept-Datetime)

Memento HTTP Flow

Page 54: TimeMaps: Metadata for Memento

200, Content-Datetime, LinkR,B,M

HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Connection: close

Memento HTTP Flow