TimeMaps: Metadata for Memento
-
Upload
robert-sanderson -
Category
Technology
-
view
1.039 -
download
1
description
Transcript of TimeMaps: Metadata for Memento
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Herbert Van de Sompel Robert Sanderson Michael L. Nelson
Lyudmila Balakireva Scott Ainsworth Harihar Shankar
http://www.mementoweb.org/
TimeMaps: Metadata for Memento
Memento is partially funded by the Library of Congress
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Memento wants to make Navigating the Web’s Past Easy
http://www.mementoweb.org/ http://groups.google.com/group/memento-dev
• Problem Statement
• Memento Solution • Navigation not Search • API for Web Archives
• Memento Ontology for TimeMaps
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Web Resources have Different Representations over Time
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Thankfully Archived Representations Exist
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
1. Access is via a new URI, unknown to the user.
2. People do not like to search for archived resources, and there is no automated method
3. Navigation in the past is inconsistent: 1. Stuck in single, necessarily incomplete archive 2. Or if not rewritten, URIs lead back to the present
3 Issues with Current Access to Archives
Comment on Popular Science article: http://bit.ly/bWr5gP
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
1. Representations Archived at a Different URI
http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
2. Searching is Cumbersome
http://web.archive.org/web/*/http://cnn.com/ http://en.wikipedia.org/w/index.php?title=September_11_attacks&action=history
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
3. Inconsistent Navigation (Archives Incomplete)
http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com
http://web.archive.org/web/20010911213855/www.cnn.com/TECH/space/
Sep 11 2001, 20:36:10 UTC Sep 11 2001, 21:38:55 UTC
SPACE
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
3. Inconsistent Navigation (Can't Stay in Past)
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks3
Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/wiki/The_Pentagon
current
Pentagon
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Past and Current Web are Not Integrated
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
The Web without a Time Dimension
Need to use a different URI to access archived versions of a resource and its current version
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
The Web with Time Dimension added by Memento
Memento uses URI of the current version to access archived versions, but qualify it with datetime, and magically arrive at the correct location.
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
The Memento Solution
There are two components to the Memento Solution:
• Component 1: Navigation to an archived resource via its original resource, by leveraging content negotiation.
• Component 2: A discovery API for archives that enables retrieving a list of all archived versions of a resource for a given URI.
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Content Negotiation in Time
• Many systems support content negotiation for file format o Your client by default asks for HTML and gets HTML o But it could get PDF via the same URI
• Memento proposes a new dimension for content negotiation: Time o Your client by default asks for the current time, and gets it o But it could get an older version via the same URI
• Can be accomplished with only one new HTTP header in each direction:
o Accept-Datetime Request for a particular timestamp o Content-Datetime The returned content’s timestamp
o These exactly mirror existing headers for Format, Language, etc.
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
Original Resource
Mementos
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
Original Resource
Mementos ?
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
Original Resource
Mementos TimeGate
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
Original Resource
Mementos
Conneg with TimeGate to Mementos
TimeGate
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
current
Apr 10 2001, 21:39:30 UTC
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
Original Resource
Mementos
Conneg with TimeGate to Mementos
TimeGate
Link Headers
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
wikipedia.org
Original Resource
Mementos
Conneg with TimeGate to Mementos
TimeGate
Link Headers
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
The Web with Time Dimension added by Memento
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
The Memento Solution
• Component 2: A discovery API for archives that allows requesting a list of all archived versions held for a resource with a given URI.
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
• Mementos for any given resource are distributed across archives. (What? Not just the Internet Archive?!)
• In order to get a correct perspective of available Mementos, different archives need to be consulted.
• Can do by distributed search (slow), or by consulting an aggregator.
• Aggregator and other services need machine readable description of archives' holdings to select appropriate Memento for request
• Closest in time • Most reliable representation • Fastest responding • (etc)
Why an API?
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
WebCitation 13 May 2009 12:28:39
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45 Dracos 14 May 2009 13:00:00
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
WebCitation 13 May 2009 12:28:39 Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45 Dracos 14 May 2009 13:00:00
TNA 14 May 2009 18:21:32
And no Internet Archive…
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
TimeMaps • At most basic: List of URIs of Mementos and their times • Expressed as Linked Data; a profile of OAI ORE Resource Maps • Link header from TimeGate and Memento
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Basic ORE Model
Aggregation (Aggr) is a set of web resources (R-1 to R-3), described in RDF or Atom by a Resource Map (ReM).
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
TimeBundles
Resources of Interest in Memento: • Original Resource • TimeGate • Mementos
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
TimeGates
• Period(s) that the TimeGate covers • Which resource is it a TimeGate for • mem:TimeSpan as can cover multiple distinct periods
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Mementos
• Time Period: valid for or observed over, number of observations • Metadata: size, format, etc (will come back to the "etc") • Which resource it is a Memento for
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Serializations
• RDF/XML • Good for XML parsers
• Turtle, N3 and related • Good for graph parsers
• RDFa • Good for web browsers
• Atom • Good for alerting, feed readers etc (but still embeds RDF)
• New: Link Header format • Good for real-time applications • Smaller file size (just the facts, ma'am) • Easy to implement with existing link header parsers • Servers need to produce format anyway, so non-rdf way out
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Use Case: Aggregator using TimeMaps
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Original Resource
Conneg with TimeGate to Mementos
TimeGate Mementos
Link Headers
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Metadata Discussion Points
1. What metadata is necessary to determine the most appropriate copy?
• Distance to requested time most important • Quality of representation? • Usage statistics for Original Resource? For Memento? • User tagging of Memento for quality? • Archive response speed? • Need to know more information from user preferences?
2. What other metadata is useful and available?
• Crawling archives have limited information • CMS systems have much more • User tags, comments, annotations • Semantic information about content, eg title, author, subject • Distribution of changes over time
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Metadata Discussion Points
3. What metadata is necessary for inter-archive synchronization?
• Deduplication information: digests, request headers • "Significant Change" factors • Crawler settings: respect no-cache, robots.txt etc
4. What metadata can be generated by other services?
• Open World Model: Anyone can say anything about anything • Technical metadata easy (MIX for images, etc) • Time Series Analysis interesting (techtales.org) • Machine Learning based approaches?
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Thank You
Rob Sanderson: • [email protected] • [email protected]
This presentation: • http://www.slideshare.net/azaroth42/xxx
Memento: • http://www.mementoweb.org/ • http:groups.google.com/group/memento-dev
MementoFox: • https://addons.mozilla.com/en-US/firefox/addon/100298 aka: http://bit.ly/memfox
Memento Enables Navigating the Past Web
TimeMaps: Metadata for Memento GSLIS Metadata Group, UIUC, 14th July 2010
Discussion Questions
1. What metadata is necessary to determine the most appropriate copy?
2. What other metadata is useful and available?
3. What metadata is necessary for inter-archive synchronization?
4. What metadata can be generated by other services?
HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Appendix: Memento HTTP Flow
Memento HTTP Flow HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Memento HTTP Flow
Memento HTTP Flow: URI-R HEAD R, (Accept-Datetime)
HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
Memento HTTP Flow
Memento HTTP Flow HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Memento HTTP Flow
Memento HTTP Flow: Success – URI-R
LinkG
HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate" Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1
Memento HTTP Flow
Memento HTTP Flow HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Memento HTTP Flow
GET G, Accept-Datetime
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
Memento HTTP Flow
Memento HTTP Flow HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow
302M, Vary, LinkR,B,M
HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8
Memento HTTP Flow
HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Memento HTTP Flow
GET M, Accept-Datetime
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
Memento HTTP Flow
Flow HEAD R, (Accept-Datetime)
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, (Accept-Datetime)
Memento HTTP Flow
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
Memento HTTP Flow