Memento: Time Travel for the Web - arXiv · Memento: Time Travel for the Web Herbert Van de Sompel...

Memento: Time Travel for the Web

Herbert Van de SompelLos Alamos NationalLaboratory, NM, USA

[email protected]

Michael L. NelsonOld Dominion University,

Norfolk, VA, [email protected]

Robert SandersonLos Alamos NationalLaboratory, NM, USA

[email protected] L. Balakireva

Los Alamos NationalLaboratory, NM, USA

[email protected]

Scott AinsworthOld Dominion University,

Norfolk, VA, [email protected]

Harihar ShankarLos Alamos NationalLaboratory, NM, USA

[email protected]

ABSTRACTThe Web is ephemeral. Many resources have representa-tions that change over time, and many of those represen-tations are lost forever. A lucky few manage to reappearas archived resources that carry their own URIs. For ex-ample, some content management systems maintain versionpages that reflect a frozen prior state of their changing re-sources. Archives recurrently crawl the web to obtain theactual representation of resources, and subsequently makethose available via special-purpose archived resources. Inboth cases, the archival copies have URIs that are protocol-wise disconnected from the URI of the resource of whichthey represent a prior state. Indeed, the lack of temporalcapabilities in the most common Web protocol, HTTP, pre-vents getting to an archived resource on the basis of theURI of its original. This turns accessing archived resourcesinto a significant discovery challenge for both human andsoftware agents, which typically involves following a mul-titude of links from the original to the archival resource,or of searching archives for the original URI. This paperproposes the protocol-based Memento solution to addressthis problem, and describes a proof-of-concept experimentthat includes major servers of archival content, includingWikipedia and the Internet Archive. The Memento solutionis based on existing HTTP capabilities applied in a novelway to add the temporal dimension. The result is a frame-work in which archived resources can seamlessly be reachedvia the URI of their original: protocol-based time travel forthe Web.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: Online In-formation Services

General TermsDesign, Experimentation, Standardization

KeywordsWeb Architecture, HTTP, Archiving, Content Negotiation,OAI-ORE, Time Travel

1. INTRODUCTION“The web does not work,” my eleven year old son com-

plained. After checking power and network connection, Irealized he meant something rather more subtle. The URI(http://stupidfunhouse.com) he had bookmarked the yearbefore returned a page that didn’t look like the original atall, and definitely was not fun. He had just discovered thatthe web has a terrible memory.

Let us restate the obvious: the Web is the most pervasiveinformation environment in the history of humanity; hun-dreds of millions of people1 access billions of resources2 usinga variety of wired or wireless devices. The rapid growth ofthe Web was made possible by a suite of relatively simple,yet powerful technologies including TCP/IP, URI, HTTP,and HTML. The Web is also highly dynamic, with a sig-nificant percentage of resources changing at different ratesover time [3, 9, 18, 25]. Given the ubiquity of the Web, itis rather surprising to find how poor its memory is regard-ing these continuous changes. Indeed, once a resource haschanged, accessing one of its prior versions becomes a sig-nificant discovery challenge, no longer merely a matter ofusing Web protocols to dereference its URI.

In essence, the time dimension is absent from the mostcommon of Web protocols, HTTP. This timelessness is evenwritten into the W3C’s Architecture of the World Wide Web[15], which reminds us that dereferencing a URI yields a rep-resentation of the (current) state of the resource identifiedby that URI, and highlights the impracticality of keepingprior states accessible at their own distinct URIs:

Resource state may evolve over time. Requiring aURI owner to publish a new URI for each changein resource state would lead to a significant num-ber of broken references. For robustness, Webarchitecture promotes independence between anidentifier and the state of the identified resource.

Nevertheless, the Web does contain a meaningful amountof records of the past. Sites based on Content Manage-ment Systems (CMS) such as Wikimedia, the platform usedby Wikipedia, keep the current version of a page accessibleat a generic URI, while older versions remain accessible at

1http://www.internetworldstats.com/stats.htm

2http://googleblog.blogspot.com/2008/07/we-knew-web-was-

big.html

arX

iv:0

911.

1112

v2 [

cs.I

R]

6 N

ov 2

009

version-specific URIs. Special-purpose services that are con-cerned with persistent referencing, such as WebCite, storea representation of the resource retrieved at the time theservice is invoked. Also, inspired by the pioneering work ofthe Internet Archive, there is an ever-growing internationalWeb Archiving [6, 22] activity that consists of recurrentlysending out crawlers to take snapshots of Web resources,storing those in special-purpose distributed archives, andmaking them accessible through tools such as the WaybackMachine3. Transactional archives [11] store every materi-ally different representation of a web server’s resources asthey are being delivered to clients. Currently, their use isprimarily restricted to applications that need to meet spe-cial legal requirements, such as keeping an exact record ofwhat has been delivered to users of an ecommerce or gov-ernment site, and they are therefore typically not openlyaccessible. Exploratory work is ongoing regarding the es-tablishment of a peer-to-peer web archive that receives itscontent from browser caches, and that therefore can be con-sidered a client-side transactional archive [4]. Also personalclient-side transactional archives have been proposed [7, 30,32] but their private purpose excludes accessing them onthe Web. Search engine caches may also contain prior rep-resentations of resources, but they are restricted to the mostrecent snapshot taken by a crawler.

Although this variety of archival solutions exists and theircoverage is growing, accessing last year’s version of a re-source remains a significant challenge. In the case of Wiki-pedia, one has to resort to its History tab and navigate thesometimes thousands of entries there. The situation is sim-ilar for most other version-aware sites. For news sites, onemay find the answer by searching the site’s special purposearchive if one exists. And, as an option of last resort un-known to many Web users, one can individually search themany Web Archives, hoping to find a page that was archivedat a time close to the desired one. This situation is cum-bersome for users who, for example, want to revisit a book-marked resource as it existed at the time of bookmarking.Research has indicated that anywhere between 50% and 80%of page visits are revisits [2, 26, 31]. To an extent, this find-ing emphasizes the need for end-user time travel on the Web.

The poor integration of archival content in regular Webnavigation is also a fundamental hindrance to applicationsthat require finding, analyzing, extracting, comparing, andotherwise leveraging historical Web information. Examplesinclude Zoetrope, a tool that allows interaction with and vi-sualization of high-resolution temporal Web data [1]; DiffIE,a Web browser plug-in that emphasizes Web content thatchanged since a previous visit [32]; and time-oriented searchthat tracks the frequency of words and phrases in resourcesover time [20]. These applications must build their ownspecial-purpose archives in an ad-hoc manner in order toachieve their goals.

In this paper, we present the Memento solution to allowtemporal access to the Web. Our solution is based on and isas simple as the technologies that led to the rapid growth ofthe Web. It focuses on seamless access to archival content(irrespective of its location) as part of regular Web naviga-tion for both human and software agents. It does not dealwith the aspect of creating, populating, and maintainingarchives, but rather leverages their existence. The remain-

3http://www.archive.org/

der of the paper is structured as follows: Section 2 briefly re-views transparent content negotiation for HTTP in order toallow a better understanding of Section 3 which introducesthe Memento solution for time travel on the Web; Section4 describes an experiment that provides a proof of conceptfor the solution; Section 5 discusses open issues; and Section6 provides an overview of related work; Section 7 holds ourconclusion.

2. CONTENT NEGOTIATIONTransparent Content Negotiation for HTTP [14] (from

here on abbreviated as conneg) allows a client to select whichrepresentation it wants to retrieve from a transparently ne-gotiable resource; that is, a resource that has multiple rep-resentations (variants) associated with it, each of which isavailable from a variant resource. Currently deployed di-mensions that are open to conneg are media type, language,compression, and character set. A client expresses prefer-ences, possibly according to multiple dimensions, in special-purpose HTTP Accept headers. Preferences are qualifiedwith “quality”, or “q”, values, that have a normalized valueof 1.0 – 0.0 (an argument without a q value is assumed tohave q=1.0). For example, by using the header “Accept-Language: en, fr;q=0.7” the client indicates that English ispreferred and French is acceptable. Based on information inthese headers, a server will either:

• Select an appropriate representation: There are twoways for a server to do so. One way is to providea “HTTP 200 OK” response with a “TCN: Choice”header, and a Content-Location header that indicatesthe URI of the variant resource that delivered the rep-resentation. The other is to provide a “HTTP 302Found” response with a “TCN: Choice” header, and aLocation header that indicates the URI of where theclient can access the variant resource.

• Respond with a “HTTP 406 Not Acceptable” responseif the server cannot meet the client’s preferences asstated in the request. The server then also returnsa “TCN: List” header and a list of variant resourcesit possesses that are associated with the requested re-source. The client can then make an informed decisionabout variant selection.4

RFC 2295 proposes a format for these lists, expressed asan Alternates response header that can be used in both theChoice and List scenarios. Web servers do not necessarilysupport all the negotiation dimensions for all of their re-sources, but do indicate the supported dimensions to clients(e.g., “Vary: negotiate, accept-language” if the language di-mension is supported). Also, note that according to RFC2295, variant resources do not themselves support contentnegotiation5.

As an example, presume a transparently negotiable re-source http://an.example.org/paper for which the followingvariant resources are available: the paper in HTML and En-glish (paper.html.en), in PDF and English (paper.pdf.en),

4The client can also force a “ HTTP 300 Multiple Choices” response

by issuing a “Negotiate: 1.0” request header. This rarely occurs inpractice, but the response is functionally equivalent to a “HTTP 406Not Acceptable” response.5Servers must return a “HTTP 506 Variant Also Negotiates” response

if variant resources support conneg.

and in PDF and French (paper.pdf.fr). Now presume a clientwants to access the paper and has a preference for HTMLand English. The interaction, in which the server makes achoice that fully honors the client’s preferences, would thenbe (only headers relevant for conneg are shown):

GET /paper HTTP/1.1Host: an.example.orgAccept: text/html, application/pdf;q=0.8Accept-Language: en-US, fr;q=0.7, de;q=0.5

HTTP/1.1 200 OKTCN: choiceVary: negotiate, accept, accept-languageContent-Location: /paper.html.enContent-Type: text/htmlContent-Language: enAlternates:

{"paper.html.en" 1.0 {type text/html} {language en}},{"paper.pdf.en" 0.8 {type application/pdf} {language en}},{"paper.pdf.fr" 0.6 {type application/pdf} {language fr}}

However, if the client prefers PDF over HTML and in-sists only on German language documents (French and En-glish have q=0.0), the interaction in which the server cannothonor the request, and leaves the choice to the client wouldbe:

GET /paper HTTP/1.1Host: an.example.orgAccept: application/pdf, text/html;q=0.8Accept-Language: de, fr;q=0.0, en-US;q=0.0

HTTP/1.1 406 Not AcceptableTCN: listVary: negotiate, accept, accept-languageAlternates: {"paper.pdf.fr" 0.8 {type application/pdf}

{language fr}}, {"paper.html.en" 0.5 {type text/html}{language en}}, {"paper.pdf.en" 0.4{type application/pdf} {language en}}

3. THE MEMENTO SOLUTIONIn this section, we introduce the two core building blocks

of the Memento solution to allow temporal navigation of theWeb: HTTP content negotiation in the datetime dimension,and an API for archives of web resources that allows request-ing an inventory of available archived resources associatedwith a resource with a given URI.

3.1 A Memento: An Archival ResourceWe introduce the term Memento to refer to an archival

record of a resource. More formally, a Memento for a re-source URI-R (as it existed) at time ti is a resource URI-Mi[URI-R@ti] for which the representation at any momentpast its creation time tc is the same as the representationthat was available from URI-R at time ti, with tc ≥ ti. Im-plicit in this definition is the notion that, once created, aMemento always keeps the same representation.

In the remainder of this paper, the term original resourceis used to refer to a resource that itself is not a Mementoof another resource, and URI-R is used to denote its URI.URI-M is used to denote the URI of a Memento.

3.2 HTTP Datetime Content NegotiationWe introduce the notion of content negotiation in the

datetime dimension (from here on abbreviated as DT-conneg),allowing a client to indicate that it is looking for past ratherthan current representations of a resource. This is achievedby using a special-purpose Accept header, experimentallynamed X-Accept-Datetime, which has datetimes (ratherthan media type or similar) as its value:

X-Accept-Datetime: {Sun, 06 Nov 1994 08:49:37 GMT}

Generally speaking, DT-conneg works in very much thesame way as existing conneg approaches: If a client wantsto retrieve a Memento of the original resource URI-R, it is-sues an HTTP GET at URI-R using the X-Accept-Datetimeheader to express the datetimes of the archival record(s) ofURI-R in which it is interested. The server handling thisHTTP GET request tries to honor it by delivering a rep-resentation it chooses based on the client’s datetime prefer-ence(s), and/or by providing the client with a list of availablevariant resources, each of which is a Memento of URI-R. De-scribed in more detail below, two distinctions exist betweenDT-conneg and other conneg approaches:

• Cases exist in which the server hosting URI-R can notitself honor the DT-conneg request, but instead redi-rects to a server that can.

• The list of available variant resources can be too exten-sive to be expressed in an Alternates header. In thiscase, a combination of a sizeable Alternates headerlisting variants centered on the requested datetime(s),and an HTTP Link header pointing at an extensivelist of variants is used.

Before deciding on the X-Accept-Datetime header, we in-vestigated possible alternatives that could be used in HTTPinteraction. We decided not to use the “features” exten-sibility mechanism introduced by RFC 2295 because it isgeared at the fine-grained specification of variant options(e.g., paper size, color depth) and hence is not suitable forsomething with the primacy of datetime. Also, the ongoingMedia Fragment work of the W3C [33] is not applicable be-cause it proposes expressing a segment of a resource (e.g.,a region of an image, a section of a video) as a URI frag-ment. It does not deal with the notion of a resource thathas changing representations over time.

3.3 A TimeGate: A Resource Capable of DT-conneg

We introduce the term TimeGate to refer to a transpar-ently negotiable resource that supports the datetime dimen-sion. More formally, a TimeGate for an original resourceURI-R is a transparently negotiable resource URI- G[URI-R]for which all variant resources are Mementos URI-Mi[URI-R@ti] of the resource URI-R. Since multiple archives mayhost versions of URI-R, multiple TimeGates may exist forany given resource, i.e. one per archive.

3.4 Time Travel: Combining DT-conneg andTimeGates

To further explain DT-conneg and TimeGates, two sepa-rate scenarios are explored. The combination of these sce-narios provides a solution for temporal Web navigation thatintegrates operational web servers and archives of all types.To allow for a better understanding, the description is re-stricted to conneg in the datetime dimension only. Also, inorder to keep examples simple, requests with multiple date-time values and associated q-values are not used. It shouldbe noted, however, that both multi-dimensional conneg, andmultiple datetime values are possible in the proposed frame-work, since it builds on the principles of RFC 2295 thatprovides these capabilities. Furthermore, we assume that

the server that hosts the original resource URI-R for whicha client wants to retrieve Mementos, is able to detect theexistence of an X-Accept-Datetime header.

Before describing the scenarios, let us provide some ex-planatory information about the HTTP headers that areinvolved:

Alternates: RFC 2295 requires listing all variant resources.However, since an extensive set of variant resources may ex-ist in case of DT-conneg, the Alternates listing is imprac-tical. Therefore, Alternates only lists a limited amount ofvariant resources, centered on the datetime requested by theclient.

Link : To compensate for the incomplete list of variantresources in Alternates, an HTTP Link header [23] providesa pointer to a resource (the TimeBundle, see Section 3.5)that supports retrieving a list of all variant resources (Me-mentos), and their associated metadata.

X-Archive-Interval : Indicates the entire datetime intervalfor which the archival server has Mementos for URI-R.

X-Datetime-Validity : Indicates the datetime interval dur-ing which the provided representation was valid. Certainservers, including CMS and transactional archives, can re-liably provide this information. Others, such as crawler-driven web archives cannot.

3.4.1 Web servers with archival capabilitiesSome web servers handle aspects of resource archiving na-

tively, by maintaining explicit information about the loca-tion and datetimes of archival records of their resources,stored internally or remotely. Many CMS, Version ControlSystems, as well as the TTApache system [8] fall under thiscategory. But also servers that recurrently archive into acloud store and keep track of the URIs of the remote archivalrecords fit in.

When a client is looking for Mementos of an original re-source URI-R hosted by these servers, they can handle therequests internally since all the information that is required– URIs of Mementos and their datetimes – is available. Inthis case, the set-up is as follows:

• URI-R itself becomes a transparently negotiable re-source that supports DT-conneg to provide access toall its available Mementos. In essence, URI-R func-tions as its own TimeGate URI-G. Note that typicalURI-Rs for these systems either provide access to thecurrent version of a resource, or to a list of all its ver-sions (each with its own URI-M), or to a combinationof both.

• All Mementos URI-Mi[URI-R@ti] of URI-R becomevariant resources for URI-R.

Figure 1 depicts a typical, successful, DT-conneg transac-tion flow for this type of server, including the HTTP head-ers that are used. The transactional behavior for less trivialcases are also considered in the Memento solution but spaceprevents us from discussing them here. Such cases includerequesting Mementos for datetimes that are out of the date-range for which the server has archival records, requestingMementos for URI-Rs that no longer exist, and the clientproviding a datetime which the server is unable to parse6.

6Details: http://mementoweb.org/guide/http/local

3.4.2 Web servers without archival capabilitiesMany other servers have no local archival capabilities

whatsoever. They host resources for which only a represen-tation of the current state can be retrieved, and are unawareof the details regarding the existence of Mementos of theirresources in other archival servers. Naturally, such a servercannot redirect a client that requests an archival record ofone of its URI-Rs to an appropriate Memento. However,these systems can still play a constructive role by redirectingthe client to a server that is equipped to handle the request:an archive of web resources. In this case, the set-up is asfollows:

• Upon detection of the X-Accept-Datetime header inthe client’s request for URI-R, the server merely redi-rects (using “HTTP 302 Found”) the client to anarchival server. Note that this is not a 302 redirectionthat is part of a conneg transaction, as described inSection 2. Rather it is a 302 redirection that resultsfrom detecting the X-Accept-Datetime header.

• The redirection is to a TimeGateURI-G[URI-R] that the archival server makes availablefor the original resource URI-R.

• The archive’s URI-G is a transparently negotiable re-source that supports DT-conneg to provide access toall the Mementos that the archive has available forURI-R.

• All Mementos URI-Mi[URI-R@ti] that the archive hasavailable for URI-R become variant resources for itsURI-G.

Figure 2 depicts a typical, successful, DT-conneg trans-action flow for this type of server, and includes the HTTPheaders that are involved. Again, the transactional behav-ior for less trivial cases is not covered here7. In essence, thesolution is the same as in the above case, with the exceptionthat the TimeGates reside on an external archival server,not on the server that hosts the original resource URI-R.This distinction raises two important questions.

First, to which archive should a server redirect? In orderto help the client, a server should redirect to an archive thathas the best archival coverage of its resources. Servers thathave an associated transactional archive should redirect toit, servers that have explicit recurrent crawling agreementswith systems such as Archive-It8 should point there, otherservers may point at their country-specific archive (suchas the Finnish, Danish, Canadian, etc. archives), and inmany cases servers can point at the Internet Archive. Notethat scenarios may be envisioned in which the redirection issubject to configuration, for example, redirection to differ-ent archives depending on archival time-range, media type,etc. Then again, this problem of redirecting to a specificarchive could be addressed by uniformly pointing at an ag-gregator service that holds crucial metadata (e.g., URI-R,URI-G, URI-M, ti) about Mementos available in a variety ofarchival servers, and that exposes cross-archive TimeGatesURI-G[URI-R]. In Section 3.5, we introduce a discovery APIfor archives that enables the creation of such a TimeGateaggregator.

7Details: http://mementoweb.org/guide/http/remote

8http://www.archive-it.org/

Fig

ure

1:

DT

-conneg

for

Web

serv

ers

wit

harc

hiv

alcapabilit

ies:

UR

I-R

=U

RI-

G.

Second, how does the server know the URI-G of theTimeGate for its own URI-R on an external archival server?This problem can be addressed by introducing archive-specific or cross-archive conventions for the syntax for URI-G of TimeGates as a function of URI-R. This would simplyformalize the status-quo as all major web archives that usethe Heritrix/Wayback solution already use such conventions.For example, the URI to retrieve a list of all archived ver-sions of http://cnn.com/ is:

http://web.archive.org/web/*/http://cnn.com/

Hence, the URI that could be used as a convention for theInternet Archive’s TimeGate for http://cnn.com/ would be:

http://web.archive.org/web/timegate/http://cnn.com/

Such a convention seems achievable in the context of theInternational Internet Preservation Consortium9 that hasmade archive interoperability one of its goals. However,when a TimeGate aggregator service is introduced, URI-G syntax conventions for individual archives are not crucial;only a convention for the aggregator’s URI-G syntax wouldbe essential.

3.5 Discovering Mementos: TimeBundles andTimeMaps

For discovery purposes, we introduce the notion of a re-source hosted by an archival server, via which a full overviewis available of all Mementos that the archive holds for anoriginal resource URI-R; we name such a resource a Time-Bundle. More formally, a TimeBundle for a resource URI-R,is a resource URI-B[URI-R] that is an aggregation of: (a) allMementos URI-Mi[URI-R@ti] available from an archive, (b)the archive’s TimeGate URI-G for URI-R, (c) the originalresource URI-R itself.

Given the semantics of a TimeBundle, as an aggregation ofa set of resources, all of which share a temporal relationshipwith URI-R, we propose to model it as an ORE Aggrega-tion [34]. The ORE specifications comply with the LinkedData conventions [5], and treat an ORE Aggregation as anon-information resource [21] described by an informationresource that is accessible via an HTTP 303 redirect fromthe URI of the ORE Aggregation. We name the informationresource that describes the TimeBundle a TimeMap; it is aspecialization of an ORE Resource Map. The TimeMap liststhe URIs of all resources that are aggregated in the Time-Bundle, as well as metadata that is available about them.We have not formally engaged in specifying which meta-data to convey in TimeMaps, but essentials such as archivaldatetime, media type, and language, as well as more specificinformation such as digest, number of observations, validitytime-range [4] must be considered10.

TimeBundles made available by archives may be leveragedin real-time client interaction, since their URI-B is expressedas the content of the HTTP Link header (see the HTTPheaders in Figures 1 and 2). And, when an archive makes itsTimeBundles discoverable using common approaches suchas SiteMaps [12], Atom Feeds [24], or OAI-PMH [19] theybecome a powerful mechanism for batch harvesting of meta-

9http://www.netpreserve.org/

10An example RDF/XML TimeMap as used in our experiment is avail-

able at http://mementoweb.org/guide/api/map1

data that describes an archive’s entire collection, and thatcan be used for the creation of cross-archive services.

3.6 A TimeGate AggregatorIf various archives implement TimeBundles and associated

TimeMaps, and make them discoverable using the aforemen-tioned techniques, then information about Mementos hostedby different archives can be harvested into an aggregatorservice. For each original resource URI-R, for which Me-mentos exist in the harvested archives, such an aggregatorthen minimally holds the distinct URI-Ms of each of thoseMementos in the various archives, as well as their archivaldatetime, media type, language etc. This information allowsthe aggregator to introduce TimeGates URI-G for each ofthe URI-Rs for which the harvested archives have Memen-tos. The variant resources for any specific TimeGate arethe Mementos for URI-R as they exist in the distributedarchives. Because the aggregator has information on Me-mentos across archives, its time-granularity is finer than thatof any of the individual archives. This provides the aggre-gator with a better range of possibilities when redirecting aclient to a Memento in response to a request for a specificdatetime. In essence, this aggregator behaves as the archivalservers discussed in Section 3.4.2, but it has a broader cov-erage both regarding URI-Rs and Memento datetimes, andit does not store the Mementos itself.

Figure 3 illustrates the value such an aggregator can bringto time travel. It shows various Mementos for the noaa.govhome page as it was around the time of Hurricane Katrina.In order to revive how the drama unfolded, inspecting Me-mentos held by different archives is required. Indeed, boththe content of the Mementos as well as their archival serverchanges as time progresses. Note also that, although the In-ternet Archive claims to have coverage for September 9 2005,the Memento is not really available (bottom left of Figure3; it is not known if this is a permanent or transient error);the next available Memento is for September 10 2005, and isavailable from Archive-It. In cases like this, an aggregatorcould support navigation across archives and across time.

4. EXPERIMENTWe have performed an experiment to demonstrate the fea-

sibility of the proposed DT-conneg framework involving adiverse array of components that jointly realize web timetravel across various servers. The deployed environment isdepicted in Figure 4. The arrows indicate the flow of HTTPinteractions shown in Figures 1 and 2, subject to the follow-ing considerations that are directly related to conducting atime travel experiment in a Web that is not (yet) DT-connegenabled.

First, as it was not realistic to try and get active develop-ment involvement from existing archival servers within thetimeframe the reported work took place, TimeGates andTimeBundles for several archives (CMS and web archives)were not implemented natively within those systems butrather by-proxy. This means that they were exposed byservers under our control, which obtained the essential in-formation from the archives using ad-hoc techniques suchas screen scraping. While it may seem that this approachundermines the essence of the protocol-based DT-connegframework, it actually is a strong illustration of its feasibil-ity: if one can scrape the essential information from archives’pages, it is certainly available in their databases, and hence,

Fig

ure

2:

DT

-conneg

for

Web

serv

ers

wit

hout

arc

hiv

alcapabilit

ies:

UR

I-R6=

UR

I-G

.

(a) Archive-ItThu, 08 Sep 2005 17:48:47 GMT

(b) Internet ArchiveThu, 08 Sep 2005 21:07:05 GMT

(c) Internet ArchiveFri, 09 Sep 2005 01:58:48 GMT

(d) Archive-ItSat, 10 Sep 2005 08:11:47 GMT

Figure 3: Distributed archive coverage of www.noaa.gov. 3(a) and 3(d) come from Archive-It and 3(b) and3(c) come from the Internet Archive. Note that 3(c) has either a transient or permament error.

native implementation should be more straightforward thanby-proxy. Also, while we rely on a by-proxy approach forcertain systems, demonstrations of the feasibility of nativeimplementation are also available.

Second, existing web servers do not currently detect the X-Accept-Datetime header required for time travel, and hencecannot issue the essential “HTTP 302 Found” to a TimeGate(see Section 3.4.2). These servers will currently respond asusual, typically with an “HTTP 200 OK” or “HTTP 404Not Found”. In order to still be able to demonstrate theDT-conneg framework in the experiment, the remedy is tohave the time travel client detect such responses that are un-expected from the time travel perspective, and take controlby subsequently issuing the DT-conneg request directly to aTimeGate for URI-R exposed by an archival server (nativeor by-proxy). In essence, in these cases, the client fulfills theredirecting role that the host of URI-R normally would inthe DT-conneg framework. For servers outside of our con-trol, there was no other option than to resort to this clientapproach; for servers under our control the redirect to aTimeGate was implemented natively.

The following is a description of the components involvedin the experiment:

Web servers: We equipped domains under our own con-trol with the capability to honor DT-conneg requests bydetecting the X-Accept-Datetime header, and redirecting toTimeGates exposed by an appropriate archival server. Thiswas trivially implemented using an Apache mod rewriterule11 for servers we could configure:

http://lanlsource.lanl.gov/

http://odusource.cs.odu.edu/

http://digitalpreservation.gov/

(LANL, ODU, and LoC, respectively in Figure 4). For ob-vious reasons, we were not able to implement this for serversbeyond our control.

Archives: Wikipedia is a prominent example of the classof servers with local archival capabilities. TimeGates (andTimeBundles) for it were implemented by-proxy (Wiki proxyin Figure 4). However, to demonstrate the possibility ofnative implementation, a plug-in was developed that addsX-Accept-Datetime and TimeGate capabilities to the Wiki-media platform on which Wikipedia is based12. To coverfor the class of servers that lack local archival capabilities,TimeGates (and TimeBundles) were implemented by-proxyfor the Internet Archive (IA proxy in Figure 4), the Inter-net Archive’s Archive-It (AI proxy in Figure 4), the Libraryof Congress’ Archive-It, the Government of Canada WebArchive, and WebCite. In addition, we developed a transac-tional archive platform and deployed it at LANL and ODU(LANL TA and ODU TA in Figure 4, respectively). As theLANL and ODU web servers respond to client requests, therepresentations they serve are pushed into these archives,yielding a high-resolution archival record of their evolvingresources. It is worth noting that the described selectioncovers a broad range of commonly deployed archival solu-tions: CMS, web-crawler based archives, on-user-demandarchives, and transactional archives.

Aggregator : Furthermore, a TimeGate aggregator (Aggr

11See http://mementoweb.org/tools/apache

12Plug-in at http://mementoweb.org/tools/wiki

in Figure 4) was developed that collects archival metadatafrom the aforementioned web archives’ TimeBundles (someby-proxy and some native), and can hence serve as a com-mon target for redirection. This collecting is currently donedynamically: as a client requests a Memento for an originalresource URI-R via the aggregator, the aggregator contactsassociated TimeBundles in various archives, merges the re-turned TimeMap information, and only then redirects theclient to an appropriate Memento. This experimental ap-proach makes retrieving Mementos via the aggregator pre-dictably slow.

Clients: We developed a FireFox plug-in that allows set-ting the browser to time travel mode, and selecting a date-time for the journey. From there onwards, the browser addsan X-Accept-Datetime header, with the datetime value setby the user, to every HTTP GET issued. If all targetedservers would implement the “HTTP 302 Found” redirec-tion upon detection of the X-Accept-Datetime header, onlyarchival pages would be retrieved, and all links in thosepages would be interpreted as requests for Mementos. Thiseffectively happens for the servers under our control (theblack flows labeled [1], [2] and [4] in Figure 4). As describedabove, other servers do not exhibit this behavior (the redflows [3] and [5] in Figure 4). Implementing the remedialbehavior where the client itself takes care of the redirectionturned out not to be trivial in the Mozilla plug-in frame-work as it does not support intercepting and modifying re-sponses13 (e.g., on 404 or 200 response codes). The result isa time travel plug-in that deals perfectly with URI-Rs of theservers under our control but not with any others. We thendecided to develop a time travel client that runs on a serverand is developed using the Apache mod python frameworkthat offered the required flexibility. The resulting gate-way client handles all flows of Figure 4 correctly, and fullydemonstrates the potential of the DT-conneg framework. Itis accessible via a web form that allows entering URI-R anda datetime. Upon submitting the time travel request, thegateway client (not the browser) fulfills the DT-conneg re-quests, and once completely handled, returns the resultingMemento page to the browser. In order to allow for contin-ued time travel of links in the page, they need to be rewrittento point at the gateway client. This is merely an artifact ofa server-side, not a browser-based, implementation. Thisclient also depicts the HTTP transactions that take placeduring time travel, and allows inspecting the HTTP head-ers involved.

With the above components in place, an experimental en-vironment results that effectively demonstrates the feasibil-ity of web time travel using the Memento solution. Twoclients, both admittedly with respective restrictions, allownavigating the past Web in very much the same way asthe current Web is browsed; they seamlessly move acrossweb servers and archives (CMS-style and web archives) us-ing the HTTP protocol, extended with DT-conneg, to tryand return a Memento that meets the client’s preference.Due to the various by-proxy components, and the dynamicimplementation of the aggregator, the navigation can oftenbe slow. However, the navigations that involve the serverswith full native support (flows [1] and [2] in Figure 4), thosethat bypass the aggregator (flows [1], [2] and [3]), and thosefor which the aggregator can respond from its limited cache

13See https://wiki.mozilla.org/Firefox/Projects/Network Error Pages

(some flows [4] and [5]) perform noticeably faster, even us-ing the gateway client. In addition, many well understoodtechniques including batch harvesting, caching, and recur-rent refreshing are available to improve the performance ofthe aggregator and fundamentally improve response times.

As an illustration of our results, Figure 5 shows two nav-igations conducted on November 2 2009, around 16:25:00UTC: one in real-time, and one in time travel mode with adatetime set to October 12 2009 16:25:00 UTC. The captionsin the figure also indicate the flow of the HTTP interactionsin the experimental environment as indicated in Figure 4.The DT-conneg framework allows a re-navigation of both inthe future. It suffices to use these respective datetimes asthe DT-conneg value, and hope that archives have recordsof the resources involved14.

Figure 4: The Memento Experiment environment.

5. DISCUSSIONIn this section, we touch upon issues pertaining to the Me-

mento solution that require further attention. The seamlessintegration of archives in regular web navigation provided byDT-conneg allows exploring novel ways to address commonproblems that result from the dynamics of the Web. Someof these have tentatively been explored in our experiment.Consider the following cases:

A URI-R vanishes, but the server that used to serve itis still operational : In this case, the server should still issuethe redirect to a TimeGate upon detection of the DT-connegrequest. This allows seamless access to a Memento of URI-R, even if the server no longer hosts the original.

A domain vanishes: The client is looking for a current rep-resentation of a URI-R that was hosted by the domain, butfails. The client resorts to interaction with other archivesor with a TimeBundle aggregator and arrives at the mostrecent Memento of the resource.

14Try it at http://mementoweb.org/demo/client1

A domain is taken over by a new custodian: The new cus-todian adheres to other policies regarding which archive toredirect a DT-conneg request. The client understands fromthe X-Archive-Interval returned by that archive of choice,that it does not cover the time range in which the previouscustodian operated the domain. The client resorts to inter-action with other archives or with a TimeBundle aggregatorand arrives at an appropriate Memento.

Two aspects related to the integration of the proposedMemento solution into the existing Web infrastructure re-quire attention. First, when issuing a request with an X-Accept-Datetime header to a server that hosts the originalresource URI-R, all caches between the client and the servermust be bypassed in order to avoid retrieving a current rep-resentation of URI-R. In our experiment, we enforced thisbehavior through a combination of two client request head-ers: “Cache-Control: no-cache” to force cache revalidationand “If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT”to make sure that revalidation fails. Further research is re-quired to find an alternative for this admittedly inelegant ap-proach. Ideally, it should leverage existing caching practicebut extend it in such a way that caches are only bypassed inDT-conneg when essential, but still used whenever possible(e.g., to deliver Mementos). Second, when it comes to list-ing variant resources in response headers, the DT-connegframework cannot operate according to the letter of RFC2295. Indeed, the RFC states: “If a response from a trans-parently negotiable resource includes an Alternates header,this header MUST contain the complete variant list boundto the negotiable resource.” This mandate is based on aperspective expressed in the RFC that “it is expected thata typical transparently negotiable resource will have 2 to 10variants, depending on its purpose.” Clearly, TimeGates asproposed in the DT-conneg framework can have many morethan 10 Mementos. We do not think this makes the con-neg framework inapplicable to the datetime dimension, butrather we believe DT-conneg introduces a challenge that theauthors of the RFC did not anticipate ten years ago. As de-scribed, we propose a solution based on a sizeable Alternatesheader combined with an HTTP Link header that leads to acomplete list of variants; other options should be explored.

An interesting characteristic of the DT-conneg frameworkrequires more explicit attention. When requesting a Me-mento for a page that contains links to external pages, orembedded resources such as images or videos, each of thoseare requested with DT-conneg from the respective serversthat host/hosted the URI-R of those resources. This isa core characteristic that the proposed time travel frame-work shares with regular web navigation. It should be notedthat this is not the current behavior of pages stored in webarchives. Indeed, in order to avoid filling out an archivedpage with current representations of embedded resources,web archives rewrite URIs in archived pages to point backinto the archive at archived representations of those resources.The same happens with links in archived pages, effectivelyturning the archive into an island isolated from the restof the Web. The upside of this approach is that archivedpages are self-contained: the page and its embedded re-sources were typically crawled around the same time andhence the archived page is likely to be a faithful reconstruc-tion of what the original looked like at the time of the crawl.The drawback of the approach is that navigation is restrictedto the archive’s island. Navigating beyond it to obtain an

(a) http://lanlsource.lanl.gov/hello - flow 1 in Figure 4

(b) http://en.wikipedia.org/MS Oasis of the Seas - flow 3 in Figure 4

(c) http://news.bbc.co.uk/ - flow 5 in Figure 4

Figure 5: Browsing in real time (Mon, 02 Nov 2009 16:25:00 GMT) on the left and time travel(Mon, 12 Oct 2009 16:25:00 GMT) on the right.

archived version of a linked resource that is not available inthe archive but might be available elsewhere on the Web, isnot possible. Further exploration is required to arrive at astrategy for web archives that would at the same time ad-here to the self-containedness principle and allow externalnavigation using the DT-conneg framework when beneficial.

Another challenge pertains to selecting a Memento thatbest meets the client’s conneg preferences. There are twoaspects to this problem. The first relates to the archivaldatetime of the Memento that an archive should return inresponse to a datetime expressed by a time travel client.For certain archives the choice is straightforward. Indeed,transactional archives and servers such as Wikipedia knowexactly during which time interval a certain Memento func-tioned as the active representation of URI-R (cf. the X-Datetime-Validity discussed in Section 3.4). Hence, theycan return the Memento that was active at the datetimespecified by the client. However for resources not hostedby such servers, it will be rare that any archive has a Me-mento that perfectly matches the client’s preference. In thiscase, an archive (or a TimeBundle aggregator) must makea choice. A typical approach used by existing web archivesis to choose the Memento that is the “closest” in time, re-gardless of whether its archival datetime is before or afterthe requested datetime. But this approach is challengedwhen pages have embedded resources. The more resourcesrequired to render a page, the more variation there will bebetween the requested datetime and the archival datetimesof available Mementos. As a matter of fact, when not beingsensible about the selection of Mementos, the resulting pagemay never actually have existed. A second challenge relatesto multi dimensional conneg that involves the datetime di-mension. Current conneg algorithms15 deal with variantselection in the dimensions specified in RFC 2295. Thesewould need to be revised to include the datetime dimension:if a client requests an HTML Memento for a specific date-time, but only a pdf is available, what should the archivalserver do? Research is required to explore both problems.

6. RELATED WORKThe goal of adding a temporal aspect to web navigation

has been explored in projects that focus on user interfaceenhancement. The Zoetrope project [1] provides a rich in-terface for querying and interacting with a set of archivedversions of selected seed pages. The interface leverages a lo-cal archive that is assembled by frequently polling those seedpages. The Past Web Browser [16] provides a simpler level ofinteraction with changing pages, but it is restricted to nav-igating existing web archives such as the Internet Archive.And DiffIE is a plug-in for Internet Explorer that empha-sizes web content that changed since a user’s previous visitby leveraging a dedicated client cache [32]. None of theseprojects propose protocol enhancements but rather use ad-hoc techniques to achieve their goals. All could benefit fromDT-conneg as a standard mechanism for accessing prior rep-resentations of resources.

Some projects have dealt with the problem of disappearedweb pages and finding archived or replacement copies onthe Web. The use of lexical signatures as search enginequery terms was proposed as a way to find content that hadmoved from its original URI [28, 29]. This approach was

15See, http://httpd.apache.org/docs/2.2/content-negotiation.html

later applied to search for content in web archives [13, 17].Also, when a “HTTP 404 Not Found” occurs, the ErrorZillaFireFox plug-in16 presents a user with a search page allowingher to find disappeared pages in web archives, and the UKNational Archive’s server plug-in redirects the client to anarchive of its choice. As suggested (Section 5), in the DT-conneg framework a client could intelligently react to 404s,and when doing so leverage available re-finding approaches.

To the best of our knowledge, very little research has ex-plored a protocol-based solution to augment the Web withtime travel capabilities. TTApache [8] introduced a modi-fied version of Apache that stored archived representationsin a local transactional archive (similar to the configura-tion illustrated in Figure 1). Ad-hoc RPC-style mechanismswere used to access archived representations given the URIof their original, e.g. “page.html?02-Nov-2009” and“page.html?now”. This approach reveals the local scopeof the problem addressed by TTApache, as opposed to theglobal perspective taken by the proposed DT-conneg frame-work. Indeed, the query components are issued against aspecific server, and are not maintained when a client movesto another server as is the case with the X-Accept-Datetimeheader of DT-conneg. TTApache also allowed addressingarchived representations using version numbers in querycomponents rather than datetimes. This capability is sim-ilar to the deprecated “Content-Version” header field fromRFC 2068 [10] and other, similar expired proposals (e.g.,[27]). Such versioning features have not found wide-spreadadoption, presumably because their address space is tied toa specific resource or server, and not universal like the date-time of DT-conneg.

7. CONCLUSIONSIn Web Archiving [22], Julien Masanes expresses a vision

of a global grid of web archives realized by interconnectingexisting and future ones:

Such a grid should link Web archives so that theytogether form one global navigation space likethe live Web itself. This is only possible if theyare structured in a way close enough to the orig-inal Web and if they are openly accessible.

We could not agree more, and feel that our Memento solu-tion presents a significant step towards achieving this vision.But our approach reaches beyond it. Indeed, the navigationspace that results from our proposal is not “like the live Webitself”, it is the Web itself, as regular navigation and timetravel are integrated. Also, it does not restrict the globalarchival grid to web archives but incorporates servers (suchas CMS) on the live Web that host archival content. TheMemento solution is capable of realizing this, and does notdisrupt firmly established HTTP practice. Rather, it addsto it an orthogonal time dimension. Moreover, the Mementosolution does not disrupt existing web archives or their es-tablished operating principles, but leverages both by tightlyintegrating them into the web. Time travel can be ours.

8. ACKNOWLEDGMENTSThis work sponsored in part by the Library of Congress.

16https://addons.mozilla.org/en-US/firefox/addon/3336

9. REFERENCES[1] E. Adar, M. Dontcheva, J. Fogarty, and D. S. Weld.

Zoetrope: interacting with the ephemeral web. InUIST ’08: Proceedings of the 21st annual ACMsymposium on User interface software and technology,pages 239–248, 2008.

[2] E. Adar, J. Teevan, and S. T. Dumais. Resonance onthe web: web dynamics and revisitation patterns. InCHI ’09: Proceedings of the 27th internationalconference on Human factors in computing systems,pages 1381–1390, 2009.

[3] E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas.The web changes everything: understanding thedynamics of web content. In WSDM ’09: Proceedingsof the Second ACM International Conference on WebSearch and Data Mining, pages 282–291, 2009.

[4] A. Anand, S. Bedathur, K. Berberich, R. Schenkel,and C. Tryfonopoulos. Everlast: a distributedarchitecture for preserving the web. In JCDL ’09:Proceedings of the 9th ACM/IEEE-CS joint conferenceon Digital libraries, pages 331–340, 2009.

[5] C. Bizer, R. Cyganiak, and T. Heath. How to publishlinked data on the web, 2007. http://sites.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/.

[6] A. Brown. Archiving Websites: A practical guide forinformation management professionals. FacetPublishing, 2006.

[7] B. F. Cooper and H. Garcia-Molina. Infomonitor:Unobtrusively archiving a World Wide Web server.International Journal on Digital Libraries,5(2):106–119, April 2005.

[8] C. E. Dyreson, H. Lin, and Y. Wang. Managingversions of web documents in a transaction-time webserver. In WWW ’04: Proceedings of the 13thinternational conference on World Wide Web, pages422–432, 2004.

[9] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. Alarge-scale study of the evolution of web pages. InWWW ’03: Proceedings of the 12th internationalconference on World Wide Web, pages 669–678, 2003.

[10] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, andT. Berners-Lee. Hypertex transfer protocol –HTTP/1.1, Internet RFC-2068, 1997.

[11] K. Fitch. Web site archiving: an approach torecording every materially different response producedby a Website. In 9th Australasian World Wide WebConference, Sanctuary Cove, Queensland, Australia,July, pages 5–9, 2003.

[12] Google, Microsoft, and Yahoo. Sitemaps XML format,2008. http://www.sitemaps.org/protocol.php.

[13] T. L. Harrison and M. L. Nelson. Just-in-timerecovery of missing web pages. In HYPERTEXT ’06:Proceedings of the Seventeenth ACM Conference onHypertext and Hypermedia, pages 145–156, 2006.

[14] K. Holtman and A. Mutz. Transparent contentnegotiation in HTTP, Internet RFC-2295, 1998.

[15] I. Jacobs and N. Walsh. Architecture of the worldwide web, volume one. Technical Report W3CRecommendation 15 December 2004, W3C, 2004.

[16] A. Jatowt, Y. Kawai, S. Nakamura, Y. Kidawara, andK. Tanaka. Journey to the past: proposal of aframework for past web browser. In HYPERTEXT

’06: Proceedings of the seventeenth conference onHypertext and hypermedia, pages 135–144, 2006.

[17] M. Klein and M. L. Nelson. Revisiting lexicalsignatures to (re-)discover web pages. In ECDL ’08:Proceedings of the 12th European Conference onResearch and Advanced Technology for DigitalLibraries, pages 371 – 382, 2008.

[18] W. Koehler. Web page change and persistence — afour-year longitudinal study. Journal of the AmericanSociety for Information Science and Technology,53(2):162–171, 2002.

[19] C. Lagoze and H. Van de Sompel. The Open ArchivesInitiative: building a low-barrier interoperabilityframework. In JCDL ’01: Proceedings of the 1stACM/IEEE-CS Joint Conference on Digital Libraries,pages 54–62, 2001.

[20] J. Leskovec, L. Backstrom, and J. Kleinberg.Meme-tracking and the dynamics of the news cycle. InKDD ’09: Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 497–506, 2009.

[21] R. Lewis. Dereferencing HTTP URIs. TechnicalReport Draft Tag Finding 04 October 2007, 2007.

[22] J. Masanes. Web Archiving. Springer-Verlag, 2006.

[23] M. Nottingham. Web linking, Internet Draftdraft-nottinghamgm-http-link-header-06, 2009.

[24] M. Nottingham and R. Sayre. The Atom syndicationformat, Internet RFC-4287, 2005.

[25] A. Ntoulas, J. Cho, and C. Olston. What’s new on theweb?: the evolution of the web from a search engineperspective. In WWW ’04: Proceedings of the 13thinternational Conference on World Wide Web, pages1–12, 2004.

[26] H. W. E. H. Obendorf, Hartmut and M. Mayer. Webpage revisitation revisited: Implications of a long-termclick-stream study of browser usage. In CHI ’07:Proceedings of the 25th international conference onHuman factors in computing systems, pages 597–606,2007.

[27] K. Ota, K. Takahashi, and K. Sekiya. Versionmanagement with meta-level links via HTTP/1.1,Internet Draft draft-ntt-http-version-00, 1996.

[28] S.-T. Park, D. M. Pennock, C. L. Giles, andR. Krovetz. Analysis of lexical signatures forimproving information persistence on the World WideWeb. ACM Transactions on Information Systems,22(4):540–572, 2004.

[29] T. A. Phelps and R. Wilensky. Robust hyperlinks costjust five words each. Technical ReportUCB/CSD-00-1091, EECS Department, University ofCalifornia, Berkeley, 2000.

[30] H. C. Rao, Y. Chen, and M. Chen. A proxy-basedpersonal web archiving service. SIGOPS OperatingSystems Review, 35(1):61–72, 2001.

[31] L. Tauscher and S. Greenberg. How people revisit webpages: Empirical findings and implications for thedesign of history systems. International Journal ofHuman-Computer Studies, 47(1), 1997.

[32] J. Teevan, S. T. Dumais, D. J. Liebling, and R. L.Hughes. Changing how people view changes on theweb. In UIST ’09: Proceedings of the 22nd annual

ACM symposium on User interface software andtechnology, pages 237–246, 2009.

[33] R. Troncy, J. Jansen, Y. Lafon, E. Mannens,S. Pfeiffer, and D. V. Deursen. Use cases andrequirements for media fragments, W3C WorkingDraft 30 April 2009, 2009.

[34] H. Van de Sompel, C. Lagoze, M. L. Nelson,S. Warner, R. Sanderson, and P. Johnston. Addingescience assets to the data web. In Proceedings of theLinked Data on the Web Workshop (LDOW 2009),2009.

Memento: Time Travel for the Web - arXiv · Memento: Time Travel for the Web Herbert Van de Sompel...

Documents

Transcript of Memento: Time Travel for the Web - arXiv · Memento: Time Travel for the Web Herbert Van de Sompel...