Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and...

16
Information Information persistence on the Web persistence on the Web Judit Bar-Ilan Judit Bar-Ilan The Hebrew University and Bar-Ilan The Hebrew University and Bar-Ilan University University and and Bluma Peritz Bluma Peritz The Hebrew University The Hebrew University
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and...

Information persistence on Information persistence on the Webthe Web

Judit Bar-IlanJudit Bar-IlanThe Hebrew University and Bar-Ilan The Hebrew University and Bar-Ilan

UniversityUniversityandand

Bluma PeritzBluma PeritzThe Hebrew UniversityThe Hebrew University

Web documentsWeb documents They are not like printed/written They are not like printed/written

materialmaterial If preserved, they last “forever”, e.g. the If preserved, they last “forever”, e.g. the

Code of HammurabiCode of Hammurabi

They are not like unrecorded phone They are not like unrecorded phone calls that disappear in the air calls that disappear in the air

Web documentsWeb documents Can exist only for a limited amount of timeCan exist only for a limited amount of time

Can be removed, or moved to a different Can be removed, or moved to a different locationlocation

Can undergo changesCan undergo changes CNN’s main page is updated approx. every 15 CNN’s main page is updated approx. every 15

minutesminutes The program page for this conferenceThe program page for this conference

Can be temporarily inaccessible Can be temporarily inaccessible Communication/server problemsCommunication/server problems

The Web is The Web is dynamicdynamic

The WebThe Web On the one hand On the one hand growsgrows continuously continuously

On the other hand it On the other hand it changeschanges constantly, constantly, thus not only new documents are added to thus not only new documents are added to it, butit, but Exiting documents are removedExiting documents are removed Existing documents undergo changesExisting documents undergo changes

contentcontent formatformat linkage linkage

Question: How do Question: How do documents on the Web documents on the Web

evolve?evolve? News pages change very frequentlyNews pages change very frequently How about more “How about more “academicacademic” topics?” topics? As a case study analyzed the changes As a case study analyzed the changes

occurring to a set of pages containing occurring to a set of pages containing the search terms the search terms informetricinformetric or or informetricsinformetrics over a period of five years over a period of five years

Almost no other such long-range Almost no other such long-range studiesstudies Koehler (JASIST, 2002): a “random”, fixed Koehler (JASIST, 2002): a “random”, fixed

set of Web pages monitored weekly for a set of Web pages monitored weekly for a period of four years period of four years

Data collectionData collection

First data collection point (June 1998)First data collection point (June 1998) Data discovery through submission of Data discovery through submission of

query to the then existing largest search query to the then existing largest search enginesengines

AltaVista, Excite, HotBot, InfoSeek, Lycos and AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustivenessNorthernLight - exhaustiveness

All results retrieved, collated list of URLs All results retrieved, collated list of URLs created (941 URLs)created (941 URLs)

URLs downloaded – asap after searchingURLs downloaded – asap after searching Content of URLs checked for presence of Content of URLs checked for presence of

search terms (866 URLs, 91.9%)search terms (866 URLs, 91.9%)

Data collection (cont.)Data collection (cont.) Consecutive data collection points: June Consecutive data collection points: June

1999, 2002 and 20031999, 2002 and 2003 Search engines were Search engines were queriedqueried as before as before

Set of search engines in 2002 & 2003: AltaVista, Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut Fast, Google, HotBot, Teoma and Wisenut

List of URLs, pages downloadedList of URLs, pages downloaded Previously identified URLs that currently Previously identified URLs that currently

were not retrieved by the search engines were not retrieved by the search engines were were revisitedrevisited and their contents and their contents downloadeddownloaded

This method allowed us to monitor This method allowed us to monitor previously discoveredpreviously discovered URLs, while URLs, while adding adding newnew (or newly discovered) URLs (or newly discovered) URLs to the setto the set

0

1000

2000

3000

4000

5000

6000

all_types 45% growth rate html_text 40% growth rate

all_types 866 1297 4075 5155

45% growth rate 866 1255.7 1820.765 2640.10925 3828.15841 5550.8297

html_text 866 1297 3746 4389

40% growth rate 866 1212.4 1697.36 2376.304 3326.8256 4657.55584

1998 1999 2001 2002 2002 2003

The observed growth rate The observed growth rate during the study periodduring the study period

Not only growth …Not only growth … Until and including 2002, 5034 URLs Until and including 2002, 5034 URLs

were discoveredwere discovered In June 2003 only 2850 were still In June 2003 only 2850 were still

available and satisfied the queryavailable and satisfied the query Thus 37.5% of the URLs Thus 37.5% of the URLs

(1890 URLs) (1890 URLs) disappeared ordisappeared or ceased to satisfy ceased to satisfy

the query (topic shift) the query (topic shift)

… … also modificationsalso modifications

Out of the URLs satisfying the query at Out of the URLs satisfying the query at two consecutive data points, about 50% two consecutive data points, about 50% have undergone some kind of have undergone some kind of modificationmodification Text of the source files comparedText of the source files compared

StableStable dataset compared to dataset compared to random setsrandom sets e.g. in Koehler’s random set of 361 URLs, only e.g. in Koehler’s random set of 361 URLs, only

for 3% no changes were observedfor 3% no changes were observed UnstableUnstable set compared to set compared to digital librariesdigital libraries

e.g.e.g. PubMedCentral,PubMedCentral, arXiv, CiteSeer – only 3% arXiv, CiteSeer – only 3% of the sample disappeared during the one of the sample disappeared during the one year period of observationyear period of observation

What is the value of What is the value of such elusive such elusive

information???information??? Dellavalle et al., Science 2003: Dellavalle et al., Science 2003: Going, going, Going, going,

gone: Lost Internet Referencesgone: Lost Internet References Bar-Ilan & Peritz, JASIST (to appear):Bar-Ilan & Peritz, JASIST (to appear):

Evolution, Continuity and Disappearance of Evolution, Continuity and Disappearance of Documents on a Specific Topic on the WebDocuments on a Specific Topic on the Web - A - A Longitudinal Study of “Informetrics”Longitudinal Study of “Informetrics” http://shum.huji.ac.il/~judit/evolution/http://shum.huji.ac.il/~judit/evolution/

barilan_peritz_JASIST_notice.pdfbarilan_peritz_JASIST_notice.pdf Internet ArchiveInternet Archive – saves “snapshots” of the – saves “snapshots” of the

Web at different points in time. Web at different points in time. Wayback Wayback MachineMachine http://http://archive.orgarchive.org

Aug 26, 2000

March 31, 2001

May 28, 2002

April 25, 2003