Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and...
-
date post
22-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and...
Information persistence on Information persistence on the Webthe Web
Judit Bar-IlanJudit Bar-IlanThe Hebrew University and Bar-Ilan The Hebrew University and Bar-Ilan
UniversityUniversityandand
Bluma PeritzBluma PeritzThe Hebrew UniversityThe Hebrew University
Web documentsWeb documents They are not like printed/written They are not like printed/written
materialmaterial If preserved, they last “forever”, e.g. the If preserved, they last “forever”, e.g. the
Code of HammurabiCode of Hammurabi
They are not like unrecorded phone They are not like unrecorded phone calls that disappear in the air calls that disappear in the air
Web documentsWeb documents Can exist only for a limited amount of timeCan exist only for a limited amount of time
Can be removed, or moved to a different Can be removed, or moved to a different locationlocation
Can undergo changesCan undergo changes CNN’s main page is updated approx. every 15 CNN’s main page is updated approx. every 15
minutesminutes The program page for this conferenceThe program page for this conference
Can be temporarily inaccessible Can be temporarily inaccessible Communication/server problemsCommunication/server problems
The Web is The Web is dynamicdynamic
The WebThe Web On the one hand On the one hand growsgrows continuously continuously
On the other hand it On the other hand it changeschanges constantly, constantly, thus not only new documents are added to thus not only new documents are added to it, butit, but Exiting documents are removedExiting documents are removed Existing documents undergo changesExisting documents undergo changes
contentcontent formatformat linkage linkage
Question: How do Question: How do documents on the Web documents on the Web
evolve?evolve? News pages change very frequentlyNews pages change very frequently How about more “How about more “academicacademic” topics?” topics? As a case study analyzed the changes As a case study analyzed the changes
occurring to a set of pages containing occurring to a set of pages containing the search terms the search terms informetricinformetric or or informetricsinformetrics over a period of five years over a period of five years
Almost no other such long-range Almost no other such long-range studiesstudies Koehler (JASIST, 2002): a “random”, fixed Koehler (JASIST, 2002): a “random”, fixed
set of Web pages monitored weekly for a set of Web pages monitored weekly for a period of four years period of four years
Data collectionData collection
First data collection point (June 1998)First data collection point (June 1998) Data discovery through submission of Data discovery through submission of
query to the then existing largest search query to the then existing largest search enginesengines
AltaVista, Excite, HotBot, InfoSeek, Lycos and AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustivenessNorthernLight - exhaustiveness
All results retrieved, collated list of URLs All results retrieved, collated list of URLs created (941 URLs)created (941 URLs)
URLs downloaded – asap after searchingURLs downloaded – asap after searching Content of URLs checked for presence of Content of URLs checked for presence of
search terms (866 URLs, 91.9%)search terms (866 URLs, 91.9%)
Data collection (cont.)Data collection (cont.) Consecutive data collection points: June Consecutive data collection points: June
1999, 2002 and 20031999, 2002 and 2003 Search engines were Search engines were queriedqueried as before as before
Set of search engines in 2002 & 2003: AltaVista, Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut Fast, Google, HotBot, Teoma and Wisenut
List of URLs, pages downloadedList of URLs, pages downloaded Previously identified URLs that currently Previously identified URLs that currently
were not retrieved by the search engines were not retrieved by the search engines were were revisitedrevisited and their contents and their contents downloadeddownloaded
This method allowed us to monitor This method allowed us to monitor previously discoveredpreviously discovered URLs, while URLs, while adding adding newnew (or newly discovered) URLs (or newly discovered) URLs to the setto the set
0
1000
2000
3000
4000
5000
6000
all_types 45% growth rate html_text 40% growth rate
all_types 866 1297 4075 5155
45% growth rate 866 1255.7 1820.765 2640.10925 3828.15841 5550.8297
html_text 866 1297 3746 4389
40% growth rate 866 1212.4 1697.36 2376.304 3326.8256 4657.55584
1998 1999 2001 2002 2002 2003
The observed growth rate The observed growth rate during the study periodduring the study period
Not only growth …Not only growth … Until and including 2002, 5034 URLs Until and including 2002, 5034 URLs
were discoveredwere discovered In June 2003 only 2850 were still In June 2003 only 2850 were still
available and satisfied the queryavailable and satisfied the query Thus 37.5% of the URLs Thus 37.5% of the URLs
(1890 URLs) (1890 URLs) disappeared ordisappeared or ceased to satisfy ceased to satisfy
the query (topic shift) the query (topic shift)
… … also modificationsalso modifications
Out of the URLs satisfying the query at Out of the URLs satisfying the query at two consecutive data points, about 50% two consecutive data points, about 50% have undergone some kind of have undergone some kind of modificationmodification Text of the source files comparedText of the source files compared
StableStable dataset compared to dataset compared to random setsrandom sets e.g. in Koehler’s random set of 361 URLs, only e.g. in Koehler’s random set of 361 URLs, only
for 3% no changes were observedfor 3% no changes were observed UnstableUnstable set compared to set compared to digital librariesdigital libraries
e.g.e.g. PubMedCentral,PubMedCentral, arXiv, CiteSeer – only 3% arXiv, CiteSeer – only 3% of the sample disappeared during the one of the sample disappeared during the one year period of observationyear period of observation
What is the value of What is the value of such elusive such elusive
information???information??? Dellavalle et al., Science 2003: Dellavalle et al., Science 2003: Going, going, Going, going,
gone: Lost Internet Referencesgone: Lost Internet References Bar-Ilan & Peritz, JASIST (to appear):Bar-Ilan & Peritz, JASIST (to appear):
Evolution, Continuity and Disappearance of Evolution, Continuity and Disappearance of Documents on a Specific Topic on the WebDocuments on a Specific Topic on the Web - A - A Longitudinal Study of “Informetrics”Longitudinal Study of “Informetrics” http://shum.huji.ac.il/~judit/evolution/http://shum.huji.ac.il/~judit/evolution/
barilan_peritz_JASIST_notice.pdfbarilan_peritz_JASIST_notice.pdf Internet ArchiveInternet Archive – saves “snapshots” of the – saves “snapshots” of the
Web at different points in time. Web at different points in time. Wayback Wayback MachineMachine http://http://archive.orgarchive.org