Web archiving challenges and opportunities

WEB ARCHIVING CHALLENGES & OPPORTUNITIESPRESENTATION FOR WEB ARCHIVING ENGINEERING POSITION

Ahmed AlSumPhD Candidate

Old Dominion University

Outline• Engineering Experience

• IBM• Old Dominion University• Internet Archive

• Web Archiving Challenges & Opportunities• Selection• Harvesting• Storage• Access• Community

• Conclusions

Cairo, Egypt2006 - 2009

CCSP Project• An internal IBM support portal that provides client-facing

audiences a by-client, holistic view of client situations• Technologies: WebSphere Portal, DB2, deployed on

zLinux machines

Responsibilities• Software Engineer

• Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and backend tasks based on EJB

• Front-end components based on Web 20 technologies (AJAX based on dojo 1.0, and Java Script)

• Lotus Sametime (Plugins and Bot development)

• Software engineer team leader• Support project quality activities• Lead code review and static analysis activities

Responsibilities• Administrator

• Deploying Portal solutions on WebSphere Portal• WebSphere Portal Administration for standalone and clustered

environment• Administration on Linux and Windows OS• DB2 server administration for single instance and multiple

instances with HADR support

• Customer support team lead• Leading customer support activities

Certifications

Sharing IBM Internal Solutions with Broader Community

Norfolk, VA USA2009 - 2013

Memento• Memento is an HTTP

extension to integrate the Past and the Current Web

I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/

Now

T1

T2

T3

Memento

• Developer and administrator for Memento aggregator and proxies

Memento Clients

• Memento currently is I-D draft, it is promoted to move to RFC soon.

San Francisco, CA USA2012

WAT Extraction• Web Archive Transformation (WAT) is a specification for

structuring metadata generated by Web crawls• Technologies:

WEB ARCHIVING

Challenges and Opportunities

Web Archive Life Cycle

Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Selection• Decide what to capture

Everything, any domain

National domains

Delegate selection to partners

Users’ favorites

• We studied what is already captured

How Much Of The Web Is Archived?

S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson

In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada 2011

See also: http://arxiv.org/abs/1212.6177

http://arxiv.org/abs/1212.6177

Archive categories

We have 3 categories of archives• Internet Archive (classic interface) • Search engine • Other archives

Selection

UK

US

Public Archives, ca. Late 2010 / Early 2011

1000 URIs Ordered by First Observation Date

Selection

See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

Memento Distribution, ordered by the first observation date

How Much of the Web is Archived?It Depends on Which Web…

Selection

Including SE cache

Excluding SE Cache

90% 79%

97% 68%

88% 19%

35% 16%

Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives

2013

95%

92%

23%

26%

Profiling Web Archive Coverage For Top-level Domain And Content Language

A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel

In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013




Where is it archived?

Selection

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Language Coverage

Selection




Growth Rate

Selection




Borrowed Portuguese material from IA

Stopped archiving since 2008

Steady growth

Stopped getting new URIs, but still crawling

Selection Research Output• Some portions of the web are

not well archived such as India and Africa.

• Profiling helping us in Memento query routing.

• IIPC proposal with Herbert Van de Sompel (LANL) and David Rosenthal (SUL).

Selection

Selection at SUL• Focus on the missing parts of the Web• Twitter - Crowdsource:

• UK Web archive: Twittervana• Internet Memory: Collect URIs from twitter APIs• VA Tech: CTRNET project

• Stanford Community• World News collection: 10 news website from each county

• Tools:

Selection

Harvesting• Services

• Archive-It• WAS @ CDLib

• Dedicated servers

• New tools

See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

Special Harvesting Techniques• Borrow old materials from other web archives• Ex Stanford WebBase Project*

• 260 TB• 7 Billion webpages

Harvesting

*http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

Special Harvesting Techniques• Social Media

• Focus on shared resources in the social media

Harvesting

Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html

http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html

Special Harvesting Techniques• SiteStory - Transactional Archive

Harvesting

Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013Sitestory: http://mementoweb.github.io/SiteStory/

http://mementoweb.github.io/SiteStory/

Harvesting • Challenges

• Ajax and Web 2.0/3.0• Streaming Media• URI challenges • Mobile

Harvesting

http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.htmlhttp://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf

http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html

http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html

http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf

Storage (Format)• Flat files:

• WARC files (ISO standard)

• No-SQL db:• Hbase at Internet memory*

• Storage at SUL:• We need to use both

Storage

*Philippe Rigaux, Understanding HBase— The data model, IM technology blog http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/

Storage (Infrastructure)• Wrong solution could be a disaster

Storage

Accessing Web Archive

URI-BasedWayBack Machine

• Textbox to enter the requested URI

• BubbleMap to show you the available mementos


Full-text search

• Challenges: Temporal Page Rank, Rank per site or memento, Date filtering

Accessing Web Archive• Thumbnail View

• Trade-off between building the thumbnail in real time or pre-building Also, trade-off between representing the thumbnail by URI or by embedded binary data Can we build partial thumbnail map?

Accessing Web Archive• Title View

• Trade-off between, extracting all the titles and keeping it as a metadata about the memento and extracting the title from the HTML content on the real time

Implemented using Simile: http://www.simile-widgets.org/timeline/

http://www.simile-widgets.org/timeline/

http://www.simile-widgets.org/timeline/

Accessing Web Archive• Wayback Machine API

• XML interface for the list of available Mementos

Accessing Web Archive• Web Page Snapshot Replay

• URI rewriting, javascript, and embedded resources

Accessing Web Archive• Page Completeness Degree

• The completeness degree could be calculated on the real time by using the preserved HTTP status for the embedded resources




Accessing Web Archive• Reconstructing web site

• Current approach is using the web archive public interface.

Accessing Web Archive• Wayback Annotator

• Create collections• Select and save

relevant content to their collections

• Annotate & mark important parts of archived web pages

• Share their work and collaborate on archived content use

http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdfhttp://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf

http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf

http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf

http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf


Collection-Based

• In addition to browsing the collection, you can browse the URIs in this collection

• Research questions: Collection overview

Accessing Web Archive• Collection visualization

• Term frequency algorithms should be normalized to take the mementos density in consideration

http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html

http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html

Accessing Web Archive• Web Archive analytics

See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf

• ArcSpread took a query from the user, extracted related information and displayed the results in spread sheet style.

http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf

http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf

Who And What Links To The Internet Archive

Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson

In Proceedings of 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013

(Best Student Paper)See also: http://arxiv.org/abs/1309.4016



Serving Robots!• Log files analysis using Apache Pig• Access to IA wayback machine as

Robots outnumber Humans • 10:1 in terms of sessions, • 5:4 in terms of raw HTTP accesses • 4:1 in terms of megabytes transferred

Access

Sessions

10

1

HTTP accesses

5

4

MB Transferred

4

1

Where do Wayback Machine Users Come From?

Website Percentage Descriptionen.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies

Access

Most Languages Self-Link

Access

ArcLink:Optimization Techniques To Build And Retrieve The Temporal Web Graph

A. AlSum, M. L. Nelson

IIPC GA 2013, Ljubljana, Slovenia

In Proceedings of the 13th international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013




http://www.youtube.com/watch?v=ikrhYMaTNzQ

Easy Solved Questions

Q: What are the available mementos for vancouver2010.com?

Access

Solved Questions, but hard

Q: What are the HTML titles for vancouver2010com through time?

A Page scraping for all mementos

Access

Impossible Questions

Q What are the anchor-text that pointed to www.vancouver2010.com through time?

Access

…<a href=www.vancouver2010.com >Vancouver Olympics</a>….

…<a href=www.vancouver2010.com >Winter Olympics</a>…

…<a href=www.vancouver2010.com >Vancouver 2010</a>…

http://www.vancouver2010.com/



ArcLink

Access

Google code: https://code.google.com/p/arcsys/

https://code.google.com/p/arcsys

Impossible Questions • Q What are the anchor-text that pointed to

www.vancouver2010.com through time?

Access

Thumbnail Summarization Techniques For Web Archives

A. AlSum, and M. L. Nelson

Submitted for publication.

Thumbnails

Access

Internet Archive UK Web archive

Thumbnail Creation Challenges• Scalability in Time

• IA may need 361 years to create thumbnail per each memento using one hundred machine

• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento

• Page quality

Access

How many thumbnails do we need?

Access

www.unfi.com on the live Web

http://www.unfi.com/

40 Thumbnails are good.

Access

Same technique applied to apple.com

Access

From 8000 Mementos to 69 Thumbnails.

Access

iTunes cover application

Access

Community• I suggest to be a member in IIPC

• Join the open Wayback Machine team• Join the Winter Olympics 2014 collaborative project, even as an

observer

Congratulations

Community• Web Archiving Workshops

WAC 2011, Ottawa, Canada

WAC 2012, Stanford, CA, USA

WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil

Tools to SUL Web Archive• Selection

• Harvest

• Analysis

• Access

Conclusions• Be Selective: Cover missing parts of the Web• Be Older: Include WebBase• Be Smart: Innovative services• Be Helpful: Researcher Framework/Dataset• Be Active: Participate in the WA communities

• Make a difference

[email protected]@aalsum

BACKUP

What is missing?


LoC Library of Congress BL British Library CAT Web Archive of Catalonia TWNational Taiwan University


Thumbnail Features

SimHash DOM tree

Embedded resources Datetime

Clustering technique

Web Archive

Web Archive

Web archiving challenges and opportunities

Software

Transcript of Web archiving challenges and opportunities