Thumbnail Summarization Techniques For Web Archives

28
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA [email protected] 1 Michael L. Nelson Old Dominion University Norfolk VA, USA [email protected] The 36th European Conference on Information Retrieval . ECIR 2014, Amsterdam, Netherlands, 2014 * Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands

description

The 36th European Conference on Information Retrieval . ECIR 2014, Amsterdam, Netherlands, 2014. Thumbnail Summarization Techniques For Web Archives. Ahmed AlSum * Stanford University Libraries Stanford CA, USA [email protected]. M ichael L. Nelson Old Dominion University - PowerPoint PPT Presentation

Transcript of Thumbnail Summarization Techniques For Web Archives

Page 1: Thumbnail Summarization Techniques For Web Archives

Thumbnail Summarization Techniques For Web Archives

Ahmed AlSum*

Stanford University Libraries

Stanford CA, [email protected]

1

Michael L. Nelson

Old Dominion University

Norfolk VA, [email protected]

The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014

*Ahmed AlSum did this work while he was PhD student at Old Dominion University

ECIR 2014 Amsterdam, Netherlands

Page 2: Thumbnail Summarization Techniques For Web Archives

ECIR 2014 Amsterdam, Netherlands 2

What is a Web Archive?

http://www.cs.odu.edu

Page 3: Thumbnail Summarization Techniques For Web Archives

Thumbnails in Web Archive

Internet Archive UK Web Archive

3ECIR 2014 Amsterdam, Netherlands

Page 4: Thumbnail Summarization Techniques For Web Archives

4

Memento Terminology

URI-R, R

URI-M, M

URI-T, TM

http://www.amazon.com

http://web.archive.org/web/20110411070244/http://amazon.com

Original Resource

Memento

TimeMap

ECIR 2014 Amsterdam, Netherlands

Page 5: Thumbnail Summarization Techniques For Web Archives

Thumbnails Creation Challenges• Scalability in Time

• IA may need 361 years to create thumbnail for each memento using one hundred machines.

• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.

• Page quality

5ECIR 2014 Amsterdam, Netherlands

Page 6: Thumbnail Summarization Techniques For Web Archives

Thumbnails Usage Challenges

6

• This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com

ECIR 2014 Amsterdam, Netherlands

Page 7: Thumbnail Summarization Techniques For Web Archives

From 10,500 Mementos to 69 Thumbnails.

7ECIR 2014 Amsterdam, Netherlands

Page 8: Thumbnail Summarization Techniques For Web Archives

How many thumbnails do we need?

www.unfi.com on the live Web

8ECIR 2014 Amsterdam, Netherlands

Page 9: Thumbnail Summarization Techniques For Web Archives

How many thumbnails do we need?

www.unfi.com on the live Web

9ECIR 2014 Amsterdam, Netherlands

Page 10: Thumbnail Summarization Techniques For Web Archives

40 Thumbnails are good.

10ECIR 2014 Amsterdam, Netherlands

Page 11: Thumbnail Summarization Techniques For Web Archives

METHODOLOGY

11ECIR 2014 Amsterdam, Netherlands

Page 12: Thumbnail Summarization Techniques For Web Archives

Visual Similarity and Text Similarity

Sim

ilar

Dif

fere

nt

HTML Text

12ECIR 2014 Amsterdam, Netherlands

Page 13: Thumbnail Summarization Techniques For Web Archives

Correlation between Visual Similarity and Text Similarity • Text Similarity

• SimHash• DOM Tree• Embedded resources• Memento Datetime (Capture time)

• Visual Similarity

13ECIR 2014 Amsterdam, Netherlands

Page 14: Thumbnail Summarization Techniques For Web Archives

Text Similarity

SimHash• Computes 64-bit SimHash fingerprints with k = 4 for two

pages• Full HTML text ✔• The main content from the web page• All the text • Templates including the text• The template excluding the text

• Calculate the differences using Hamming Distance

14ECIR 2014 Amsterdam, Netherlands

Page 15: Thumbnail Summarization Techniques For Web Archives

Text Similarity

DOM Tree• Transfer each webpage to DOM tree• Calculate the difference using Levenshtein Distance

• Levenshtein distance: is the number of operations to insert, update, and delete.

15ECIR 2014 Amsterdam, Netherlands

Page 16: Thumbnail Summarization Techniques For Web Archives

Text Similarity

Embedded resources• Extract the embedded resources for each page • Calculate the total number of new resources that have

been added and the resources that have been removed.• For example, the difference between M1 and M2:

• Addition of 5 resources (2 javascript files and 3 images) • Removal of 2 resources (1 javascript file and 1 image).

16ECIR 2014 Amsterdam, Netherlands

Page 17: Thumbnail Summarization Techniques For Web Archives

Text Similarity

Memento datetime• Calculate the difference between the record capture time

for both pages in seconds.

17ECIR 2014 Amsterdam, Netherlands

Page 18: Thumbnail Summarization Techniques For Web Archives

Visual Similarity• Measurement: the number of different pixels between two

thumbnails• To compare two thumbnails,

• Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600.

• Calculate the Manhattan distance and Zero distance between each pair

18ECIR 2014 Amsterdam, Netherlands

Page 19: Thumbnail Summarization Techniques For Web Archives

Correlation between Visual Similarity and Text Similarity

SimHash DOM tree

Embedded resources Memento Datetime

19

SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]

ECIR 2014 Amsterdam, Netherlands

Page 20: Thumbnail Summarization Techniques For Web Archives

SELECTION ALGORITHMS

20ECIR 2014 Amsterdam, Netherlands

Page 21: Thumbnail Summarization Techniques For Web Archives

Threshold Grouping

21ECIR 2014 Amsterdam, Netherlands

Page 22: Thumbnail Summarization Techniques For Web Archives

Threshold Grouping

22ECIR 2014 Amsterdam, Netherlands

Page 23: Thumbnail Summarization Techniques For Web Archives

Clustering technique• Input:

• TimeMap with n mementos• A set of features.

• For example, F = {SimHash, Memento-Datetime}

• Task:• Cluster n mementos in K clusters.

23ECIR 2014 Amsterdam, Netherlands

Page 24: Thumbnail Summarization Techniques For Web Archives

Clustering technique

SimHash Feature SimHash and Datetime Features

24

Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.

ECIR 2014 Amsterdam, Netherlands

Page 25: Thumbnail Summarization Techniques For Web Archives

Time Normalization

25ECIR 2014 Amsterdam, Netherlands

Page 26: Thumbnail Summarization Techniques For Web Archives

Selection Algorithms Comparison

  Threshold Grouping K clustering Time Normalization

TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109

# Features 1 feature 1 or more 1 feature

Preprocessing required Yes Yes No

Efficient processing Medium Extensive Light

Incremental Yes No Yes

Online/offline Both Both Both

26ECIR 2014 Amsterdam, Netherlands

Page 27: Thumbnail Summarization Techniques For Web Archives

Generalization outside the Web Archive

• Get k thumbnails from website that has n pages

27ECIR 2014 Amsterdam, Netherlands

Page 28: Thumbnail Summarization Techniques For Web Archives

Conclusions• We explored the similarity between the text and visual

appearance of the web page.• We found that SimHash and Levenshtein distance have the highest

correlation

• We presented three algorithms to select k thumbnails from n mementos per TimeMap.

28

[email protected]@aalsum

ECIR 2014 Amsterdam, Netherlands