MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
-
Upload
kalpesh-padia -
Category
Education
-
view
3.989 -
download
1
description
Transcript of MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
1
Visualizing Digital Collections at Archive-It
Kalpesh Padia
Director: Michele C. Weigle
Committee: Michael L. Nelson Ravi Mukkamala
7/20/2012 MS Thesis - August 2012
MS Thesis - August 2012 2
Agenda
Introduction
Motivation
Related Work
Collection Retrieval and Processing
Visualizations
Case Studies
Future Work
Conclusion7/20/2012
MS Thesis - August 2012 3
INTRODUCTION AND MOTIVATION
7/20/2012
MS Thesis - August 2012 4
Digital Archives
7/20/2012
http://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/
MS Thesis - August 2012 6
Archive-It Collection Hierarchy
7/20/2012
Level 3 (Leaf Nodes)
Level 2
Level 1
Root Collection Title
Category 1
Web page 1
Archived Version 1
Archived Version n
Web page n
Category n
MS Thesis - August 2012 7
Exploring Archive-It Collections
7/20/2012
http://archive-it.org/collections/1068
MS Thesis - August 2012 8
Exploring Archive-It Collections
7/20/2012
http://archive-it.org/collections/1068
MS Thesis - August 2012 9
Exploring Archive-It Collections
7/20/2012
http://archive-it.org/collections/1068
MS Thesis - August 2012 10
Exploring Archive-It Collections
7/20/2012
http://wayback.archive-it.org/1068/*/http://acda.co/
MS Thesis - August 2012 11
Exploring Archive-It Collections
7/20/2012
http://archive-it.org/collections/1068
MS Thesis - August 2012 12
Exploring Archive-It Collections
7/20/2012
http://archive-it.org/collections/2836
MS Thesis - August 2012 13
Drawbacks
No visual feedback
Discovering individual pages is difficult
Optional metadata and categorization
Collection structure known only to curator
7/20/2012
MS Thesis - August 2012 14
Contribution
Interactive visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline
Temporal exploration of collections
Uncover collection structure7/20/2012
MS Thesis - August 2012 15
RELATED WORK
7/20/2012
MS Thesis - August 2012 16
Microsoft Pivot
7/20/2012
http://www.microsoft.com/silverlight/pivotviewer/
MS Thesis - August 2012 17
Page History Explorer
7/20/2012
A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008.
MS Thesis - August 2012 18
3D Wall
7/20/2012
http://www.webarchive.org.uk/ukwa/wall/Blogs
MS Thesis - August 2012 19
Treemap
7/20/2012
Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization '91
MS Thesis - August 2012 20
Series Browser
7/20/2012
M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.
MS Thesis - August 2012 21
A1 Explorer
7/20/2012
M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.
MS Thesis - August 2012 22
EASY
7/20/2012
Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.3200
MS Thesis - August 2012 23
Wordle
7/20/2012
. Jonathan Feinberg, http://wordle.net/ , Dogear
MS Thesis - August 2012 24
DATA RETRIEVAL AND PROCESSING
7/20/2012
MS Thesis - August 2012 25
11 Collections, 2K+ Web pages, 70K+ Mementos
7/20/2012
MS Thesis - August 2012 26
Data Retrieval & Processing
Retrieval: Screen scrape Copy collection hierarchy Store page content
Processing: Calculate TF and TF-IDF Generate screenshots Generate wordles Rule-based categorization Construct JSON
7/20/2012
MS Thesis - August 2012 27
No categorization
7/20/2012
http://www.archive-it.org/collections/2836
MS Thesis - August 2012 28
Improper Categorization
7/20/2012
http://www.archive-it.org/collections/2323
MS Thesis - August 2012 29
Rule based categorization
News Web pages
7/20/2012
http://www.archive-it.org/collections/2836
Blogs
Social Media
Videos
MS Thesis - August 2012 30
Special URI and TLD based categorization
Pakistani news web pages
7/20/2012
http://www.archive-it.org/collections/2836
MS Thesis - August 2012 31
VISUALIZATIONS
7/20/2012
MS Thesis - August 2012 32
Treemap
7/20/2012
MS Thesis - August 2012 33
Time Cloud
7/20/2012
MS Thesis - August 2012 34
Bubble Chart, Image Plot & Timeline
7/20/2012
MS Thesis - August 2012 35
CASE STUDIES
7/20/2012
MS Thesis - August 2012 36
1. Collection Building and Growth
7/20/2012
MS Thesis - August 2012 37
2. Re-Categorization (Pakistan Flood: no categorization)
7/20/2012
MS Thesis - August 2012 38
2. Re-Categorization (Pakistan Flood: after categorization)
7/20/2012
MS Thesis - August 2012 39
3. Collection Synopsis
7/20/2012
MS Thesis - August 2012 40
3. Collection Synopsis
7/20/2012
MS Thesis - August 2012 41
3. Collection Synopsis
7/20/2012
MS Thesis - August 2012 42
3. Collection Synopsis
7/20/2012
MS Thesis - August 2012 43
3. Collection Synopsis
7/20/2012
MS Thesis - August 2012 44
4. Theme Tracking
7/20/2012
MS Thesis - August 2012 45
4. Theme Tracking
7/20/2012
MS Thesis - August 2012 46
4. Theme Tracking
7/20/2012
MS Thesis - August 2012 47
4. Theme Tracking
7/20/2012
MS Thesis - August 2012 48
Informal User Evaluation
Alex Thurman, Columbia University Libraries
Feedback on ease of browsing and obtaining information user-friendliness of the interface whether they prefer textual or graphical
interface most effective visualization effectiveness of the rule-based categorization
in exploring archives7/20/2012
MS Thesis - August 2012 49
Feedback
Effective visualizations: Treemap – color coding useful for identifying newer
additions Image plot – screenshots with mouse-over wordles
allow for good navigation Timeline – useful for visualizing development of
groups in collection
Suggestions Broader timescale for treemaps Include stop words from other languages
7/20/2012
MS Thesis - August 2012 50
FUTURE WORK AND CONCLUSION
7/20/2012
MS Thesis - August 2012 51
Future Work
N-Gram wordles
Term expansion
Krovetz stemmer (dictionary based stemmer)
Integration with Archive-It
Detailed user evaluation
Implementation for other archives
7/20/2012
MS Thesis - August 2012 52
Conclusion
Identified metrics for collections
7/20/2012
MS Thesis - August 2012 53
Conclusion
Identified metrics for collections
Visualizations Treemap
7/20/2012
MS Thesis - August 2012 54
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud
7/20/2012
MS Thesis - August 2012 55
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud Bubble chart
7/20/2012
MS Thesis - August 2012 56
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud Bubble chart Image plot
7/20/2012
MS Thesis - August 2012 57
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud Bubble chart Image plot Wordle
7/20/2012
MS Thesis - August 2012 58
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline
7/20/2012
MS Thesis - August 2012 59
Conclusion
Identified metrics for collections
Visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline
Rule – based categorization7/20/2012
MS Thesis - August 2012 60
BACKUP
7/20/2012
MS Thesis - August 2012 61
Time Span
Time spanSmall 1 Day - 2 Weeks
Medium 2 Weeks - 4 MonthsLarge > 4 Months
7/20/2012
http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/
MS Thesis - August 2012 62
Groups
GroupsSmall 1
Medium 2 - 5Large > 5
7/20/2012
http://www.archive-it.org/collections/1068
MS Thesis - August 2012 63
URI Domains
URI DomainsSmall 1 - 10
Medium 11 - 20Large > 20
7/20/2012http://www.archive-it.org/collections/2836
MS Thesis - August 2012 64
Number of Web Pages
# of Web PagesSmall 1 - 10
Medium 11 - 99Large > 99
7/20/2012
http://www.archive-it.org/collections/2836
MS Thesis - August 2012 65
Jigsaw
7/20/2012
Stasko et.al., IEEE VAST 2007
MS Thesis - August 2012 66
Themeriver
7/20/2012
Wei et.al. in SIGKDD, 2010
MS Thesis - August 2012 68
Time Cloud
7/20/2012
MS Thesis - August 2012 69
Bubble Chart
7/20/2012
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
MS Thesis - August 2012 70
Image Plot with Wordle
7/20/2012
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
MS Thesis - August 2012 71
Timeline
7/20/2012
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068