web archiving tools and technologies
-
Upload
dan-chudnov -
Category
Technology
-
view
748 -
download
5
description
Transcript of web archiving tools and technologies
![Page 1: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/1.jpg)
Web ArchivingTools and Technology
Dan Chudnov - GWU Librariesdchud at gwu edu
@dchudIS&T Workshop, April 2, 2013
Washington DC USA
Tuesday, April 2, 13
![Page 2: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/2.jpg)
select scope crawl process access
unt nom tool
X X
heritrix X X
wct X X X X
netarchivesuite X X X X X
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
![Page 3: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/3.jpg)
select
•what to collect
•who authorizes
•when
•what order
Tuesday, April 2, 13
![Page 4: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/4.jpg)
scope
•how much
• robots.txt
•what to leave out
•which doors not to open
Tuesday, April 2, 13
![Page 5: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/5.jpg)
crawl• start with seeds
• find, queue, follow links
• be kind to each site
• parallelize across sites
• schedule, log, checkpoint, resume
• bundle
Tuesday, April 2, 13
![Page 6: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/6.jpg)
process• lump, split, bundle,
rebundle
• quality control
• index, surrogate, reorder, prep for access
• store, distribute, preserve
Tuesday, April 2, 13
![Page 7: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/7.jpg)
access
• browse
• search
• known items
• patterns
• needles
Tuesday, April 2, 13
![Page 8: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/8.jpg)
select scope crawl process access
unt nom tool
X X
heritrix X X
wct X X X X
netarchivesuite X X X X X
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
![Page 9: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/9.jpg)
UNT URL Nomination Tool•collaborative
selection
•collect seed lists
•attach metadata
•agree on scope
• feed crawlers
Tuesday, April 2, 13
![Page 10: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/10.jpg)
heritrix
• free software from Internet Archive
• easy to start with
• difficult to master
• powerful, configurable, confusing
Tuesday, April 2, 13
![Page 11: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/11.jpg)
heritrix cont’d• two major versions, “1” and “3”
• WCT and NetArchive embed “1”
• “1” - minimal UI
• “3” - even less
• iterate early - long learning curve
• best available tool
Tuesday, April 2, 13
![Page 12: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/12.jpg)
heritrix cont’d
Tuesday, April 2, 13
![Page 13: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/13.jpg)
Web Curator Tool• free software from
NLNZ / BL
• full crawling workflow suite
• select, obtain permissions, authorize
• schedule, crawl w/heritrix 1
Tuesday, April 2, 13
![Page 14: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/14.jpg)
WCT cont’d
• quality review
• statistics, hierarchy visualization, pruning
• troubleshooting
• task notifications
• reporting
Tuesday, April 2, 13
![Page 15: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/15.jpg)
WCT cont’d
Tuesday, April 2, 13
![Page 16: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/16.jpg)
NetarchiveSuite
• free software from netarkivet.dk
• used by State and University Library, The Royal Library in Denmark
• complete solution from selection to access
Tuesday, April 2, 13
![Page 17: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/17.jpg)
NetarchiveSuite cont’d
Tuesday, April 2, 13
![Page 18: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/18.jpg)
NetarchiveSuite cont’d• selection, scoping,
scheduling
• crawling, troubleshooting, tweaking
• system dashboard, quality assurance
• heritrix and wayback
Tuesday, April 2, 13
![Page 19: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/19.jpg)
warc-tools
• command-line tools for arc/warc
• validate, summarize, filter
• bundle / rebundle, convert, index
Tuesday, April 2, 13
![Page 20: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/20.jpg)
NutchWax
• free software
• index / search of ARC data
• development slowed / stopped but still used
Tuesday, April 2, 13
![Page 21: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/21.jpg)
searchingweb archives
is hard
Tuesday, April 2, 13
![Page 22: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/22.jpg)
wayback
• free software from Internet Archive
• public web access to web archives
• what you’ve seen at archive.org
Tuesday, April 2, 13
![Page 23: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/23.jpg)
wayback cont’d
Tuesday, April 2, 13
![Page 24: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/24.jpg)
wayback cont’d
Tuesday, April 2, 13
![Page 25: web archiving tools and technologies](https://reader033.fdocuments.net/reader033/viewer/2022052505/555c0316d8b42a56448b5388/html5/thumbnails/25.jpg)
Tuesday, April 2, 13