Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries...

Post on 14-Jan-2016

216 views 0 download

Tags:

Transcript of Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries...

Documenting Internet2an IT perspective

Eric CelesteUniversity of Minnesota (Twin Cities)

Librariesfor the Coalition for Networked Information

6 December 2005

...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

background...

• DI2 brought together– University of Minnesota (CBI)– University of Michigan (SI)– Internet2

• web crawling only a small part

• the “save everything” approach

briefly…

• on crawling with spiders• on Heritrix and JavaScript• on Spotlight and local files• on sinkholes and strategies

spiders on the web

pages

links

hosts & domains

robots.txt

scope

seeds

excluded pages

done!

our crawler

• Heritrix, from the IA• aiming for broad deployment, Archive-It

• cross-platform, many users• simple setup, sophisticated options

• generates ARC files

from ARC to archive

• keep originals intact• a few large files to manage• can serve a mirror from the master

• can extract files for research• solution requires Perl, PHP, JavaScript, MySQL

processing...

• for mirroring online– optimizing and indexing with Perl

– loading into MySQL database– presenting via PHP

• for using on local disk– extracting files from ARC

joys of javascript...

• modifies the page after loading

• HTML almost unmolested• changes explicit in code

are we there yet?

• make the archive obvious• yet intrude as little as possible

global research locally• a web site in your pocket• applying local tools• maintaining browse-ability• Apple’s Spotlight one of many

sinkholes / strategies• partnership with institution

– config, IP, retention

• crawling far from perfect– no creation dates, exclusions– sticky traps, scripted pages (AJAX)

• scripts still immature– better demarcation– more self-contained (not at /)

still...

• capture & save what we can• keep it as “original” as possible

• stay flexible for the future• have fun in the present!

more information

• http://wiki.lib.umn.edu/DI2/

• Eric Celeste <efc@umn.edu>