Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries...

27
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December 2005 ...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Transcript of Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries...

Page 1: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

Documenting Internet2an IT perspective

Eric CelesteUniversity of Minnesota (Twin Cities)

Librariesfor the Coalition for Networked Information

6 December 2005

...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Page 2: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

background...

• DI2 brought together– University of Minnesota (CBI)– University of Michigan (SI)– Internet2

• web crawling only a small part

• the “save everything” approach

Page 3: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

briefly…

• on crawling with spiders• on Heritrix and JavaScript• on Spotlight and local files• on sinkholes and strategies

Page 4: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

spiders on the web

Page 5: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

pages

Page 6: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

links

Page 7: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

hosts & domains

Page 8: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

robots.txt

Page 9: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

scope

Page 10: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

seeds

Page 11: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

excluded pages

Page 12: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 13: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 14: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 15: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 16: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 17: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Page 18: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

done!

Page 19: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

our crawler

• Heritrix, from the IA• aiming for broad deployment, Archive-It

• cross-platform, many users• simple setup, sophisticated options

• generates ARC files

Page 20: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

from ARC to archive

• keep originals intact• a few large files to manage• can serve a mirror from the master

• can extract files for research• solution requires Perl, PHP, JavaScript, MySQL

Page 21: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

processing...

• for mirroring online– optimizing and indexing with Perl

– loading into MySQL database– presenting via PHP

• for using on local disk– extracting files from ARC

Page 22: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

joys of javascript...

• modifies the page after loading

• HTML almost unmolested• changes explicit in code

Page 23: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

are we there yet?

• make the archive obvious• yet intrude as little as possible

Page 24: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

global research locally• a web site in your pocket• applying local tools• maintaining browse-ability• Apple’s Spotlight one of many

Page 25: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

sinkholes / strategies• partnership with institution

– config, IP, retention

• crawling far from perfect– no creation dates, exclusions– sticky traps, scripted pages (AJAX)

• scripts still immature– better demarcation– more self-contained (not at /)

Page 26: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

still...

• capture & save what we can• keep it as “original” as possible

• stay flexible for the future• have fun in the present!

Page 27: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

more information

• http://wiki.lib.umn.edu/DI2/

• Eric Celeste <[email protected]>