Sample Crawl with Heritrix 1.14cornelia/russir14/lectures/russir_handson1.pdfA d min Console 0 jobs...
31
Why Heritrix? Internet Archive’s web-scale, archival-quality web crawler project Open-source and extensible Written in Java and used in CiteSeer
Transcript of Sample Crawl with Heritrix 1.14cornelia/russir14/lectures/russir_handson1.pdfA d min Console 0 jobs...
Why Heritrix?
Internet Archive’s web-scale, archival-quality web crawlerprojectOpen-source and extensibleWritten in Java and used in CiteSeer
Download/untar/cd bin
http://crawler.archive.org/index.html Go to sourceforge downloads page and get version 1.14.3