Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We...

45
Crawling Ida Mele

Transcript of Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We...

Page 1: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling

Ida Mele

Page 2: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch

• Apache Nutch is an open source Java implementation of a search engine

• We can use Nutch for crawling a portion of the Web • Useful links:• http://nutch.apache.org/• http://wiki.apache.org/nutch/• http://wiki.apache.org/nutch/NutchTutorial

Ida Mele Crawling 2

Page 3: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch: advantages

• Understanding • We have the source code and we can use it to see

how a large search engine works • Nutch has been built using ideas from academia

and industry, and it is very useful for researchers who want to try out new search algorithms

Ida Mele Crawling 3

Page 4: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch: advantages

• Transparency • The details of the ranking algorithms used by

commercial search engines are secret, and usually there are economical reasons behind the ranked list of results

• Nutch implementation is transparent. We know how the ranking algorithms work, and we can trust on the fairness of the final rankings

Ida Mele Crawling 4

Page 5: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch: advantages

• Extensibility • Nutch is a platform for adding search to

heterogeneous collections of information • It allows to customize the search interface• We can extend the out-of-the-box functionality

through the plugin mechanism

Ida Mele Crawling 5

Page 6: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch vs. Lucene

• Nutch is built on top of Lucene• Apache Lucene is a Java library for text indexing and

searching • It ensures high-performance and full-featured text

search• It provides support for any application that requires

full-text search • It is used just for indexing and not for crawling

Ida Mele Crawling 6

Page 7: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Architecture

• Nutch can be divided into two pieces:• crawler which fetches pages and turns them into an

inverted index• searcher which answers users' search queries

• The index is the interface between the crawler and the searcher

• The crawler and searcher systems can be on separate hardware platforms

Ida Mele Crawling 7

Page 8: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Architecture

• Crawler and searcher systems can be scaled independently

• For example, if we have a highly trafficked search page that provides searching for a relatively modest set of sites, we may use a modest crawler infrastructure, and invest more substantial resources for supporting the searcher

Ida Mele Crawling 8

Page 9: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawler system

• The crawler system is driven by the Nutch tool called crawl, and by other related tools to build and maintain the data structures

• Data structures are:• the web database • a set of segments • the index

Ida Mele Crawling 9

Page 10: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

WebDB

• The web database (WebDB) is a data structure for mirroring the structure and properties of the web graph being crawled

• It stores two types of entities:• Page: It is indexed by its URL and the MD5 hash of

its contents. Other information: the # of outlinks, fetch information, the score of the page

• Link: It represents the connection between the source page and the target page

Ida Mele Crawling 10

Page 11: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Segment

• The segment is a collection of pages that are fetched and indexed by the crawler in a run

• The fetchlist is a list of URLs to fetch, and it is generated from the WebDB

• The fetcher output is the data retrieved from the pages in the fetchlist

• Any segment has a lifespan (30 days is the default re-fetch interval)

Ida Mele Crawling 11

Page 12: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Index

• Inverted index of all pages retrieved by the system• The index is created by merging all of the individual

segment indexes• Nutch uses Lucene to build the index. Note that in

Lucene there is the concept of segment, but it is different from the segment in Nutch:• In Lucene, the index segment is a portion of the index• In Nutch, the segment is a fetched and indexed

portion of the WebDB

Ida Mele Crawling 12

Page 13: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling

• Nutch can operate at one of these three different scales:• Local filesystem• Intranet• Web

• All scales have different characteristics. For example, crawling the file system is more reliable compared to the other two scales

Ida Mele Crawling 13

Page 14: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling

• For crawling billions of pages from the web, we must:• define the seed set (i.e., the set of pages we want to

start with) • decide how many crawlers we use and how

partition the work among them • decide how often we want to do the re-crawling • cope with broken links, unresponsive sites, and

unintelligible or duplicate content

Ida Mele Crawling 14

Page 15: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling

• The crawling process is basically a cycle made of three steps:1. the crawler generates a set of fetchlists from the

WebDB (generate)2. a set of fetchers downloads the content from

the Web (fetch)3. the crawler updates the WebDB with new links

that were found (update)

Ida Mele Crawling 15

Page 16: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling

• Nutch observes:• Politeness: URLs with the same host are always

assigned to the same fetchlist, so that a web site is not overloaded with requests from multiple fetchers in rapid succession

• Robots Exclusion Protocol: It allows site owners to control which parts of their site may be crawled

Ida Mele Crawling 16

Page 17: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling: low-level tools

• Crawling is done by the crawl tool of Nutch, that is a front-end to lower-level tools

• The crawl tool can be used to get started with crawling websites, but then we need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl

Ida Mele Crawling 17

Page 18: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling: low-level tools

• We can use the lower-level tools in sequence:1. Create a new WebDB (admin db-create)2. Inject root URLs into the WebDB (inject)3. Generate a fetchlist from the WebDB in a new

segment (generate)4. Fetch content from URLs in the fetchlist (fetch)5. Update the WebDB with links from fetched pages

(updatedb)6. Repeat steps 3-5 until the required depth is reached

Ida Mele Crawling 18

Page 19: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling: low-level tools

7. Update segments with scores and links from the WebDB (updatesegs)

8. Index the fetched pages (index)9. Eliminate duplicate content, and duplicate URLs,

from the indexes (dedup)10. Merge the indexes into a single index for

searching (merge)

Ida Mele Crawling 19

Page 20: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Crawling: low-level tools

• We create a new WebDB (step 1), and we populate it with some seed URLs (step 2)

• Then we use the generate/fetch/update cycle (steps 3-6)• After the cycle, the crawler creates an index (steps 7-10).

In particular, • each segment is indexed independently (step 8) • the duplicate pages are removed (step 9) • the individual indexes are combined into a single

index (step 10)

Ida Mele Crawling 20

Page 21: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Running a crawl with Nutch

• Download and unpack a Nutch distribution (for example, apache-nutch-1.1-bin.zip)

• Make sure that the environment variable NUTCH_JAVA_HOME or JAVA_HOME is set with the Java home path:• Run the following command or add it to the .bashrc

file:export NUTCH_JAVA_HOME= %pathJava

Ida Mele Crawling 21

Page 22: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch configuration

• All of Nutch's configuration files are in the conf subdirectory

• The main configuration file is conf/nutch-default.xml. It contains the default settings, and should not be modified

• To change a setting we can create or update the conf/nutch-site.xml file

Ida Mele Crawling 22

Page 23: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Nutch configuration

• Add your agent name in the value field of the http.agent.name property of the file conf/nutch-site.xml, for example, we can use the name: Sapienza University

Ida Mele Crawling 23

<property> <name>http.agent.name</name>

<value>Sapienza University</value> <description>

HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization.

</description></property>

Page 24: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Url filter

• The crawl tool uses a filter to decide which URLs can go into the WebDB (steps 2 and 5)

• This can be used to restrict the crawling to the URLs that match any given pattern, specified by regular expressions

• For example, if we want to restrict the domain to the DIS domain, we have to update the configuration file conf/crawl-urlfilter.txt

Ida Mele Crawling 24

Page 25: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Url filter

• Open the file conf/crawl-urlfilter.txt and replace the line: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/with:+^http://([a-z0-9]*\.)*dis.uniroma1.it/

• The file conf/crawl-urlfilter.txt will contain:# accept hosts in MY.DOMAIN.NAME#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/+^http://([a-z0-9]*\.)*dis.uniroma1.it/

Ida Mele Crawling 25

Page 26: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Example

• Create a file called urls, that contains the root URLs.

• These URLs will be used to populate the initial fetchlist.

• For example, if we want to start from the home page of the department, we will use:echo ‘http://www.dis.uniroma1.it’ > urls

Ida Mele Crawling 26

Page 27: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Example

• We run the crawler with:bin/nutch crawl urls -dir mycrawl -depth 5 > mycrawl.logwhere:

• urls is the name of the file with the seed URLs• mycrawl is the name of the directory • 5 is the depth of the crawling• mycrawl.log is the name of the log file

Ida Mele Crawling 27

Page 28: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl

• The directory mycrawl contains the following subdirectories:• crawldb • linkdb• segments • index• indexes

Ida Mele Crawling 28

Page 29: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readdb

• The readdb tool parses the WebDB and displays portions of it in human-readable form

• The stats option displays the number of pages and links: bin/nutch readdb mycrawl/crawldb -stats >stats.txt

Then, we can use:more stats.txt

Ida Mele Crawling 29

Page 30: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readdb

• The dump option gives the dump of the pages. Each page appears in a separate block, with one field per line. The ID field is the MD5 hash of the page contents. There is also information about when the pages should be next fetched (which defaults to 30 days), and the page scores

• We issue the command:bin/nutch readdb mycrawl/crawldb -dump mydumpthen we use: more mydump/part-00000

Ida Mele Crawling 30

Page 31: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readdb

• The readdb tool also supports extraction of an individual page or link by URL or MD5 hash

• For example, to examine the info of the page http://cclii.dis.uniroma1.it/airo/index.phpwe use the option url by issuing the command:bin/nutch readdb mycrawl/crawldb -url http://www.dis.uniroma1.it/airo/index.php

Ida Mele Crawling 31

Page 32: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readlinkdb

• The readlinkdb tool can be used to create the dump of the link structure (the graph) by using the option dump: bin/nutch readlinkdb mycrawl/linkdb/ -dump mylinks

• We can read the in-links by using:more mylinks/part-00000Note that it gives us just the list of the in-links. For the out-links we have to merge the segments and read the result

Ida Mele Crawling 32

Page 33: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readseg

• The crawl creates a few segments in timestamped subdirectories, one for each generate/fetch/update cycle

• The readseg tool is the segment reader • The option list gives a summary of all of the

generated segments:bin/nutch readseg -list -dir mycrawl/segments/

Ida Mele Crawling 33

Page 34: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: readseg

• The option dump gives a dump of a given segment:bin/nutch readseg -dump mycrawl/segments/YYYYMMDDhhmmss/ dump_seg1Where YYYYMMDDhhmmss is the name of the segment, and it is given by the date and time we created the segment

• Then we can use:more dump_seg1/dump

Ida Mele Crawling 34

Page 35: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Results of the crawl: mergeseg

• We have seen that the readlinkdb tool can be used to have the list of in-links

• To have the out-links we need to merge the segments and read the result

• We use the mergesegs tool: bin/nutch mergesegs whole-segments -dir mycrawl/segments/*

• Then we can use the dump option of the readseg tool on the result of the merge: bin/nutch readseg -dump whole-segments/YYYYMMDDhhmmss/ dump-outlinks

Ida Mele Crawling 35

Page 36: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• We want to create the webgraph of a portion of the Web• First of all, install and configure Nutch• For the crawling: • Create the file with the seed set (example urls)• Update the conf/url-filter.txt file• Decide the depth of the crawling and crawl a portion of

the web using the crawl tool. For example, for depth 5 we issue:

bin/nutch crawl urls -dir mycrawl -depth 5 > mycrawl.log

Ida Mele Crawling 36

Page 37: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Once the crawling is completed, you can create the webgraph• Download the directory with libraries lib.zip

available at: http://www.dis.uniroma1.it/~mele/WebIR.html • Download the file set-classpath.sh

available at: http://www.dis.uniroma1.it/~mele/WebIR.htmlUpdate the file set-classpath.sh with the path to your lib directory

• Put the set-classpath.sh file in the Nutch home, open the terminal, and set the classpath with • source set-classpath.sh

Ida Mele Crawling 37

Page 38: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Create the file with in-links using the following commands:• bin/nutch readlinkdb mycrawl/linkdb/ -dump

mylinks• egrep -v $'^$' mylinks/part-00000 >inlinks.txt

Ida Mele Crawling 38

Page 39: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Create the file with the out-links1) Merge the segments:• bin/nutch mergesegs whole-segments -dir

mycrawl/segments/*

2) Use readseg to read the segments, and then create the file with out-links:• bin/nutch readseg -dump

whole-segments/YYYYMMDDhhmmss/dump-outlinks• cat dump-outlinks/dump | egrep 'URL|toUrl'

>outlinks.txt

Ida Mele Crawling 39

Page 40: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Print the in-links and out-links in the links.txt file by issuing the following commands:• java nutchGraph.PrintInlinks inlinks.txt >links.txt• java nutchGraph.PrintOutlinks outlinks.txt

>>links.txt

• Remove the duplicates:• LANG=C sort links.txt | uniq > cleaned-links.txt

Ida Mele Crawling 40

Page 41: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Create the map of urls with the following commands:• cut -f1 links.txt >url-list.txt• cut -f2 links.txt >>url-list.txt• LANG=C sort url-list.txt | uniq > sorted-url-list.txt• java -Xmx2G it.unimi.dsi.util.FrontCodedStringList -u

-r 32 umap.fcl < sorted-url-list.txt• java -Xmx2G it.unimi.dsi.sux4j.mph.MWHCFunction

umap.mph sorted-url-list.txt

Ida Mele Crawling 41

Page 42: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Exercise

• Create the graph:• java -Xmx2G nutchGraph.PrintEdges cleaned-

links.txt umap.mph > webgraph.dat• numNodes=$(wc -l < sorted-url-list.txt)• java -Xmx2G nutchGraph.IncidenceList2Webgraph

$numNodes webgraph• java -Xmx2G it.unimi.dsi.webgraph.BVGraph –g

ASCIIGraph webgraph webgraph

Ida Mele Crawling 42

Page 43: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Indexing

• Once the crawling operation is completed, we have the graph and the indexed pages

• Remember that Nutch uses Lucene for the indexing phase

• If we want to use MG4J for building the inverted index, we can collect the pages fetched during the crawling by using: wget -i sorted-url-list.txt

• Then we can use MG4J for indexing and querying the resulting collection of web pages

Ida Mele Crawling 43

Page 44: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

WEBWEB

db

Linkstructure

RankPR

Nutch ParserDBreaddb

graph.txt

PageRank

getfiles

files

MG4J QueryMG4JQuery

Ida Mele Crawling 44

ASCIIGraph

BVGraph

Page 45: Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.

Homework

• Repeat the exercise using a different seed set and/or depth. Create the corresponding webgraph. Compute the Pagerank for the nodes of the webgraph. Plot the distribution of the Pagerank values

Ida Mele Crawling 45